DataFlow: AI-Native ETL/ELT

Ever tried to move data between databases and ended up writing more boilerplate than actual logic? Or watched a 3-hour ETL job crash at 97% with a cryptic error that basically says "good luck figuring this out"?

Meet DataFlow - where "copy this table from Oracle to BigQuery" is literally all you need to say.

The Magic: Data Pipeline as Conversation

Traditional ETL tools make you:

Write complex configuration files
Define schemas manually
Handle errors with blind retries
Pray nothing breaks at 3 AM

DataFlow flips the script:

You describe what you want in plain English
QRY figures out the optimal strategy based on your data
If something breaks, the AI actually understands why and fixes it
You sleep soundly while your data moves itself

How It Works

Just Ask

You: I need to copy the sales_transactions table from Oracle to BigQuery.
     It's about 50GB.

QRY: I'll help you move the sales_transactions table. Let me analyze
     the source and plan the optimal strategy.

     Here's what I found:
     - Source: oracle.sales.sales_transactions
     - Estimated size: 47.3 GB (312 million rows)
     - Strategy: Distributed processing (8 parallel workers)
     - Estimated time: 25-35 minutes

     Type mappings:
     - TRANSACTION_ID (NUMBER) → INT64
     - AMOUNT (NUMBER(18,2)) → NUMERIC
     - TRANSACTION_DATE (DATE) → DATE
     - CUSTOMER_NAME (VARCHAR2) → STRING

     Should I proceed?

You: Yes, and add a column for when this data was loaded.

QRY: Got it. Adding LOAD_TIMESTAMP with the current UTC time.

     Pipeline started!

     Progress:
     ████████░░░░░░░░░░░░ 42% | 132M rows | 19.8 GB
     Current partition: 2024-03 to 2024-04
     ETA: 18 minutes

Smart Type Conversions

DataFlow automatically handles type mappings between databases. Oracle NUMBER becomes BigQuery INT64 or NUMERIC. VARCHAR2 becomes STRING. CLOBs, BLOBs, dates, timestamps - all handled automatically.

No more manual schema definitions. No more type mismatch errors at runtime.

Auto-Healing When Things Go Wrong

Traditional ETL:

Error: Connection timeout at row 5,234,891
Action: Retry from beginning (again)
Time wasted: 2 hours

DataFlow:

Issue detected: Connection timeout during write

Analyzing... The target database is experiencing high load.
Action: Reducing batch size and switching to off-peak connection pool.
Resuming from checkpoint at row 5,234,891.

Progress continues...

The AI doesn't just retry blindly - it understands the context, diagnoses the problem, and adapts the strategy.

Key Features

Any-to-Any Database Transfers

Move data between any connected datasources:

Oracle ↔ BigQuery
PostgreSQL ↔ Snowflake
MySQL ↔ Cloudera
Any combination you have connected

Automatic Scaling

DataFlow picks the right execution mode for your data:

Data Size	How It Runs
Small (< 100 MB)	Instant, inline execution
Medium (100 MB - 10 GB)	Background job
Large (10 GB - 100 GB)	Distributed worker pool
Massive (> 100 GB)	Spark cluster

You don't need to configure anything. Just describe what you want.

Schedule Recurring Pipelines

You: Can you run this every night at 2 AM?

QRY: Created scheduled pipeline:
     - Schedule: Daily at 02:00 UTC
     - Mode: Incremental (only new transactions)
     - Filter: WHERE transaction_date >= yesterday
     - Notifications: Email on completion/failure

     The nightly job will only transfer new transactions
     since the last run, making it much faster.

Add Transformations On-The-Fly

You: Copy the customers table, but combine first_name and last_name
     into a full_name column, and filter out inactive customers.

QRY: I'll copy the customers table with these modifications:
     - New column: full_name = first_name + ' ' + last_name
     - Filter: WHERE status = 'active'

     Proceeding...

Real-Time Progress Tracking

Check on your pipelines anytime:

You: How's that Oracle migration going?

QRY: Pipeline Status: oracle_migration_20241027

     Progress: 67% complete
     Processed: 3.2 GB of 4.8 GB
     Rows: 45.2M of 67.5M
     Speed: 125K rows/second
     ETA: 5 minutes

     No errors encountered.

Use Cases

Data Warehouse Loading

Move operational data from your ERP (Oracle, SAP) to your analytics warehouse (BigQuery, Snowflake) on a schedule.

Database Migration

Moving to a new platform? DataFlow handles the heavy lifting with smart type conversions and progress tracking.

Cross-Platform Analytics

Combine data from multiple sources for unified reporting without manual exports and imports.

Backup and Archival

Automatically archive old data to cheaper storage tiers while maintaining queryability.

Dev/Test Data Refresh

Keep development and test environments updated with production-like data (with optional anonymization).

Security & Compliance

RBAC Enforced: You can only move data you have permission to read/write
Audit Logging: Every pipeline execution is logged with user, source, target, and row counts
Encrypted Transit: All database connections use TLS
Credential Isolation: Workers use service accounts, not your personal credentials
Temporary Data Cleanup: Intermediate files are encrypted and auto-deleted

Getting Started

Connect your datasources - Make sure both source and target databases are connected in QRY
Start a conversation - Just describe what data you want to move and where
Review the plan - QRY shows you estimated size, time, and type mappings
Confirm and go - Watch the progress or go grab coffee

That's it. No YAML files. No Airflow DAGs. No Python scripts. Just conversation.

When to Use DataFlow vs Forge

Use Case	Best Choice
Simple table copy	DataFlow - just describe what you want
Ad-hoc data movement	DataFlow - conversational interface
Complex multi-step pipelines	Lakeflow - visual DAG and YAML
Scheduled multi-task workflows	Lakeflow - job orchestration
Data quality expectations	Lakeflow - built-in expectations
AI + ETL + notifications in one workflow	Lakeflow - job with mixed task types

Think of DataFlow as your friendly conversational assistant for data movement, and Lakeflow as your enterprise-grade orchestration platform when you need more control.

DataFlow: Because "SELECT * FROM source INSERT INTO destination" should actually be that simple.

The Magic: Data Pipeline as Conversation​

How It Works​

Just Ask​

Smart Type Conversions​

Auto-Healing When Things Go Wrong​

Key Features​

Any-to-Any Database Transfers​

Automatic Scaling​

Schedule Recurring Pipelines​

Add Transformations On-The-Fly​

Real-Time Progress Tracking​

Use Cases​

Data Warehouse Loading​

Database Migration​

Cross-Platform Analytics​

Backup and Archival​

Dev/Test Data Refresh​

Security & Compliance​

Getting Started​

When to Use DataFlow vs Forge​