DataFlow: AI-Native ETL/ELT
Ever tried to move data between databases and ended up writing more boilerplate than actual logic? Or watched a 3-hour ETL job crash at 97% with a cryptic error that basically says "good luck figuring this out"?
Meet DataFlow - where "copy this table from Oracle to BigQuery" is literally all you need to say.
The Magic: Data Pipeline as Conversation
Traditional ETL tools make you:
- Write complex configuration files
- Define schemas manually
- Handle errors with blind retries
- Pray nothing breaks at 3 AM
DataFlow flips the script:
- You describe what you want in plain English
- QRY figures out the optimal strategy based on your data
- If something breaks, the AI actually understands why and fixes it
- You sleep soundly while your data moves itself
How It Works
Just Ask
You: I need to copy the sales_transactions table from Oracle to BigQuery.
It's about 50GB.
QRY: I'll help you move the sales_transactions table. Let me analyze
the source and plan the optimal strategy.
Here's what I found:
- Source: oracle.sales.sales_transactions
- Estimated size: 47.3 GB (312 million rows)
- Strategy: Distributed processing (8 parallel workers)
- Estimated time: 25-35 minutes
Type mappings:
- TRANSACTION_ID (NUMBER) → INT64
- AMOUNT (NUMBER(18,2)) → NUMERIC
- TRANSACTION_DATE (DATE) → DATE
- CUSTOMER_NAME (VARCHAR2) → STRING
Should I proceed?
You: Yes, and add a column for when this data was loaded.
QRY: Got it. Adding LOAD_TIMESTAMP with the current UTC time.
Pipeline started!
Progress:
████████░░░░░░░░░░░░ 42% | 132M rows | 19.8 GB
Current partition: 2024-03 to 2024-04
ETA: 18 minutes
Smart Type Conversions
DataFlow automatically handles type mappings between databases. Oracle NUMBER becomes BigQuery INT64 or NUMERIC. VARCHAR2 becomes STRING. CLOBs, BLOBs, dates, timestamps - all handled automatically.
No more manual schema definitions. No more type mismatch errors at runtime.
Auto-Healing When Things Go Wrong
Traditional ETL:
Error: Connection timeout at row 5,234,891
Action: Retry from beginning (again)
Time wasted: 2 hours
DataFlow:
Issue detected: Connection timeout during write
Analyzing... The target database is experiencing high load.
Action: Reducing batch size and switching to off-peak connection pool.
Resuming from checkpoint at row 5,234,891.
Progress continues...
The AI doesn't just retry blindly - it understands the context, diagnoses the problem, and adapts the strategy.
Key Features
Any-to-Any Database Transfers
Move data between any connected datasources:
- Oracle ↔ BigQuery
- PostgreSQL ↔ Snowflake
- MySQL ↔ Cloudera
- Any combination you have connected
Automatic Scaling
DataFlow picks the right execution mode for your data:
| Data Size | How It Runs |
|---|---|
| Small (< 100 MB) | Instant, inline execution |
| Medium (100 MB - 10 GB) | Background job |
| Large (10 GB - 100 GB) | Distributed worker pool |
| Massive (> 100 GB) | Spark cluster |
You don't need to configure anything. Just describe what you want.
Schedule Recurring Pipelines
You: Can you run this every night at 2 AM?
QRY: Created scheduled pipeline:
- Schedule: Daily at 02:00 UTC
- Mode: Incremental (only new transactions)
- Filter: WHERE transaction_date >= yesterday
- Notifications: Email on completion/failure
The nightly job will only transfer new transactions
since the last run, making it much faster.
Add Transformations On-The-Fly
You: Copy the customers table, but combine first_name and last_name
into a full_name column, and filter out inactive customers.
QRY: I'll copy the customers table with these modifications:
- New column: full_name = first_name + ' ' + last_name
- Filter: WHERE status = 'active'
Proceeding...
Real-Time Progress Tracking
Check on your pipelines anytime:
You: How's that Oracle migration going?
QRY: Pipeline Status: oracle_migration_20241027
Progress: 67% complete
Processed: 3.2 GB of 4.8 GB
Rows: 45.2M of 67.5M
Speed: 125K rows/second
ETA: 5 minutes
No errors encountered.
Use Cases
Data Warehouse Loading
Move operational data from your ERP (Oracle, SAP) to your analytics warehouse (BigQuery, Snowflake) on a schedule.
Database Migration
Moving to a new platform? DataFlow handles the heavy lifting with smart type conversions and progress tracking.
Cross-Platform Analytics
Combine data from multiple sources for unified reporting without manual exports and imports.
Backup and Archival
Automatically archive old data to cheaper storage tiers while maintaining queryability.
Dev/Test Data Refresh
Keep development and test environments updated with production-like data (with optional anonymization).
Security & Compliance
- RBAC Enforced: You can only move data you have permission to read/write
- Audit Logging: Every pipeline execution is logged with user, source, target, and row counts
- Encrypted Transit: All database connections use TLS
- Credential Isolation: Workers use service accounts, not your personal credentials
- Temporary Data Cleanup: Intermediate files are encrypted and auto-deleted
Getting Started
- Connect your datasources - Make sure both source and target databases are connected in QRY
- Start a conversation - Just describe what data you want to move and where
- Review the plan - QRY shows you estimated size, time, and type mappings
- Confirm and go - Watch the progress or go grab coffee
That's it. No YAML files. No Airflow DAGs. No Python scripts. Just conversation.
When to Use DataFlow vs Forge
| Use Case | Best Choice |
|---|---|
| Simple table copy | DataFlow - just describe what you want |
| Ad-hoc data movement | DataFlow - conversational interface |
| Complex multi-step pipelines | Lakeflow - visual DAG and YAML |
| Scheduled multi-task workflows | Lakeflow - job orchestration |
| Data quality expectations | Lakeflow - built-in expectations |
| AI + ETL + notifications in one workflow | Lakeflow - job with mixed task types |
Think of DataFlow as your friendly conversational assistant for data movement, and Lakeflow as your enterprise-grade orchestration platform when you need more control.
DataFlow: Because "SELECT * FROM source INSERT INTO destination" should actually be that simple.