QRY Lakeflow
Status: Production Ready
Overview
Remember when building data pipelines meant spending weeks wrestling with Airflow DAGs, debugging cryptic YAML errors at 2 AM, and explaining to your manager why "it works on my machine" doesn't apply to production? Those days are over.
Lakeflow is QRY's data orchestration engine for building, scheduling, and monitoring data pipelines and jobs. Think Databricks Lakeflow meets a UX designer who actually uses the product. Define complex data workflows using simple YAML or drag-and-drop visual editing — your choice, and yes, they sync automatically.
Key Features
Professional IDE Experience
Lakeflow features a full IDE interface inspired by VS Code and modern development tools. Work with multiple pipelines and jobs simultaneously using tabs, keyboard shortcuts, and powerful navigation.
Dual Editing Modes
Switch between YAML and visual drag-and-drop editing whenever you want. Changes in one mode automatically sync to the other. It's like having your cake and eating it too, except the cake is a perfectly configured ETL pipeline.
Interactive DAG Canvas
Design task workflows with drag-and-drop connections. No more drawing diagrams on whiteboards that nobody updates. Your DAG is always accurate because it is the configuration.
Real-Time Monitoring
Watch execution progress live with animated task nodes. Blue pulse means running, green means success, red means "time to investigate." It's like watching your data do a choreographed dance, except occasionally one dancer trips.
AI-Assisted Authoring
Tell Lakeflow what you want in plain English, and it generates the YAML for you. "Create a pipeline that aggregates daily sales by region" becomes a working configuration in seconds. Your keyboard thanks you.
Stage-Based Pipeline Design
Visual Source → Transform → Expectations → Sink flow that makes data lineage obvious. Even your manager can follow along during demos.
Git Integration
Version control your pipelines and jobs with Git Folders. Track changes, collaborate with teammates, and integrate with CI/CD workflows.
Two Core Abstractions
Lakeflow gives you two powerful building blocks:
| Type | Purpose | Think of it as... |
|---|---|---|
| Pipelines | Single-purpose data transformations (ETL/ELT) | A well-trained specialist |
| Jobs | Multi-task orchestrations with dependencies | A conductor leading an orchestra |
Getting Started
Navigating to Lakeflow
- Open QRY
- Click Lakeflow in the left navigation rail
- Marvel at the dashboard showing all your pipelines and jobs
The IDE Interface
Lakeflow uses a three-panel IDE layout similar to VS Code:
┌─────────────────────────────────────────────────────────────┐
│ Tab Bar (open pipelines/jobs, drag to reorder) │
├──────────┬────────────────────────────────┬─────────────────┤
│ │ │ │
│ File │ Editor Area │ Right Rail │
│ Explorer │ (YAML or Visual) │ (panel icons) │
│ │ │ │
│ Folders │ ├─────────────────┤
│ & Items │ │ Panel Content │
│ │ │ (contextual) │
└──────────┴────────────────────────────────┴─────────────────┘
Components:
- Tab Bar: Open multiple pipelines/jobs simultaneously, drag tabs to reorder
- File Explorer: Browse folders, search items, filter by type/status
- Editor Area: Main workspace for YAML editing or visual canvas
- Right Rail: Quick access to Run History, Settings, Validation
- Panel Area: Contextual panels that slide in from the right
Keyboard Shortcuts
Master these shortcuts to work efficiently:
| Shortcut | Action |
|---|---|
Cmd+S | Save current item |
Cmd+Enter | Validate configuration |
Cmd+Shift+Enter | Deploy/Activate |
Cmd+P | Quick open (search items) |
Cmd+B | Toggle sidebar |
Cmd+\ | Toggle split view |
The Interface at a Glance
| Area | What it does |
|---|---|
| Toolbar | Create new items, filter by status/workspace, bulk actions |
| Folder Tree | Organize pipelines and jobs (yes, folders actually work here) |
| Item List | Table or card view - your choice |
| Status Indicators | Visual badges so you know what's running, broken, or waiting |
Pipelines: Your Data Transformation Workhorses
Pipelines are single-purpose data transformation workflows that move and transform data between sources and targets. One pipeline, one job, done well.
Pipeline Lifecycle
DRAFT → DEPLOYED → DEPRECATED
- Draft: Work in progress, edit freely
- Deployed: Production-ready, running in the wild
- Deprecated: Retirement home for pipelines you can't quite delete yet
Creating a Pipeline
- Click + New Pipeline in the toolbar
- Choose your preferred editing mode (YAML or Visual)
- Define your pipeline:
name: sales_daily_summary
description: "Aggregate daily sales by region"
source:
datasource: bigquery
catalog: my_project
schema: raw_data
target:
catalog: my_project
schema: analytics
tables:
- name: daily_sales
type: live_table
query: |
SELECT
DATE(transaction_date) as sale_date,
region,
SUM(amount) as total_amount,
COUNT(*) as transaction_count
FROM source.transactions
WHERE transaction_date >= CURRENT_DATE - 30
GROUP BY 1, 2
- Click Validate to catch errors before they catch you
- Click Save to preserve your work
- Click Deploy when you're ready for prime time
Visual Pipeline Editor
The Pipeline Editor shows a linear stage-based flow:
Source → Transform → Expectations → Sink
Each stage is a clickable card:
| Stage | Color | What you configure |
|---|---|---|
| Source | Cyan | Where data comes from, query, incremental settings |
| Transform | Purple | SQL or Python transformations |
| Expectations | Amber | Data quality checks (because garbage in = garbage out) |
| Sink | Green | Where data lands, write mode, merge keys |
Pipeline Table Types
| Type | When to use |
|---|---|
live_table | Real-time aggregations, computed on-demand |
materialized_view | Performance optimization, pre-computed results |
streaming | Continuous ingestion, real-time data feeds |
Jobs: Orchestrating the Orchestra
Jobs combine multiple tasks into complex workflows. Tasks can depend on each other, forming a directed acyclic graph (DAG) - fancy words for "things happen in the right order."
Job Lifecycle
DRAFT → ACTIVE ⟷ PAUSED → ARCHIVED
- Draft: Build and test without affecting production
- Active: Scheduled and running
- Paused: Taking a break, preserves schedule
- Archived: Soft deleted (in case you change your mind)
Creating a Job
- Click + New Job in the toolbar
- Define your orchestration:
name: daily_analytics_workflow
description: "End-to-end daily analytics processing"
schedule:
cron: "0 6 * * *" # 6 AM UTC daily
timezone: "UTC"
tasks:
- name: extract_data
type: pipeline
pipeline_name: raw_data_ingestion
timeout_seconds: 1800
- name: transform_data
type: pipeline
pipeline_name: sales_daily_summary
depends_on:
- extract_data
timeout_seconds: 3600
- name: generate_report
type: prompt
prompt_config:
prompt: "Analyze today's sales data and summarize key insights"
model: "gemini-2.0-flash"
context:
include_upstream_results: true
depends_on:
- transform_data
- name: notify_team
type: notification
notification_config:
channels:
- type: email
recipients:
- analytics@company.com
message: "Daily analytics job completed successfully"
depends_on:
- generate_report
- Click Validate, then Save, then Activate
Task Types
Lakeflow supports five task types for maximum flexibility:
Pipeline Task
Run a Lakeflow pipeline as part of your job.
- name: run_etl
type: pipeline
pipeline_name: my_pipeline
timeout_seconds: 3600
Prompt Task
Execute an AI prompt - yes, you can have AI analyze your data as part of the workflow.
- name: analyze_data
type: prompt
prompt_config:
prompt: "Analyze the data and provide insights"
system: "You are a data analyst"
model: "gemini-2.0-flash"
tools:
- DatabaseTool
- PythonTool
context:
include_upstream_results: true
output:
format: "markdown"
max_tokens: 4000
Python Task
Run custom Python code in a sandboxed environment.
- name: custom_processing
type: python
python_config:
script: |
import pandas as pd
# Access upstream results
upstream_data = context.get('upstream_results', {})
# Your custom logic
result = {"processed": True}
print(f"Processed data: {result}")
requirements:
- pandas>=2.0.0
timeout_seconds: 600
Notification Task
Send alerts when things happen (or don't).
- name: send_alert
type: notification
notification_config:
channels:
- type: email
recipients:
- team@company.com
message: "Job completed with status: {{ job.status }}"
subject: "Daily Job Update"
Condition Task
Control flow based on upstream results - because sometimes you need if/else in your pipelines.
- name: check_quality
type: condition
condition_config:
expression: "upstream.data_quality.score > 0.95"
on_true: continue
on_false: skip_downstream
Visual Job Editor
The Job Editor provides an interactive DAG canvas:
Task Palette (left sidebar):
| Task Type | Icon | Color |
|---|---|---|
| Pipeline | Workflow | Indigo |
| Prompt | Message | Purple |
| Python | Code | Green |
| Notification | Bell | Orange |
| Condition | Git branch | Slate |
Creating Tasks:
- Drag and Drop: Grab a task type from the palette, drop it on the canvas
- YAML Editing: Switch to YAML mode, add your task, watch the visual DAG update
Connecting Tasks:
- Hover over a task node
- Drag from the bottom handle
- Connect to another task's top handle
- The
depends_onrelationship creates automatically
Interactive DAG Features
| Feature | How |
|---|---|
| Pan | Click and drag on empty canvas |
| Zoom | Mouse wheel or pinch |
| MiniMap | Overview navigation in corner |
| Select | Click nodes, Shift+click for multi-select |
Scheduling
Use standard cron expressions with timezone support:
schedule:
cron: "0 6 * * *" # Daily at 6 AM
timezone: "America/New_York"
Common Patterns:
| Pattern | When it runs |
|---|---|
0 * * * * | Every hour |
0 6 * * * | Daily at 6 AM |
0 6 * * 1 | Every Monday at 6 AM |
0 6 1 * * | First day of month at 6 AM |
*/15 * * * * | Every 15 minutes |
Format: minute hour day month weekday
Real-Time Execution Monitoring
When a job runs, the DAG comes alive:
| Status | Appearance |
|---|---|
| Pending | Gray nodes |
| Running | Blue nodes with pulse animation |
| Completed | Green nodes |
| Failed | Red nodes |
| Skipped | Gray nodes with opacity |
Edge animations show data flow direction. It's oddly satisfying to watch.
Folder Organization
Keep your pipelines and jobs organized:
- Click + next to "Folders" in the sidebar
- Configure:
- Name: URL-safe slug (
sales-etl) - Display Name: Human-readable (
Sales ETL Pipelines) - Color: Visual identifier
- Icon: Choose from Lucide icons
- Name: URL-safe slug (
Move items via drag-and-drop, context menu, or bulk actions.
Git Folders
Folders can be Git-enabled for version control:
- Create a folder or select existing one
- Click Enable Git in folder settings
- Optionally connect to a remote repository
Git Folders provide:
- Version history for all pipelines and jobs inside
- Branch management for safe experimentation
- Remote sync with GitHub, GitLab, Bitbucket
- CI/CD integration for automated deployments
See Git Folders documentation for complete details.
Workspace Integration
Lakeflow integrates with QRY Workspaces for team collaboration:
- Personal: Visible only to you
- Workspace: Shared with team members
Permissions follow the usual pattern: View, Execute, Edit, Admin.
AI-Assisted YAML Generation
Open the AI Assistant from the toolbar and describe what you want:
"Create a pipeline that aggregates daily sales by product category
from the transactions table in BigQuery and stores results in
the analytics schema"
The AI generates the YAML. Click Apply. Done.
Example Prompts:
Pipeline:
Create a pipeline to deduplicate customer records from the raw_customers
table based on email, keeping the most recent entry
Job:
Build a job that runs every Monday at 9 AM to:
1. Refresh the weekly sales pipeline
2. Generate an AI summary of sales trends
3. Email the report to the sales team
Best Practices
Pipeline Design
- Single Responsibility: One pipeline, one purpose
- Idempotency: Design to be safely re-runnable
- Data Quality Checks: Use expectations to catch issues early
- Documentation: Future you will thank present you
Job Orchestration
- Modular Tasks: Break complex workflows into discrete steps
- Realistic Timeouts: Don't guess, measure
- Retry Configuration: Handle transient failures gracefully
- Notifications: Alert on failures before your users do
Naming Conventions
Pipelines: {domain}_{action}_{frequency}
e.g., sales_aggregate_daily
Jobs: {domain}_{workflow}_{frequency}
e.g., analytics_reporting_weekly
Folders: {domain}-{category}
e.g., sales-etl, finance-reports
Scheduling Strategy
- Off-peak hours: Schedule heavy jobs during low-usage times
- Dependency chains: Stagger dependent jobs appropriately
- Timezone awareness: Consider your team's working hours
- Buffer time: Allow gaps between scheduled jobs
Troubleshooting
Pipeline Won't Deploy
- Validate the YAML configuration
- Check for syntax errors in SQL queries
- Verify datasource connections exist
- Ensure you have edit permissions
Job Not Running on Schedule
- Verify job status is Active (not Draft or Paused)
- Check cron expression syntax
- Verify timezone setting
- Review scheduler service logs
Task Stuck in Running
- Check task timeout settings
- Review underlying query/script performance
- Cancel the run and investigate
- Check resource limits (memory, CPU)
API Reference
For programmatic access:
# List pipelines
GET /api/lakeflow/pipelines
# Create pipeline
POST /api/lakeflow/pipelines
Content-Type: application/json
{
"definition": "<yaml>",
"format": "yaml"
}
# Run job
POST /api/lakeflow/jobs/{id}/run
# Stream run progress (SSE)
GET /api/lakeflow/job-runs/{run_id}/stream
See the API Reference for complete documentation.
See Also
- Git Folders - Version control for pipelines and jobs
- DataFlow - AI-native ETL for simpler migrations
- Forge - LLM-driven database migration platform (Teradata/Oracle/Cloudera → BigQuery)
- Scheduled Tasks - For scheduling conversations and reports
- Notebooks - Reusable analysis workflows
- Workspaces - Team collaboration
Last updated: April 2026