QRY Lakeflow

Status: Production Ready

Overview

Remember when building data pipelines meant spending weeks wrestling with Airflow DAGs, debugging cryptic YAML errors at 2 AM, and explaining to your manager why "it works on my machine" doesn't apply to production? Those days are over.

Lakeflow is QRY's data orchestration engine for building, scheduling, and monitoring data pipelines and jobs. Think Databricks Lakeflow meets a UX designer who actually uses the product. Define complex data workflows using simple YAML or drag-and-drop visual editing — your choice, and yes, they sync automatically.

Key Features

Professional IDE Experience

Lakeflow features a full IDE interface inspired by VS Code and modern development tools. Work with multiple pipelines and jobs simultaneously using tabs, keyboard shortcuts, and powerful navigation.

Dual Editing Modes

Switch between YAML and visual drag-and-drop editing whenever you want. Changes in one mode automatically sync to the other. It's like having your cake and eating it too, except the cake is a perfectly configured ETL pipeline.

Interactive DAG Canvas

Design task workflows with drag-and-drop connections. No more drawing diagrams on whiteboards that nobody updates. Your DAG is always accurate because it is the configuration.

Real-Time Monitoring

Watch execution progress live with animated task nodes. Blue pulse means running, green means success, red means "time to investigate." It's like watching your data do a choreographed dance, except occasionally one dancer trips.

AI-Assisted Authoring

Tell Lakeflow what you want in plain English, and it generates the YAML for you. "Create a pipeline that aggregates daily sales by region" becomes a working configuration in seconds. Your keyboard thanks you.

Stage-Based Pipeline Design

Visual Source → Transform → Expectations → Sink flow that makes data lineage obvious. Even your manager can follow along during demos.

Git Integration

Version control your pipelines and jobs with Git Folders. Track changes, collaborate with teammates, and integrate with CI/CD workflows.

Two Core Abstractions

Lakeflow gives you two powerful building blocks:

Type	Purpose	Think of it as...
Pipelines	Single-purpose data transformations (ETL/ELT)	A well-trained specialist
Jobs	Multi-task orchestrations with dependencies	A conductor leading an orchestra

Getting Started

Navigating to Lakeflow

Open QRY
Click Lakeflow in the left navigation rail
Marvel at the dashboard showing all your pipelines and jobs

The IDE Interface

Lakeflow uses a three-panel IDE layout similar to VS Code:

┌─────────────────────────────────────────────────────────────┐
│  Tab Bar (open pipelines/jobs, drag to reorder)             │
├──────────┬────────────────────────────────┬─────────────────┤
│          │                                │                 │
│  File    │     Editor Area                │  Right Rail     │
│ Explorer │     (YAML or Visual)           │  (panel icons)  │
│          │                                │                 │
│  Folders │                                ├─────────────────┤
│  & Items │                                │  Panel Content  │
│          │                                │  (contextual)   │
└──────────┴────────────────────────────────┴─────────────────┘

Components:

Tab Bar: Open multiple pipelines/jobs simultaneously, drag tabs to reorder
File Explorer: Browse folders, search items, filter by type/status
Editor Area: Main workspace for YAML editing or visual canvas
Right Rail: Quick access to Run History, Settings, Validation
Panel Area: Contextual panels that slide in from the right

Keyboard Shortcuts

Master these shortcuts to work efficiently:

Shortcut	Action
`Cmd+S`	Save current item
`Cmd+Enter`	Validate configuration
`Cmd+Shift+Enter`	Deploy/Activate
`Cmd+P`	Quick open (search items)
`Cmd+B`	Toggle sidebar
`Cmd+\`	Toggle split view

The Interface at a Glance

Area	What it does
Toolbar	Create new items, filter by status/workspace, bulk actions
Folder Tree	Organize pipelines and jobs (yes, folders actually work here)
Item List	Table or card view - your choice
Status Indicators	Visual badges so you know what's running, broken, or waiting

Pipelines: Your Data Transformation Workhorses

Pipelines are single-purpose data transformation workflows that move and transform data between sources and targets. One pipeline, one job, done well.

Pipeline Lifecycle

DRAFT → DEPLOYED → DEPRECATED

Draft: Work in progress, edit freely
Deployed: Production-ready, running in the wild
Deprecated: Retirement home for pipelines you can't quite delete yet

Creating a Pipeline

Click + New Pipeline in the toolbar
Choose your preferred editing mode (YAML or Visual)
Define your pipeline:

name: sales_daily_summary
description: "Aggregate daily sales by region"

source:
  datasource: bigquery
  catalog: my_project
  schema: raw_data

target:
  catalog: my_project
  schema: analytics

tables:
  - name: daily_sales
    type: live_table
    query: |
      SELECT
        DATE(transaction_date) as sale_date,
        region,
        SUM(amount) as total_amount,
        COUNT(*) as transaction_count
      FROM source.transactions
      WHERE transaction_date >= CURRENT_DATE - 30
      GROUP BY 1, 2

Click Validate to catch errors before they catch you
Click Save to preserve your work
Click Deploy when you're ready for prime time

Visual Pipeline Editor

The Pipeline Editor shows a linear stage-based flow:

Source → Transform → Expectations → Sink

Each stage is a clickable card:

Stage	Color	What you configure
Source	Cyan	Where data comes from, query, incremental settings
Transform	Purple	SQL or Python transformations
Expectations	Amber	Data quality checks (because garbage in = garbage out)
Sink	Green	Where data lands, write mode, merge keys

Pipeline Table Types

Type	When to use
`live_table`	Real-time aggregations, computed on-demand
`materialized_view`	Performance optimization, pre-computed results
`streaming`	Continuous ingestion, real-time data feeds

Jobs: Orchestrating the Orchestra

Jobs combine multiple tasks into complex workflows. Tasks can depend on each other, forming a directed acyclic graph (DAG) - fancy words for "things happen in the right order."

Job Lifecycle

DRAFT → ACTIVE ⟷ PAUSED → ARCHIVED

Draft: Build and test without affecting production
Active: Scheduled and running
Paused: Taking a break, preserves schedule
Archived: Soft deleted (in case you change your mind)

Creating a Job

Click + New Job in the toolbar
Define your orchestration:

name: daily_analytics_workflow
description: "End-to-end daily analytics processing"

schedule:
  cron: "0 6 * * *"    # 6 AM UTC daily
  timezone: "UTC"

tasks:
  - name: extract_data
    type: pipeline
    pipeline_name: raw_data_ingestion
    timeout_seconds: 1800

  - name: transform_data
    type: pipeline
    pipeline_name: sales_daily_summary
    depends_on:
      - extract_data
    timeout_seconds: 3600

  - name: generate_report
    type: prompt
    prompt_config:
      prompt: "Analyze today's sales data and summarize key insights"
      model: "gemini-2.0-flash"
      context:
        include_upstream_results: true
    depends_on:
      - transform_data

  - name: notify_team
    type: notification
    notification_config:
      channels:
        - type: email
          recipients:
            - analytics@company.com
          message: "Daily analytics job completed successfully"
    depends_on:
      - generate_report

Click Validate, then Save, then Activate

Task Types

Lakeflow supports five task types for maximum flexibility:

Pipeline Task

Run a Lakeflow pipeline as part of your job.

- name: run_etl
  type: pipeline
  pipeline_name: my_pipeline
  timeout_seconds: 3600

Prompt Task

Execute an AI prompt - yes, you can have AI analyze your data as part of the workflow.

- name: analyze_data
  type: prompt
  prompt_config:
    prompt: "Analyze the data and provide insights"
    system: "You are a data analyst"
    model: "gemini-2.0-flash"
    tools:
      - DatabaseTool
      - PythonTool
    context:
      include_upstream_results: true
    output:
      format: "markdown"
      max_tokens: 4000

Python Task

Run custom Python code in a sandboxed environment.

- name: custom_processing
  type: python
  python_config:
    script: |
      import pandas as pd

      # Access upstream results
      upstream_data = context.get('upstream_results', {})

      # Your custom logic
      result = {"processed": True}
      print(f"Processed data: {result}")
    requirements:
      - pandas>=2.0.0
    timeout_seconds: 600

Notification Task

Send alerts when things happen (or don't).

- name: send_alert
  type: notification
  notification_config:
    channels:
      - type: email
        recipients:
          - team@company.com
        message: "Job completed with status: {{ job.status }}"
        subject: "Daily Job Update"

Condition Task

Control flow based on upstream results - because sometimes you need if/else in your pipelines.

- name: check_quality
  type: condition
  condition_config:
    expression: "upstream.data_quality.score > 0.95"
    on_true: continue
    on_false: skip_downstream

Visual Job Editor

The Job Editor provides an interactive DAG canvas:

Task Palette (left sidebar):

Task Type	Icon	Color
Pipeline	Workflow	Indigo
Prompt	Message	Purple
Python	Code	Green
Notification	Bell	Orange
Condition	Git branch	Slate

Creating Tasks:

Drag and Drop: Grab a task type from the palette, drop it on the canvas
YAML Editing: Switch to YAML mode, add your task, watch the visual DAG update

Connecting Tasks:

Hover over a task node
Drag from the bottom handle
Connect to another task's top handle
The depends_on relationship creates automatically

Interactive DAG Features

Feature	How
Pan	Click and drag on empty canvas
Zoom	Mouse wheel or pinch
MiniMap	Overview navigation in corner
Select	Click nodes, Shift+click for multi-select

Scheduling

Use standard cron expressions with timezone support:

schedule:
  cron: "0 6 * * *"      # Daily at 6 AM
  timezone: "America/New_York"

Common Patterns:

Pattern	When it runs
`0 * * * *`	Every hour
`0 6 * * *`	Daily at 6 AM
`0 6 * * 1`	Every Monday at 6 AM
`0 6 1 * *`	First day of month at 6 AM
`/15 * * *`	Every 15 minutes

Format: minute hour day month weekday

Real-Time Execution Monitoring

When a job runs, the DAG comes alive:

Status	Appearance
Pending	Gray nodes
Running	Blue nodes with pulse animation
Completed	Green nodes
Failed	Red nodes
Skipped	Gray nodes with opacity

Edge animations show data flow direction. It's oddly satisfying to watch.

Folder Organization

Keep your pipelines and jobs organized:

Click + next to "Folders" in the sidebar
Configure:
- Name: URL-safe slug (sales-etl)
- Display Name: Human-readable (Sales ETL Pipelines)
- Color: Visual identifier
- Icon: Choose from Lucide icons

Move items via drag-and-drop, context menu, or bulk actions.

Git Folders

Folders can be Git-enabled for version control:

Create a folder or select existing one
Click Enable Git in folder settings
Optionally connect to a remote repository

Git Folders provide:

Version history for all pipelines and jobs inside
Branch management for safe experimentation
Remote sync with GitHub, GitLab, Bitbucket
CI/CD integration for automated deployments

See Git Folders documentation for complete details.

Workspace Integration

Lakeflow integrates with QRY Workspaces for team collaboration:

Personal: Visible only to you
Workspace: Shared with team members

Permissions follow the usual pattern: View, Execute, Edit, Admin.

AI-Assisted YAML Generation

Open the AI Assistant from the toolbar and describe what you want:

"Create a pipeline that aggregates daily sales by product category
from the transactions table in BigQuery and stores results in
the analytics schema"

The AI generates the YAML. Click Apply. Done.

Example Prompts:

Pipeline:

Create a pipeline to deduplicate customer records from the raw_customers
table based on email, keeping the most recent entry

Job:

Build a job that runs every Monday at 9 AM to:
Refresh the weekly sales pipeline
Generate an AI summary of sales trends
Email the report to the sales team

Best Practices

Pipeline Design

Single Responsibility: One pipeline, one purpose
Idempotency: Design to be safely re-runnable
Data Quality Checks: Use expectations to catch issues early
Documentation: Future you will thank present you

Job Orchestration

Modular Tasks: Break complex workflows into discrete steps
Realistic Timeouts: Don't guess, measure
Retry Configuration: Handle transient failures gracefully
Notifications: Alert on failures before your users do

Naming Conventions

Pipelines:  {domain}_{action}_{frequency}
            e.g., sales_aggregate_daily

Jobs:       {domain}_{workflow}_{frequency}
            e.g., analytics_reporting_weekly

Folders:    {domain}-{category}
            e.g., sales-etl, finance-reports

Scheduling Strategy

Off-peak hours: Schedule heavy jobs during low-usage times
Dependency chains: Stagger dependent jobs appropriately
Timezone awareness: Consider your team's working hours
Buffer time: Allow gaps between scheduled jobs

Troubleshooting

Pipeline Won't Deploy

Validate the YAML configuration
Check for syntax errors in SQL queries
Verify datasource connections exist
Ensure you have edit permissions

Job Not Running on Schedule

Verify job status is Active (not Draft or Paused)
Check cron expression syntax
Verify timezone setting
Review scheduler service logs

Task Stuck in Running

Check task timeout settings
Review underlying query/script performance
Cancel the run and investigate
Check resource limits (memory, CPU)

API Reference

For programmatic access:

# List pipelines
GET /api/lakeflow/pipelines

# Create pipeline
POST /api/lakeflow/pipelines
Content-Type: application/json
{
  "definition": "<yaml>",
  "format": "yaml"
}

# Run job
POST /api/lakeflow/jobs/{id}/run

# Stream run progress (SSE)
GET /api/lakeflow/job-runs/{run_id}/stream

See the API Reference for complete documentation.

Overview​

Key Features​

Professional IDE Experience​

Dual Editing Modes​

Interactive DAG Canvas​

Real-Time Monitoring​

AI-Assisted Authoring​

Stage-Based Pipeline Design​

Git Integration​

Two Core Abstractions​

Getting Started​

Navigating to Lakeflow​

The IDE Interface​

Keyboard Shortcuts​

The Interface at a Glance​

Pipelines: Your Data Transformation Workhorses​

Pipeline Lifecycle​

Creating a Pipeline​

Visual Pipeline Editor​

Pipeline Table Types​

Jobs: Orchestrating the Orchestra​

Job Lifecycle​

Creating a Job​

Task Types​

Pipeline Task​

Prompt Task​

Python Task​

Notification Task​

Condition Task​

Visual Job Editor​

Interactive DAG Features​

Scheduling​

Real-Time Execution Monitoring​

Folder Organization​

Git Folders​

Workspace Integration​

AI-Assisted YAML Generation​

Best Practices​

Pipeline Design​

Job Orchestration​

Naming Conventions​

Scheduling Strategy​

Troubleshooting​

Pipeline Won't Deploy​

Job Not Running on Schedule​

Task Stuck in Running​

API Reference​

See Also​