QRY Forge: Database Migrations Without the Pain

Status: Production Ready

Overview

Migrating a 30-year-old Teradata warehouse used to mean 18 months, three consultants named Brad, and a project plan that died on contact with reality. The DDL didn't translate cleanly. The procedures had three flavors of MERGE no documentation explained. Half the validation tickets ended in "looks fine, ship it." The steering committee asked for a status update every Monday and got a PowerPoint slide that hadn't changed in six weeks.

Forge is QRY's migration platform for Teradata, Oracle, and Cloudera sources moving to BigQuery (and now Cloudera/Iceberg). It automates the boring parts — DDL translation, schema mapping, row-count validation, gate approvals — and tells you when something genuinely needs human attention. Brad is now free to consult elsewhere.

What it does

Forge organises a migration into Projects that contain ordered Waves of Artifacts. Each artifact is a real database object — a table, view, stored procedure, UDF, or macro — that needs to land on the target. Forge picks the right strategy per artifact (direct copy vs. Lakeflow delegation, pure DDL vs. LLM translation), runs the work, validates the result, and surfaces the outcome to a steering committee through gate approvals.

Stage	What Forge does	What you do
Discover	Scan the source, inventory tables/views/procedures, estimate sizes	Pick a project scope
Plan	Group artifacts into waves, configure validations and gates	Sanity-check the wave order
Translate	Convert DDL and procedural code via LLM, with bounded self-heal	Review tricky cases in the editor
Migrate	Direct row-copy for small tables, Lakeflow delegation for big ones	Watch the dashboard
Validate	Run row counts, hashes, null rates, business-rule SQL per wave	Approve when checks are green
Approve	Steering committee signs off via chat or UI	Click approve, or say "approve gate X"

Sources and targets

Source dialects

Source	Discovery	DDL extraction	Translation
Teradata	DBC catalog (perm space, row counts)	`SHOW TABLE/VIEW/PROCEDURE`	LLM, dialect-aware
Oracle	`USER_` / `ALL_` catalogs	`DBMS_METADATA.GET_DDL`	LLM, NUMBER → NUMERIC handling
Cloudera (Impala/Hive)	`SHOW DATABASES/TABLES`, `SHOW TABLE STATS`	`SHOW CREATE TABLE/FUNCTION`	LLM, complex types fall through to STRING

Target platforms

Target	DDL strategy	Bulk write
BigQuery	`CREATE OR REPLACE`, dataset-prefixed FQ names	Native client, autodetect off, explicit schema
Cloudera/Iceberg	`CREATE TABLE IF NOT EXISTS … STORED BY ICEBERG` (format-version=2, MERGE INTO)	Direct INSERT batches up to 1M rows
QRY Notebook	LLM emits cell list for procedures it can't fully translate	Notebook bound to target datasource

The qry_notebook target is the escape hatch for stored procedures that don't have a clean cross-dialect mapping. Forge generates a QRY Notebook with markdown context + SQL/Python cells, and a human can finish the conversion in the IDE rather than fighting the translator.

The waves & artifacts model

A project's lifecycle goes:

DRAFT → PLANNING → RUNNING → VALIDATING → APPROVED → COMPLETE
                       │            │
                       └─→ FAILED   └─→ REJECTED

A wave is a group of artifacts that move together. Waves can declare dependencies on other waves so that you don't migrate a fact table before its dimensions. Each wave runs its artifacts in parallel with a bounded semaphore — fast enough to use the source connection pool, slow enough to not melt it.

Per-artifact, Forge auto-routes:

Tables ≤10 GB / ≤10 M rows → direct in-memory copy (driver → pandas → target)
Tables >10 GB / >10 M rows → delegated to a Lakeflow pipeline so external Spark handles the bulk
Cloudera target tables >1 M rows → fail explicitly (Spark delegation for Cloudera target is on the roadmap)
Views, UDFs, macros, procedures → translation via LLM, then DDL deploy with bounded self-heal

Translation engine

Forge translates non-table artifacts (views, procedures, UDFs, macros) by sending the source DDL plus dialect-specific guidance to an LLM. The output is validated structurally before the DDL ever hits the target.

Self-heal loop

When a deploy fails — BigQuery rejects a function definition, Iceberg complains about a partition spec — Forge feeds the error back to the LLM and asks for a repair. Bounded by forge_translation.deploy_repair_attempts (default 2, max 5) so it can't loop forever.

Safety gate

Repair output is filtered for destructive tokens (DROP, TRUNCATE, DELETE, GRANT, REVOKE) before re-execution. A hallucinated DROP TABLE customers; won't be quietly applied just because the LLM thought it would help.

Cost tracking

Every translation logs to qry_forge_translation_cost_dollars_total{dialect, target_platform} so you can answer "what did this migration cost in tokens?" without screen-scraping vendor invoices.

Validations

Forge runs validations per wave. Six built-in kinds, with a registry pattern (@register_validator("kind")) for adding your own:

Kind	Use it for
`row_count`	Source vs target counts, with a tolerance %
`aggregate_equality`	GROUP BY parity (key columns + value columns)
`schema_equality`	Column names + types match the source
`hash_md5`	Deterministic row-hash on a chosen column set
`null_rate`	Caps the % of NULL values per column
`business_rule_sql`	Arbitrary SQL that must return zero rows to pass

Validations gate the wave's transition from VALIDATING → ready-for-approval. If a row-count check fails by more than its tolerance, the wave doesn't unlock the gate, period.

Gates and approvals

Once a wave's validations pass, the wave's gates open for approval. Gates have required_approvers with two shapes:

required_approvers:
  - {type: user,  id: "alice@company.com"}
  - {type: group, id: "data_steering_committee"}

A user-typed entry needs that specific user. A group-typed entry resolves at approval time against the actor's groups (a member of the group can approve on behalf of it). Mixing both is fine. This is how the real steering-committee workflow happens — "data leadership" approves the cutover, not "Alice specifically."

Conversational management

Forge ships with a chat tool (forge_migration) that exposes 19 actions covering the full lifecycle. The steering committee can run the project from a conversation without leaving Slack-shaped real estate.

Read

Action	Question it answers
`list_projects`	"What migrations are in flight?"
`get_project`	"What's the health of project X?"
`list_waves`	"Show me wave statuses"
`get_wave_status`	"How is wave 2 doing right now?"
`list_artifacts`	"Which tables are in this wave?"
`list_validations`	"What checks are configured for wave 2?"
`list_gates`	"What gates are pending my approval?"
`list_audit_events`	"What changed in the last hour?"
`explain_blocker`	"Why is wave 3 stuck?"
`generate_report`	"Make me a one-pager for the committee"
`compare_data`	"Does the row count on `sales_2024` match source?"

Write

Action	Spoken as
`start_wave`	"Run wave 1"
`retry_artifact`	"Retry the failed `customer_dim` table"
`skip_artifact`	"Skip the legacy macro, we'll handle that manually"
`move_artifact`	"Move that procedure into wave 3"
`run_validation`	"Run the schema_equality check on this wave"
`run_all_validations`	"Run all validations on wave 2"
`approve_gate`	"Approve the cutover gate"
`reject_gate`	"Reject this gate, wave 1 had data drift"

Every write action passes through MigrationService so the same RBAC, audit logging, and gate-completion logic that runs from the UI runs from chat. Approvals carry the actor's identity into the audit trail — chat is not a backdoor.

Observability

Surface	Where
Metrics	`qry_forge_*` Prometheus series — waves, artifacts, translation cost, deploy repair counts
Dashboard	Grafana UID `qry-forge`
Alerts	PrometheusRule group `qry.forge` (alert descriptions enforced by a release-check script)
Audit	`migration_audit_events` table, surfaced via `list_audit_events`
SSE	`forge.*` events streamed to the Run Dashboard for live progress

Safety nets

Forge has more emergency brakes than a freight train, because it earns the right to.

Switch	Where	What it does
Feature flag	`runtime_config.forge_translation.enabled`	Disables LLM translation per tenant
Global kill switch	`FORGE_GLOBAL_KILL_SWITCH=1` env var	All Forge endpoints return 404 immediately
Repair safety gate	`_is_safe_repair_ddl`	Rejects destructive DDL from LLM repair
Path whitelist	`_safe_qualified_path`	Blocks SQL injection through artifact paths
DDL allowlist	`_DESTRUCTIVE_DDL_RE`	Word-boundary regex on `DROP`/`TRUNCATE`/`DELETE`/`GRANT`/`REVOKE`
Rate limits	Per-user, per-action via Nexus rate limiter	Translation: 100/h, discovery scan: 20/h

The kill switch returns 404 (not 403) on purpose: in a real incident you don't want clients retrying because they think they got rate-limited. They get "this feature doesn't exist," they stop, you fix it.

Getting Started

Navigating to Forge

Open QRY
Click Forge in the left navigation rail
Pick or create a project — you'll land on the Project page with a wave list and the live Run Dashboard

A typical first migration

Create a project, pick source + target datasources
Run discovery — Forge scans the source and proposes an artifact inventory
Group artifacts into waves — usually by domain (one wave = one functional area)
Configure validations per wave (row_count + schema_equality is a sensible default)
Set required approvers on each wave's gate
Start wave 1 — from the UI, or just say "start wave 1" in chat
Watch the Run Dashboard. When validations complete, approve the gate and Forge unlocks the next wave

From chat, end-to-end

The same flow, but conversationally:

You: List my Forge projects
Forge tool: 1 project — "Teradata→BigQuery Migration", health: amber

You: Why amber?
Forge tool: 2 of 5 waves complete; wave 3 has 1 failed artifact (customer_dim).

You: Explain the blocker on wave 3
Forge tool: customer_dim deploy failed: BigQuery rejected DECIMAL(38,15)
            precision. LLM repair attempted twice, both failed. Likely
            a NUMBER(38,*) column in source — needs manual review.

You: Move customer_dim to wave 5 and skip it for now
Forge tool: artifact moved + status set to skipped.

You: Approve the wave 3 gate
Forge tool: gate approved. Wave 3 → APPROVED. Wave 4 unlocked.

Reference

For the day-to-day operations playbook, the Forge repo ships two long-form guides:

docs/forge/MIGRATION_GUIDE.md — the full end-to-end flow from project creation through cutover, with example configs and decision trees
docs/forge/RUNBOOK.md — incident response: stuck waves, failed translations, kill-switch usage, common error signatures and their fixes

These live in the main QRY repo because they reference internal endpoints, env vars, and feature-flag paths that change per release. The docs you're reading now are the conceptual overview; the runbook is what you read at 2 AM.

Overview​

What it does​

Sources and targets​

Source dialects​

Target platforms​

The waves & artifacts model​

Translation engine​

Self-heal loop​

Safety gate​

Cost tracking​

Validations​

Gates and approvals​

Conversational management​

Read​

Write​

Observability​

Safety nets​

Getting Started​

Navigating to Forge​

A typical first migration​

From chat, end-to-end​

Reference​

See Also​