Qry Forge: Database Migrations Without the Pain
Status: Production Ready
Overview
Migrating a 30-year-old Teradata warehouse used to mean 18 months, three consultants named Brad, and a project plan that died on contact with reality. The DDL didn't translate cleanly. The procedures had three flavors of MERGE no documentation explained. Half the validation tickets ended in "looks fine, ship it." The steering committee asked for a status update every Monday and got a PowerPoint slide that hadn't changed in six weeks.
Forge is Qry's migration platform for Teradata, Oracle, and Cloudera sources moving to BigQuery (and now Cloudera/Iceberg). It automates the boring parts — DDL translation, schema mapping, row-count validation, gate approvals — and tells you when something genuinely needs human attention. Brad is now free to consult elsewhere.
What it does
Forge organises a migration into Projects that contain ordered Waves of Artifacts. Each artifact is a real database object — a table, view, stored procedure, UDF, or macro — that needs to land on the target. Forge picks the right strategy per artifact (direct copy vs. Lakeflow delegation, pure DDL vs. LLM translation), runs the work, validates the result, and surfaces the outcome to a steering committee through gate approvals.
| Stage | What Forge does | What you do |
|---|---|---|
| Discover | Scan the source, inventory tables/views/procedures, estimate sizes | Pick a project scope |
| Plan | Group artifacts into waves, configure validations and gates | Sanity-check the wave order |
| Translate | Convert DDL and procedural code via LLM, with bounded self-heal | Review tricky cases in the editor |
| Migrate | Direct row-copy for small tables, Lakeflow delegation for big ones | Watch the dashboard |
| Validate | Run row counts, hashes, null rates, business-rule SQL per wave | Approve when checks are green |
| Approve | Steering committee signs off via chat or UI | Click approve, or say "approve gate X" |
Sources and targets
Source dialects
| Source | Discovery | DDL extraction | Translation |
|---|---|---|---|
| Teradata | DBC catalog (perm space, row counts) | SHOW TABLE/VIEW/PROCEDURE | LLM, dialect-aware |
| Oracle | USER_* / ALL_* catalogs | DBMS_METADATA.GET_DDL | LLM, NUMBER → NUMERIC handling |
| Cloudera (Impala/Hive) | SHOW DATABASES/TABLES, SHOW TABLE STATS | SHOW CREATE TABLE/FUNCTION | LLM, complex types fall through to STRING |
Target platforms
| Target | DDL strategy | Bulk write |
|---|---|---|
| BigQuery | CREATE OR REPLACE, dataset-prefixed FQ names | Native client, autodetect off, explicit schema |
| Cloudera/Iceberg | CREATE TABLE IF NOT EXISTS … STORED BY ICEBERG (format-version=2, MERGE INTO) | Direct INSERT batches up to 1M rows |
| Qry Notebook | LLM emits cell list for procedures it can't fully translate | Notebook bound to target datasource |
The qry_notebook target is the escape hatch for stored procedures that don't have a clean cross-dialect mapping. Forge generates a Qry Notebook with markdown context + SQL/Python cells, and a human can finish the conversion in the IDE rather than fighting the translator.
The waves & artifacts model
A project's lifecycle goes:
DRAFT → PLANNING → RUNNING → VALIDATING → APPROVED → COMPLETE
│ │
└─→ FAILED └─→ REJECTED
A wave is a group of artifacts that move together. Waves can declare dependencies on other waves so that you don't migrate a fact table before its dimensions. Each wave runs its artifacts in parallel with a bounded semaphore — fast enough to use the source connection pool, slow enough to not melt it.
Per-artifact, Forge auto-routes:
- Tables ≤10 GB / ≤10 M rows → direct in-memory copy (driver → pandas → target)
- Tables >10 GB / >10 M rows → delegated to a Lakeflow pipeline so external Spark handles the bulk
- Cloudera target tables >1 M rows → fail explicitly (Spark delegation for Cloudera target is on the roadmap)
- Views, UDFs, macros, procedures → translation via LLM, then DDL deploy with bounded self-heal
Translation engine
Forge translates non-table artifacts (views, procedures, UDFs, macros) by sending the source DDL plus dialect-specific guidance to an LLM. The output is validated structurally before the DDL ever hits the target.
Self-heal loop
When a deploy fails — BigQuery rejects a function definition, Iceberg complains about a partition spec — Forge feeds the error back to the LLM and asks for a repair. Bounded by forge_translation.deploy_repair_attempts (default 2, max 5) so it can't loop forever.
Safety gate
Repair output is filtered for destructive tokens (DROP, TRUNCATE, DELETE, GRANT, REVOKE) before re-execution. A hallucinated DROP TABLE customers; won't be quietly applied just because the LLM thought it would help.
Cost tracking
Every translation logs to qry_forge_translation_cost_dollars_total{dialect, target_platform} so you can answer "what did this migration cost in tokens?" without screen-scraping vendor invoices.
Validations
Forge runs validations per wave. Six built-in kinds, with a registry pattern (@register_validator("kind")) for adding your own:
| Kind | Use it for |
|---|---|
row_count | Source vs target counts, with a tolerance % |
aggregate_equality | GROUP BY parity (key columns + value columns) |
schema_equality | Column names + types match the source |
hash_md5 | Deterministic row-hash on a chosen column set |
null_rate | Caps the % of NULL values per column |
business_rule_sql | Arbitrary SQL that must return zero rows to pass |
Validations gate the wave's transition from VALIDATING → ready-for-approval. If a row-count check fails by more than its tolerance, the wave doesn't unlock the gate, period.
Gates and approvals
Once a wave's validations pass, the wave's gates open for approval. Gates have required_approvers with two shapes:
required_approvers:
- {type: user, id: "alice@company.com"}
- {type: group, id: "data_steering_committee"}
A user-typed entry needs that specific user. A group-typed entry resolves at approval time against the actor's groups (a member of the group can approve on behalf of it). Mixing both is fine. This is how the real steering-committee workflow happens — "data leadership" approves the cutover, not "Alice specifically."
Conversational management
Forge ships with a chat tool (forge_migration) that exposes 19 actions covering the full lifecycle. The steering committee can run the project from a conversation without leaving Slack-shaped real estate.
Read
| Action | Question it answers |
|---|---|
list_projects | "What migrations are in flight?" |
get_project | "What's the health of project X?" |
list_waves | "Show me wave statuses" |
get_wave_status | "How is wave 2 doing right now?" |
list_artifacts | "Which tables are in this wave?" |
list_validations | "What checks are configured for wave 2?" |
list_gates | "What gates are pending my approval?" |
list_audit_events | "What changed in the last hour?" |
explain_blocker | "Why is wave 3 stuck?" |
generate_report | "Make me a one-pager for the committee" |
compare_data | "Does the row count on sales_2024 match source?" |
Write
| Action | Spoken as |
|---|---|
start_wave | "Run wave 1" |
retry_artifact | "Retry the failed customer_dim table" |
skip_artifact | "Skip the legacy macro, we'll handle that manually" |
move_artifact | "Move that procedure into wave 3" |
run_validation | "Run the schema_equality check on this wave" |
run_all_validations | "Run all validations on wave 2" |
approve_gate | "Approve the cutover gate" |
reject_gate | "Reject this gate, wave 1 had data drift" |
Every write action passes through MigrationService so the same RBAC, audit logging, and gate-completion logic that runs from the UI runs from chat. Approvals carry the actor's identity into the audit trail — chat is not a backdoor.
Observability
| Surface | Where |
|---|---|
| Metrics | qry_forge_* Prometheus series — waves, artifacts, translation cost, deploy repair counts |
| Dashboard | Grafana UID qry-forge |
| Alerts | PrometheusRule group qry.forge (alert descriptions enforced by a release-check script) |
| Audit | migration_audit_events table, surfaced via list_audit_events |
| SSE | forge.* events streamed to the Run Dashboard for live progress |
Safety nets
Forge has more emergency brakes than a freight train, because it earns the right to.
| Switch | Where | What it does |
|---|---|---|
| Feature flag | runtime_config.forge_translation.enabled | Disables LLM translation per tenant |
| Global kill switch | FORGE_GLOBAL_KILL_SWITCH=1 env var | All Forge endpoints return 404 immediately |
| Repair safety gate | _is_safe_repair_ddl | Rejects destructive DDL from LLM repair |
| Path whitelist | _safe_qualified_path | Blocks SQL injection through artifact paths |
| DDL allowlist | _DESTRUCTIVE_DDL_RE | Word-boundary regex on DROP/TRUNCATE/DELETE/GRANT/REVOKE |
| Rate limits | Per-user, per-action via Nexus rate limiter | Translation: 100/h, discovery scan: 20/h |
The kill switch returns 404 (not 403) on purpose: in a real incident you don't want clients retrying because they think they got rate-limited. They get "this feature doesn't exist," they stop, you fix it.
Getting Started
Navigating to Forge
- Open Qry
- Click Forge in the left navigation rail
- Pick or create a project — you'll land on the Project page with a wave list and the live Run Dashboard
A typical first migration
- Create a project, pick source + target datasources
- Run discovery — Forge scans the source and proposes an artifact inventory
- Group artifacts into waves — usually by domain (one wave = one functional area)
- Configure validations per wave (row_count + schema_equality is a sensible default)
- Set required approvers on each wave's gate
- Start wave 1 — from the UI, or just say "start wave 1" in chat
- Watch the Run Dashboard. When validations complete, approve the gate and Forge unlocks the next wave
From chat, end-to-end
The same flow, but conversationally:
You: List my Forge projects
Forge tool: 1 project — "Teradata→BigQuery Migration", health: amber
You: Why amber?
Forge tool: 2 of 5 waves complete; wave 3 has 1 failed artifact (customer_dim).
You: Explain the blocker on wave 3
Forge tool: customer_dim deploy failed: BigQuery rejected DECIMAL(38,15)
precision. LLM repair attempted twice, both failed. Likely
a NUMBER(38,*) column in source — needs manual review.
You: Move customer_dim to wave 5 and skip it for now
Forge tool: artifact moved + status set to skipped.
You: Approve the wave 3 gate
Forge tool: gate approved. Wave 3 → APPROVED. Wave 4 unlocked.
Reference
For the day-to-day operations playbook, the Forge repo ships two long-form guides:
docs/forge/MIGRATION_GUIDE.md— the full end-to-end flow from project creation through cutover, with example configs and decision treesdocs/forge/RUNBOOK.md— incident response: stuck waves, failed translations, kill-switch usage, common error signatures and their fixes
These live in the main Qry repo because they reference internal endpoints, env vars, and feature-flag paths that change per release. The docs you're reading now are the conceptual overview; the runbook is what you read at 2 AM.
See Also
- Lakeflow — Forge delegates large-table migrations to Lakeflow pipelines under the hood
- DataFlow — Conversational AI-native ETL for ad-hoc data movement (lighter than Forge)
- RBAC — Per-datasource permission model that gates Forge actions
- Workspaces — Team-scoped projects and shared audit trail
- Notebooks — Where the
qry_notebooktranslation target lands
Last updated: April 2026