Pre-processed profiling
Without profiling, when a user asks "what's the distribution of country in customers?", QRY has to either run an exploratory query first (slow, costs query time) or guess from column metadata alone (often wrong on real data).
Pre-processed profiling is a background process that computes column statistics — distinct counts, top values, null rates, numeric distributions, date ranges — and stores them so they're already in the LLM's context every time someone queries the table.
The result: better answers, fewer round-trips, lower query cost.
What gets profiled
For each table, the profiler computes per-column statistics:
| Column type | Stats stored |
|---|---|
| Text / categorical | distinct count, top-K most frequent values, null rate |
| Numeric | min, max, mean, median, stddev, percentiles, null rate |
| Date / timestamp | min date, max date, granularity (daily / hourly / mixed), null rate |
| Boolean | true/false counts, null rate |
| Mixed / unknown | distinct count, sample of values |
Statistics are stored in table_profiling_results with a last_profiled_at timestamp.
Configuring profiling per datasource
In Admin > Datasources > {datasource}, open the Profiling tab.
Enable
Profiling is opt-in per datasource. Off by default — the upfront cost (a few queries per table) isn't always worth it for fast-changing or huge databases.
Schedule
When enabled, configure cadence:
- Hourly — for tables that change frequently and freshness of stats matters.
- Daily — typical default.
- Weekly — for stable reference data.
- Manual only — admin triggers refresh on demand.
The BatchProfilingService runs on a Celery worker. Profiling jobs are scheduled in Beat with the chosen cadence.
Sampling
For huge tables, full-table profiling is too expensive. Configure a sample size (default 1M rows) — the profiler runs on the sample, results are flagged as approximate.
Per-table overrides
The default cadence and sample size apply to all tables in the datasource. Override per-table for the noisy outliers:
- Tables that change every few minutes → hourly profile.
- Tables that almost never change but are huge → manual only.
- Reference tables (countries, currencies) → small enough to skip sampling.
How it shows up to users
When a user opens a conversation against a profiled table, the LLM sees something like:
Table: customers (5,000 rows, last profiled 2026-05-04 09:00 UTC)
Columns:
customer_id (text) — 5,000 distinct (unique)
country (text) — 47 distinct, top: ES (28%), DE (14%), FR (11%), GB (9%)
balance (numeric) — min: 0.00, max: 1,247,300.00, mean: 8,432.50, median: 4,100.00
created_at (timestamp) — 2018-01-01 → 2026-05-03, daily granularity
is_active (boolean) — true: 4,138 (83%), false: 862 (17%)
...
This is what makes QRY's first answer about a table accurate without an exploratory query.
Operational notes
Cost
Profiling runs SQL queries against your datasources. Query cost adds up across many tables — for a BigQuery datasource with 200 tables, daily profiling can cost real money. Sample, schedule weekly, or selectively enable per table.
Worker scaling
BatchProfilingService runs in the Celery worker pool. Long-running profile jobs share the pool with other Celery tasks (scheduled notebooks, RAG indexing). For tenants with heavy profiling, scale celery-workers up:
kubectl scale deployment/celery-worker --replicas=N -n qry-app
Cache invalidation
Schema changes invalidate the profile. After adding / removing / renaming columns, trigger a manual re-profile.
Common issues
Profile job stuck in running for hours. Either the underlying SQL is genuinely that slow (large unindexed table) or the worker died mid-job. Check Celery worker health (Operations > Monitoring and health) and restart if needed.
Stats look wrong / outdated.
last_profiled_at tells you how stale they are. If freshness matters, increase cadence or trigger manual refresh.
Profiling is enabled but the LLM doesn't seem to use the stats. The stats are passed in the system prompt — the LLM sometimes ignores them on first pass. If it matters, add a domain-context note like "Always trust pre-profiled column statistics" (see Domain context).
Massive tables with sampling enabled give surprising distributions. 1M-row sample on a 1B-row table can mis-represent rare values. Bump the sample size for that specific table or accept the approximation.
A new datasource has no profiles yet. Profile the first time manually (Admin > Datasources > {datasource} > Profile now) so users don't get an unprofiled experience while waiting for the first scheduled run.
See also
- Connecting databases — pre-requisite: the datasource has to exist.
- Pre-processed profiling reference — full feature reference.
- Monitoring and health — observe Celery worker health.