Skip to main content

Pre-processed profiling

Without profiling, when a user asks "what's the distribution of country in customers?", QRY has to either run an exploratory query first (slow, costs query time) or guess from column metadata alone (often wrong on real data).

Pre-processed profiling is a background process that computes column statistics — distinct counts, top values, null rates, numeric distributions, date ranges — and stores them so they're already in the LLM's context every time someone queries the table.

The result: better answers, fewer round-trips, lower query cost.

What gets profiled

For each table, the profiler computes per-column statistics:

Column typeStats stored
Text / categoricaldistinct count, top-K most frequent values, null rate
Numericmin, max, mean, median, stddev, percentiles, null rate
Date / timestampmin date, max date, granularity (daily / hourly / mixed), null rate
Booleantrue/false counts, null rate
Mixed / unknowndistinct count, sample of values

Statistics are stored in table_profiling_results with a last_profiled_at timestamp.

Configuring profiling per datasource

In Admin > Datasources > {datasource}, open the Profiling tab.

Enable

Profiling is opt-in per datasource. Off by default — the upfront cost (a few queries per table) isn't always worth it for fast-changing or huge databases.

Schedule

When enabled, configure cadence:

  • Hourly — for tables that change frequently and freshness of stats matters.
  • Daily — typical default.
  • Weekly — for stable reference data.
  • Manual only — admin triggers refresh on demand.

The BatchProfilingService runs on a Celery worker. Profiling jobs are scheduled in Beat with the chosen cadence.

Sampling

For huge tables, full-table profiling is too expensive. Configure a sample size (default 1M rows) — the profiler runs on the sample, results are flagged as approximate.

Per-table overrides

The default cadence and sample size apply to all tables in the datasource. Override per-table for the noisy outliers:

  • Tables that change every few minutes → hourly profile.
  • Tables that almost never change but are huge → manual only.
  • Reference tables (countries, currencies) → small enough to skip sampling.

How it shows up to users

When a user opens a conversation against a profiled table, the LLM sees something like:

Table: customers (5,000 rows, last profiled 2026-05-04 09:00 UTC)

Columns:
customer_id (text) — 5,000 distinct (unique)
country (text) — 47 distinct, top: ES (28%), DE (14%), FR (11%), GB (9%)
balance (numeric) — min: 0.00, max: 1,247,300.00, mean: 8,432.50, median: 4,100.00
created_at (timestamp) — 2018-01-01 → 2026-05-03, daily granularity
is_active (boolean) — true: 4,138 (83%), false: 862 (17%)
...

This is what makes QRY's first answer about a table accurate without an exploratory query.

Operational notes

Cost

Profiling runs SQL queries against your datasources. Query cost adds up across many tables — for a BigQuery datasource with 200 tables, daily profiling can cost real money. Sample, schedule weekly, or selectively enable per table.

Worker scaling

BatchProfilingService runs in the Celery worker pool. Long-running profile jobs share the pool with other Celery tasks (scheduled notebooks, RAG indexing). For tenants with heavy profiling, scale celery-workers up:

kubectl scale deployment/celery-worker --replicas=N -n qry-app

Cache invalidation

Schema changes invalidate the profile. After adding / removing / renaming columns, trigger a manual re-profile.

Common issues

Profile job stuck in running for hours. Either the underlying SQL is genuinely that slow (large unindexed table) or the worker died mid-job. Check Celery worker health (Operations > Monitoring and health) and restart if needed.

Stats look wrong / outdated. last_profiled_at tells you how stale they are. If freshness matters, increase cadence or trigger manual refresh.

Profiling is enabled but the LLM doesn't seem to use the stats. The stats are passed in the system prompt — the LLM sometimes ignores them on first pass. If it matters, add a domain-context note like "Always trust pre-profiled column statistics" (see Domain context).

Massive tables with sampling enabled give surprising distributions. 1M-row sample on a 1B-row table can mis-represent rare values. Bump the sample size for that specific table or accept the approximation.

A new datasource has no profiles yet. Profile the first time manually (Admin > Datasources > {datasource} > Profile now) so users don't get an unprofiled experience while waiting for the first scheduled run.

See also

QRYA product of IXEN.