Monitoring and health

A QRY tenant exposes three layers of observability:

Health probes — Kubernetes-level liveness / readiness, used by the cluster to decide whether to route traffic and when to restart.
Prometheus metrics — application-level counters / gauges / histograms, scraped by Prometheus and displayed in Grafana.
Audit and structured logs — for forensics, see Audit and compliance.

This page covers (1) and (2).

Health probes — why two of them

Two endpoints, deliberately different:

`/health/live`

Simple. Returns 200 always (unless the process is dead). Used as the liveness probe — Kubernetes restarts a pod that fails liveness, so it has to be aggressively cheap and never-blocking. No DB checks, no Redis checks.

`/health/ready`

Verifies dependencies. Hits PostgreSQL, hits Redis, returns 503 if either is unreachable. Used as the readiness probe — Kubernetes stops routing traffic to a pod that fails readiness, but doesn't restart it. So a brief Postgres blip doesn't churn pods.

The split is what keeps QRY stable through transient dependency hiccups: liveness stays green, readiness goes red, traffic drains, dependency recovers, traffic returns. No restart cycle.

Configuring probes

Defaults from the qry-platform Helm chart are sensible:

livenessProbe:
  httpGet: { path: /health/live, port: 8000 }
  periodSeconds: 30
  timeoutSeconds: 5
readinessProbe:
  httpGet: { path: /health/ready, port: 8000 }
  periodSeconds: 10
  timeoutSeconds: 5

For high-traffic tenants, shorten readiness periodSeconds to 5 to cut user-visible latency on dependency hiccups.

Topology

A typical tenant's pod topology:

Deployment	Default replicas	What it does
`qry-backend`	2+	API + chat orchestration
`qry-worker`	1–5 (autoscaled)	Chat streaming workers; 10 concurrent conversations per pod
`celery-worker`	1–5 (autoscaled)	Async tasks: scheduled, RAG indexing, profiling
`celery-beat`	1	Schedules periodic tasks
`postgres`	1 (StatefulSet) — only ixenlab; Cloud SQL otherwise	Database
`redis` / `valkey`	1 — Memorystore for Cloud, in-cluster for ixenlab	Cache + Celery broker

Per WORKER_MAX_CONCURRENT_TASKS=10, each qry-worker pod handles up to 10 concurrent conversations. Total tenant capacity = replicas × 10.

asyncio.Semaphore enforces the per-pod cap and gives the worker graceful shutdown — in-flight conversations finish before the pod terminates.

Prometheus metrics

Every QRY tenant exposes /metrics on each pod for Prometheus to scrape. Metric names are prefixed by feature:

qry_chat_* — conversation-level (active conversations, messages, model tokens)
qry_query_* — SQL query execution (latency, errors, by datasource)
qry_rag_* — embedding pipeline (indexed files, vector dim, retrieval latency)
qry_scheduled_task_* — scheduled task execution (success / failure / cost)
qry_forge_* — Forge migration (wave duration, self-heal attempts, deploy failures)
qry_license_* — license validation results
qry_workspace_* — workspace operations
qry_python_exec_* — Python executor usage (native vs. K8s, latency)

Useful queries to alert on:

Metric	Alert when
`qry_query_errors_total`	rate increases > 10% over 1h
`qry_chat_pending_messages`	> 100 for > 5min (worker pool exhausted)
`qry_forge_deploy_failures_total`	> 0
`qry_license_validation_failures_total`	> 3 in 24h
`qry_rag_indexing_lag_seconds`	> 600

Grafana dashboards

Standard dashboards (UID in parentheses):

qry-overview — tenant health at a glance.
qry-chat — conversation throughput, latency, model cost.
qry-forge — Forge wave / deploy / translation metrics.
qry-lakeflow — pipeline / job runs.
qry-rag — embedding pipeline.

Import them from monitoring/grafana/ in the QRY repo, or use the JSON dashboards bundled in the Helm chart.

Alert groups

Prometheus alert rules are grouped by area:

qry.platform — cross-cutting health (any pod down, DB down, Redis down).
qry.forge — Forge-specific (deploy failures, alerts on wave-duration regressions).
qry.lakeflow — pipeline failures.
qry.scheduled — scheduled-task failures and quota issues.
qry.license — license validation issues.
qry.security — ABAC enforcement violations, unusual access patterns.

Route each group to the right team / channel. The qry.security group should always page; others can be ticket-only.

Logs

Structured JSON logs to stdout. Parse with whatever you have — Loki / ELK / GCP Cloud Logging. Useful filter fields:

tenant_id — multi-tenant attribution.
user_id — per-user investigation.
conversation_id / notebook_id / task_id — artifact-level.
level — INFO, WARNING, ERROR.
module — which QRY component emitted the log.

For ABAC violations specifically, qry.security log records contain the parsed SQL and the policy that fired — useful for compliance reports.

Common issues

Pod is Running but /health/ready returns 503. PostgreSQL or Redis is unreachable. Check both — kubectl exec into the pod and try psql / redis-cli.

Pods churn (CrashLoopBackOff) on /health/live. Liveness shouldn't fail for transient deps — if it is, the process probably crashed. Look at logs from the previous container instance.

Grafana dashboards show no data. Either Prometheus isn't scraping, or the metric names don't match. Confirm kubectl port-forward svc/prometheus -n monitoring 9090 and query the metric directly.

Alerts firing constantly for qry_query_errors_total after a deploy. A new datasource was added and is throwing connection errors. Either fix the datasource config or temporarily disable the offending alert.

qry_license_validation_failures_total rising despite "license is fine". GCP outage or service account key revoked. See License management for the 24h grace logic.

Need to dig into a specific conversation's path. conversation_id filter across logs gives you the full trail: API request, worker assignment, DB queries, LLM calls, streamed response.

Health probes — why two of them​

/health/live​

/health/ready​

Configuring probes​

Topology​

Prometheus metrics​

Grafana dashboards​

Alert groups​

Logs​

Common issues​

See also​