Monitoring and health
A QRY tenant exposes three layers of observability:
- Health probes — Kubernetes-level liveness / readiness, used by the cluster to decide whether to route traffic and when to restart.
- Prometheus metrics — application-level counters / gauges / histograms, scraped by Prometheus and displayed in Grafana.
- Audit and structured logs — for forensics, see Audit and compliance.
This page covers (1) and (2).
Health probes — why two of them
Two endpoints, deliberately different:
/health/live
Simple. Returns 200 always (unless the process is dead). Used as the liveness probe — Kubernetes restarts a pod that fails liveness, so it has to be aggressively cheap and never-blocking. No DB checks, no Redis checks.
/health/ready
Verifies dependencies. Hits PostgreSQL, hits Redis, returns 503 if either is unreachable. Used as the readiness probe — Kubernetes stops routing traffic to a pod that fails readiness, but doesn't restart it. So a brief Postgres blip doesn't churn pods.
The split is what keeps QRY stable through transient dependency hiccups: liveness stays green, readiness goes red, traffic drains, dependency recovers, traffic returns. No restart cycle.
Configuring probes
Defaults from the qry-platform Helm chart are sensible:
livenessProbe:
httpGet: { path: /health/live, port: 8000 }
periodSeconds: 30
timeoutSeconds: 5
readinessProbe:
httpGet: { path: /health/ready, port: 8000 }
periodSeconds: 10
timeoutSeconds: 5
For high-traffic tenants, shorten readiness periodSeconds to 5 to cut user-visible latency on dependency hiccups.
Topology
A typical tenant's pod topology:
| Deployment | Default replicas | What it does |
|---|---|---|
qry-backend | 2+ | API + chat orchestration |
qry-worker | 1–5 (autoscaled) | Chat streaming workers; 10 concurrent conversations per pod |
celery-worker | 1–5 (autoscaled) | Async tasks: scheduled, RAG indexing, profiling |
celery-beat | 1 | Schedules periodic tasks |
postgres | 1 (StatefulSet) — only ixenlab; Cloud SQL otherwise | Database |
redis / valkey | 1 — Memorystore for Cloud, in-cluster for ixenlab | Cache + Celery broker |
Per WORKER_MAX_CONCURRENT_TASKS=10, each qry-worker pod handles up to 10 concurrent conversations. Total tenant capacity = replicas × 10.
asyncio.Semaphore enforces the per-pod cap and gives the worker graceful shutdown — in-flight conversations finish before the pod terminates.
Prometheus metrics
Every QRY tenant exposes /metrics on each pod for Prometheus to scrape. Metric names are prefixed by feature:
qry_chat_*— conversation-level (active conversations, messages, model tokens)qry_query_*— SQL query execution (latency, errors, by datasource)qry_rag_*— embedding pipeline (indexed files, vector dim, retrieval latency)qry_scheduled_task_*— scheduled task execution (success / failure / cost)qry_forge_*— Forge migration (wave duration, self-heal attempts, deploy failures)qry_license_*— license validation resultsqry_workspace_*— workspace operationsqry_python_exec_*— Python executor usage (native vs. K8s, latency)
Useful queries to alert on:
| Metric | Alert when |
|---|---|
qry_query_errors_total | rate increases > 10% over 1h |
qry_chat_pending_messages | > 100 for > 5min (worker pool exhausted) |
qry_forge_deploy_failures_total | > 0 |
qry_license_validation_failures_total | > 3 in 24h |
qry_rag_indexing_lag_seconds | > 600 |
Grafana dashboards
Standard dashboards (UID in parentheses):
qry-overview— tenant health at a glance.qry-chat— conversation throughput, latency, model cost.qry-forge— Forge wave / deploy / translation metrics.qry-lakeflow— pipeline / job runs.qry-rag— embedding pipeline.
Import them from monitoring/grafana/ in the QRY repo, or use the JSON dashboards bundled in the Helm chart.
Alert groups
Prometheus alert rules are grouped by area:
qry.platform— cross-cutting health (any pod down, DB down, Redis down).qry.forge— Forge-specific (deploy failures, alerts on wave-duration regressions).qry.lakeflow— pipeline failures.qry.scheduled— scheduled-task failures and quota issues.qry.license— license validation issues.qry.security— ABAC enforcement violations, unusual access patterns.
Route each group to the right team / channel. The qry.security group should always page; others can be ticket-only.
Logs
Structured JSON logs to stdout. Parse with whatever you have — Loki / ELK / GCP Cloud Logging. Useful filter fields:
tenant_id— multi-tenant attribution.user_id— per-user investigation.conversation_id/notebook_id/task_id— artifact-level.level—INFO,WARNING,ERROR.module— which QRY component emitted the log.
For ABAC violations specifically, qry.security log records contain the parsed SQL and the policy that fired — useful for compliance reports.
Common issues
Pod is Running but /health/ready returns 503.
PostgreSQL or Redis is unreachable. Check both — kubectl exec into the pod and try psql / redis-cli.
Pods churn (CrashLoopBackOff) on /health/live.
Liveness shouldn't fail for transient deps — if it is, the process probably crashed. Look at logs from the previous container instance.
Grafana dashboards show no data.
Either Prometheus isn't scraping, or the metric names don't match. Confirm kubectl port-forward svc/prometheus -n monitoring 9090 and query the metric directly.
Alerts firing constantly for qry_query_errors_total after a deploy.
A new datasource was added and is throwing connection errors. Either fix the datasource config or temporarily disable the offending alert.
qry_license_validation_failures_total rising despite "license is fine".
GCP outage or service account key revoked. See License management for the 24h grace logic.
Need to dig into a specific conversation's path.
conversation_id filter across logs gives you the full trail: API request, worker assignment, DB queries, LLM calls, streamed response.
See also
- License management — license metrics and grace period.
- External Spark cluster — Spark job health alongside QRY's.
- Audit and compliance — log retention and forensic queries.