Skip to main content

Monitoring and health

A QRY tenant exposes three layers of observability:

  1. Health probes — Kubernetes-level liveness / readiness, used by the cluster to decide whether to route traffic and when to restart.
  2. Prometheus metrics — application-level counters / gauges / histograms, scraped by Prometheus and displayed in Grafana.
  3. Audit and structured logs — for forensics, see Audit and compliance.

This page covers (1) and (2).

Health probes — why two of them

Two endpoints, deliberately different:

/health/live

Simple. Returns 200 always (unless the process is dead). Used as the liveness probe — Kubernetes restarts a pod that fails liveness, so it has to be aggressively cheap and never-blocking. No DB checks, no Redis checks.

/health/ready

Verifies dependencies. Hits PostgreSQL, hits Redis, returns 503 if either is unreachable. Used as the readiness probe — Kubernetes stops routing traffic to a pod that fails readiness, but doesn't restart it. So a brief Postgres blip doesn't churn pods.

The split is what keeps QRY stable through transient dependency hiccups: liveness stays green, readiness goes red, traffic drains, dependency recovers, traffic returns. No restart cycle.

Configuring probes

Defaults from the qry-platform Helm chart are sensible:

livenessProbe:
httpGet: { path: /health/live, port: 8000 }
periodSeconds: 30
timeoutSeconds: 5
readinessProbe:
httpGet: { path: /health/ready, port: 8000 }
periodSeconds: 10
timeoutSeconds: 5

For high-traffic tenants, shorten readiness periodSeconds to 5 to cut user-visible latency on dependency hiccups.

Topology

A typical tenant's pod topology:

DeploymentDefault replicasWhat it does
qry-backend2+API + chat orchestration
qry-worker1–5 (autoscaled)Chat streaming workers; 10 concurrent conversations per pod
celery-worker1–5 (autoscaled)Async tasks: scheduled, RAG indexing, profiling
celery-beat1Schedules periodic tasks
postgres1 (StatefulSet) — only ixenlab; Cloud SQL otherwiseDatabase
redis / valkey1 — Memorystore for Cloud, in-cluster for ixenlabCache + Celery broker

Per WORKER_MAX_CONCURRENT_TASKS=10, each qry-worker pod handles up to 10 concurrent conversations. Total tenant capacity = replicas × 10.

asyncio.Semaphore enforces the per-pod cap and gives the worker graceful shutdown — in-flight conversations finish before the pod terminates.

Prometheus metrics

Every QRY tenant exposes /metrics on each pod for Prometheus to scrape. Metric names are prefixed by feature:

  • qry_chat_* — conversation-level (active conversations, messages, model tokens)
  • qry_query_* — SQL query execution (latency, errors, by datasource)
  • qry_rag_* — embedding pipeline (indexed files, vector dim, retrieval latency)
  • qry_scheduled_task_* — scheduled task execution (success / failure / cost)
  • qry_forge_* — Forge migration (wave duration, self-heal attempts, deploy failures)
  • qry_license_* — license validation results
  • qry_workspace_* — workspace operations
  • qry_python_exec_* — Python executor usage (native vs. K8s, latency)

Useful queries to alert on:

MetricAlert when
qry_query_errors_totalrate increases > 10% over 1h
qry_chat_pending_messages> 100 for > 5min (worker pool exhausted)
qry_forge_deploy_failures_total> 0
qry_license_validation_failures_total> 3 in 24h
qry_rag_indexing_lag_seconds> 600

Grafana dashboards

Standard dashboards (UID in parentheses):

  • qry-overview — tenant health at a glance.
  • qry-chat — conversation throughput, latency, model cost.
  • qry-forge — Forge wave / deploy / translation metrics.
  • qry-lakeflow — pipeline / job runs.
  • qry-rag — embedding pipeline.

Import them from monitoring/grafana/ in the QRY repo, or use the JSON dashboards bundled in the Helm chart.

Alert groups

Prometheus alert rules are grouped by area:

  • qry.platform — cross-cutting health (any pod down, DB down, Redis down).
  • qry.forge — Forge-specific (deploy failures, alerts on wave-duration regressions).
  • qry.lakeflow — pipeline failures.
  • qry.scheduled — scheduled-task failures and quota issues.
  • qry.license — license validation issues.
  • qry.security — ABAC enforcement violations, unusual access patterns.

Route each group to the right team / channel. The qry.security group should always page; others can be ticket-only.

Logs

Structured JSON logs to stdout. Parse with whatever you have — Loki / ELK / GCP Cloud Logging. Useful filter fields:

  • tenant_id — multi-tenant attribution.
  • user_id — per-user investigation.
  • conversation_id / notebook_id / task_id — artifact-level.
  • levelINFO, WARNING, ERROR.
  • module — which QRY component emitted the log.

For ABAC violations specifically, qry.security log records contain the parsed SQL and the policy that fired — useful for compliance reports.

Common issues

Pod is Running but /health/ready returns 503. PostgreSQL or Redis is unreachable. Check both — kubectl exec into the pod and try psql / redis-cli.

Pods churn (CrashLoopBackOff) on /health/live. Liveness shouldn't fail for transient deps — if it is, the process probably crashed. Look at logs from the previous container instance.

Grafana dashboards show no data. Either Prometheus isn't scraping, or the metric names don't match. Confirm kubectl port-forward svc/prometheus -n monitoring 9090 and query the metric directly.

Alerts firing constantly for qry_query_errors_total after a deploy. A new datasource was added and is throwing connection errors. Either fix the datasource config or temporarily disable the offending alert.

qry_license_validation_failures_total rising despite "license is fine". GCP outage or service account key revoked. See License management for the 24h grace logic.

Need to dig into a specific conversation's path. conversation_id filter across logs gives you the full trail: API request, worker assignment, DB queries, LLM calls, streamed response.

See also

QRYA product of IXEN.