Skip to main content

Multi-tenant provisioning

A new QRY tenant is one provision_tenant.sh command. The script orchestrates everything: license, database, namespace, RBAC, Helm install, IngressRoute, DNS, bootstrap admin, and optional config clone from an existing tenant.

This page is the operational reference. The full source of truth is kubernetes/tenants/README.md and kubernetes/tenants/provision_tenant.sh in the QRY repo.

What's per-tenant vs. shared

Per-tenant (created by provision_tenant.sh):

  • A namespace qry-<id> in the shared cluster.
  • A database qrydb_<id> on the shared Cloud SQL instance.
  • A Memorystore Valkey instance (created before the script runs — see Prerequisites).
  • A GCP license service account with caps.
  • A Let's Encrypt cert via Traefik ACME + Route53 DNS-01.
  • An IngressRoute for <id>.qry.dev.
  • A bootstrap admin user.

Shared across tenants:

  • The GKE cluster (autopilot-cluster-pue in pue-madrid, region europe-southwest1).
  • The Cloud SQL Postgres instance.
  • The Traefik LB.
  • The python-execution namespace (with per-tenant RBAC).
  • The Artifact Registry.
  • JWT_SECRET_KEY — Fernet-encrypted credentials need to be portable between tenants for migration scenarios, so the secret is global. Don't generate per-tenant secrets unless you understand the implications.

Prerequisites

Once per operator machine:

gcloud auth login
gcloud config set project pue-madrid
gcloud container clusters get-credentials autopilot-cluster-pue \
--region=europe-southwest1 --project=pue-madrid

helm registry login europe-southwest1-docker.pkg.dev # OCI chart registry

brew install postgresql@18 gettext # macOS — psql client >= 18

AWS credentials with Route53 write on qry.dev hosted zone:

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

(Same creds Traefik uses; pull with kubectl -n default get secret traefik-dnsprovider-aws-config.)

Python venv at backend/venv/ for passlib (used to hash the bootstrap admin password).

Pre-create the Valkey instance

Memorystore Valkey is per-tenant — same chart-level Celery /0 collisions if you share. Create it before running the script:

gcloud memorystore instances create qry-<CUSTOMER_ID> \
--project=pue-madrid \
--location=europe-southwest1 \
--shard-count=1 \
--node-type=SHARED_CORE_NANO \
--authorization-mode=AUTH_DISABLED

Takes ~5 minutes. When it finishes, grab the PSC IP:

gcloud memorystore instances list --project=pue-madrid \
--location=europe-southwest1 \
--filter="name~qry-<CUSTOMER_ID>" \
--format='value(endpoints[0].connections[0].pscAutoConnection.ipAddress)'

You'll pass this to --valkey-ip.

Run the script

cd kubernetes/tenants

./provision_tenant.sh \
--customer-id acme \
--host acme.qry.dev \
--max-users 20 \
--max-datasources 10 \
--valkey-ip 10.0.100.25 \
--admin-email admin@acme.com \
--admin-password '<strong-password>' \
--features "rag,batch-profiling,scheduled-tasks,workspaces,domain-agents" \
--clone-config-from qrydb

What each flag does

  • --customer-id — short id, lowercase. Becomes namespace qry-<id> and database qrydb_<id>.
  • --host — public DNS name. Routed via Traefik IngressRoute.
  • --max-users, --max-datasources — license caps baked into the GCP SA JSON.
  • --valkey-ip — from the pre-create step above.
  • --admin-email, --admin-password — bootstrap admin; can change password on first login.
  • --features — comma-separated feature flags; restrict to what the customer's plan covers.
  • --clone-config-from — copy system_configuration and model_configurations from an existing tenant DB. Skip if you want a clean default config.
  • --chart-version — qry-platform Helm chart version. Pin for reproducibility.
  • --dry-run — print the plan without applying.

What the script does, in order

  1. DatabaseCREATE DATABASE qrydb_<id>, install pgvector extension.
  2. Namespacekubectl create namespace qry-<id> with the right labels.
  3. RBAC — service accounts, roles, role bindings inside the namespace; cross-namespace RBAC for python-execution.
  4. Secrets — license JSON, JWT secret, DB password, Valkey URL, license SA key. Fernet-portable.
  5. Helm installhelm install qry oci://europe-southwest1-docker.pkg.dev/pue-madrid/puedata/qry-platform with values rendered for this tenant.
  6. IngressRoute — Traefik routing for the host with TLS via cert-manager.
  7. Route53 record<host> CNAME → traefik-gcs.puedata.com.
  8. Bootstrap admin — insert one row in the new DB's users table with the hashed password.
  9. Optional config clone — copy system_configuration rows from the source DB.

If any step fails, the script bails. Re-running is idempotent — already-created resources are skipped.

Verify

After the script finishes:

# Pod health
kubectl get pods -n qry-<id>

# Cert provisioning
kubectl describe certificate -n qry-<id>

# DNS resolution
dig +short <host>.

# Backend health
curl -k https://<host>/health/live

Open https://<host> and log in as the bootstrap admin.

Common issues

provision_tenant.sh errors at helm install with image pull failure. The cluster's pod identity may not have read access to the Artifact Registry. Confirm the cluster's service account has artifactregistry.reader on the project.

Cert is NotReady for several minutes. Let's Encrypt rate limits + DNS-01 propagation. Usually clears within 5–10 minutes. If still stuck, check Traefik's cert-manager logs.

Backend pod crashloops with license invalid. The license JSON key in the secret was malformed. Re-run the script with a fresh secret, or replace manually.

asyncpg import error after a fresh tenant. Tenant got an old chart version that pinned an asyncpg < 0.31 incompatible with PG18. Bump --chart-version.

Tenant suspended unexpectedly. Check the validation log — most likely the license SA key was rotated server-side but not in the tenant's secret. See License management.

Customer's data should be on a different cluster. The script targets the pue-madrid Autopilot cluster by default. For ixenlab (RKE2/Harvester) a different procedure applies — see the project-specific runbook.

See also

QRYA product of IXEN.