PostgreSQL upgrades
Major Postgres versions (PG15 → PG16 → PG17 → PG18) don't share data directory layouts. You can't just bump the image tag — the new container will refuse to start because the on-disk format is incompatible.
This page covers the dump-and-restore procedure, with the specific gotchas that have bitten us in real upgrades. The reference run was PG16 → PG18 on ixenlab on 2026-02-07.
Prerequisites
- A working backup of the source DB. Don't skip this.
psqlclient matching the target version (e.g.postgresql@18from Homebrew if upgrading to PG18). Older clients can't talk to newer servers fully.pg_dumpmatching the source version. Mismatched dumps occasionally lose features.- Sufficient downtime window. For a QRY tenant of typical size (~1 GB messages table), expect 30–60 minutes of read-only or unavailable.
Plan
- Scale down all services that talk to Postgres.
- Take a
pg_dump. - Delete the PVC.
- Helm-upgrade with the new Postgres image.
- Bring Postgres up on a fresh PVC.
CREATE EXTENSION pgvectoron the empty DB.- Restore the dump.
- Scale services back up.
Step by step
1. Scale down dependents
kubectl scale deployment/qry-backend --replicas=0 -n qry-app
kubectl scale deployment/qry-worker --replicas=0 -n qry-app
kubectl scale deployment/celery-worker --replicas=0 -n qry-app
kubectl scale deployment/celery-beat --replicas=0 -n qry-app
Wait for pods to terminate.
2. Take the dump
For ixenlab (Postgres in-cluster, postgres-0 pod):
# Inside the postgres pod
kubectl exec -n qry-app postgres-0 -- \
pg_dump -U postgres -Fc qrydb > /tmp/qrydb.dump
kubectl cp qry-app/postgres-0:/tmp/qrydb.dump ./qrydb.dump
For Cloud SQL (Autopilot tenants), use gcloud sql export:
gcloud sql export sql qrydb-instance gs://your-backup-bucket/qrydb-pre-upgrade.sql \
--database=qrydb
Critical: kubectl port-forward drops on transfers >1 GB. Always use kubectl cp + kubectl exec for the in-cluster path. The messages table is typically the bottleneck (~1 GB with embeddings). --single-transaction with port-forward is a guaranteed failure on large DBs — only use it for direct connections.
3. Delete the PVC
For PVC-backed Postgres (ixenlab):
kubectl scale statefulset/postgres --replicas=0 -n qry-app
kubectl delete pvc postgres-data-postgres-0 -n qry-app
The PVC is gone. The data is gone. The dump is your only copy now — confirm it's readable before proceeding.
For Cloud SQL, this step is replaced by upgrading the instance via gcloud sql instances patch --database-version=POSTGRES_18. No PVC delete needed; Cloud SQL handles the migration.
4. Helm upgrade with new image
helm upgrade --reuse-values qry \
oci://europe-southwest1-docker.pkg.dev/pue-madrid/puedata/qry-platform \
--version <chart-version> \
--set postgres.image=pgvector/pgvector:pg18 \
-n qry-app
For QRY, always use pgvector/pgvector:pg<N> — this image bundles the pgvector extension, which we depend on. A vanilla postgres:18 image would force a separate pgvector install step.
5. Scale Postgres back up
kubectl scale statefulset/postgres --replicas=1 -n qry-app
kubectl wait --for=condition=ready pod/postgres-0 -n qry-app --timeout=300s
The PVC is fresh and empty. Postgres initialises from scratch.
6. Install pgvector
The pgvector image bundles the binary, but the database needs the extension explicitly:
kubectl exec -n qry-app postgres-0 -- \
psql -U postgres -d qrydb -c "CREATE EXTENSION IF NOT EXISTS vector"
7. Restore the dump
kubectl cp ./qrydb.dump qry-app/postgres-0:/tmp/qrydb.dump
kubectl exec -n qry-app postgres-0 -- \
pg_restore -U postgres -d qrydb /tmp/qrydb.dump
Restore takes proportional to dump size. For 1 GB of data, expect 5–15 minutes depending on the database's CPU.
8. Scale services back up
kubectl scale deployment/qry-backend --replicas=2 -n qry-app
kubectl scale deployment/qry-worker --replicas=N -n qry-app
kubectl scale deployment/celery-worker --replicas=N -n qry-app
kubectl scale deployment/celery-beat --replicas=1 -n qry-app
Hit the /health/ready endpoint to confirm Postgres connectivity:
curl -k https://<host>/health/ready
Should return 200. If 503 with a Postgres error, the restore didn't complete or the service can't see the DB.
The JWT_SECRET_KEY gotcha
Settings reads JWT_SECRET_KEY, not SECRET_KEY. This was fixed in v0.85.3. If you're restoring data from a tenant where the secret was generated with a different key, Fernet-encrypted credentials will fail to decrypt.
Both clusters that share user data (e.g. DR replicas) must use the same JWT_SECRET_KEY. After upgrade, confirm the secret hasn't been regenerated:
kubectl get secret qry-platform-secrets -n qry-app \
-o jsonpath='{.data.JWT_SECRET_KEY}' | base64 -d
Compare against the source cluster.
PG18 specifics
PostgreSQL 18 brings:
- Async I/O — ~3× faster vector similarity searches.
- Parallel GIN builds — speeds up ABAC tag indexes.
- Data checksums enabled by default.
- UUIDv7 available — but QRY stays on UUIDv4 (current data sizes don't justify the migration).
asyncpg>=0.31.0 is required. Older asyncpg versions silently misbehave against PG18.
Common issues
Restore fails with "extension vector does not exist".
Step 6 was skipped. CREATE EXTENSION IF NOT EXISTS vector first, then re-restore.
Restored DB is missing rows.
Dump was incomplete (partial transfer due to dropped port-forward). Re-take the dump using kubectl cp instead.
Backend pods crash with Fernet token invalid.
JWT_SECRET_KEY mismatch between the cluster and the encrypted credentials in the DB. Fix the secret to match the original.
Restore takes forever.
Disable indexes during restore, then rebuild after. pg_restore --no-create -j 4 parallelises across tables.
Cloud SQL upgrade got stuck. Major-version Cloud SQL upgrades have their own quirks. Open a GCP support ticket; for ixenlab, you control the procedure and can recover via dump/restore.
See also
- Multi-tenant provisioning — fresh tenants come with the latest PG image.
- Monitoring and health — verify health post-upgrade.
- The
backend/scripts/sync_db.shhelper covers similar dump/restore for cross-cluster sync.