PostgreSQL upgrades

Major Postgres versions (PG15 → PG16 → PG17 → PG18) don't share data directory layouts. You can't just bump the image tag — the new container will refuse to start because the on-disk format is incompatible.

This page covers the dump-and-restore procedure, with the specific gotchas that have bitten us in real upgrades. The reference run was PG16 → PG18 on ixenlab on 2026-02-07.

Prerequisites

A working backup of the source DB. Don't skip this.
psql client matching the target version (e.g. postgresql@18 from Homebrew if upgrading to PG18). Older clients can't talk to newer servers fully.
pg_dump matching the source version. Mismatched dumps occasionally lose features.
Sufficient downtime window. For a QRY tenant of typical size (~1 GB messages table), expect 30–60 minutes of read-only or unavailable.

Plan

Scale down all services that talk to Postgres.
Take a pg_dump.
Delete the PVC.
Helm-upgrade with the new Postgres image.
Bring Postgres up on a fresh PVC.
CREATE EXTENSION pgvector on the empty DB.
Restore the dump.
Scale services back up.

Step by step

1. Scale down dependents

kubectl scale deployment/qry-backend --replicas=0 -n qry-app
kubectl scale deployment/qry-worker --replicas=0 -n qry-app
kubectl scale deployment/celery-worker --replicas=0 -n qry-app
kubectl scale deployment/celery-beat --replicas=0 -n qry-app

Wait for pods to terminate.

2. Take the dump

For ixenlab (Postgres in-cluster, postgres-0 pod):

# Inside the postgres pod
kubectl exec -n qry-app postgres-0 -- \
  pg_dump -U postgres -Fc qrydb > /tmp/qrydb.dump
kubectl cp qry-app/postgres-0:/tmp/qrydb.dump ./qrydb.dump

For Cloud SQL (Autopilot tenants), use gcloud sql export:

gcloud sql export sql qrydb-instance gs://your-backup-bucket/qrydb-pre-upgrade.sql \
  --database=qrydb

Critical: kubectl port-forward drops on transfers >1 GB. Always use kubectl cp + kubectl exec for the in-cluster path. The messages table is typically the bottleneck (~1 GB with embeddings). --single-transaction with port-forward is a guaranteed failure on large DBs — only use it for direct connections.

3. Delete the PVC

For PVC-backed Postgres (ixenlab):

kubectl scale statefulset/postgres --replicas=0 -n qry-app
kubectl delete pvc postgres-data-postgres-0 -n qry-app

The PVC is gone. The data is gone. The dump is your only copy now — confirm it's readable before proceeding.

For Cloud SQL, this step is replaced by upgrading the instance via gcloud sql instances patch --database-version=POSTGRES_18. No PVC delete needed; Cloud SQL handles the migration.

4. Helm upgrade with new image

helm upgrade --reuse-values qry \
  oci://europe-southwest1-docker.pkg.dev/pue-madrid/puedata/qry-platform \
  --version <chart-version> \
  --set postgres.image=pgvector/pgvector:pg18 \
  -n qry-app

For QRY, always use pgvector/pgvector:pg<N> — this image bundles the pgvector extension, which we depend on. A vanilla postgres:18 image would force a separate pgvector install step.

5. Scale Postgres back up

kubectl scale statefulset/postgres --replicas=1 -n qry-app
kubectl wait --for=condition=ready pod/postgres-0 -n qry-app --timeout=300s

The PVC is fresh and empty. Postgres initialises from scratch.

6. Install pgvector

The pgvector image bundles the binary, but the database needs the extension explicitly:

kubectl exec -n qry-app postgres-0 -- \
  psql -U postgres -d qrydb -c "CREATE EXTENSION IF NOT EXISTS vector"

7. Restore the dump

kubectl cp ./qrydb.dump qry-app/postgres-0:/tmp/qrydb.dump
kubectl exec -n qry-app postgres-0 -- \
  pg_restore -U postgres -d qrydb /tmp/qrydb.dump

Restore takes proportional to dump size. For 1 GB of data, expect 5–15 minutes depending on the database's CPU.

8. Scale services back up

kubectl scale deployment/qry-backend --replicas=2 -n qry-app
kubectl scale deployment/qry-worker --replicas=N -n qry-app
kubectl scale deployment/celery-worker --replicas=N -n qry-app
kubectl scale deployment/celery-beat --replicas=1 -n qry-app

Hit the /health/ready endpoint to confirm Postgres connectivity:

curl -k https://<host>/health/ready

Should return 200. If 503 with a Postgres error, the restore didn't complete or the service can't see the DB.

The JWT_SECRET_KEY gotcha

Settings reads JWT_SECRET_KEY, not SECRET_KEY. This was fixed in v0.85.3. If you're restoring data from a tenant where the secret was generated with a different key, Fernet-encrypted credentials will fail to decrypt.

Both clusters that share user data (e.g. DR replicas) must use the same JWT_SECRET_KEY. After upgrade, confirm the secret hasn't been regenerated:

kubectl get secret qry-platform-secrets -n qry-app \
  -o jsonpath='{.data.JWT_SECRET_KEY}' | base64 -d

Compare against the source cluster.

PG18 specifics

PostgreSQL 18 brings:

Async I/O — ~3× faster vector similarity searches.
Parallel GIN builds — speeds up ABAC tag indexes.
Data checksums enabled by default.
UUIDv7 available — but QRY stays on UUIDv4 (current data sizes don't justify the migration).

asyncpg>=0.31.0 is required. Older asyncpg versions silently misbehave against PG18.

Common issues

Restore fails with "extension vector does not exist". Step 6 was skipped. CREATE EXTENSION IF NOT EXISTS vector first, then re-restore.

Restored DB is missing rows. Dump was incomplete (partial transfer due to dropped port-forward). Re-take the dump using kubectl cp instead.

Backend pods crash with Fernet token invalid. JWT_SECRET_KEY mismatch between the cluster and the encrypted credentials in the DB. Fix the secret to match the original.

Restore takes forever. Disable indexes during restore, then rebuild after. pg_restore --no-create -j 4 parallelises across tables.

Cloud SQL upgrade got stuck. Major-version Cloud SQL upgrades have their own quirks. Open a GCP support ticket; for ixenlab, you control the procedure and can recover via dump/restore.

Prerequisites​

Plan​

Step by step​

1. Scale down dependents​

2. Take the dump​

3. Delete the PVC​

4. Helm upgrade with new image​

5. Scale Postgres back up​

6. Install pgvector​

7. Restore the dump​

8. Scale services back up​

The JWT_SECRET_KEY gotcha​

PG18 specifics​

Common issues​

See also​