Skip to main content

External Spark cluster

For DataFlow and Lakeflow transfers larger than ~10 GB, QRY delegates execution to an external Spark Standalone cluster. The cluster is a separate Kubernetes namespace (spark-system on ixenlab) with its own master + workers, exposing a REST submission endpoint at port 6066.

Why bother offloading: it's not raw speed (the bottleneck is per-row serialisation in kudu-spark3, ~675 rows/s for 174-col tables with 3 tablets — see below). It's operational decoupling — a runaway 50 GB transfer can't pin the qry-worker pool, can be retried independently, and has its own resource ceiling.

When delegation kicks in

QRY's pipeline routing checks if a transfer is "Spark-eligible":

ConditionResult
Transfer size > 10 GB or > 10 M rowsDelegate to Spark
Pipeline has transforms / data-quality / watermark / UPSERTStay on qry-worker (Spark short-circuit declined)
Bare SELECT ... INSERTEligible for delegation
Smaller transferStay on qry-worker (delegation overhead not worth it)

The short-circuit logic is in _try_spark_delegation. It bails for anything that's not a straight copy because Spark's pipeline rewrite isn't a 1:1 of QRY's transform/DQ semantics — risk of subtly wrong outputs.

Architecture

┌────────────────┐      ┌──────────────────────┐      ┌──────────────┐
│ qry-worker │──→──│ spark-system │──→──│ Source DB │
│ (decides to │ REST│ - spark-master │ │ (Cloudera/ │
│ delegate) │ 6066│ - spark-workers (N) │ │ Kudu) │
└────────────────┘ └──────────────────────┘ └──────┬───────┘
│ │
└──→ writes ──────────────→┘

BigQuery

qry-worker submits the Spark job over REST. The Spark cluster reads from the source database, writes to BigQuery, and returns success/failure to qry-worker. The data path doesn't go through Kubernetes egress; it's source-DB → spark-worker → BigQuery direct.

Configuring on QRY side

In Admin > System Settings > Lakeflow > Spark:

  • Spark REST endpointhttp://<lb-ip>:6066. On ixenlab the LB is 10.0.80.44.
  • Spark master URL — used inside generated job specs.
  • Submission user / token — if your Spark cluster has auth.
  • Default executor config — number of executors, executor memory, cores per executor.

Test the wiring with Submit test job — QRY submits a tiny no-op job and confirms the round-trip.

Setting up the Spark cluster itself

That's a separate playbook (docs/etl/SPARK_INTEGRATION.md and SPARK_SETUP.md in the QRY repo). High-level:

  • Spark Standalone (not Kubernetes-mode) for simplicity and isolation.
  • Master + workers in their own namespace.
  • LoadBalancer or NodePort exposing port 6066 for REST submission and 4040+ for the UI.
  • Ingress / network policy locking down 6066 to qry-worker only — REST submission is unauthenticated by default and can take arbitrary jobs.

Throughput expectations

For Cloudera/Kudu sources, the bottleneck is per-row serialisation in kudu-spark3. Empirically:

  • ~675 rows/sec for 174-column tables with 3 tablets.
  • Roughly linear in column count, sublinear in tablet count.

A 50 GB / 100 M row transfer at this rate takes ~40 hours. That's not a bug in QRY — it's the kudu-spark3 driver. For latency-sensitive moves, partition the transfer (per-day chunks scheduled separately) and run multiple jobs in parallel. QRY's Lakeflow can express that as a pipeline.

Submission mode

REST mode (spark.submit.deployMode=cluster) is what QRY uses. The job runs entirely on the Spark cluster — qry-worker just polls for status. Failure of qry-worker mid-job does not kill the Spark job; restart the watching qry-worker and it re-attaches to the existing job by id.

Common issues

"Connection refused" from qry-worker to 6066. Network policy or firewall. Confirm qry-worker pod can reach the Spark LB IP on port 6066.

Job submitted, status stuck in PENDING. No Spark workers available, or workers are running other jobs. Check Spark master UI (port 8080) for queue.

Job hangs at "writing to BigQuery". BigQuery service-account permissions on the spark-worker pod identity. The worker writes directly; it needs bigquery.dataEditor on the target dataset.

Throughput much worse than the kudu-spark3 numbers above. Check executor count and memory. Underprovisioned workers yield much worse throughput. Bump spark.executor.memory and spark.executor.instances.

Job failed but qry-worker doesn't know. The status poll lost track. Look in Spark master UI for the job's actual final state and trigger a manual retry from Lakeflow.

_try_spark_delegation not delegating despite a 20 GB job. The pipeline has a transform or DQ step. Spark short-circuit only kicks in for bare SELECT/INSERT. Either remove the in-flight steps (do them in a separate post-processing job) or let it run on qry-worker.

See also

QRYA product of IXEN.