External Spark Cluster
Status: Production Ready
Overview
Offload DataFlow/Lakeflow transfers >10GB from Cloudera/Kudu to BigQuery onto an external Spark Standalone cluster (opaque compute backend, same pattern as Ray for ML Training)
Current Status
REST submission mode validated end-to-end from chat (100K + 1M rows to BQ). Throughput is limited by per-row serialization in kudu-spark3 connector (~675 rows/s aggregate for 174-col tables with 3 tablets). Spark's value is operational decoupling, not raw speed vs Celery for tables with few tablets. See docs/etl/SPARK_INTEGRATION.md "Measured performance" section for details. k8s_job mode deferred.
Key Features
Getting Started
tip
This feature is available in your Qry instance. Check the User Guide for detailed instructions.
See Also
Last updated: April 22, 2026