Skip to main content

External Spark Cluster

Status: Production Ready

Overview

Offload DataFlow/Lakeflow transfers >10GB from Cloudera/Kudu to BigQuery onto an external Spark Standalone cluster (opaque compute backend, same pattern as Ray for ML Training)

Current Status

REST submission mode validated end-to-end from chat (100K + 1M rows to BQ). Throughput is limited by per-row serialization in kudu-spark3 connector (~675 rows/s aggregate for 174-col tables with 3 tablets). Spark's value is operational decoupling, not raw speed vs Celery for tables with few tablets. See docs/etl/SPARK_INTEGRATION.md "Measured performance" section for details. k8s_job mode deferred.

Key Features

Getting Started

tip

This feature is available in your Qry instance. Check the User Guide for detailed instructions.

See Also


Last updated: April 22, 2026