DATAPROC - com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype

rrotter · September 16, 2025, 12:40pm

I’m new to Dataproc and am having trouble running a job that accesses a PostgreSQL database (Compute Engine VM) to write data to BigQuery.

I created a cluster with the following configuration:

gcloud dataproc clusters create NAME-CLUSTER \
    --enable-component-gateway \
    --bucket STAGING-BUCKET \
    --region southamerica-east1 \
    --subnet default \
    --public-ip-address \
    --master-machine-type e2-standard-2 \
    --master-boot-disk-size 100 \
    --num-workers 2 \
    --worker-machine-type e2-standard-2 \
    --worker-boot-disk-size 200 \
    --image-version 2.2-ubuntu22 \
    --properties dataproc:conda.packages=google-cloud-secret-manager==2.24.0,spark:spark.jars=https://jdbc.postgresql.org/download/postgresql-42.7.7.jar \
    --scopes 'https://www.googleapis.com/auth/cloud-platform' \
    --initialization-actions 'gs://goog-dataproc-initialization-actions-southamerica-east1/connectors/connectors.sh' \
    --metadata spark-bigquery-connector-version=0.42.2 \
    --project PROJECT_ID

Quoting the last two commands to run the job:

#1
gcloud dataproc jobs submit pyspark \
    --cluster NAME-CLUSTER \
    gs://BUCKET/initial_load.py \
    --region southamerica-east1 \
    --files=gs://BUCKET/config.json \
    --jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar \
    --project PROJECT_ID 
    
#2
gcloud dataproc jobs submit pyspark \
    --cluster NAME-CLUSTER \
    gs://BUCKET/initial_load.py \
    --region southamerica-east1 \
    --files=gs://BUCKET/config.json \
    --jars=gs://BUCKET/jars/spark-bigquery-with-dependencies_2.13-0.42.2.jar \
    --project PROJECT_ID

Both returned the following error:

25/09/15 19:27:54 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will not yet see flushed data for gs://dataproc-temp-sa-east1-454845955091-uyoz8e6n/27d1a5a7-6e7e-4247-83e2-5a59bc27a244/spark-job-history/application_1757961849781_0003.inprogress [CONTEXT ratelimit_period=“1 MINUTES” ]Iniciando a leitura dos dados do PostgreSQL…Traceback (most recent call last):File “/tmp/with-dependencies/initial_load.py”, line 74, in ).load()^^^^^^File “/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py”, line 314, in loadFile “/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py”, line 1322, in callFile “/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py”, line 179, in decoFile “/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py”, line 326, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred while calling o79.load.: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype

I created and recreated the cluster with updated BigQuery connector versions, changed the job calls to see if I got different results, but in neither case it worked.

I’m not sure if it’s an error in my code or the cluster configuration.
Could someone please help me?

Thank you.

mpinlac · September 16, 2025, 3:11pm

Hi @rrotter,

For the error that you got, it seems you reload the BigQuery connector twice, as stated here the connectors for Spark BigQuery are pre-installed already in Dataproc 2.1 and later image. It is automatically added when you flag it:
--metadata spark-bigquery-connector-version=0.42.X

You can try removing those JARs (--jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar) that you are loading manually, since the BigQuery connector you need is already defined in your configuration. This should help resolve the error you are encountering.

rrotter · September 16, 2025, 3:14pm

Thank you very much, @mpinlac.

Topic		Replies	Views
Dataproc serverless batch failure on accessing external PostgreSQL database Data Analytics dataproc	0	94	October 18, 2023
Dataproc PySpark Job Fails with "Failed to find data source: pubsub Data Analytics dataproc , cloud-pubsub	1	43	March 19, 2025
Error connecting to jdbc with pyspark in dataproc Data Analytics dataproc	1	105	October 25, 2023

DATAPROC - com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype

AI Suggested topics