DATAPROC - com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype

I’m new to Dataproc and am having trouble running a job that accesses a PostgreSQL database (Compute Engine VM) to write data to BigQuery.

I created a cluster with the following configuration:

gcloud dataproc clusters create NAME-CLUSTER \
    --enable-component-gateway \
    --bucket STAGING-BUCKET \
    --region southamerica-east1 \
    --subnet default \
    --public-ip-address \
    --master-machine-type e2-standard-2 \
    --master-boot-disk-size 100 \
    --num-workers 2 \
    --worker-machine-type e2-standard-2 \
    --worker-boot-disk-size 200 \
    --image-version 2.2-ubuntu22 \
    --properties dataproc:conda.packages=google-cloud-secret-manager==2.24.0,spark:spark.jars=https://jdbc.postgresql.org/download/postgresql-42.7.7.jar \
    --scopes 'https://www.googleapis.com/auth/cloud-platform' \
    --initialization-actions 'gs://goog-dataproc-initialization-actions-southamerica-east1/connectors/connectors.sh' \
    --metadata spark-bigquery-connector-version=0.42.2 \
    --project PROJECT_ID

Quoting the last two commands to run the job:

#1
gcloud dataproc jobs submit pyspark \
    --cluster NAME-CLUSTER \
    gs://BUCKET/initial_load.py \
    --region southamerica-east1 \
    --files=gs://BUCKET/config.json \
    --jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar \
    --project PROJECT_ID 
    
#2
gcloud dataproc jobs submit pyspark \
    --cluster NAME-CLUSTER \
    gs://BUCKET/initial_load.py \
    --region southamerica-east1 \
    --files=gs://BUCKET/config.json \
    --jars=gs://BUCKET/jars/spark-bigquery-with-dependencies_2.13-0.42.2.jar \
    --project PROJECT_ID 

Both returned the following error:

25/09/15 19:27:54 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will not yet see flushed data for gs://dataproc-temp-sa-east1-454845955091-uyoz8e6n/27d1a5a7-6e7e-4247-83e2-5a59bc27a244/spark-job-history/application_1757961849781_0003.inprogress [CONTEXT ratelimit_period=“1 MINUTES” ]Iniciando a leitura dos dados do PostgreSQL…Traceback (most recent call last):File “/tmp/with-dependencies/initial_load.py”, line 74, in ).load()^^^^^^File “/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py”, line 314, in loadFile “/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py”, line 1322, in callFile “/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py”, line 179, in decoFile “/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py”, line 326, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred while calling o79.load.: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype

I created and recreated the cluster with updated BigQuery connector versions, changed the job calls to see if I got different results, but in neither case it worked.

I’m not sure if it’s an error in my code or the cluster configuration.
Could someone please help me?

Thank you.

Hi @rrotter,

For the error that you got, it seems you reload the BigQuery connector twice, as stated here the connectors for Spark BigQuery are pre-installed already in Dataproc 2.1 and later image. It is automatically added when you flag it:
--metadata spark-bigquery-connector-version=0.42.X

You can try removing those JARs (--jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar) that you are loading manually, since the BigQuery connector you need is already defined in your configuration. This should help resolve the error you are encountering.

Thank you very much, @mpinlac. :smiley: :clap: