I’m new to Dataproc and am having trouble running a job that accesses a PostgreSQL database (Compute Engine VM) to write data to BigQuery.
I created a cluster with the following configuration:
gcloud dataproc clusters create NAME-CLUSTER \
--enable-component-gateway \
--bucket STAGING-BUCKET \
--region southamerica-east1 \
--subnet default \
--public-ip-address \
--master-machine-type e2-standard-2 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type e2-standard-2 \
--worker-boot-disk-size 200 \
--image-version 2.2-ubuntu22 \
--properties dataproc:conda.packages=google-cloud-secret-manager==2.24.0,spark:spark.jars=https://jdbc.postgresql.org/download/postgresql-42.7.7.jar \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-southamerica-east1/connectors/connectors.sh' \
--metadata spark-bigquery-connector-version=0.42.2 \
--project PROJECT_ID
Quoting the last two commands to run the job:
#1
gcloud dataproc jobs submit pyspark \
--cluster NAME-CLUSTER \
gs://BUCKET/initial_load.py \
--region southamerica-east1 \
--files=gs://BUCKET/config.json \
--jars=gs://BUCKET/jars/spark-3.5-bigquery-0.42.2.jar \
--project PROJECT_ID
#2
gcloud dataproc jobs submit pyspark \
--cluster NAME-CLUSTER \
gs://BUCKET/initial_load.py \
--region southamerica-east1 \
--files=gs://BUCKET/config.json \
--jars=gs://BUCKET/jars/spark-bigquery-with-dependencies_2.13-0.42.2.jar \
--project PROJECT_ID
Both returned the following error:
25/09/15 19:27:54 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will not yet see flushed data for gs://dataproc-temp-sa-east1-454845955091-uyoz8e6n/27d1a5a7-6e7e-4247-83e2-5a59bc27a244/spark-job-history/application_1757961849781_0003.inprogress [CONTEXT ratelimit_period=“1 MINUTES” ]Iniciando a leitura dos dados do PostgreSQL…Traceback (most recent call last):File “/tmp/with-dependencies/initial_load.py”, line 74, in ).load()^^^^^^File “/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py”, line 314, in loadFile “/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py”, line 1322, in callFile “/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py”, line 179, in decoFile “/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py”, line 326, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred while calling o79.load.: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider not a subtype
I created and recreated the cluster with updated BigQuery connector versions, changed the job calls to see if I got different results, but in neither case it worked.
I’m not sure if it’s an error in my code or the cluster configuration.
Could someone please help me?
Thank you.