I’ve been getting an internal error from Spark related to a java.lang.NullPointerException when I attempt to write from Pyspark to Bigquery in Dataproc. I’m using the bigquery plugin for pyspark, and writing to an existing, partitioned table. The error only seems to occur when I am writing a table that is the result of a join.
Error Message
py4j.protocol.Py4JJavaError: An error occurred while calling o764.save.
: com.google.cloud.bigquery.connector.common.BigQueryConnectorException: unexpected issue trying to save [period_start: date, period_end: date ... 20 more fields]
...
Caused by: org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase optimization failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.
...
Caused by: java.lang.NullPointerException
at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
This is the failing call:
(
df.write.format("bigquery")
.partitionBy("period_start")
.option("writeMethod", "direct")
.option("table", bigquery_destination_table)
.option("createDisposition", "CREATE_IF_NEEDED")
.mode("overwrite")
.save()
)
I initialized the spark session with:
spark = (
SparkSession.builder.appName("my_app")
.config(
"spark.jars",
",".join(["https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar",
"https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.36.1.jar",]),)
.getOrCreate()
)
spark.conf.set("temporaryGcsBucket", "my-bucket")
spark.conf.set("viewsEnabled", "true")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")
spark.sparkContext.setLogLevel("WARN")
How do I get around this error to write the table to BigQuery?
Full stack trace includes double dollar sign which is not allowed, so I am not able to post it.