Google Cloud Platform - Vertex AI - Workbench JupyterLab - Spark/Hadoop - JAVA_HOME is not set error

Hi All,

I am trying to connect to a SparkSession on Vertex AI’s Workbench JupyterLab, but receive this error. Locally, my JAVA_HOME system environments and path environments are already set, and can work when I run Jupyter locally. But only on Vertex AI’s Workbench JupyterLab I get this error.

Code:

from pyspark.sql import SparkSession
spark = SparkSession.builder
.appName(‘Jupyter BigQuery Storage’)
.config(‘spark.jars’, ‘gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar’)
.getOrCreate()

Full Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_3404/1949393828.py in <module>
      9 spark = SparkSession.builder \
     10   .appName('Jupyter BigQuery Storage')\
---> 11   .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
     12   .getOrCreate()
     13 

/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in getOrCreate(cls, conf)
    390         with SparkContext._lock:
    391             if SparkContext._active_spark_context is None:
--> 392                 SparkContext(conf=conf or SparkConf())
    393             return SparkContext._active_spark_context
    394 

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    337         with SparkContext._lock:
    338             if not SparkContext._gateway:
--> 339                 SparkContext._gateway = gateway or launch_gateway(conf)
    340                 SparkContext._jvm = SparkContext._gateway.jvm
    341 

/opt/conda/lib/python3.7/site-packages/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise RuntimeError("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

RuntimeError: Java gateway process exited before sending its port number

Do let me know if you have advice or help, thank you!

You would need to have Java installed on your Mac, Linux or Windows, without Java installation & not having JAVA_HOME environment variable set with Java installation path or not having PYSPARK_SUBMIT_ARGS, you would get this Exception.

You need to Set PYSPARK_SUBMIT_ARGS with master, this resolves Exception: Java gateway process exited before sending the driver its port number.

export PYSPARK_SUBMIT_ARGS=“–master local[3] pyspark-shell”

vi ~/.bashrc , add the above line and reload the bashrc file using source ~/.bashrc

In case the issue is still not resolved, check your Java installation and JAVA_HOME environment variable.

You can see this troubleshooting documentation[1].

[1] https://cloud.google.com/vertex-ai/docs/general/troubleshooting