Got it - thanks for the clarification! (And thank you for continuing to respond to me!) In regards to your reply -
- PySpark Environment: Ensure that the PySpark environment is using the same Python interpreter where the google-cloud-secret-manager library was installed. If PySpark is using a different interpreter, it won’t have access to the library. You can set the PYSPARK_PYTHON environment variable to point to the correct interpreter.
Well, as dumb as this sounds, I’m actually not really sure where the google-cloud-secret-manager ended up getting installed.
You’ll see what I mean in the next response -
- Python Interpreter: When you SSH into the VM, you can check which Python interpreter is being used by default. Run which python and which python3 to see the paths to the Python interpreters. Then, use pip list or pip3 list to see if the google-cloud-secret-manager is listed there.
Oddly enough, though I hadn’t specified conda on my dataproc cluster, it seems like conda is automatically set in the paths:
annabel:~$ which python
/opt/conda/default/bin/python
annabel:~$ which python3
/opt/conda/default/bin/python3
When looking at the system paths in general, it seems like the paths mostly point to conda -
annabel:~$ python -m site
sys.path = [
‘/home/annabel’,
‘/opt/conda/default/lib/python310.zip’,
‘/opt/conda/default/lib/python3.10’,
‘/opt/conda/default/lib/python3.10/lib-dynload’,
‘/opt/conda/default/lib/python3.10/site-packages’,
‘/usr/lib/spark/python’,
]
USER_BASE: ‘/home/annabel/.local’ (doesn’t exist)
USER_SITE: ‘/home/annabel/.local/lib/python3.10/site-packages’ (doesn’t exist)
ENABLE_USER_SITE: True
And when using pip list and pip3 list, I don’t see google-cloud-secret-manager in any of the lists.
annabel:~$ pip list
…
google-cloud-pubsub 2.13.12
google-cloud-redis 2.9.3
google-cloud-spanner 3.19.0
google-cloud-speech 2.15.1
google-cloud-storage 2.5.0
google-cloud-texttospeech 2.12.3
google-cloud-translate 3.8.4
google-cloud-vision 3.1.4
…
annabel:~$ pip3 list
…
google-cloud-pubsub 2.13.12
google-cloud-redis 2.9.3
google-cloud-spanner 3.19.0
google-cloud-speech 2.15.1
google-cloud-storage 2.5.0
google-cloud-texttospeech 2.12.3
google-cloud-translate 3.8.4
google-cloud-vision 3.1.4
…
- Dataproc Versions: Ensure that the version of Dataproc you are using is compatible with the google-cloud-secret-manager library. Although the library is installed, there might be compatibility issues with certain versions of Dataproc.
I guess given the results above, the library doesn’t seem to be installed correctly. I double-checked the initialization code, and yes - the logs do indicate that it was successfully installed.
Downloading google_cloud_secret_manager-2.16.4-py2.py3-none-any.whl (116 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.6/116.6 kB 5.9 MB/s eta 0:00:00
Installing collected packages: google-cloud-secret-manager
Successfully installed google-cloud-secret-manager-2.16.4
- Use Initialization Actions to Set Environment Variables: You can use initialization actions not only to install packages but also to set environment variables or perform other setup tasks that might be necessary for your PySpark job to run correctly.
Given that I’m using a secret API key, I think I should be using Secret Manager for this though, right? In the case I just can’t get Dataproc running with this library though, I could also look into this as an alternative option.