Hi,
I’m working on a Dataflow batch job to fetch data from an SQL server using the JayDeBeApi package, which requires a JDBC connection. To handle this, I need to install Java and other necessary dependencies on the worker nodes. I decided to use a Dataflow Flex template, where I included all the required installation steps in a Dockerfile (shown below), and then stored the Flex template in Artifact Registry.
FROM gcr.io/dataflow-templates-base/python39-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
COPY main.py .
COPY setup.py .
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
# Install apache-beam and other dependencies to launch the pipeline
RUN pip install apache-beam[gcp]==2.58.0
# Install required dependencies for Java setup
RUN apt-get update && apt-get install -y wget tar
# Copy JDK and JDBC jar files from GCS
COPY download_jars.sh /tmp/download_jars.sh
RUN chmod +x /tmp/download_jars.sh && /tmp/download_jars.sh
# Set JAVA_HOME and update alternatives for Java
# ENV JAVA_HOME=/usr/lib/jvm/jdk
ENV JAVA_HOME=/usr/lib/jvm/jre1.8.0_351
ENV PATH=$JAVA_HOME/bin:$PATH
RUN update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jre1.8.0_351/bin/java" 1
RUN ln -s $JAVA_HOME/lib/amd64/server/libjvm.so /usr/lib/libjvm.so
# Verify the Java installation
RUN java -version
RUN echo "export JAVA_HOME=/usr/lib/jvm/jre1.8.0_351" >> /root/.bashrc
RUN pwd
ENTRYPOINT ["opt/google/dataflow/python_template_launcher"]
I’ve tested the JDBC connection separately, and it’s working fine.
However, when running the Dataflow job using this Flex template, it seems that the Java and other dependencies aren’t accessible inside the ParDo function, getting error while making JDBC connection
What could be causing this issue? Any insights would be greatly appreciated.
