I have run few DataFlow jobs (MongoDB to BigQuery) and they successfully completed with the desired data loaded on BigQuery.
However, when I try to transform them into recurring jobs (aka GCP Dataflow Pipelines), the job does not get triggered and here is the error log message I get:
The error message youâre seeing indicates that the Cloud Scheduler job is trying to trigger a Dataflow pipeline, but the pipeline is not found at the specified URL. To troubleshoot this issue, you can follow these steps:
Check that the pipeline exists. You can do this by going to the Dataflow page in the Google Cloud console and searching for the pipeline name.
Check that the pipeline is in the correct location. The pipeline must be in the same project and location as the scheduler job.
Check that the pipeline is enabled. You can enable the pipeline by going to the Dataflow page in the Google Cloud console and clicking the âEnableâ button.
Check that the pipeline is configured correctly. You can view the pipeline configuration by going to the Dataflow page in the Google Cloud console and clicking the âViewâ button.
Check Pipeline permissions Ensure that the Cloud Scheduler service account has the necessary permissions to trigger Dataflow pipelines. The service account should have the roles/dataflow.developer role or a custom role with equivalent permissions.
When you set up a Dataflow job from the Cloud Console, you are actually creating a Cloud Scheduler job that triggers a Dataflow pipeline. The Cloud Scheduler job needs to be configured with the service account that will be used to run the Dataflow pipeline.
If you did not configure a service account when you set up the Dataflow job, then the Cloud Scheduler job will use the default service account. This service account must have the following permissions:
If you have created a new service account with the correct permissions and assigned it to the running pipeline, but you are still having problems, then it is possible that there is an issue with the scheduler job itself.
Here are some additional troubleshooting tips:
Check the logs for the scheduler job to see if there are any other errors.
It definitely seems like the issue is specifically with triggering the Dataflow pipeline through Cloud Scheduler. Here are a few more things you could check:
Cloud Scheduler Configuration: Make sure that the Cloud Scheduler job is correctly configured to trigger the pipeline. The URL should be the REST API endpoint for running the pipeline, and the HTTP method should be POST. Also, ensure that the body of the POST request is correctly formatted.
Cloud Scheduler Timezone: Check the timezone setting for your Cloud Scheduler job. If the timezone is not set correctly, it might cause the job to trigger at unexpected times or not trigger at all.
Cloud Scheduler Frequency: Check the frequency setting for your Cloud Scheduler job. If the frequency is set too low (e.g., every minute), it might cause the job to fail due to rate limiting.
Dataflow Pipeline State: Check the state of your Dataflow pipeline in the Google Cloud Console. If the pipeline is in a terminal state (e.g., FAILED or CANCELLED), it will not be triggered by Cloud Scheduler.
Dataflow Pipeline Parameters: If your pipeline requires any parameters to run, make sure that these parameters are correctly included in the Cloud Scheduler job.