Convert a GCP Dataflow Successful Job into Pipeline?

Dear GCP community,

I have run few DataFlow jobs (MongoDB to BigQuery) and they successfully completed with the desired data loaded on BigQuery.

However, when I try to transform them into recurring jobs (aka GCP Dataflow Pipelines), the job does not get triggered and here is the error log message I get:

{"@type":"type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished",
"jobName":"projects/<project-name>/locations/us-central1/jobs/datapipelines-<pipeline-name>", 
"status":"NOT_FOUND", 
"targetType":"HTTP",
"url":"https://datapipelines.googleapis.com/v1/projects/<project-name>/locations/us-central1/pipelines/<pipeline-name>:run"}

Did anyone else encounter the same issue? Any recommendations?

Thank you!

3 Likes

The error message you’re seeing indicates that the Cloud Scheduler job is trying to trigger a Dataflow pipeline, but the pipeline is not found at the specified URL. To troubleshoot this issue, you can follow these steps:

  1. Check that the pipeline exists. You can do this by going to the Dataflow page in the Google Cloud console and searching for the pipeline name.
  2. Check that the pipeline is in the correct location. The pipeline must be in the same project and location as the scheduler job.
  3. Check that the pipeline is enabled. You can enable the pipeline by going to the Dataflow page in the Google Cloud console and clicking the “Enable” button.
  4. Check that the pipeline is configured correctly. You can view the pipeline configuration by going to the Dataflow page in the Google Cloud console and clicking the “View” button.
  5. Check Pipeline permissions Ensure that the Cloud Scheduler service account has the necessary permissions to trigger Dataflow pipelines. The service account should have the roles/dataflow.developer role or a custom role with equivalent permissions.
3 Likes

Hi @ms ,

Thank you so much for your help. I truly appreciate.

  1. The pipeline exists

  2. The pipeline is in the correct location

  3. The pipeline is enabled

  4. Properly configured from a successful job

  5. I did not configure any service account. I am doing it directly from the Cloud Console. Could you please elaborate on this? This may be the problem.

Thank you again!

2 Likes

When you set up a Dataflow job from the Cloud Console, you are actually creating a Cloud Scheduler job that triggers a Dataflow pipeline. The Cloud Scheduler job needs to be configured with the service account that will be used to run the Dataflow pipeline.

If you did not configure a service account when you set up the Dataflow job, then the Cloud Scheduler job will use the default service account. This service account must have the following permissions:

  • roles/dataflow.admin
  • roles/dataflow.worker
3 Likes

Hi @ms4446 ,

I have created a new service account with these roles and assigned it to the running pipeline. However, the same issue is still happening.

Are there any other plausible reasons behind this bug?

Thank you again for your help!

2 Likes

If you have created a new service account with the correct permissions and assigned it to the running pipeline, but you are still having problems, then it is possible that there is an issue with the scheduler job itself.

Here are some additional troubleshooting tips:

  • Check the logs for the scheduler job to see if there are any other errors.
  • Try restarting the scheduler job.
  • Try creating a new scheduler job.
3 Likes

Hi @ms4446 ,

Thank you for your follow-up. There are no other errors in the scheduler job and I created a new scheduler job but still got the same problem.

What would you recommend in this case?

Thank you!

2 Likes

Hi @ms4446 ,

Just a quick follow-up on my previous message. The Dataflow job works successfully and when I import it as a Pipeline, it does not get triggered.

Cloud Scheduler does not show any other logs than ‘AttemptStarted’ and ‘AttemptFinished’ (with ‘NotFound’ error) as I previously shared.

I appreciate your help and I would love to know if there is a way to report/resolve this bug with GCP teams.

Thanks!

1 Like

It definitely seems like the issue is specifically with triggering the Dataflow pipeline through Cloud Scheduler. Here are a few more things you could check:

  1. Cloud Scheduler Configuration: Make sure that the Cloud Scheduler job is correctly configured to trigger the pipeline. The URL should be the REST API endpoint for running the pipeline, and the HTTP method should be POST. Also, ensure that the body of the POST request is correctly formatted.

  2. Cloud Scheduler Timezone: Check the timezone setting for your Cloud Scheduler job. If the timezone is not set correctly, it might cause the job to trigger at unexpected times or not trigger at all.

  3. Cloud Scheduler Frequency: Check the frequency setting for your Cloud Scheduler job. If the frequency is set too low (e.g., every minute), it might cause the job to fail due to rate limiting.

  4. Dataflow Pipeline State: Check the state of your Dataflow pipeline in the Google Cloud Console. If the pipeline is in a terminal state (e.g., FAILED or CANCELLED), it will not be triggered by Cloud Scheduler.

  5. Dataflow Pipeline Parameters: If your pipeline requires any parameters to run, make sure that these parameters are correctly included in the Cloud Scheduler job.