Persistent "DAG Failed" error in Vertex AI AutoML training with correct IAM roles

I’ve been working on an ETL (Extract, Transform, Load) project to prepare a dataset for a fuel demand prediction model. The data, sourced from CSV, JSONL, was successfully loaded into BigQuery. I then used a SQL query to clean the data, handling negative values, missing payment methods, and unifying the inventory and transaction tables. The final table, Table_Final_DemandaFuel, is ready for machine learning.

The goal is to use this clean data to train a regression model in Vertex AI AutoML Tabular to predict fuel sales (Litros_Vendidos).

However, the training job keeps failing with a recurring error:

The DAG failed because some tasks failed. The failed tasks are: [exit-handler-1].; Job (project_id = braided-circuit-457918-m3, job_id = <JOB_ID>) is failed due to the above error.; Failed to handle the job: {project_number = 847026632307, job_id = <JOB_ID>}. Always happens on the 8th step, it works well until this step comes.

Steps Taken So Far to Fix the Error:

  • Data Validation: The final BigQuery table has been validated, and all known data quality issues (negative values, NULLs) have been resolved. The data looks clean.

  • Permissions Check: I’ve meticulously added all necessary IAM roles to my service account (847026632307-compute@developer.gserviceaccount.com). This includes:

    • Vertex AI User

    • BigQuery Data Viewer

    • Service Account User

    • Vertex AI Administrator

  • Service Agent Verification: I’ve also confirmed that the Google-managed service agent for Vertex AI (service-847026632307@gcp-sa-aiplatform.iam.gserviceaccount.com) exists and has the correct Vertex AI Service Agent role.

Despite these steps, the error persists, indicating a more complex issue with the project’s configuration or a backend problem within the Vertex AI service itself.