OperationalError Cloud Composer

Hello,

I upgraded my Cloud Composer environment from composer-3-airflow-2.10.2-build.5 to composer-3-airflow-2.10.2-build.11. Since, all the DAGs I have that generate about 40 tasks makes fail. Tasks goes from queued to running and even deferred some times, but inevitably fail. These tasks are basic

CloudRunExecuteJobOperator calls. I just trigger jobs on GCP Cloud Run Jobs from airflow, it’s such a basic task, I don’t get why the system struggle with it. When I look at the logs here is what I get:

[details=Show More]
airflow-worker Retrying

I already tried to increase the system size (to large) and increased the amount of RAM and CPU on the other components. I feel like the system cannot handle a lot f concurrent tasks to defer.

Hi @x-alex ,

Thanks for your question. This forum focuses on Google Cloud’s Application Integration. Could you please confirm if the issue you’re describing occurs while using Application Integration? :blush:

Hi @x-alex ,

Welcome to Google Cloud Community!

The error you’re seeing points to a database connection problem between your Airflow workers and the Airflow metadata database (usually Cloud SQL for Composer). This, along with the fact that it occurred after a Composer environment upgrade, strongly indicates that the issue is likely due to a change in the environment configuration or resource limitations.

Here are some potential solutions that might address your issue:

  • Rollback (Temporary): If possible and if the upgrade is the only change, consider rolling back to the previous Composer environment version (composer-3-airflow-2.10.2-build.5) as a temporary measure to confirm that the upgrade is indeed the root cause. This buys you time to investigate further.
  • Airflow Configuration Settings: You can configure these settings using Airflow configuration overrides within the Cloud Composer environment.
  • DAG Design: Review your DAGs to see if you can reduce the number of concurrent tasks or batch operations.
  • Scale Up Cloud SQL: If the Cloud SQL instance is experiencing high CPU, memory, or disk I/O usage, a reasonable solution would be to scale up to a larger instance with more resources.
  • Optimize Database Queries: If the Cloud SQL logs indicate slow queries, identify and optimize those queries. This might involve adding indexes to tables, rewriting queries, or using more efficient data structures.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

I have the same error…

I’m having the same error as well.
It was my first composer environment (I had prior experienced on self-hosted Airflow).

What did not work for me :

  • Upscaling environment (small → medium)
  • Vertical worker upscale (they can handle at most 4 trivial tasks, 5 tasks raise that errror)
  • Vertical scheduler upscale

What “worked” limiting to 4 tasks per worker. Which is beyond ridiculous, for a task that only consists in a few API calls (no data manipulation involved).

My local worker container can handle 100x that load (and the laptop is old) so there must be a problem.

What I won’t do : change the DAG so that there is less or no concurrency : I’m currently using Airflow to have parallelism so I am currently aiming at running at least 10 to 20 tasks in parallel.