It sounds like you’re encountering connectivity problems with the second Airflow SQL proxy pod in your Cloud Composer setup. The error message suggests a timeout when attempting to connect to the PostgreSQL database.
Here are some steps you could take to troubleshoot this issue:
Review the logs of the second airflow-sqlproxy pod extensively to gather more detailed information about the connection issues. Look for any specific error messages or patterns that might indicate the cause of the timeouts.
Check the resource allocation for the second pod. It’s possible that it’s running out of resources or encountering contention issues due to insufficient memory, CPU, or other resource limitations.
Examine the PostgreSQL database performance during the times when the timeouts occur. There might be high loads or specific conditions causing the connection timeouts for this pod specifically.
Review the Airflow configuration for both pods, ensuring they are identical or at least have the same relevant configuration settings related to database connectivity.
Consider redeploying the problematic pod. Sometimes, reinitializing the pod or creating a new instance might resolve underlying issues in the deployment.
If the issue persists and you can’t resolve it with the above steps, it might be beneficial to reach out to Google Cloud support.
Actually we have already checked everything you mentioned except may be to examine the PostgreSQL database performance, even we don’t have much visibility to this managed PostgreSQL.
At the moment we disabled a couple of DAGs (including airflow_clean_up dag) so it’s working fine. We may have to make some optimizations to these DAGs.
What is weird is that this is happening only to 1 of 2 airflow-sqlproxy running pods.