Hello Google Cloud Team,
We are currently running a Cloud SQL for MySQL instance in High Availability (HA) mode, but during both manual and automatic failovers, the downtime is consistently around 8–10 minutes, even though documentation states expected failover time should be around 60–120 seconds.
This extended downtime is severely impacting our production application, where user login and write queries must succeed continuously.
Instance Configuration Details
| Setting | Value |
|---|---|
| Edition | Enterprise |
| Database Version | MySQL 8.0.18 |
| Region | us-central1 |
| Primary Zone | us-central1-a |
| HA | Enabled (Regional Instance) |
| vCPUs | 8 |
| Memory | 28 GB |
| Storage Type | SSD |
| Storage Size | 900 GB |
| Automated Backups | Enabled |
| PITR | Enabled |
| Private IP | Enabled |
| Performance Schema | Enabled |
MySQL flags
innodb_buffer_pool_size = 12663676416
innodb_log_buffer_size = 109051904
table_open_cache = 40000
max_connections = 2000
tmp_table_size = 33554432
max_heap_table_size = 33554432
Failover (manual trigger from GCP Console) consistently takes 8–10 minutes before the instance returns to a “RUNNABLE” state. During this time, both reads and writes fail. The standby is in the same region (multi-zone configuration), and synchronous replication is enabled.The issue persists even with low query load during manual testing. There is no apparent replication lag before the failover event.
OUR CURRENT PROBLEM
We need a zero-downtime or near-zero-downtime architecture where the application can still
perform both read and write operations during
Cloud SQL maintenance events
Failover to the standby instance
Unexpected database crashes
Question:
What is the most reliable design or GCP-recommended solution to achieve continuous read-write capability
and avoid application downtime during failover or restart events?
Are there any best practices or patterns (e.g., proxy layer, replication, buffering, or DR strategy) that support this use case?