We’ve recently noticed that one of our Kubernetes clusters becomes inaccessible via the control plane API during deployments. The downtime typically lasts between 30 minutes and an hour. We’ve also observed that during these periods, the REPAIR_CLUSTER process starts. Here are the recent occurrences:
To better understand the issue you will need to check the logs
Check Control Plane Logs & Metrics
View logs from the Kubernetes API Server:
gcloud logging read “resource.type=k8s_cluster AND logName:stderr” --limit 50
Check control plane CPU/memory usage:
gcloud container operations list --filter=“operationType=UPGRADE_MASTER”
View recent cluster events:
kubectl get events --sort-by=.metadata.creationTimestamp -A
Investigate API Overload During Deployments
If too many resources (pods, deployments, services) are updated at once, the API server might become overwhelmed.
kubectl get apiservices and look for timeouts or unavailable components.