Control plane API availability and cluster repairs

akolupaev · February 18, 2025, 8:04am

Hello,

We’ve recently noticed that one of our Kubernetes clusters becomes inaccessible via the control plane API during deployments. The downtime typically lasts between 30 minutes and an hour. We’ve also observed that during these periods, the REPAIR_CLUSTER process starts. Here are the recent occurrences:

2025-02-07T00:19:44.322795855Z
2025-02-07T09:38:43.573013059Z
2025-02-07T11:26:43.549065231Z
2025-02-10T12:15:43.703596753Z
2025-02-10T12:56:43.46727104Z
2025-02-10T13:37:43.864985605Z
2025-02-10T14:20:43.599401413Z
2025-02-10T19:25:43.95928846Z
2025-02-11T09:11:45.334995747Z
2025-02-11T09:43:45.136944182Z
2025-02-14T15:14:11.761866548Z
2025-02-14T16:00:14.227655526Z

Could you please help us understand why REPAIR_CLUSTER is triggered so frequently?

Is there anything we can do to prevent periodic API control plane unavailability?

Thank you in advance!

debolek · February 19, 2025, 5:00pm

To better understand the issue you will need to check the logs

Check Control Plane Logs & Metrics
View logs from the Kubernetes API Server:
gcloud logging read “resource.type=k8s_cluster AND logName:stderr” --limit 50

Check control plane CPU/memory usage:
gcloud container operations list --filter=“operationType=UPGRADE_MASTER”

View recent cluster events:
kubectl get events --sort-by=.metadata.creationTimestamp -A
Investigate API Overload During Deployments

If too many resources (pods, deployments, services) are updated at once, the API server might become overwhelmed.
kubectl get apiservices and look for timeouts or unavailable components.

debolek · February 19, 2025, 5:06pm

To Prevent Future Issues:-
Check Google Cloud Incident Logs:

gcloud container operations list --filter=“operationType=REPAIR_CLUSTER”

Limit API Requests per Deployment: Reduce unnecessary API calls by caching responses.

Use Horizontal Pod Autoscaling (HPA): Instead of recreating pods, let the cluster autoscale based on demand.

Manually Scale the Control Plane if Needed: gcloud container clusters resize YOUR_CLUSTER_NAME --num-nodes=3

Topic		Replies	Views
Problem with accessibility to gke control plane Serverless Applications gke	24	148	June 27, 2023
Connectivity issues to GKE control plane Serverless Applications gke	4	41	January 20, 2023
Problem with accessing k8s API in GCP Serverless Applications gke	4	30	July 3, 2023

Control plane API availability and cluster repairs

AI Suggested topics