Transient outbound DNS/TCP timeouts from surviving GKE pods after node pool upgrade and scale-down

ryanhan · June 22, 2026, 3:48pm

Hi Google Cloud Community,

We are investigating a short production incident on GKE Standard where surviving workload pods experienced transient outbound connectivity failures to multiple external dependencies.

Environment:

GKE Standard
Zone: asia-northeast1-c
Node pool: dedicated workload node pool
Node image: COS_CONTAINERD
Node version context:
- The same node pool had previously been running 1.35.5-gke.1000000.
- It was upgraded to 1.35.5-gke.1057002 about 9 hours before the incident.
- The node upgrade operation completed at 2026-06-21 19:33 UTC.
Incident window: 2026-06-22 04:24-04:30 UTC

Observed symptoms:

Multiple external dependencies failed around the same time.
Failures included:
- DNS resolution failure
- TCP connection timeout after DNS resolution
- HTTPS read timeout
- PostgreSQL connection timeout
- LLM/STT provider request timeouts
The issue was not limited to a single external provider.
A separate scaler workload also saw:
- DNS failure resolving a DB host
- later, resolved-IP TCP timeout to the same DB endpoint
- then recovery a few minutes later

What we ruled out so far:

Affected pods did not restart.
No OOMKilled events.
CPU and memory were not saturated.
The issue was not isolated to one external SaaS provider.
Some idle nodes were removed by cluster autoscaler around the same time, but the affected pods were on surviving nodes, not the nodes being deleted.
Aggregate node network metrics did not show a complete NIC outage.

Context:

Cluster autoscaler removed idle nodes during the same window.
Shortly after, new pods had to wait for node scale-up capacity.
We believe low warm capacity amplified the user-visible impact, but it does not explain why existing surviving pods saw DNS/TCP/HTTPS outbound failures.

Current hypothesis:
This looks like a transient shared outbound path degradation from GKE workloads to external dependencies, possibly involving DNS, Cloud NAT, VPC dataplane, node networking, conntrack/netfilter, or a node churn / GKE patch interaction.

Questions:

Has anyone seen short-lived outbound DNS/TCP timeout bursts from surviving GKE pods during or after node pool upgrades or cluster-autoscaler scale-down?
Are there known issues or diagnostics specific to GKE 1.35.5-gke.1057002?
Besides VPC Flow Logs, Cloud NAT logs, packet capture, node serial logs, and GKE events, what else should we inspect?
Are there recommended metrics/logs for detecting conntrack exhaustion, NAT drops, DNS path issues, or node-level egress dataplane issues on GKE Standard?

Any pointers would be appreciated.

Topic		Replies	Views
GKE Pod Network Issue - 1.33.4-gke.1134000 Serverless Applications gke	0	86	September 25, 2025
GKE networking routes being deleted on node pool deletion Serverless Applications gke	2	29	December 19, 2024
GKE connectivity issues on 1.30.5-gke.1699000 Serverless Applications gke-enterprise , gke	2	26	February 3, 2025

Transient outbound DNS/TCP timeouts from surviving GKE pods after node pool upgrade and scale-down

AI Suggested topics