Hi Google Cloud Community,
We are investigating a short production incident on GKE Standard where surviving workload pods experienced transient outbound connectivity failures to multiple external dependencies.
Environment:
- GKE Standard
- Zone: asia-northeast1-c
- Node pool: dedicated workload node pool
- Node image: COS_CONTAINERD
- Node version context:
- The same node pool had previously been running 1.35.5-gke.1000000.
- It was upgraded to 1.35.5-gke.1057002 about 9 hours before the incident.
- The node upgrade operation completed at 2026-06-21 19:33 UTC.
- Incident window: 2026-06-22 04:24-04:30 UTC
Observed symptoms:
- Multiple external dependencies failed around the same time.
- Failures included:
- DNS resolution failure
- TCP connection timeout after DNS resolution
- HTTPS read timeout
- PostgreSQL connection timeout
- LLM/STT provider request timeouts
- The issue was not limited to a single external provider.
- A separate scaler workload also saw:
- DNS failure resolving a DB host
- later, resolved-IP TCP timeout to the same DB endpoint
- then recovery a few minutes later
What we ruled out so far:
- Affected pods did not restart.
- No OOMKilled events.
- CPU and memory were not saturated.
- The issue was not isolated to one external SaaS provider.
- Some idle nodes were removed by cluster autoscaler around the same time, but the affected pods were on surviving nodes, not the nodes being deleted.
- Aggregate node network metrics did not show a complete NIC outage.
Context:
- Cluster autoscaler removed idle nodes during the same window.
- Shortly after, new pods had to wait for node scale-up capacity.
- We believe low warm capacity amplified the user-visible impact, but it does not explain why existing surviving pods saw DNS/TCP/HTTPS outbound failures.
Current hypothesis:
This looks like a transient shared outbound path degradation from GKE workloads to external dependencies, possibly involving DNS, Cloud NAT, VPC dataplane, node networking, conntrack/netfilter, or a node churn / GKE patch interaction.
Questions:
- Has anyone seen short-lived outbound DNS/TCP timeout bursts from surviving GKE pods during or after node pool upgrades or cluster-autoscaler scale-down?
- Are there known issues or diagnostics specific to GKE 1.35.5-gke.1057002?
- Besides VPC Flow Logs, Cloud NAT logs, packet capture, node serial logs, and GKE events, what else should we inspect?
- Are there recommended metrics/logs for detecting conntrack exhaustion, NAT drops, DNS path issues, or node-level egress dataplane issues on GKE Standard?
Any pointers would be appreciated.