Transient outbound DNS/TCP timeouts from surviving GKE pods after node pool upgrade and scale-down

Hi Google Cloud Community,

We are investigating a short production incident on GKE Standard where surviving workload pods experienced transient outbound connectivity failures to multiple external dependencies.

Environment:

  • GKE Standard
  • Zone: asia-northeast1-c
  • Node pool: dedicated workload node pool
  • Node image: COS_CONTAINERD
  • Node version context:
    • The same node pool had previously been running 1.35.5-gke.1000000.
    • It was upgraded to 1.35.5-gke.1057002 about 9 hours before the incident.
    • The node upgrade operation completed at 2026-06-21 19:33 UTC.
  • Incident window: 2026-06-22 04:24-04:30 UTC

Observed symptoms:

  • Multiple external dependencies failed around the same time.
  • Failures included:
    • DNS resolution failure
    • TCP connection timeout after DNS resolution
    • HTTPS read timeout
    • PostgreSQL connection timeout
    • LLM/STT provider request timeouts
  • The issue was not limited to a single external provider.
  • A separate scaler workload also saw:
    • DNS failure resolving a DB host
    • later, resolved-IP TCP timeout to the same DB endpoint
    • then recovery a few minutes later

What we ruled out so far:

  • Affected pods did not restart.
  • No OOMKilled events.
  • CPU and memory were not saturated.
  • The issue was not isolated to one external SaaS provider.
  • Some idle nodes were removed by cluster autoscaler around the same time, but the affected pods were on surviving nodes, not the nodes being deleted.
  • Aggregate node network metrics did not show a complete NIC outage.

Context:

  • Cluster autoscaler removed idle nodes during the same window.
  • Shortly after, new pods had to wait for node scale-up capacity.
  • We believe low warm capacity amplified the user-visible impact, but it does not explain why existing surviving pods saw DNS/TCP/HTTPS outbound failures.

Current hypothesis:
This looks like a transient shared outbound path degradation from GKE workloads to external dependencies, possibly involving DNS, Cloud NAT, VPC dataplane, node networking, conntrack/netfilter, or a node churn / GKE patch interaction.

Questions:

  1. Has anyone seen short-lived outbound DNS/TCP timeout bursts from surviving GKE pods during or after node pool upgrades or cluster-autoscaler scale-down?
  2. Are there known issues or diagnostics specific to GKE 1.35.5-gke.1057002?
  3. Besides VPC Flow Logs, Cloud NAT logs, packet capture, node serial logs, and GKE events, what else should we inspect?
  4. Are there recommended metrics/logs for detecting conntrack exhaustion, NAT drops, DNS path issues, or node-level egress dataplane issues on GKE Standard?

Any pointers would be appreciated.

1 Like