Long delay on initial outbound connection with Gen1 Cloud Run with Direct VPC Egress+Cloud NAT

We are considering moving from Azure Container Apps to GCP Cloud Run due to the much better “scale to 0” boot times, and are running benchmarks, but are running into the following issue:

When we deploy our app in the Gen1 environment (we need fast cold boots) using Direct VPC Egress to a Shared VPC, with a Cloud NAT on the Shared VPC, there is a significant delay on the first outbound HTTP(S) connection of about 10-15 seconds.

We see in the logs that the Firewall is hit only ~8s after the request on average, and the Cloud NAT allocation is done only after ~10s on average. We’ve made sure there are enough allocated ports, we’ve tried both standard and premium network class IPs. What’s more; it seems to be intermittent, and sometimes we do get ~1s cold boot and immediate outbound https connection.

When we remove the Direct VPS Egress from the Cloud Run instance, it resolves the issue and outbound https connections are available within 1 second; but we need Clout NAT for IP whitelisting.

This issue does not seem to exist on Gen2, but that of course leads to colder boot times in general. Upon further investigation, this also happens on Gen2.

Is there a known issue with Cloud Run instances and Direct VPC Egress delays during cold boot?

Edit: nice illustration of what I’m seeing – a 12 second delay between the firewall seeing the egress request and the NAT allocation.

I found this topic from last Friday which seems to have similar issues, just using VMs: https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/VMs-with-private-IPs-and-NAT-have-an-extra-40s-network-bootstrap/m-p/789965/thread-id/7982

for the cloud run side - do you have CPU Boost enabled? Faster cold starts with startup CPU Boost | Google Cloud Blog, 3 Ways to optimize Cloud Run response times | Google Cloud Blog

For the VPC side, Direct VPC egress with a VPC network | Cloud Run Documentation | Google Cloud.

Are you using Network tags for the firewall?

How large is the subnet (it sounds weird but it has to do with the underlying host created) ?

How did you set – Traffic routing, select one of the following:

  • Route only requests to private IPs to the VPC to send only traffic to internal addresses through the VPC network.
  • Route all traffic to the VPC to send all outbound traffic through the VPC network

Did you setup anything in the Network Intelligence center to test egress (it would point to a route or service networking configuration issue)

Based on above, if there are not firewall or config issues, you can look at the router with the mapping gcloud compute routers get-nat-mapping-info | Google Cloud CLI Documentation and potentially adjust the NAT rules to gcloud compute routers nats rules create | Google Cloud CLI Documentation to adjust the priority, another way is to use the NGFW with external IP to whitelist to the service or have an Application load balancer and is WAF to whitelist (understand the later come at an additional cost)

Additionally another way is to setup the Secure Web Proxy with NAT - creating a Private Service Connect (PSC) as it has a higher priority Deploy Secure Web Proxy as a Private Service Connect service attachment | Google Cloud

So let me try to answer all your questions :grinning_face_with_smiling_eyes:

Yes, CPU boost is enabled.

No, the firewall has no rules related to tags. I do set the tag “test” or “prod” on the Cloud Run egress setting, but nothing is done with it in the firewall.

The subnet was /24, but because of your question I changed it to /21 to see what would happen: same issue.

Traffic routing is set to “Route all traffic to the VPC”

I did not use Network Intelligence – but don’t think there is a generic issue with the routing because after the delay the routing works just fine.

Here’s the output:

---
instanceName: ''
interfaceNatMappings:
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.0.20
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.21/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.22/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.23/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.0.20
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.21/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.22/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.23/32
  sourceVirtualIp: ''
---
instanceName: ''
interfaceNatMappings:
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.8.16
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.17/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.18/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.19/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.8.16
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.17/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.18/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.19/32
  sourceVirtualIp: ''

The IPs match up with what I’m seeing being used in the allocation of Cloud NAT

2 Likes

Sure, maybe that would work. But the recommended and documented way of using a static outbound IP is using Direct VPS Egress and Cloud NAT – it should “just work” but clearly there is a bug here… just cannot get any actual support without paying apparently.

I agree and I haven’t run into latency issues until I created them :slightly_smiling_face: - just providing another method I used to resolve based on use case. Even if you had paid support - you’d be luck to get assistance on a network latency issue

Sorry, the questions were to narrow a number of scenarios - I would setup a connectivity test in Network Intelligence Center (Connectivity Tests overview | Google Cloud) with an client; Also I’d enable the Network Recommender API and use the network analyzer Use Network Analyzer | Google Cloud to look at the node path and see if either the route or FW traffic is causing the delay. I would pay attention to the RTT and SRT times for the session

Thanks! I did both of those and there is no issue with a Connectivity Test from the VPC to the Destination IP. The Network Analyzer has no recommendations either :slightly_smiling_face: And it seems the Network Topology and other tools from the Network Intelligence Center don’t seem to work for Serverless loads. It’s just blank. So thanks for all the idea’s! But still nothing :disappointed_face:

1 Like

Here is a picture that illustrates what I’m talking about: I’ve made a “egress-allow-all” firewall rule that just logs everything, and you can clearly see the 12 second delay between the firewall seeing the request and the NAT allocation – even though the nat routing (as seen in one of the replies above) is already there for this source ip/port

1 Like

Just noticed that Secure Web Proxy is a $1000/month service (or maybe not but the GCP price calculators are missing most of services so who knows..), so that would kill the use case of cloud run of a simple service. Cannot believe this simple use case is so buggy or at least undocumented.

Tried changing so many settings but this freaking delay just keeps popping up. This is turning me off from the idea of using GCP at all to be honest. It’s such a shame that this wonderful 1s cold boot time is absolutely destroyed by a 15s wait on a NAT port. Could have been amazing but guess I’ll just have to give AWS a go.

Yes, the other tools don’t work with serverless

In the one use case, I’m using it - its about 60 GB/month and is comes out to $4.50 per month. its about .02 /gb, so idk ref calculator

Ah that’s not too bad. I saw an instance hour cost of $1.25 somewhere.

Too bad it’s also listed as not being support in direct vps egress under Limitations.

Just to rule it out - you’re using a slim base image? normally in a case like this - based on your statement about no having support - I’d say run this report GoogleCloudPlatform/vpc-network-tester: Deploy VPC Network Tester to App Engine or Cloud Run to investigate the traffic flow. This app demonstrates how to package network tools as a serverless service. (github.com) and open an issue on the repo as it goes to the network program team but it seems they stopped supporting it :disappointed_face: GoogleCloudPlatform/PerfKitBenchmarker: PerfKit Benchmarker (PKB) contains a set of benchmarks to measure and compare cloud offerings. The benchmarks use default settings to reflect what most users will see. PerfKit Benchmarker is licensed under the Apache 2 license terms. Please make sure to read, understand and agree to the terms of the LICENSE and CONTRIBUTING files before proceeding. (github.com) might be a repo if you run the benchmark on CR and ask the question on why the delay on 1st response, their 1st question will be are you using based slim image. With the later you could make the point that AWS or Azure doesn’t exhibit this issue.

If you did have support, you could open a case but that would turn you off more as it takes time to get to the program team even if you give them everything up front.

yes, $1.25 /hour for committed use according to the pricing page - as it is serverless the committed use is low unless it is really chatty 60 GB isn’t a lot of traffic for the service I’m running so I’m probably under the committed use threshold

Yeah, the image is not the issue. Maybe I can try the vpc network tester some day, but as you said; even if that proves my point, who is there to listen? I think will I will just forget about this Cloud Run setup for now, go back to Azure and maybe come back in a year to see if anything improved.

Thanks for all the helpful tips, really appreciate it!