We are considering moving from Azure Container Apps to GCP Cloud Run due to the much better “scale to 0” boot times, and are running benchmarks, but are running into the following issue:
When we deploy our app in the Gen1 environment (we need fast cold boots) using Direct VPC Egress to a Shared VPC, with a Cloud NAT on the Shared VPC, there is a significant delay on the first outbound HTTP(S) connection of about 10-15 seconds.
We see in the logs that the Firewall is hit only ~8s after the request on average, and the Cloud NAT allocation is done only after ~10s on average. We’ve made sure there are enough allocated ports, we’ve tried both standard and premium network class IPs. What’s more; it seems to be intermittent, and sometimes we do get ~1s cold boot and immediate outbound https connection.
When we remove the Direct VPS Egress from the Cloud Run instance, it resolves the issue and outbound https connections are available within 1 second; but we need Clout NAT for IP whitelisting.
This issue does not seem to exist on Gen2, but that of course leads to colder boot times in general. Upon further investigation, this also happens on Gen2.
Is there a known issue with Cloud Run instances and Direct VPC Egress delays during cold boot?
Edit: nice illustration of what I’m seeing – a 12 second delay between the firewall seeing the egress request and the NAT allocation.
Based on above, if there are not firewall or config issues, you can look at the router with the mapping gcloud compute routers get-nat-mapping-info | Google Cloud CLI Documentation and potentially adjust the NAT rules to gcloud compute routers nats rules create | Google Cloud CLI Documentation to adjust the priority, another way is to use the NGFW with external IP to whitelist to the service or have an Application load balancer and is WAF to whitelist (understand the later come at an additional cost)
No, the firewall has no rules related to tags. I do set the tag “test” or “prod” on the Cloud Run egress setting, but nothing is done with it in the firewall.
The subnet was /24, but because of your question I changed it to /21 to see what would happen: same issue.
Traffic routing is set to “Route all traffic to the VPC”
I did not use Network Intelligence – but don’t think there is a generic issue with the routing because after the delay the routing works just fine.
Sure, maybe that would work. But the recommended and documented way of using a static outbound IP is using Direct VPS Egress and Cloud NAT – it should “just work” but clearly there is a bug here… just cannot get any actual support without paying apparently.
I agree and I haven’t run into latency issues until I created them - just providing another method I used to resolve based on use case. Even if you had paid support - you’d be luck to get assistance on a network latency issue
Sorry, the questions were to narrow a number of scenarios - I would setup a connectivity test in Network Intelligence Center (Connectivity Tests overview | Google Cloud) with an client; Also I’d enable the Network Recommender API and use the network analyzer Use Network Analyzer | Google Cloud to look at the node path and see if either the route or FW traffic is causing the delay. I would pay attention to the RTT and SRT times for the session
Thanks! I did both of those and there is no issue with a Connectivity Test from the VPC to the Destination IP. The Network Analyzer has no recommendations either And it seems the Network Topology and other tools from the Network Intelligence Center don’t seem to work for Serverless loads. It’s just blank. So thanks for all the idea’s! But still nothing
Here is a picture that illustrates what I’m talking about: I’ve made a “egress-allow-all” firewall rule that just logs everything, and you can clearly see the 12 second delay between the firewall seeing the request and the NAT allocation – even though the nat routing (as seen in one of the replies above) is already there for this source ip/port
Just noticed that Secure Web Proxy is a $1000/month service (or maybe not but the GCP price calculators are missing most of services so who knows..), so that would kill the use case of cloud run of a simple service. Cannot believe this simple use case is so buggy or at least undocumented.
Tried changing so many settings but this freaking delay just keeps popping up. This is turning me off from the idea of using GCP at all to be honest. It’s such a shame that this wonderful 1s cold boot time is absolutely destroyed by a 15s wait on a NAT port. Could have been amazing but guess I’ll just have to give AWS a go.
If you did have support, you could open a case but that would turn you off more as it takes time to get to the program team even if you give them everything up front.
yes, $1.25 /hour for committed use according to the pricing page - as it is serverless the committed use is low unless it is really chatty 60 GB isn’t a lot of traffic for the service I’m running so I’m probably under the committed use threshold
Yeah, the image is not the issue. Maybe I can try the vpc network tester some day, but as you said; even if that proves my point, who is there to listen? I think will I will just forget about this Cloud Run setup for now, go back to Azure and maybe come back in a year to see if anything improved.
Thanks for all the helpful tips, really appreciate it!