Long delay on initial outbound connection with Gen1 Cloud Run with Direct VPC Egress+Cloud NAT

nw, least the tester will the perfkit will provide a method to compare cross platforms. If I run into a similar issue/use case - and raise a ticket to resolution - I’ll comment back

Thanks a lot! I actually found a cheap workaround for now; I run a Cloud Scheduler that pings my service every 20 minutes – its not enough to keep the instance active nor incur much costs, but apparently enough for the networking to remember/cache the configuration and not take 15s to setup the port binding etc. Pretty weird but it seems to work.

1 Like

so then you could do that with a liveliness probe well without needing scheduler - just give it the route and period. Sorry I overlooked it - I typically have health and liveliness probes which might be why I haven’t run into the delay - thanks for sharing

Configure container health checks (services) | Cloud Run Documentation | Google Cloud

I already have a liveness probe (on the same endpoint) as well, but that only lives as long as the instance; so not long enough to mitigate the issue. I think I’m running into this because the service has extremely bursty traffic, but ah well.. maybe we will never know for sure :grinning_face_with_smiling_eyes:

@Terranca I was just wondering if you found some solution by chance.
I’m facing the exact same problem where Cloud NAT adds about 40+ seconds delay when we deploy the instance the first time. I saw above that you mentioned using the cloud scheduler to ping your service every X minutes.
Doing that works for the first instance but under some load testing I noticed that every time an additional cloud run instance is deployed, the users that are routed to that new instance also face the 40+ seconds delay (in my situation it’s 40+ seconds for a simple python app using an outbound connection).
It seems like Direct VPC egress with a static outbound connection isn’t working great.

Hi, no I never found the solution and nobody from Google ever responded, so it looks like we’re stuck with this delay until a high paying Enterprise customer runs into this.

I would not expect the requester to wait for the second instance to spin up when you have liveness checks etc as it could do the spin up in the background, but maybe its designed that way? Edit: yeah, seems it indeed designed that way .. that’s a shame..

Hope you have more luck finding a proper solution.

1 Like

@Terranca Yeah it seems like we’ll have to wait that Google fixes it indeed, that’s very unfortunate.
Thank you so much for your answer!

I am having the same problem. Using direct VPC access then Cloud NAT to access internet. I access my DB throu internet and with this setup its very unpredictable, my startup probes needs to be around >15s for this (normally they are ok within 1s, without VPC).

I played with the Cloud NAT port allocation per VM, increased a bit, that seem to make it better but nowhere optimal. Same problem with the OP. Just delays of 10s of seconds.

Yeah it’s a shame – it’s such a wonderful setup, but ruined by the Cloud NAT. I’ve removed my “accepted answer” as I just noticed it would look like this issue is actually solved, which obviously it isn’t. Hope you find a workaround and let us know :slightly_smiling_face: