502 Bad Gateway | tcpdump Apigee X

error on our production at particular time regularly (say around midnight everyday)

If there is a cadence to the error - in other words it appears and then disappears, around the same time, each day - then the source of the error is likely not the static unchanging configuration, like TargetServer TLS.

You’ve observed that Apigee itself is not reaching its 55s timeout. It’s under 1 second. The 502 is not a system aburptly closing a connection. It’s a system actively rejecting the request. This suggests that there is a network device or system, somewhere between your Apigee and the target (possibly including the target), that is going actively “offline” at a particular time of day. It could be a network switch or router. For example, if there is a scheduled job (“cron job”) that applies updated configuration to a WAF or router, every night at midnight, it might cause a service disruption resulting in 502 errors. Maybe this happens only in prod because prod networks are “more important”. It could be some other scheduled task. It could be some cron job that actively resets or reboots a software-based router. Or maybe it’s not a network device or system, maybe it’s the actual target responding with 502. (That would be unusual).

You can use the HTTP header “x-cloud-trace-context” to correlate the request originating from Apigee to the request received upstream. Suppose there is a request handled at Apigee that sees 502 around midnight. Check the x-cloud-trace-context header on Apigee, and see if you can find an inbound request at the target with the same value for x-cloud-trace-context. If you can find that, then it means the target has received the request and responded with 502. Probably unlikely.

More likely is, there is an intervening network system - check those for the x-cloud-trace-context header you see on Apigee.

If the problem really does happen around midnight, every day, then you could set up a scheduled task to start a debug session programmatically just a few minutes before midnight. You can even specify a condition “trace only requests that result in 502 status code”. Then the next day you will be able to download that trace session, and examine a batch of failed requests.

1 Like