The timeout is not on the microservice side, and it happens before the request is sent. Also, I see that the average proxy response time for this endpoint is 785ms - so that is the time spent to process the request and forward it to backend server. Do you know what can cause this poor performance?
YES. I have some ideas.
The #1 reason I see for large latency between Apigee and some upstream system is… network latency. Apigee can run “anywhere”. If you are using Apigee SaaS (Edge or X), then Google runs Apigee on your behalf, in some cloud datacenter, which itself is sited in some geography. That datacenter might be in the US (in the west, the east, central, or south), or in the EU (Frankfurt, Netherlands, Belgium, Zurich, London), or Asia (Seoul, Osaka, HK, Taiwan)… you get the idea. OR, if you’re using customer-managed Apigee (aka OPDK or also hybrid), then the datacenter can be literally anywhere.
Connecting from a client system (let’s say postman) directly into a microservice, will be fast, if the client and the service are near each other on the network. Let’s suppose your workstation is in an office building in New York City, and you are using a cloud-based Apigee that is running in the northern virginia region for GCP. The time to send data from NYC to Virginia will be pretty quick. Less than 10ms roundtrip transit time. Now imagine inserting a proxy between the client and the service, and that proxy runs in Zurich. Now the data must go from NYC to Zurich, then to the microservice back in Virginia, then the response does the same. This can add hundreds of milliseconds of latency to the call. Add in things like VPNs and other network inefficiencies, and it is easy to attain latency in excess of 700ms on a call. To optimize this, you need to know where on the network your client (postman or whatever) is, your microservice, and your Apigee proxy. You want the path between those three parties to be as short as possible.
Also keep in mind the payload size will affect transit time. If you send a 10mb request, that will take many more packets than a single GET /hello call. And that means more latency. It will take time to transmit all the data in the request. This compounds with the network distance I described above.
The next thing I recommend checking is the TLS handshake negotiation at each connection point. There will be TLS between the client and Apigee, and TLS between Apigee and the microservice. If you have no Keep-alive on either end, you will see unnecessary, avoidable latency. If you are using HTTP1.0, switch to HTTP 1.1.
Then I would look at the API proxy. Operations like Verifying an API Key, extracting a path, removing or injecting a header…those are all in-memory operations. The VerifyAPIKey will require a lookup to persistent store on the first call, but subsequent calls will read information from cache, so it will be fast. All of those things should execute within a total of 1-2 ms. That’s not where the latency will be. On the other hand if you have lookups in the API proxy, to do things like ServiceCallout to retrieve an authorization table, or ServiceCallout to perform message logging to some remote system… those things can consume time, because again, they are calls across the network. All the same issues I mentioned above apply here too. I dealt with one customer recently that had 4 logging calls in their proxy, each one was consuming 25-30ms. Collapsing those logging calls to just one, and stuffing it into the postclientflow, avoided over 100ms of latency in the API proxy itself.
Last thing - if you are seeing timeouts before the request is sent, then … check the TLS connection. The Apigee target has timeouts for keepalive, I/O, and connection. It sounds like it is failing on the connection timeout.
when going through apigee, every few requests are slow, sometimes it’s 3,5,6 seconds, sometimes it timeouts at 9. … We have other proxies who have similar steps, but perform much much better.
Also to note, this issue only persists in the prod environment. It performs normally on development, and the revision is the same.
Your observations, together, suggest a network issue. It’s probably not an Apigee proxy issue. Maybe there is a “virtual IP” between Apigee and the upstream prod system. A network device like a F5 or Netscaler or something. Maybe that thing is misconfigured and has an incorrect target.
It’s possible that your Apigee prod environment is much more stressed than your Apigee non-prod, and that extra stress is leading to contention in the message processor which leads to a higher rate of timeouts. If you are using Apigee cloud, that should never happen. If you are managing the apigee runtime yourself (OPDK or hybrid) , then check the health of the MPs to see that they are not over-stressed.
But I suggest that you investigate “network” before “MP”.