ya, increases in Latency when a system is under load is always an interesting issue.
However, we have observed that during peak hours, certain transactions are taking several seconds to complete. Please note there is no issue at target application and its latency is very minimal even during peak hours.
Interesting. One of the possibilities in distributed systems, as I’m sure you are aware, is that .. if the upstream target system is slow to respond, then that can result in slowdowns that you can observe in anything “in front of” that system, like Apigee.
“No issue at target application” is a good data point. How are you assessing that? Are you able to see the network hops between the Apigee MP and the target system? Is it possible that a load balancer between Apigee and the target system is holding onto a connection, before connecting to the upstream, because of some load balancing behavior? Is it possible that the data you are looking at in the upstream target, the data that leads you to conclude “there is no problem” is not considering I/O delays, or delays in accepting new inbound connections?
I’m not an expert in On-prem Apigee systems and diagnosing performance issues within them, but … I would look at CPU. memory usage, GC, and I/O waits in the Apigee system. If you see high CPU and high GC (JVM Garbage collection) rates in the MP at the same time you are observing increased Latency, then it is possible that Apigee is simply spending a bunch of time buffering a large number of concurrent calls, and is suffering from resource contention across all those calls. In this case, increasing the number of nodes running Apigee, or the CPU or memory available to the existing nodes (eg scaling out or up) may allow you to handle the increased load without increased latency, or while keeping latency under your desired threshold. On the other hand if you see high volume of IO waits, that means Apigee is waiting for some system (could be the caller, or it could be the target, or it could be some intervening network system like a switch) to communicate - either accept information or return information. And in that case, nothing you do at Apigee will help with the performance or Latency, because the source of the increased Latency is outside of Apigee.
At peak load, are you confident that the network is not saturated? If you have a 10g switch, and ALL of the inter-communicating distributed systems are using that switch, you can easily saturate the network to the point that no additional traffic will flow. This can exhibit as “increased Latency”. Basically Apigee’s invocation of the upstream system is queueing at the switch. The solution to this is to get a fatter network, or re-design the network so that not all systems are using the same switch. (A variant of this is: move to Google Cloud; Google has a better network than you!)
Your test to use a “bare” or stripped down API proxy to evaluate Latency is a good data point. Seeing the same added latency could be consistent with a saturated network. More questions though - is it running in the same MP, at the same time there is load going through the other API proxy that is exhibiting increased Latency? In that case, the stripped down API proxy could be competing for the same machine resources (CPU, memory, I/O) that the other proxy is competing for, and so seeing increased Latency in all of the API proxies is something we would expect, if it is resource contention in the MP that is the cause. So the “increased latency with a simple proxy” is not conclusive one way or the other, without additional data.
Does APIGEE start the TargetEndpoint flow execution only once the target connection pool is available? If so, what is the maximum connection pool size for a single TargetEndpoint?
We don’t describe that stuff as part of the Apigee interface. These are implementation details that you should not need to depend on, as you diagnose and optimize the performance of this system. The Apigee engineering team tests the on-prem system and has optimized the Target connection pooling based on years of experience with many many customers. It’s possible there is a problem in the pool management, but I would suggest that is unlikely, and is probably not the first thing you should explore. Exhaust other possibilities before thinking “I need to look at how Apigee manages the connection pool size”.
The other thing I would recommend is to test this at different throughput and concurrency levels. Ideally you have a test “backend” that can deliver varying levels of average but not constant latency - a latency distribution that mimics that delivered by the production system.
It’s probably not the case that Latency remains at 10ms, constant and flat, and then … in some step function, jumps up to 200ms or 500ms, constant and flat. There’s a transition. Where is it ? And when you cross that transition, what is happening at the target system? What is happening at Apigee? What is happening on the network?
there is no single , pat answer on how to solve performance and Latency issues in distributed systems. You just need to continue to test, try different scenarios, while monitoring the key metrics, and then draw conclusions based on the data you can observe.
Good luck!