Its random that there is spike in apigee-x proxies processing time from 100ms to 150 ms (most of time) to 4 minutes in a different hrs of a day for live apigee-x proxies. overall 95 and 99 percentile looks good and we don’t see any logs error in cloud logging during listed time or in overall flow from start to end to apigee to backend.
We see such spikes in Apigee analytics “Proxy performance” and also txn or tps are within the limit of performance test we have done it.
Yes. The first thing to do is understand where the latency increase comes from.
a jump in latency can be due to lots of things. Some examples:
Temporary service disruption. the upstream for the Apigee proxy is a remote service which is available in multiple different Datacenters or Cloud regions. Let’s say us-east4 and us-west1. Normally Apigee distributes load to the multiple different regions equally, or… based on geographic affinity, and the average latency observed is 100ms. Some of the requests are 50ms and some are 150ms, but on average 100ms. But, periodically during the day one of the DCs or regions housing the upstream system goes offline, for maintenance, and Apigee needs to direct all of its load to just one of the regions. This causes Apigee to need to transmit data across a more distant network, incurring higher latency. The 50ms requests disappear, and the result is the average latency increase to 150ms. When the remote system comes back online, the latency returns to its previous average of 100ms.
Noisy neighbors. the upstream for the Apigee proxy is a remote service which is available in a single Datacenter or region. Periodically during the day there is a batch job that runs, independent of workload that runs through Apigee. The batch affects the network or the actual clusters hosting the service that is upstream from Apigee. During the execution of that batch job, average latency for all requests going through Apigee increases. When the batch job completes, latency as observed within Apigee returns to its previous average of 100ms.
Cyclic Traffic patterns. During several times during the day (the morning rush, mid-day surge, etc) traffic volume routinely spikes to 5-8x its typical volume. (For example, at 6-7am, users Signin, and thousands of them need to obtain new tokens and fill caches). While that surge happens, systems, caches, datastores are congested and content with higher load, resulting in higher latency. After some time, autoscale effects adapt to the load, and latency lowers somewhat. When the surge abates, the average latency returns to its previous average of 100ms.
Among the various customers I work with, I have seen all of these and more as causes of higher latency. One customer had a DNS resolution problem that led to implicit fallbacks in the upstream, and THAT was the cause of the 1st case (Temporary Service Disruption). We used a DNS monitoring service to diagnose this.
It’s also possible but in my experience less likely for a cyclic latency change to result from issues internal to the Apigee service. We have thousands of organizations and projects under management and we’re pretty good at maintaining a stable service. The operations team at Google has “managed out” many of the problems long ago. There are still problems of course, but generally they are novel, not cyclic, and we identify them and fix them pretty quickly.
One possible cause is an over-stressed persistence layer. It’s possible if you have hundreds of thousands of tokens being issued per second, there will be some periodic database compaction occurring that causes latency increases. This usually falls under cases our operations team has “managed out” previously. But it could still happen.
Collecting more information beyond “we observe latency increases every once in a while, often”, is essential to diagnosing the problem, and potentially solving it. Is it isolated to a particular proxy? a particular region? a particular Target? The Apigee analytics charts can help immensely with this - it will show you Apigee latency as well as remote latency.