Race Condition in Apigee X During Concurrent Token Fetch

Hello Apigee Experts,

We are facing a concurrency issue in Apigee X while implementing token-based authentication with a backend system.

  • We use a Service Callout to authenticate with the backend.
  • The response contains:
    • An access token (valid for 90 minutes)
    • A refresh token/key (valid for 24 hours)
  • We store:
    • The access token in one PopulateCache policy (TTL: 89 mins)
    • The refresh key in another PopulateCache policy (TTL: 23 hrs 55 mins)

This setup works perfectly when handling a single request at a time.

Issue:

When I receive 5 concurrent requests and both caches are initially empty:

  • All 5 requests independently call the backend.
  • The backend returns 5 different access tokens and refresh keys.
  • Each request stores its own token in cache — leading to token-key mismatches and potential downstream authentication issues.

We believe this is a race condition where Apigee doesn’t serialize or lock concurrent flows accessing and writing to the cache at the same time.

  • How can I prevent multiple concurrent requests from refreshing the token at the same time?
  • Is there a recommended pattern in Apigee to coordinate or lock refresh logic across requests?

@dchiesa1 @kurtkanaskie @dknezic @shrenikkumar-s @JayashreeR

1 Like

Apigee doesn’t guarantee that if you store things in cache in multiple steps from multiple API requests, that the cache is updated atomically or transactionally. APigee doesn’t guarantee that if you call PopulateCache twice within a proxy, those two things will remain in cache and concurrent API Proxy requests will not cache other things in-between.

This is the problem, isn’t it?

CachePopulate populates the cache. If you call it concurrently, it will populate the cache multiple times. Keeping the refresh and access token synchronized… that’s an application-layer problem. If you are caching the access token and refresh token separately, and they need to be paired, then you have introduced a race condition and the answer is: Don’t do that.

If you want to store the pair of {access,refresh} token, then store them together, at the same time, in the same cache entry. Or, if they need to be paired but have different lifetime, then use a unique value in the respective cachekeys for access token and for the refresh token.

Good luck.

Thanks @dchiesa1 for your insights.

We are using the same cache key (key fragment) to store both the access token and the refresh token. However, when multiple requests arrive concurrently and the cache is empty, they all call the backend simultaneously. As a result, one response’s access token and another’s refresh token may get cached under the same key, leading to mismatches. This inconsistency in the cache causes authentication failures downstream.

Current Cache Key Setup:
<KeyFragment ref="request.uri"/>

This configuration generally works well — the key remains consistent across a session, and the URI provides a stable basis to fetch or populate the cache.

We thought about making the cache key unique per transaction by adding request-specific elements like a request ID or transaction ID, such as:

<KeyFragment ref="request.uri"/>
<KeyFragment ref="request.headers.reqID"/>

While this ensures that each request has a unique cache key and avoids collisions, it introduces a new problem:
We can’t retrieve the cached data on subsequent requests unless we have the exact same reqID or transactionID, which can not be available or consistent across calls. So although this avoids token overwrites, it defeats the purpose of caching for reuse.


Is there a recommended way in Apigee to handle this kind of concurrency scenario?

Specifically, we’re looking for a way to:

  • Avoid cache key collisions on concurrent requests

  • Ensure consistency between cached access and refresh tokens

  • Still benefit from caching across requests (not just per transaction)

yes, I get it. Thanks for the deeper explanation.

If I were doing this I might look into one of two options.

Option 1

a 2-level cache in which the first level uses a fixed key “tokenkey” .

And the contents of that is … the request ID or messageid of one of the 5 (or 7 or whatever) requests that is in a race to get an access token/refresh token pair. And that messageid is used as the key for the access/refresh token pair.

So the flow is

  • LookupCache for key= “tokenkey”. Retrieve a messageid.
  • if tokenkey is present,
    • Get cache for “token” + messageid
    • use the access/refresh token pair
  • if tokenkey is not present,
    • get a new access/refresh token pair
    • PopulateCache with current messageid using acc/ref token as content
    • PopulateCache to “tokenkey” with current messageid as content

The “race condition” still happens, but with a 2-stage cache approach

Option 2 . would be to pipeline all requests for token through a separate, single-threaded external service. Effectively managing the locking externally.

Thank you @dchiesa1 for sharing the solution options.

After analysing the first option, it appears promising but assumes that the access and refresh tokens have the same expiry duration, which is not the case in our environment.

  • In our setup, the access token expires after 90 minutes, while the refresh token remains valid for 24 hours.
  • Due to this difference, the access and refresh tokens are stored in separate caches.

With this approach, a challenge arises when generating a new access token after the initial 90 minutes:

  • A new access token is generated using the existing refresh token and access token combination.
  • The caches are then updated.
    • The tokenKey cache is refreshed with the new requestID/messageID.
    • The access token cache is updated using the new requestID/messageID as the key.

This process functions correctly for the next 90 minutes. However, after 180 minutes, when another token refresh is needed:

  • The access token is retrieved from the tokenKey cache using the requestID/messageID.
  • The refresh token cache, however, still references the old tokenKey, causing a mismatch.
  • Since the access and refresh tokens now reference different sessions, the token refresh operation fails.

Unless the expiry durations or storage mechanisms can be aligned, this approach may not be feasible. Please advise if there are alternative perspectives.


Proposed Workaround:

As an alternative, the following approach is being considered:

  • Concurrent requests are allowed to generate and cache their own access and refresh token pairs, accepting the possibility of race conditions.
  • For the first 90 minutes, only the access token is used, allowing normal operation.
  • Once the access token expires, both tokens are fetched from the cache to request a new access token.
  • If a race condition causes token mismatches, the refresh request will fail.
  • Upon failure:
    • Relevant caches are invalidated.
    • A failure response is returned to the consumer.
  • When the consumer retries (starting a new session), caches are empty, prompting a fresh token request.
    This approach accepts that one transaction may fail in the event of a race condition but ensures that subsequent transactions recover cleanly.

We would greatly appreciate your thoughts on this workaround, as well as any suggestions or insights you might have.

Yes I am aware they have different lifetimes. But the access token and refresh token are coupled.

When the access token expires, you need the refresh token. Without understanding more of your requirements, I thought it best to couple them.

If this doesn’t work, then I suggest you go implement an external microservice and you can have complete control of the logic.

1 Like

Hi @dchiesa1

Trust you’re doing well.

Yes, the access and refresh tokens are indeed coupled, or corresponding to each other. However, they’re stored in separate caches with distinct lifetimes; the refresh token has a significantly longer expiry, say 24 hours.

The challenge arises when the access token expires. We then use the refresh token to obtain a new access token, which we cache with its associated msgId. The issue is, we don’t re-cache or update the refresh token at this point, since it’s designed to be purged 24 hours from its initial push to the cache. Consequently, when we only update the msgId to access token mapping, we lose the original msgId association with which the refresh token was initially cached. This means we can’t retrieve the correct refresh token for subsequent “refreshing” operations tied to that msgId.

Your validation is always impactful, and I truly appreciate your insights.

yep, I understand. Storing them separately IS the problem. When you do that you introduce the race condition. Storing them together avoids that.

I think there are probably specific scenarios that you want to allow, specific sequences of the use of a token.

  1. lookup token based on …a fixed key “upstream-accesstoken” . Maybe you have multiple well-known fixed keys associated to different upstream systems.
  2. in case of cache hit. token is valid/not expired. - ok proceed, use the token. Nothing further to do.
  3. in case of cache miss: token is expired. lookup the refresh token with a fixed key “upstream-refreshtoken”.
  4. in case of cache hit for refresh token. Use the refresh token to get a new access token. and then, cache the access token using the fixed key (upstream-accesstoken). If there is a new refresh token, cache the NEW refresh token using the appropriate fixed key. (At some point you will need a new refresh token, and often when getting a new access token you ALSO get a new refresh token. That’s ideal.) Usually there is a need to use the old (expired) access token with the refresh token in the refresh_token flow; that means you must cache both the access token and the refresh token together.
  5. refresh token cache miss - you need to go through the original grant flow. In all cases when you get a token, you must cache the refresh+access token PAIR together in the same cache entry when you cache the refresh token.

But there’s something odd - a system-to-system call is usually client_credentials or similar, in which case there is no refresh token flow. When the access token expires you go through another client_credentials flow. I don’t understand why there is a refresh token anyway. It’s Apigee to some upstream system. so why the refresh token? Refresh tokens are used for human flows - like auth code, or Password flow, to eliminate the involvement of the human keying in a password again. Refresh token is out of place for a system-to-system scenario, which THIS IS. So what’s going on? Why are you trying to solve this anyway?

Let’s think about the race: With no concurrency, (a single transaction), you will have this ordering:

  • Tx1 → writecache “accesstoken” with accesstoken1
  • Tx1 → writecache “refreshtoken” with refreshtoken1,accesstoken1

This is the simple case. There is no race. End state of cache: key(accesstoken) => value(accesstoken1). key(refreshtoken)=>value(refreshtoken1,accesstoken1)

What if there are two transactions and the updates to the cache get interleaved?

  • Tx1 → writecache “accesstoken” with accesstoken1
  • Tx2 → writecache “accesstoken” with accesstoken2
  • Tx2 → writecache “refreshtoken” with refreshtoken2,accesstoken2
  • Tx1 → writecache “refreshtoken” with refreshtoken1,accesstoken1

End state of cache: key(accesstoken) => value(accesstoken2). key(refreshtoken)=>value(refreshtoken1,accesstoken1)

At some later point, the accesstoken will expiry. Cache will get evicted. Logic will be: cache miss, so retrieve “refreshtoken”. Get (refreshtoken1,accesstoken1). That is a matched pair. You can use that for refresh flow.

If there are 5 transactions getting tokens (a race) , then it should not matter. Last write to the cache wins. The last tx to complete will write its refresh+access token to the cache with the key “upstream-refreshtoken” . And that will always be a consistent pair.

As I said, caching the items separately is what gets you into trouble. If you cache them together you avoid the problem.

Maybe there is some other constraint I am not considering - like you can only have one active access token at a time, or only one active refresh token at a time. Maybe the upstream invalidates all existing access tokens when it issues a new access token. IF that’s the case then you cannot solve that with Apigee alone. You need an external service that can serialize all access and make sure there is only ONE TRUE access token at any one time.

good luck

2 Likes

Hi @dchiesa1 ,

Thank you for your valuable feedback.

Yes, we do agree that this is not a usual standard scenario or implementation on the provider side but unfortunately we do not have control over it.

It is a system-to-system scenario but with a refresh_token
JayashreeR_0-1751354638951.png

Based on the approach you described, we have come up with the following solution, which is theoretically expected to achieve 99.99% accuracy.

We plan to implement three caches as outlined below:

  1. Cache-1: Stores the access token, valid for 90 minutes
  2. Cache-2: Stores the refresh token, valid for 24 hours
  3. Cache-3: Stores a combination of access token + refresh token with infinite validity

How the solution works:

  • In the event of a race condition, we allow the last transaction to update Cache-3 with the latest combination of access and refresh tokens.
  • We will always refer to Cache-2 to check the availability of the refresh token, and if it exists, we will then use the refresh token from Cache-3 to refresh the access token.

When the access token expires (after 90 minutes), we’ll refer to Cache-2 to check the availability of the refresh token, and if it exists, then we’ll retrieve the token pair from Cache-3, generate a new access token, and update only Cache-1 & Cache-3, keeping Cache-2 unchanged (so that the refresh_token expires at its original expiry).

When the refresh token in Cache-2 expires (after 24 hours), and a session login request is received, we’ll generate a new set of access and refresh tokens using credentials. At this point, we’ll update all Cache-1 (with the new access_token with 90m expiry), Cache-2 (with the new refresh_token with 24-hour expiry) and Cache-3 (with the latest combination infinitely).

Key Considerations:

  • In a race condition scenario, we allow the last successful transaction to overwrite Cache-3, which acts as the source of truth for token refresh.
  • Cache-2 is used solely to validate the refresh token’s availability.

This setup appears to handle race conditions effectively and ensures we always have a valid mechanism to obtain a new access token. We’re currently working on implementing this logic in our code.

Please let us know what do you think of it or if you have further suggestions.

If it’s system-to-system there is no need for a refresh token. Systems don’t have a problem remembering their credentials. Refresh tokens exist to ease the experience for human users, who use authorization code grant, or password grant. For a system-to-system interaction, there is no need for a refresh token. Systems don’t care if they have to re-authenticate again. They don’t get tired and they don’t forget. I understand that you said “your token endpoint generates a refresh token”. from what you explained , you do not need it. Why bother with it?

You can literally ignore the refresh token, always just get an access token, it will work fine. I think you may be over-complicating things.

If I were solving this I would make my job, and the job of future maintainers, as simple as possible. Just use the access token and get a new one just before it expires. DONE.

Look at all the discussion we’re having. Look at the long-winded reply you just sent. Now imagine future YOU, or some other maintainer, that comes into this project, and observes that the token refresh is not happening correctly. How do they debug it? How do they diagnose it? would you want to maintain it?

I would want the system to be as simple as possible to meet the requirements. I think the set of requirements does not include “use the refresh token, whether you need it or not.”

2 Likes