We have a hybrid environment with two data centers (active-active).
Cassandra is shared across both DCs, but we currently see about ~3 ms latency between them.
At the moment, OAuth token generation and validation are running only in one data center, because:
-
Tokens are generated in DC-1
-
When requests hit DC-2, token validation fails
-
The failure is caused by Cassandra sync/replication lag (~3 ms), so DC-2 cannot reliably verify tokens generated in DC-1 in real time
Because of this, we are forced to keep OAuth services pinned to a single data center, which is not acceptable for availability and scalability. We cannot operate with only one DC handling auth.
Question:
In real-world, production scenarios, how is this typically solved?
Specifically:
-
How do we handle OAuth token generation and validation across multiple data centers with network latency?
-
Is there a recommended approach (stateless tokens, signing keys, local validation, etc.) to avoid cross-DC dependency?
-
Are there best practices or Google-recommended architectures for multi-DC OAuth setups?
Any guidance or reference architectures would be very helpful.
Hi,
Great question. Multi-DC configurations are both common and recommended for production installations as they provide a higher level of availability and durability. Apigee always runs in an Active / Active configuration with traffic shaping at the network in front of each Apigee runtime determining the actual active / active or active / passive model.
When building each data center’s runtime out, it is recommended to spread components across multiple availability zones within the region/data center. Within an Apigee runtime installation (within a single data center) most components run as replica sets allowing for internal active / active configurations with traffic routing being controlled internally to load balance API processing inside of each installation.
Therefore with multi-dc / multi-region configurations external traffic shaping can focus on providing the best regional experiences to the API consumers. It is recommended that multi-dc implementations of Apigee be used in a regional affinity model as opposed to a round-robin type model. This provides the best experiences for consumers while still allowing for fail-over scenarios when needed. When using a regional affinity model for the traffic coming to the Apigee runtimes, OAuth token replication latency is not a concern as the token is minted in the same region that it will be used for with most calls. And if a regional swing is required for maintenance or emergency needs, the tokens have been replicated to the other data center and are available.
Based on the information that you have provided in the post, I suspect that a round-robin type networking pattern is being used as part of the current active / active configuration. I recommend reviewing the load balancer configuration to determine whether the load balancing pattern can be moved to a regional affinity model. Another possible solution is to use a session pinning model allowing for each consumer to stay with the same regional installation that was used to start the session while a session is open. This will eliminate token replication latency impacts while providing a better experience for the consumer.
If modifications to traffic shaping are not an option. Moving to an authentication model that does not require token replication is recommended. A signed JWT can be very effective and allow for a similar experience while enabling minting to be performed in any region that has the correct certificates and validation to be performed anywhere the public certificates are available. This can be an extremely powerful approach if cross Apigee Organization authentication is required. A common pattern for this is in global deployments where geographically data resident controls are needed, however a common authentication token is needed to be used across geographies.
For additional details and discussion, I also recommend that you reach out to your account team at Google.
Cheers,
2 Likes