A strong disaster recovery (DR) strategy is essential for ensuring business continuity in a distributed system, especially when using a service like Pub/Sub. The goal is to minimize data loss and downtime, which can be achieved through different patterns based on your specific needs for resiliency and cost. This blog post explores two popular DR patterns for Pub/Sub: Active-Active Multi-Region Publishing and Active-Warm Customer-Controlled Failover.
Assumptions
Before we dive in, let’s establish some core assumptions that underpin these patterns:
- Consumers are Idempotent: A consumer should be able to process the same message multiple times without causing incorrect or duplicate results.
- Ordered Message Delivery is Not Required: In a distributed system, embracing eventual consistency is a common practice. This means the system’s state will eventually converge to the correct outcome even if messages are not delivered in a strict order.
- Logs and Data can be Delayed: Some recovery patterns may involve a delay in the delivery of logs and data during a failover event.
Active-Active multi-region publishing: High availability, high cost
This pattern is the gold standard for high resiliency and availability, offering near-zero impact on your Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
How it works: In this setup, publishers simultaneously send the same message to regional endpoints in two or more separate regions. The Pub/Sub service in each region then delivers the message to its nearest subscribers. This redundancy ensures that if one region experiences an outage, messages are still successfully delivered to the other, active region.
Key components:
- Dual Publishers: Your publisher service is configured with two clients, each connected to a dedicated regional endpoint (e.g., asia-south1 and asia-south2). Messages are published to both regions at the same time.
- Active Consumers: Consumers in both regions are constantly active, ready to process messages from their respective endpoints.
- Retry Service: A batch job is in place to periodically check for any messages that failed to publish and republishes them. These failed messages are stored in a Google Cloud Storage (GCS) bucket.
Resilience scenarios:
- Regional Pub/Sub Failure: The failure of one region has minimal to no impact on message delivery, as the messages are already being sent to and processed by the other region.
- Publisher Unavailability: If a publisher service becomes unavailable, Pub/Sub’s monitoring metrics can alert you to a drop in message flow, triggering a recovery action.
- Consumer Unavailability: With consumers active in both regions, if one set of subscribers fails, messages are seamlessly processed by the consumers in the other region.
Pros and cons: The primary advantage is unparalleled resiliency. The main drawback is the increased cost due to the need for duplicate resources and ongoing operations across multiple regions.
Active-Warm Customer-Controlled Failover: Cost-effective resiliency
This pattern provides a balance between cost and resiliency. While it’s less resilient than the Active-Active approach due to the manual or automated switchover process, it is a more cost-effective solution for disaster recovery.
How it works: Under normal operation, publishers write messages to a single, primary regional endpoint (e.g., asia-south1). A secondary region (asia-south2) is kept in a “warm” standby state, with its consumers ready to activate if needed.
Key components:
- Primary Publisher: A publisher service is configured to write messages to the primary regional endpoint.
- Standby Consumers: Consumers in the secondary region are active but are in standby mode, ready to take over message processing.
- Retry Service: A batch job periodically retries failed messages, publishing them to the currently active endpoint.
Resilience scenarios:
- Regional Pub/Sub Failure: If the primary region’s Pub/Sub service becomes unavailable, the publisher must detect the failure and re-initialize its client to target the secondary region’s endpoint. This can be achieved through an automated program or a manual re-deployment of the service. You can use error rate monitoring (e.g., tracking 5xx errors) to signal a switchover.
- Publisher Unavailability: The system is resilient to a publisher failure. You can set up alerts based on Pub/Sub’s monitoring metrics if no messages are received, which can then trigger recovery actions.
- Consumer Unavailability: If the primary region’s subscribers fail, Pub/Sub automatically routes messages to the nearest available consumers, which in this case are the standby consumers in the secondary region. This results in minimal impact to RTO/RPO.
Pros and cons: This pattern significantly reduces costs by only actively processing messages in a single region. The trade-off is a potential RTO/RPO impact, as it requires a failover event to switch to the secondary region.
By understanding these two distinct patterns, you can choose the right disaster recovery strategy for your Pub/Sub architecture based on your business’s specific needs for availability, latency, and cost.
Thanks for reading!!