[Public Preview] Mastering WebSocket load balancing on Google Cloud with IN-FLIGHT mode

If you are building real-time applications—like chat platforms, live dashboards, or multiplayer games—you are likely relying on WebSockets. While Google Cloud’s Application Load Balancers (ALB) offer fantastic native support for WebSockets, scaling them across your backend instances introduces a unique challenge: How do you accurately measure and balance the load of persistent connections?

The secret lies in moving away from traditional request-rate metrics and embracing the IN-FLIGHT balancing mode.

The problem with Requests Per Second (RPS)

Typically, HTTP load balancing uses the RATE balancing mode, measuring Requests Per Second (RPS). However, WebSockets break this model.

A WebSocket begins life as a standard HTTP request with an Upgrade header. Once the backend accepts the upgrade, the connection stays open. To a load balancer tracking RATE, it only sees that single initial request. Once the connection is established, it drops off the RPS radar, even though it is still consuming backend resources. If you rely on RATE, your load balancer might overwhelm a single instance by funneling hundreds of long-lived WebSockets to it simply because the current “requests per second” seems low.

The solution: IN-FLIGHT balancing mode

Introduced for application load balancers (ALB), the IN-FLIGHT balancing mode changes the paradigm. Instead of measuring how many requests are arriving, it measures how many requests are currently active (sent to the backend but not yet completed).

For WebSockets, this is perfect. The ALB treats a long-lived WebSocket connection as a single “in-flight” request that remains open for the entire duration of the session.

When you configure your backend group with IN-FLIGHT mode, you set a maximum number of concurrent connections (e.g., 100 active WebSockets per VM). The load balancer will neatly distribute new connections to backends with the lowest number of active sessions. Once an instance hits its in-flight capacity, the ALB seamlessly overflows new traffic to other available backends.

  • Websocket clients: Multiple users initiate connections. Note that WebSockets use continuous, bidirectional arrows, representing their long-lived nature.

  • Application Load Balancer: Inside the ALB, the IN-FLIGHT mode logic tracks Active Concurrent Connections. It visualizes this as a counter (like a gauge) rather than a request rate.

  • Balancing in action:

    • Backend A is currently handling 198/200 allowed concurrent connections (high utilization). The diagram shows the ALB diverting new traffic away from this group.

    • Backend B is only handling 12/200 connections (low utilization). The ALB routes the new incoming WebSocket connections (represented by the bold blue arrows) directly to Group B, balancing the real-time concurrency load.

  • Overflow scenario: If Group A was completely full (200/200), the diagram illustrates how the ALB would trigger an “Overflow” event, temporarily blocking new connections to that group until capacity becomes available, ensuring existing sessions aren’t dropped.

Two critical rules for WebSockets on GCP

Before you configure this, keep two things in mind:

  1. Increase the timeout: By default, GCP backend services have a 30-second timeout. If your WebSocket goes idle for 30 seconds, the ALB will aggressively drop the connection. You must increase the backend service timeout (e.g., to 3600 seconds or more) to allow long-lived streams.

  2. Session Affinity: If your architecture requires clients to reconnect to the exact same backend server if a drop occurs, ensure you enable Session Affinity (like CLIENT_IP or GENERATED_COOKIE).

Sample configuration

Here is how you can configure a Google Cloud backend service for WebSockets using the IN-FLIGHT balancing mode via the gcloud CLI.

1. Create the backend service with a high timeout

First, we create the backend service and increase the timeout to 1 hour (3600 seconds) to prevent the load balancer from closing persistent connections prematurely.

gcloud compute backend-services create my-websocket-backend \
--load-balancing-scheme=EXTERNAL_MANAGED \
--protocol=HTTP \
--port-name=http \
--timeout=3600 \
--global

2. Add the backend instance group with IN-FLIGHT mode

Next, we attach our instance group (or Network Endpoint Group) to the backend service. Here, we specify the IN_FLIGHT balancing mode and cap it at 150 concurrent WebSocket connections per instance.

gcloud compute backend-services add-backend my-websocket-backend \
--instance-group=my-app-instance-group \
--instance-group-zone=us-central1-a \
--balancing-mode=IN_FLIGHT \
--max-in-flight-per-instance=150 \
--global

Conclusion & call to action

By switching your WebSocket backend services to IN-FLIGHT balancing mode, you ensure your load balancer accurately understands your active concurrency. This simple configuration change prevents traffic hotspots, ensures smoother scaling, and provides a much more stable real-time experience for your users.

Feature is in Public Preview, therefore it is the right time to test it on your own. We’re looking for your feedback and observations.

7 Likes

Wow, great. It can be helpful to lots of developers who wanna build real time applications.