The Stateful Worker problem: Architecting a Scalable GKE Autopilot host for your LiveKit Agent

Part 2 of 4: Beyond the chatbot

By Tanya Dixit, AI Solutions Acceleration Architect, Google Cloud, and Jian Bo Tan, Solutions Acceleration Architect, Google Cloud

In Part 1, we built a “Hello World” voice agent using LiveKit and the Gemini Live model. It was magical: a Python script running on your laptop that could listen, think, and speak in real-time.

But “works on my machine” is a dangerous phrase in production.

When you close your laptop, the agent dies. If ten users call at once, your single Python process might choke. And if you deploy it to a standard serverless platform, you might find your agents getting killed mid-sentence because of timeouts.

This is Part 2 of our series. We are moving from “how to code it” to “where to run it.” We will architect a production-grade hosting environment for our Agent Backend using Google Kubernetes Engine (GKE) Autopilot.

Here is what we’ll cover:

  • The workload: Why a voice agent behaves differently than a web API.

  • The platform: Why GKE Autopilot beats Cloud Run for this specific use case.

  • The architecture: Designing a secure, scalable “Brain” for your agents.

  • The code: A deployment blueprint for handling graceful shutdowns and autoscaling.

Understanding the workload: The “Stateful Worker”

To host this correctly, we first need to understand what we are actually hosting. Most modern web backends are stateless REST APIs. A user sends a request, the server processes it (ms), sends a response, and forgets everything. If the server crashes, you just retry the request.

A real-time voice agent is fundamentally different. It is a Stateful Worker.

  • Long-lived connections: A voice session isn’t a millisecond request; it’s a continuous conversation that can last minutes or hours.

  • In-memory state: The agent holds the conversation context, audio buffers, and user data in memory.

  • Critical affinity: If the process running the agent dies, the user is disconnected instantly. There is no “retry.” The magic is broken.

This means our infrastructure has one primary job: Keep the pod alive.

We need a platform that guarantees a specific amount of CPU/Memory for the duration of the call and gives us precise control over when and how the process terminates.

The compute decision: Why GKE Autopilot?

When deploying containers on Google Cloud, the usual suspects are Cloud Run and GKE.

For stateless APIs, Cloud Run is often the default choice. It scales to zero and manages everything for you. But for our Stateful Worker, it presents challenges:

  • Timeouts: Cloud Run is designed for short-lived requests. While it supports WebSockets, it enforces timeouts (default 60 mins) that can hard-kill long sessions.

  • CPU Throttling: Cloud Run (by default) throttles CPU when a request isn’t active. For a voice agent listening for user input, this “idle” time is actually critical processing time for the VAD (Voice Activity Detection).

  • Scale-down aggression: Cloud Run scales down aggressively to save costs, which is risky for stateful sessions.

Enter GKE Autopilot.

GKE Autopilot offers the best of both worlds: the “serverless” experience of not managing nodes (Google manages the cluster infrastructure) with the Pod lifecycle control of Kubernetes.

With GKE Autopilot, we can:

  1. Guarantee resources: Request exactly 1000m CPU and 1Gi Memory per pod, and get exactly that. No throttling.

  2. Control termination: Configure terminationGracePeriodSeconds to allow agents to finish their conversations before shutting down.

  3. Scale on custom metrics: Use Horizontal Pod Autoscaling (HPA) to scale based on actual load, not just CPU spikes.

Architecture: The Agent Backend

Here is the high-level architecture for our Agent Backend on GKE:

  • Private cluster: The GKE cluster is private. Nodes have no public IP addresses, reducing the attack surface.

  • Cloud NAT: Agents need to reach out to the internet to connect to LiveKit Cloud and Google APIs (Maps, Places). Cloud NAT handles this securely.

  • Private Google access: Traffic to Vertex AI (Gemini) travels over Google’s private backbone, not the public internet. This reduces latency—critical for voice—and improves security.

  • LiveKit affinity: Interestingly, we don’t need to configure Kubernetes Session Affinity. LiveKit Cloud handles the routing. When a user connects, LiveKit assigns them to a specific agent instance. Our job is simply to provide a pool of available agents.

Sizing the agent: The “T-Shirt Size”

Voice AI is heavy. It involves real-time audio encoding/decoding, Voice Activity Detection (VAD), and managing the WebSocket stream. If you size your pods like a typical web server (e.g., 250m CPU), you will hear it. The audio will stutter, the agent will lag, and the user experience will suffer.

Based on our capacity analysis, here is our recommended “T-Shirt Size” for a production agent pod:

YAML

CPU Request: 1000m (1 vCPU)
CPU Limit: 2000m (2 vCPU)
Memory Request: 1Gi
Memory Limit: 2Gi

Why so large? We found that while a single session might only consume ~200-300m CPU, the burst usage during turn-taking (processing user speech + generating response) can spike. By reserving 1 vCPU, we ensure smooth audio even during complex interactions.

Scaling strategy: Implementing HPA

Scaling a stateful workload is tricky. If you scale up too slowly, users get rejected. If you scale down too fast, you kill active conversations. We use Horizontal Pod Autoscaling (HPA) with a specific configuration to handle this “dampening.”

The “flapping” problem

Imagine a demo where you show the agent to a colleague. You connect (1 session), CPU spikes. HPA sees the spike and adds 2 pods. You disconnect. HPA sees low CPU and deletes the pods. This “flapping” is wasteful and dangerous.

The solution: Stabilization Windows

We configure the HPA with a stabilizationWindowSeconds to smooth out these jitters.

YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: backend-agent-hpa
spec:
 minReplicas: 3
 maxReplicas: 10
 behavior:
   scaleUp:
     stabilizationWindowSeconds: 120  # Wait 2 mins before scaling up
   scaleDown:
     stabilizationWindowSeconds: 900  # Wait 15 mins before scaling down
 metrics:
 - type: Resource
   resource:
     name: cpu
     target:
       type: Utilization
       averageUtilization: 75

  • Scale up (120s): Ignores brief spikes (like pod startup). Only scales if load is sustained.

  • Scale down (900s): The “Safety Net.” If load drops, we wait 15 minutes before removing pods. This ensures that if a user briefly disconnects and reconnects, or if there is a lull in conversation, we don’t prematurely kill capacity.

Deployment blueprint

Finally, let’s look at the Deployment manifest. The most critical line here is terminationGracePeriodSeconds.

When Kubernetes decides to scale down, it sends a SIGTERM signal to the pod. By default, it waits 30 seconds before force-killing it (SIGKILL). For a voice conversation, 30 seconds might not be enough to say goodbye.

We increase this to 150 seconds (2.5 minutes) and add a preStop hook.

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
 name: backend-agent-deployment
spec:
 template:
   spec:
     terminationGracePeriodSeconds: 150  # Give agents time to finish!
     containers:
     - name: backend-agent
       image: backend-agent:latest
       lifecycle:
         preStop:
           exec:
             command: ["/bin/sh", "-c", "sleep 120"]

How it works:

  1. Scale down event: K8s decides to remove a pod.

  2. PreStop Hook: The pod executes sleep 120. It stops accepting new connections (because it’s removed from the Service endpoints) but stays alive.

  3. Draining: Existing conversations continue naturally.

  4. Termination: After 120s (plus a buffer), the pod shuts down.

This simple change turns a “dropped call” into a graceful exit.

Conclusion

We now have a robust “Brain” for our voice agent.

  • It runs on GKE Autopilot, giving us serverless ease with infrastructure control.

  • It uses Workload Identity and Private Google Access for security and speed.

  • It scales intelligently, protecting active conversations while adapting to demand.

But a brain needs a heart. Right now, our media traffic (the actual voice packets) is routing through LiveKit Cloud’s public infrastructure.

In Part 3, we will bring the heart home.

We will guide you through self-hosting the open-source LiveKit SFU (Selective Forwarding Unit) on Google Cloud. You will learn:

  • The networking challenge: How to solve the complex world of UDP port ranges, Host Networking, and NAT traversal on Kubernetes.

  • Cost & control: How self-hosting can dramatically reduce per-minute costs for high-volume applications while giving you total data sovereignty.

  • The architecture: A deep dive into deploying the LiveKit Server, Redis (Cloud Memorystore), and Ingress on GKE to create a fully private, low-latency media mesh.

This is the step that transforms your project from a cloud-dependent prototype into a fully independent, enterprise-grade platform. Stay tuned!


Next steps:

4 Likes