Part 2 of 4: Beyond the chatbot
By Tanya Dixit, AI Solutions Acceleration Architect, Google Cloud, and Jian Bo Tan, Solutions Acceleration Architect, Google Cloud
In Part 1, we built a “Hello World” voice agent using LiveKit and the Gemini Live model. It was magical: a Python script running on your laptop that could listen, think, and speak in real-time.
But “works on my machine” is a dangerous phrase in production.
When you close your laptop, the agent dies. If ten users call at once, your single Python process might choke. And if you deploy it to a standard serverless platform, you might find your agents getting killed mid-sentence because of timeouts.
This is Part 2 of our series. We are moving from “how to code it” to “where to run it.” We will architect a production-grade hosting environment for our Agent Backend using Google Kubernetes Engine (GKE) Autopilot.
Here is what we’ll cover:
-
The workload: Why a voice agent behaves differently than a web API.
-
The platform: Why GKE Autopilot beats Cloud Run for this specific use case.
-
The architecture: Designing a secure, scalable “Brain” for your agents.
-
The code: A deployment blueprint for handling graceful shutdowns and autoscaling.
Understanding the workload: The “Stateful Worker”
To host this correctly, we first need to understand what we are actually hosting. Most modern web backends are stateless REST APIs. A user sends a request, the server processes it (ms), sends a response, and forgets everything. If the server crashes, you just retry the request.
A real-time voice agent is fundamentally different. It is a Stateful Worker.
-
Long-lived connections: A voice session isn’t a millisecond request; it’s a continuous conversation that can last minutes or hours.
-
In-memory state: The agent holds the conversation context, audio buffers, and user data in memory.
-
Critical affinity: If the process running the agent dies, the user is disconnected instantly. There is no “retry.” The magic is broken.
This means our infrastructure has one primary job: Keep the pod alive.
We need a platform that guarantees a specific amount of CPU/Memory for the duration of the call and gives us precise control over when and how the process terminates.
The compute decision: Why GKE Autopilot?
When deploying containers on Google Cloud, the usual suspects are Cloud Run and GKE.
For stateless APIs, Cloud Run is often the default choice. It scales to zero and manages everything for you. But for our Stateful Worker, it presents challenges:
-
Timeouts: Cloud Run is designed for short-lived requests. While it supports WebSockets, it enforces timeouts (default 60 mins) that can hard-kill long sessions.
-
CPU Throttling: Cloud Run (by default) throttles CPU when a request isn’t active. For a voice agent listening for user input, this “idle” time is actually critical processing time for the VAD (Voice Activity Detection).
-
Scale-down aggression: Cloud Run scales down aggressively to save costs, which is risky for stateful sessions.
Enter GKE Autopilot.
GKE Autopilot offers the best of both worlds: the “serverless” experience of not managing nodes (Google manages the cluster infrastructure) with the Pod lifecycle control of Kubernetes.
With GKE Autopilot, we can:
-
Guarantee resources: Request exactly 1000m CPU and 1Gi Memory per pod, and get exactly that. No throttling.
-
Control termination: Configure
terminationGracePeriodSecondsto allow agents to finish their conversations before shutting down. -
Scale on custom metrics: Use Horizontal Pod Autoscaling (HPA) to scale based on actual load, not just CPU spikes.
Architecture: The Agent Backend
Here is the high-level architecture for our Agent Backend on GKE:
-
Private cluster: The GKE cluster is private. Nodes have no public IP addresses, reducing the attack surface.
-
Cloud NAT: Agents need to reach out to the internet to connect to LiveKit Cloud and Google APIs (Maps, Places). Cloud NAT handles this securely.
-
Private Google access: Traffic to Vertex AI (Gemini) travels over Google’s private backbone, not the public internet. This reduces latency—critical for voice—and improves security.
-
LiveKit affinity: Interestingly, we don’t need to configure Kubernetes Session Affinity. LiveKit Cloud handles the routing. When a user connects, LiveKit assigns them to a specific agent instance. Our job is simply to provide a pool of available agents.
Sizing the agent: The “T-Shirt Size”
Voice AI is heavy. It involves real-time audio encoding/decoding, Voice Activity Detection (VAD), and managing the WebSocket stream. If you size your pods like a typical web server (e.g., 250m CPU), you will hear it. The audio will stutter, the agent will lag, and the user experience will suffer.
Based on our capacity analysis, here is our recommended “T-Shirt Size” for a production agent pod:
YAML
CPU Request: 1000m (1 vCPU)
CPU Limit: 2000m (2 vCPU)
Memory Request: 1Gi
Memory Limit: 2Gi
Why so large? We found that while a single session might only consume ~200-300m CPU, the burst usage during turn-taking (processing user speech + generating response) can spike. By reserving 1 vCPU, we ensure smooth audio even during complex interactions.
Scaling strategy: Implementing HPA
Scaling a stateful workload is tricky. If you scale up too slowly, users get rejected. If you scale down too fast, you kill active conversations. We use Horizontal Pod Autoscaling (HPA) with a specific configuration to handle this “dampening.”
The “flapping” problem
Imagine a demo where you show the agent to a colleague. You connect (1 session), CPU spikes. HPA sees the spike and adds 2 pods. You disconnect. HPA sees low CPU and deletes the pods. This “flapping” is wasteful and dangerous.
The solution: Stabilization Windows
We configure the HPA with a stabilizationWindowSeconds to smooth out these jitters.
YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: backend-agent-hpa
spec:
minReplicas: 3
maxReplicas: 10
behavior:
scaleUp:
stabilizationWindowSeconds: 120 # Wait 2 mins before scaling up
scaleDown:
stabilizationWindowSeconds: 900 # Wait 15 mins before scaling down
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 75
-
Scale up (120s): Ignores brief spikes (like pod startup). Only scales if load is sustained.
-
Scale down (900s): The “Safety Net.” If load drops, we wait 15 minutes before removing pods. This ensures that if a user briefly disconnects and reconnects, or if there is a lull in conversation, we don’t prematurely kill capacity.
Deployment blueprint
Finally, let’s look at the Deployment manifest. The most critical line here is terminationGracePeriodSeconds.
When Kubernetes decides to scale down, it sends a SIGTERM signal to the pod. By default, it waits 30 seconds before force-killing it (SIGKILL). For a voice conversation, 30 seconds might not be enough to say goodbye.
We increase this to 150 seconds (2.5 minutes) and add a preStop hook.
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-agent-deployment
spec:
template:
spec:
terminationGracePeriodSeconds: 150 # Give agents time to finish!
containers:
- name: backend-agent
image: backend-agent:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 120"]
How it works:
-
Scale down event: K8s decides to remove a pod.
-
PreStop Hook: The pod executes
sleep 120. It stops accepting new connections (because it’s removed from the Service endpoints) but stays alive. -
Draining: Existing conversations continue naturally.
-
Termination: After 120s (plus a buffer), the pod shuts down.
This simple change turns a “dropped call” into a graceful exit.
Conclusion
We now have a robust “Brain” for our voice agent.
-
It runs on GKE Autopilot, giving us serverless ease with infrastructure control.
-
It uses Workload Identity and Private Google Access for security and speed.
-
It scales intelligently, protecting active conversations while adapting to demand.
But a brain needs a heart. Right now, our media traffic (the actual voice packets) is routing through LiveKit Cloud’s public infrastructure.
In Part 3, we will bring the heart home.
We will guide you through self-hosting the open-source LiveKit SFU (Selective Forwarding Unit) on Google Cloud. You will learn:
-
The networking challenge: How to solve the complex world of UDP port ranges, Host Networking, and NAT traversal on Kubernetes.
-
Cost & control: How self-hosting can dramatically reduce per-minute costs for high-volume applications while giving you total data sovereignty.
-
The architecture: A deep dive into deploying the LiveKit Server, Redis (Cloud Memorystore), and Ingress on GKE to create a fully private, low-latency media mesh.
This is the step that transforms your project from a cloud-dependent prototype into a fully independent, enterprise-grade platform. Stay tuned!
Next steps:
-
Check out the GKE Autopilot documentation.
-
Review the LiveKit Agents framework for more on the worker lifecycle.
-
Stay tuned for Part 3!