Beyond the Chatbot: WebRTC, Gemini, and Your First Real-Time Voice Agent
This is Part 1 of our 4-part series:
Beyond the Chatbot: Building Scalable, Real-Time Voice AI on Google Cloud
By Tanya Dixit, AI Solutions Acceleration Architect, Google Cloud, and Jian Bo Tan, Solutions Acceleration Architect, Google Cloud
Voice is the most natural interface we have. As humans, we don’t just exchange text; we interrupt, we use tone to convey meaning, and we expect immediate feedback. While text-based chatbots have revolutionized customer service, a new generation of AI is enabling us to build applications that go beyond typing—creating fluid, voice-first experiences that feel less like a transaction and more like a conversation.
Imagine a customer service agent that you can talk to while driving, ask to “find a nearby store,” and have it instantly provide information without you ever touching a screen. This is the promise of real-time voice AI.
In this post—the first in our series—we’ll explore how to build this exact experience. We’ll design and write the code for a production-grade, stateful voice agent powered by the Google Gemini Live model and LiveKit.
Here’s what you will learn:
- The Context: How the Gemini Live model enables a new class of streaming, multimodal applications.
- The Protocols: Understanding the different roles of WebSockets and WebRTC in modern application architecture.
- The Code: How to build a complete voice agent that holds a natural conversation and fetches real-time data.
The rise of real-time multimodal AI
Traditional voice assistants have relied on a pipeline of separate services: converting speech to text, processing that text with an LLM, and then converting the text response back to speech. This approach works well for many applications but is inherently designed for turn-based interactions.
The Google Gemini Live model introduces a new paradigm: native multimodality. It can process a continuous stream of audio input and generate a continuous stream of audio output. This shift from discrete turns to continuous streaming unlocks powerful new capabilities for enterprise applications:
- True Interruptibility: Users can speak over the agent to correct it or change direction mid-sentence, creating a much more natural flow.
- Rich Contextual Understanding: The model can perceive non-verbal cues like tone, pitch, and hesitation (“um…”, “ah…”), using them to better understand user intent.
- Seamless Interaction: By eliminating the need to constantly open and close microphone streams, the interaction feels fluid and always-on.
To bring these capabilities to a user in a web browser, we need a transport protocol that can handle this continuous, bi-directional flow of high-fidelity media.
The “why”: Choosing the right protocol for voice
Modern web applications rely on various protocols, each optimized for specific tasks. Choosing the right one is critical for delivering the best user experience.
The role of WebSockets
WebSockets are the industry standard for real-time, bi-directional data transfer. They run on TCP, which guarantees that every byte of data sent is received in the exact order it was sent. This makes them perfect for critical signaling, text chat, and real-time data dashboards where missing a single data point is unacceptable.
The role of WebRTC for media
For live voice and video, however, the priority shifts from perfect order to low latency. In a live conversation, it is better to skip a lost millisecond of audio than to pause the entire conversation to wait for it.
WebRTC (Web Real-Time Communication) was designed specifically for this need. It runs primarily on UDP, a protocol that favors speed and immediacy. It also includes a mature, enterprise-grade media stack directly in the browser that handles:
- Echo Cancellation: Essential for preventing audio feedback loops when using speakers.
- Automatic Gain Control: Normalizes audio volume so users don’t have to shout or whisper.
- Network Adaptation: dynamically adjusts audio bitrate to match changing network conditions, ensuring the call stays connected even on weak signals.
For our real-time voice agent, WebRTC provides the robust, low-latency foundation we need.
The “easy button”: The LiveKit Agent Framework
Implementing a full WebRTC stack from scratch is a significant engineering undertaking. To accelerate our development, we’ll use LiveKit, an open-source infrastructure framework that is widely adopted by enterprises for its scalability and reliability.
Specifically, we’ll use the LiveKit Agent Framework, a server-side toolkit designed for building “AI participants.” It allows our Python application to join a real-time session just like any other user, giving it the ability to listen, think, and speak into the room.
Infrastructure Note: For this tutorial, we will use LiveKit Cloud, a fully managed service that handles the complex media server infrastructure (SFU) for us. This allows us to focus entirely on writing our agent code. However, because LiveKit is open-source, you retain the full ability to self-host the media infrastructure on your own Google Cloud environment. We will cover self-hosting the LiveKit SFU in detail in Part 3 of this series.
To build our app, we only need to understand these core concepts:
- Room: The secure, virtual session where participants meet.
- Participant: Any entity in a Room. We will have a Human User (on our React frontend) and an AI Agent (on our Python backend).
- Track: A media stream. The user publishes an AudioTrack (their voice), and the agent publishes one back.
- AgentSession: The runtime instance that manages the conversation state, wiring the user’s audio stream directly to the Gemini Live model and back.
The “how”: Architecture of a real-time voice agent
Let’s get to the code. We’ll build a voice agent app where a user can have a hands-free conversation with an agent to query information and perform actions.
Here is the logical flow of our application. The User (React) and the Agent (Python) both connect to the managed LiveKit Cloud SFU, which acts as a high-performance media router. The agent then handles the intelligence by interacting with Google Cloud APIs.
Let’s build the four critical pieces of this flow, using code snippets adapted from a production-grade prototype.
Step 1: The authentication (Backend: agent.py)
Security is paramount. Before a user can join a room, they must be authenticated. We’ll create a simple Flask endpoint in our agent’s service to issue secure JSON Web Tokens (JWTs).
# From agent.py - A Flask endpoint to grant tokens
from livekit.server_sdk import (
AccessToken,
VideoGrants,
)
# … (imports for Flask, etc.)
LIVEKIT_API_KEY = os.environ.get(“LIVEKIT_API_KEY”)
LIVEKIT_API_SECRET = os.environ.get(“LIVEKIT_API_SECRET”)
@app.route(“/token”, methods=[“GET”])
def token():
room_name = request.args.get(“roomName”)
identity = request.args.get(“identity”)
\# Create an access token for the user
access\_token \= AccessToken(LIVEKIT\_API\_KEY, LIVEKIT\_API\_SECRET)
\# Define strict permissions for this user
grant \= VideoGrants(
room\_join=True,
room=room\_name,
can\_publish=True, \# Allow user to speak
can\_subscribe=True, \# Allow user to hear the agent
)
\# Grant the permissions and return the token
access\_token.with\_identity(identity).with\_name(identity).with\_grants(grant)
return jsonify({"token": access\_token.to\_jwt()})
Step 2: The client connection (Frontend: App.tsx)
Our React frontend uses this token to establish a secure WebRTC connection. The @livekit/components-react library abstracts away the complex signaling and media negotiation.
// From App.tsx - The main React component
import { LiveKitRoom } from ‘@livekit/components-react’;
import { useState, useEffect } from ‘react’;
function App() {
const [token, setToken] = useState<string>(“”);
// … (fetch token from backend on load) …
if (token === “”) {
return <div>Getting token…</div>;
}
return (
<LiveKitRoom
token={token}
serverUrl={import.meta.env.VITE_LIVEKIT_URL}
connect={true}
audio={true} // Automatically acquire and publish microphone
video={false}
>
{/* Our App UI goes here */}
<VoiceAgentApp />
</LiveKitRoom>
);
}
Step 3: The agent’s brain (Backend: agent.py)
Now we define the agent itself. We configure it to use the Gemini Live model and provide a “system prompt” that defines its helpful persona.
# From agent.py - Defining the agent’s AI model and persona
from livekit.agents import (
Agent,
JobContext,
llm,
)
from livekit.plugins import google_ai # LiveKit’s plugin for Google AI
class MyAgent(Agent):
def __init__(self):
super().__init__()
\# Configure the Gemini Live model for low-latency streaming
self.llm \= google\_ai.LLM(
model="gemini-2.0-flash-live",
temperature=0.7,
)
\# Define the agent's operational rules
self.instructions \= """
You are a helpful voice assistant.
You are friendly, concise, and helpful.
When asked for specific information, you MUST use the
appropriate tool.
If any required information is missing, you MUST ask the user
for it explicitly.
"""
\# ... (rest of init) ...
Step 4: Grounding with function tools (Backend: agent.py)
To make our agent truly useful, we must “ground” it in real-world data. We do this by defining “tools”—Python functions that the Gemini Live model can request to execute.
# From agent.py - Giving the agent a “tool”
import json
from livekit.agents import llm
class MyAgent(Agent):
# … (init and instructions from above) …
\# Define a tool that the Gemini Live model can decide to use
@function\_tool()
async def get\_information(
self,
query: str,
parameter: str | None \= None,
) \-\> str:
print(f"Gemini requested information: {query}")
\# In a real app, this would call an external API service
\# data \= await external\_service.get\_data(...)
\# Placeholder for demonstration
data \= {"status": "success", "result": "Here is the requested information..."}
\# Return the data as a JSON string to the model
return json.dumps(data)
Key features in action
By combining the native streaming capabilities of the Gemini Live model with these grounded tools, we can unlock powerful new user experiences:
- Natural, Complex Requests: Users can speak naturally, combining multiple intents into a single sentence. For example, “Check the status of my order and let me know the estimated delivery date.”
- Real-time Contextual Data: The agent can understand fuzzy requests like “latest”, “nearest”, or “best rated”, and translate them into precise API queries to fetch real-time data.
- Contextual Multi-turn Dialogue: If a user follows up with, “Actually, can you check for my other order instead?” the agent maintains the context of the conversation and switches the query instantly without needing the user to restate the entire request.
Conclusion
We’ve successfully architected a modern, real-time voice AI. We’ve seen how the Gemini Live model enables a new class of continuous, multimodal interactions, and how WebRTC provides the low-latency transport layer to deliver it. By using the LiveKit Agent Framework, we’ve abstracted away the complexity of the media stack, allowing us to focus entirely on building a great user experience.
But our agent is currently just a Python script on a local machine. To serve enterprise-scale traffic, we need a robust, scalable infrastructure.
Coming in Part 2: The stateful worker problem
In Part 2, we will move from “how to code it” to “where to run it.” We will architect a production-grade hosting environment for our Agent Backend using GKE Autopilot, exploring:
- Why GKE Autopilot? We’ll discuss why GKE Autopilot is the ideal choice for hosting stateful, persistent AI workers compared to stateless options.
- Agent Backend Architecture: A deep dive into the specific challenges of hosting the Python agent service, distinct from the media infrastructure.
- Scaling Strategy: How to use Horizontal Pod Autoscaling (HPA) to dynamically adjust our agent pool based on real-time demand.
What’s next?
- Explore the concepts: Start digging into the official Gemini API documentation and the LiveKit Agent framework.
- Ask questions! As Solutions Architects at Google, we love discussing these types of complex, real-world architectures. What are your biggest challenges in building AI applications? Ask us your questions about GKE, Gemini, or WebRTC in the comments below!