We have all been there: you dial a customer service number, only to be greeted by a robotic voice informing you that you are “number 30 in the queue”. What follows is a loooong ordeal of repetitive hold music and a “sorry, we are experiencing unusually high call volumes” message.
Press enter or click to view image in full size
This isn’t just a minor inconvenience — it’s a failure in customer experience. Historically, automating these interactions meant building a voice bot that felt just as robotic as the queue itself. Until recently, the standard engineering approach was to chain three distinct models together: Speech-to-Text (STT), followed by an LLM or AI Agent for intent detection, reasoning and answer generation, and finally a Text-to-Speech (TTS) model to voice the response.
This “assembly line” architecture has several critical flaws from both a technical and business perspective:
-
The “Walkie-Talkie” Latency: Each stage must wait for the previous one to complete, leading to awkward, multi-second delays that kill the flow of natural conversation.
-
Emotional Information Decay: Translating raw speech into plain text strips away the “how” of what was said — losing the user’s intonation, pacing, and pitch.
-
Neglected Expression: Classical TTS models focus on lexical accuracy but neglect the emotional warmth required to de-escalate a frustrated caller
Press enter or click to view image in full size
Google’s Gemini Live API fundamentally changes this landscape by offering a unified, native audio architecture. Instead of a disjointed chain, the model processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses.
Key features that solve the “Queue Nightmare” include:
-
Ultra-Low Latency: It is the fastest model for generating the first token of audio output, enabling real-time, fluid dialogue.
-
Affective Dialog: The model adapts its tone and style to match the user’s emotional expression — essential for providing empathetic support.
-
Proactive Audio: The model can intelligently distinguish between the speaker and background noise, knowing exactly when to respond and when to listen.
In this article, I will show you how to connect this new “Agentic AI” with traditional telephony. We will look at how to use FreeSWITCH as an operating system for telephony to pipe calls directly into a multi-agent application powered by Google Agent Development Kit (ADK) and Gemini Live API.
Press enter or click to view image in full size
Generate with Google Gemini NanoBanana
Let’s have a closer look at the “brain” of the operation. You can experiment with the Gemini Live API directly from Vertex AI Studio by selecting the “Live API” option to test real-time interactions before writing a single line of code:
Press enter or click to view image in full size
Press enter or click to view image in full size
Key Features of the Gemini Live API
The Live API is a native multimodal, multi language engine designed for low-latency, real-time use cases.
Here are the key features that make it really stand out:
-
128k Context Window: Makes it possible to maintain complex, long-running conversation history without losing track of previous turns.
-
Ultra-Low Latency: the fastest model when it comes to seconds to generate the first token of audio output
-
Multimodal: it is not just voice. It works also with video so that model can reason based on what it seems and not only on what it hears. Think about using your phone or laptop camera or .. sharing your screen and asking agent for help. This is massive!
-
Affective Dialog: The model detects emotion in the user’s voice (frustration, confusion, joy) and adapts its own tone to match.
-
Proactive Audio: Gemini can distinguish between the speaker and background noise, intelligently deciding when to respond or remain silent.
-
30+ HD Voices: Choose from a broad library of natural-sounding voices across 30+ languages.
-
Multilingual Sync: Supports simultaneous multilingual conversations, identifying the spoken language on the fly without setting changes.
-
Environmental Filtering: Native noise suppression allows for clear conversations even in loud, outdoor, or industrial settings.
Beyond that, the Live API integrates directly with the tools required to interact with our knowledge and operational systems:
-
Real-Time Grounding: Uses Function Calling and Google Search to fetch factual, up-to-the-minute information during a live call.
-
Built-in Session Management allows you to maintain context across long conversations
-
Uses Ephemeral Tokens for client-side security. Like standard API keys, ephemeral tokens can be extracted from client-side applications such as web browsers or mobile apps. But because ephemeral tokens expire quickly and can be restricted, they significantly reduce the security risks in a production environment.
Upgrading to Live Multiagent Apps
If you are already using the Google Agent Development Kit (ADK), the transition to real-time voice and video is incredibly simple. You don’t need to rewrite your agent’s logic.
By simply updating your Agent Declaration to point to a supported Gemini Live model (like gemini-live-2.5-flash-native-audio), your existing tools, prompts, and reasoning paths are instantly “voice and video -enabled”:
"""Google Search Agent definition for ADK Bidi-streaming demo."""
import os
from google.adk.agents import Agent
from google.adk.tools import google_search
from dotenv import load_dotenv
from google.adk.tools import VertexAiSearchTool
load_dotenv()
SEARCH_ENGINE_ID = os.getenv('SEARCH_ENGINE_ID')
SEARCH_DATASTORE_ID = os.getenv('SEARCH_DATASTORE_ID')
private_knowledge_corpus = VertexAiSearchTool(
search_engine_id = SEARCH_ENGINE_ID,
max_results = 10
)
knowledge_agent = Agent(
model=os.getenv("GEMINI_LIVE_MODEL"),
name="company_knowledge_expert",
instruction="You are MyCompany expert with access knowledge database",
tools=[private_knowledge_corpus],
)
ai_specialist = Agent(
name="google_ai_specialist",
model=os.getenv("GEMINI_LIVE_MODEL"),
tools=[google_search],
instruction="You are cloud expert focusing on Google AI",
)
marketing_specialist = Agent(
name="marketing_specialist",
model=os.getenv("GEMINI_LIVE_MODEL"),
tools=[google_search],
instruction="You are marketing specialist"
)
root_agent = Agent(
name="company_agent",
model=os.getenv("GEMINI_LIVE_MODEL"),
tools=[google_search],
sub_agents = [ai_specialist, marketing_specialist, knowledge_agent],
instruction="You are a helpful assistant with access to specialist agents for marketing, company knowledge and ai specialits"
)
Press enter or click to view image in full size
Once your agents are defined, you can launch the ADK WebUI to interact with them immediately.
Press enter or click to view image in full size
It is an impressive experience — especially when you realize that you can simply toggle video collection to enable full multimodal capabilities. This allows the agent to process real-time audio and video streams simultaneously, answering questions about what it “sees” through your camera with the same low latency as a voice-only call.
From Foundations to Phone Calls
We now have the technical foundation in place: a sophisticated AI agent capable of multimodal reasoning. The next step is to bridge this agent to the real world so users can simply dial a phone number and speak to it directly.
Instead of reaching a human operator or a frustrating touch-tone menu, the caller will engage in a natural conversation with the ADK Voice Agent powered by Gemini Live.
Your first thought might be that connecting a phone line to AI Agents is incredibly complex. However, the reality is much simpler than you might expect. To understand how we bridge these two worlds — the Agentic AI runtime and the traditional telephone network (PSTN) — we need to look at the underlying infrastructure.
The Infrastructure: Bridging the PSTN and AI
At the core of this architecture is FreeSWITCH, an open-source framework that serves as the “operating system” for telephony. It manages the complex, low-level mechanics of answering calls, mixing audio, and orchestrating media streams. While it powers some of the world’s largest telecom infrastructures, for our purposes, you can view it as a containerized service running on a virtual machine that hands off calls to our AI agent.
Connecting to the Real World via SIP
To interact with the outside world, FreeSWITCH uses SIP (Session Initiation Protocol) — the universal standard for managing real-time communication over the internet. It is used by Telecom providers to initiate and manage real-time sessions (signaling) over the internet.
Since FreeSWITCH is internet-based software, it cannot connect directly to physical cell towers or copper landlines. It requires a SIP Trunk Provider to act as the bridge. You will also need SIP Trunk Provider provider for two additional reasons:
-
Acquiring a Phone Number: You must purchase a legitimate DID (Direct Inward Dialing) number (e.g., a Polish +48 number).
-
Inbound Routing: When a user calls, the signal travels through the traditional phone network to the provider’s data center, where it is converted into VoIP packets and forwarded to your FreeSWITCH server.
-
FreeSWITCH generates a bidirectional audio stream that we can pipe directly into our Gemini Live powered ADK application
Press enter or click to view image in full size
There are thousands of companies acting as SIP Trunk Provider. For this demo, I used Halonet.pl. While the process is rather straightforward, there are a few practical considerations:
-
Cost: Maintaining a number is inexpensive (roughly 1.23 PLN/month), but you must pre-fund your account for per-minute usage charges (take this into account when you calculate TCO — there is cost per minute for your calls to SIP Trunk Provider).
-
Verification: To comply with anti-terrorism regulations, you must undergo profile verification. This typically involves submitting a bank transfer confirmation and can take up to two business days.
Press enter or click to view image in full size
Once you have acquired a phone number, you are ready to deploy the FreeSWITCH server. For my experiments, I containerized the software and ran it on a Google Cloud virtual machine.
Press enter or click to view image in full size
Routing Calls with the Dialplan
To handle incoming calls, FreeSWITCH requires a routing logic defined in an XML-based configuration file called the Dialplan. When a call arrives from your SIP provider, the Dialplan matches the destination number against specific rules — often using regex patterns.
FreeSWITCH uses various modules as action nodes within this plan. These allow you to intercept a call to execute custom scripts (such as Python or Lua) or open a WebSocket connection.
Below is a sample Dialplan that instructs FreeSWITCH to answer every call and simply echo back what the user says. Simultaneously, it records the conversation to a file that can be synced to a GCS bucket.
<include>
<extension name="inbound_halonet">
<condition field="destination_number" expression="^.*$">
<action application="answer"/>
<action application="sleep" data="1000"/>
<action application="log" data="INFO Starting Recording to GCS..."/>
<action application="set" data="RECORD_TITLE=Call-\${destination_number}"/>
<action application="set" data="RECORD_STEREO=true"/>
<action application="record_session" data="/recordings/rec_\${strftime(%Y-%m-%d-%H-%M-%S)}_\${caller_id_number}.wav"/>
<action application="echo"/>
</condition>
</extension>
</include>
Linking FreeSWITCH server with SIP Trunk Provider via SIP Profiles
To ensure your SIP Trunk Provider knows where to send calls, you must configure a SIP Profile. This XML file contains the authentication and gateway details needed to link your local server with your SIP account. Registration with the provider occurs automatically as soon as the FreeSWITCH server starts. Here is the configuration I used for the Halonet gateway:
<include>
<gateway name="halonet">
<param name="proxy" value="sip.halonet.pl"/>
<param name="username" value="xxxxxxxxx"/>
<param name="password" value="yyyyyyyyyyyyyyy"/>
<param name="register" value="true"/>
<param name="from-domain" value="sip.halonet.pl"/>
<param name="from-user" value="solvewithlucas"/>
<param name="caller-id-in-from" value="true"/>
<param name="extension" value="auto_to_user"/>
<param name="extension-in-contact" value="true"/>
</gateway>
</include>
Adding Virtual Operator
At this stage, our server acts like a “parrot” — it can receive calls and echo audio. Our real goal, however, is to replace that echo with AI Agent powered by Gemini Live.
To do this, we need to run our AI Agent as a standalone service. I implemented this using FastAPI where I provide WebSocket endpoint for real-time, bi-directional (Bidi) streaming.
Press enter or click to view image in full size
The service performs the following steps:
-
Session Initialization: It initializes or reinitializes Google ADK session.
-
RunConfig Setup: It configures a
RunConfigobject to control the behavior of the Gemini Live model. Through this config, you can define specific languages, select voices, and enable native Gemini Live features like affective dialog (emotional intelligence) and proactivity (allowing the model to speak without being prompted). -
To manage the real-time nature of the conversation, the service relies on a non-blocking architecture driven by two concurrent asynchronous tasks. This ensures that our agent can ingest audio and generate a response simultaneously, maintaining the low latency essential for a natural flow.
The Data Flow: Upstream and Downstream
The core logic is handled within our WebSocket endpoint:
- The Upstream Task: This task continuously listens for incoming messages from the WebSocket. When it receives raw audio bytes, it wraps them into a
Blob(specifically 16-bit PCM at 16kHz) and pushes them into aLiveRequestQueuefor AI Agent to process.
@app.websocket("/ws/{user_id}/{session_id}")
async def websocket_endpoint(
websocket: WebSocket,
user_id: str,
session_id: str,
proactivity: bool = False,
affective_dialog: bool = False,
) -> None:
"""WebSocket endpoint for bidirectional streaming with ADK.
Args:
websocket: The WebSocket connection
user_id: User identifier
session_id: Session identifier
proactivity: Enable proactive audio (native audio models only)
affective_dialog: Enable affective dialog (native audio models only)
"""
await websocket.accept()
logger.debug("WebSocket connection accepted")
# ========================================
# Phase 2: Session Initialization (once per streaming session)
# ========================================
# Build RunConfig with optional proactivity and affective dialog
# These features are only supported on native audio models
run_config = RunConfig(
streaming_mode=StreamingMode.BIDI,
response_modalities=response_modalities,
input_audio_transcription=types.AudioTranscriptionConfig(),
output_audio_transcription=types.AudioTranscriptionConfig(),
session_resumption=types.SessionResumptionConfig(),
proactivity=(
types.ProactivityConfig(proactive_audio=True)
if proactivity
else None
),
enable_affective_dialog=affective_dialog
if affective_dialog
else None,
)
# Get or create session (handles both new sessions and reconnections)
session = await session_service.get_session(
app_name=APP_NAME, user_id=user_id, session_id=session_id
)
if not session:
await session_service.create_session(
app_name=APP_NAME, user_id=user_id, session_id=session_id
)
live_request_queue = LiveRequestQueue()
# ========================================
# Phase 3: Active Session (concurrent bidirectional communication)
# ========================================
async def upstream_task() -> None:
"""Receives messages from WebSocket and sends to LiveRequestQueue."""
logger.debug("upstream_task started")
while True:
# Receive message from WebSocket (text or binary)
message = await websocket.receive()
# Handle binary frames (audio data)
if "bytes" in message:
audio_data = message["bytes"]
logger.debug(
f"Received binary audio chunk: {len(audio_data)} bytes"
)
audio_blob = types.Blob(
mime_type="audio/pcm;rate=16000", data=audio_data
)
live_request_queue.send_realtime(audio_blob)
- The Downstream Task: This task consumes generated events from the AI’s
run_live()ADK runner generator. It takes the model’s responses—which include both text transcripts and generated audio—and streams them back through the WebSocket toward the FreeSWITCH server.
session_service = InMemorySessionService()
runner = Runner(app_name=APP_NAME, agent=agent,
session_service=session_service
)
async def downstream_task() -> None:
async for event in runner.run_live(
user_id=user_id,
session_id=session_id,
live_request_queue=live_request_queue,
run_config=run_config,
):
event_json = event.model_dump_json(exclude_none=True, by_alias=True)
logger.debug(f"[SERVER] Event: {event_json}")
await websocket.send_text(event_json)
For optimal performance with Gemini, the service adheres to specific audio characteristics:
-
Input Requirements: 16-bit PCM format at a 16kHz sample rate (
audio/pcm;rate=16000). -
Output Specifications: 16-bit PCM, Mono, at a 24kHz sample rate.
In my deployment, I containerized the agent service and ran it on the same virtual machine as FreeSWITCH to minimize network hop latency.
Bridging AI Agent with FreeSWITCH server
To bridge our FreeSWITCH telephony server and our WebSocket-based ADK service, we need a specialized middleware that acts as a real-time translator. This “Bridge” service exposes a WebSocket endpoint that FreeSWITCH connects to at the start of every call.
Press enter or click to view image in full size
Inside this middleware, we orchestrate two asynchronous tasks to maintain a fluid, bi-directional conversation:
- The Upstream Tunnel: This task captures raw audio bytes directly from the phone line via FreeSWITCH and forwards them instantly to the Gemini Agent service. It acts as a transparent pipe, ensuring the AI Agent “hears” the caller in real time.
agent_url = f"{AGENT_BASE_URL}/{user_id}/{uuid}"
try:
async with websockets.connect(agent_url) as agent_ws:
logger.info(f"🚀 Connected to Agent: {agent_url}")
# Task A: FreeSWITCH (Mic) -> Gemini Agent
async def upstream():
try:
while True:
msg = await websocket.receive()
if "bytes" in msg:
# Forward Raw Audio Bytes to Agent
await agent_ws.send(msg["bytes"])
elif "text" in msg:
pass
except Exception:
pass
- The Downstream Translator: This is where the complex work happens. It listens for the AI Agent’s JSON events, which contain both text transcriptions and base64-encoded audio. The bridge extracts these audio chunks and packages them into a format FreeSWITCH understands, allowing the AI’s voice to be played back to the caller
async def downstream():
try:
async for message in agent_ws:
try:
data = json.loads(message)
if "content" in data:
parts = data["content"].get("parts", [])
# 3. Process Audio
for part in parts:
if "inlineData" in part:
b64_audio = part["inlineData"]["data"]
# Send to FreeSWITCH
fs_payload = {
"type": "streamAudio",
"data": {
"audioDataType": "raw",
"sampleRate": SAMPLE_RATE,
"audioData": b64_audio
}
}
await websocket.send_json(fs_payload)
By default, FreeSWITCH needs an extra “skill” to stream audio over WebSockets effectively. For this, I utilized mod_audio_stream, a lightweight yet powerful module designed to stream L16 audio from a telephony channel to a WebSocket endpoint and handle the returning audio packets.
The Final Piece: Dialplan
With our bridge services in place, we must modify the FreeSWITCH Dialplan. Instead of a simple echo, the Dialplan now instructs the server to answer the call, initiate the audio stream to our bridge service URL, and “park” the call to keep the media session active while the AI takes over.
<include>
<extension name="inbound_agent">
<condition field="destination_number" expression="^.*$">
<action application="answer"/>
<action application="sleep" data="200"/>
<action application="export" data="STREAM_PLAYBACK=1"/>
<action application="log" data="INFO 🎤 STARTING CALL: \${{caller_id_number}}"/>
<action application="set" data="stream_res=\${{uuid_audio_stream \${{uuid}} start ws://127.0.0.1:8080/live/\${{uuid}}?caller_id=\${{caller_id_number}} mono 24000}}"/>
<action application="park" />
</condition>
</extension>
</include>
With this configuration, your infrastructure is fully wired. It’s time to pick up your phone, dial your number, and experience a low-latency, multimodal conversation powered by Gemini Live.
Lukasz
I am sharing my knowledge, thoughts and lessons learnt with people who like to experiment, innovate and work on AI…
For those who want to focus entirely on their agent definition without the overhead of managing a custom telephony stack, there is a streamlined alternative: Gemini Enterprise for Customer Experience (GECX)
The Managed Path: Gemini Enterprise (GECX)
GECX leverages the same Google ADK framework behind the scenes but offers a high-level, low-code interface for building, evaluating, and monitoring your agents. Turning a standard AI agent into a sophisticated voice agent is as simple as configuring it to use a native audio-to-audio model, such as gemini-live-2.5-flash-native-audio:
Press enter or click to view image in full size
GECX makes it also easy to connect your AI Agent to various data sources and functional tools to build a truly “action-oriented” voice assistants. This includes:
-
Google Managed Connectors: Quickly integrate with established platforms like Salesforce, ServiceNow, and Cloud Storage using pre-built integration connectors.
-
MCP Servers: Connect your agents to Model Context Protocol (MCP) servers to standardize how they access local or remote data and services.
-
API Integrations: Use OpenAPI specifications to allow your agent to interact with any web-based service or internal microservice.
-
Custom Python Code: For maximum flexibility, you can define custom tools directly in Python, allowing the agent to execute specific logic, perform complex calculations, or handle proprietary data formats.
-
Grounding and Search: Enhance factual accuracy by enabling Google Search or Vertex AI Search to retrieve real-time information during a live session.
Press enter or click to view image in full size
When you are ready, GECX allows you to deploy your agent as a web widget or to fully managed Telco server providing a seamless customer experience across platforms.
Press enter or click to view image in full size
In just a few clicks, you can acquire a fully managed phone number and a Telco server that automatically routes calls to your AI agent.
Press enter or click to view image in full size
Press enter or click to view image in full size
The Trade-off: While GECX is incredibly efficient, the managed phone numbers are currently US-based. However, as I explained in this article — telco server like FreeSWITCHThis can be registered with any local SIP Trunk Provider so chances are you would be able to register your local number (such as a Polish +48) working with Google account team. Also custom FreeSWITCH approach detailed earlier in this post remains one of the viable options as well.
Deploying as widget to your web application is also quite straightforward:
Press enter or click to view image in full size
What’s Next?
It is truly impressive how much can be achieved with a few clicks in a managed environment. However, if you prefer to maintain full ownership and flexibility over your integration, there is much more to explore.
In my next article, I will explain how to repurpose our Agent Service as a backend microservice for web applications. We will dive into using WebSockets to stream not just audio, but also video snapshots, allowing your ADK agent to answer questions in real time based on what it “sees” through a user’s camera.
Press enter or click to view image in full size
Press enter or click to view image in full size
Stay tuned! Its going to be live and multimodal!
This article is authored by Lukasz Olejniczak — Customer Engineer at Google Cloud. The views expressed are those of the authors and don’t necessarily reflect those of Google.
Please clap for this article if you enjoyed reading it. For more about google cloud, data science, data engineering, and AI/ML follow me on LinkedIn




















