CX Agent Studio - Twilio Integration

We have deployed the official Google CES Twilio Adapter ( GitHub - GoogleCloudPlatform/ces-twilio-adapter · GitHub ) to connect a Twilio phone number to a CES agent. The integration is otherwise working correctly, but audio from the CES agent has an approximately 60-second delay (or longer) before playing to the caller.

What’s Working

- Phone calls connect successfully

- WebSocket connections establish (both Twilio and CES)

- Bidirectional audio streaming (caller can speak, agent receives)

- CES agent processes queries and generates responses

- Audio from CES agent DOES play to caller

- Audio quality is good and intelligible

- No errors in Cloud Run logs

- No Twilio protocol errors

The Problem

- Audio has approximately 60-second delay before playing to caller

- Example timeline:

- 00:00 - Caller says “Hello”

- 00:01 - CES agent processes and generates response

- 01:00 - Caller finally hears the agent’s response

(no delays are shown in the Conversational Insights log)

Audio Format

- CES Output: LINEAR16, 16000Hz

- Conversion: LINEAR16 16kHz → MULAW 8kHz (using Python audioop)

- Twilio Input: MULAW, 8000Hz

- **Conversion Status:***Working correctly

Timestamp Field Investigation

We attempted to add timestamps to audio packets to enable proper pacing:

```json

{

“event”: “media”,

“streamSid”: “MZxxxxx”,

“media”: {

“payload”: “base64-encoded-audio”

},

“timestamp”: 12345

}

```

**Result:** Twilio returns protocol error:

- **Error Code:** 31951

- **Error Message:** “Stream - Protocol - Invalid Message”

- **Description:** “The Streamer has received a message non compliant with the protocol”

### Current Media Message Format

We are using the clean Twilio format without timestamps:

```json

{

“event”: “media”,

“streamSid”: “MZxxxxx”,

“media”: {

“payload”: “base64-encoded-audio”

}

}

```

Result:

- :white_check_mark: No protocol errors

- :cross_mark: Audio delays ~60 seconds (appears to be buffering)

IAM Permissions Configured

The service account has the following permissions:

1. **roles/dialogflow.client** - For CES API access

2. **roles/contactcenteraiplatform.viewer** - For CES platform access

3. **roles/secretmanager.secretAccessor** - For Twilio credentials

4. **Custom role: ces_session_runner**

- Permission: `ces.sessions.bidiRunSession`

- Required for bidirectional CES sessions

All permissions are working correctly (no permission errors in logs).

1. Is the timestamp field supported in Twilio Media Streams with CES?

- If yes, what is the correct format/implementation?

- If no, how can we achieve real-time audio playback?

2. Is the 60-second audio delay a known issue with the CES Twilio Adapter?

- Are there configuration options to reduce buffering?

- Is there an updated version of the adapter available?

3. What is the recommended approach for real-time audio playback?

- Should we use a different audio format?

- Are there additional Twilio webhook configurations needed?

- Should we implement custom buffering/pacing logic?

4. Are there any CES-specific configurations that affect audio timing?

- Agent configuration settings?

- Deployment-level settings?

- Regional latency considerations?

5. Is there documentation or examples of working real-time voice integrations with CES?

- Reference architectures?

- Best practices?

- Performance tuning guides?

@Google - why are you requiring a standalone audio streaming microservice for Twilio integration?

With CX, there was one-click deployment.
With Voiceflow, it’s plug and play.

This seems like a pretty clunky part of the new offering.