We have deployed the official Google CES Twilio Adapter ( GitHub - GoogleCloudPlatform/ces-twilio-adapter · GitHub ) to connect a Twilio phone number to a CES agent. The integration is otherwise working correctly, but audio from the CES agent has an approximately 60-second delay (or longer) before playing to the caller.
What’s Working
- Phone calls connect successfully
- WebSocket connections establish (both Twilio and CES)
- Bidirectional audio streaming (caller can speak, agent receives)
- CES agent processes queries and generates responses
- Audio from CES agent DOES play to caller
- Audio quality is good and intelligible
- No errors in Cloud Run logs
- No Twilio protocol errors
The Problem
- Audio has approximately 60-second delay before playing to caller
- Example timeline:
- 00:00 - Caller says “Hello”
- 00:01 - CES agent processes and generates response
- 01:00 - Caller finally hears the agent’s response
(no delays are shown in the Conversational Insights log)
Audio Format
- CES Output: LINEAR16, 16000Hz
- Conversion: LINEAR16 16kHz → MULAW 8kHz (using Python audioop)
- Twilio Input: MULAW, 8000Hz
- **Conversion Status:***Working correctly
Timestamp Field Investigation
We attempted to add timestamps to audio packets to enable proper pacing:
```json
{
“event”: “media”,
“streamSid”: “MZxxxxx”,
“media”: {
“payload”: “base64-encoded-audio”
},
“timestamp”: 12345
}
```
**Result:** Twilio returns protocol error:
- **Error Code:** 31951
- **Error Message:** “Stream - Protocol - Invalid Message”
- **Description:** “The Streamer has received a message non compliant with the protocol”
### Current Media Message Format
We are using the clean Twilio format without timestamps:
```json
{
“event”: “media”,
“streamSid”: “MZxxxxx”,
“media”: {
“payload”: “base64-encoded-audio”
}
}
```
Result:
-
No protocol errors
-
Audio delays ~60 seconds (appears to be buffering)
IAM Permissions Configured
The service account has the following permissions:
1. **roles/dialogflow.client** - For CES API access
2. **roles/contactcenteraiplatform.viewer** - For CES platform access
3. **roles/secretmanager.secretAccessor** - For Twilio credentials
4. **Custom role: ces_session_runner**
- Permission: `ces.sessions.bidiRunSession`
- Required for bidirectional CES sessions
All permissions are working correctly (no permission errors in logs).
1. Is the timestamp field supported in Twilio Media Streams with CES?
- If yes, what is the correct format/implementation?
- If no, how can we achieve real-time audio playback?
2. Is the 60-second audio delay a known issue with the CES Twilio Adapter?
- Are there configuration options to reduce buffering?
- Is there an updated version of the adapter available?
3. What is the recommended approach for real-time audio playback?
- Should we use a different audio format?
- Are there additional Twilio webhook configurations needed?
- Should we implement custom buffering/pacing logic?
4. Are there any CES-specific configurations that affect audio timing?
- Agent configuration settings?
- Deployment-level settings?
- Regional latency considerations?
5. Is there documentation or examples of working real-time voice integrations with CES?
- Reference architectures?
- Best practices?
- Performance tuning guides?