ADK Agents + Cloud Run deployment + Session on VertexAI Agent Engine issue

Hi ,

I developed a multi agent system deployed on Google Cloud Run (as streaming is broken on Agent Engine deployment), and I do use Agent Engine for session and memory management.

I now have integrated this into my react based UI application and have used copilotkit as wrapper to manage conversations with Agent.

To scale, I have multiple instances of my UI running on Clod Run (tried with multip pod on k8s as well).

I am facing issue while receiving the events as part for /agent-events, where some of the events are getting lost, also I see messages from Agent are not recieved well in order or sometimes the conversation from a old session is picked in current session - I have one session per conversation, and maintain history on chat bar.

I tries just a single instance on UI cloud run deployment and all works - so I concluded that the issue is multiple instance of my UI where events and responses are recieved by diferent instances of UI in random.

So conversation is happening on one instance and the events are received the another instance.

Has anyone come across this distributes system issue and has implemented anything ? I have some ideas but I would like hear from expert community here ?

1 Like

Hi @Anil_Singh This sounds like a stateless multi-instance issue. When you scale Cloud Run, requests and event callbacks can land on different instances, so if you rely on in-memory session state or local event handling, events will appear out of order or tied to the wrong session.

You should not depend on instance memory for session or event routing. Instead use a shared backend layer such as Firestore, Redis (Memorystore), or a database to persist session state and message ordering. For streaming or event delivery, consider Pub/Sub or a message queue so events are processed deterministically and tied to a session ID.

Also ensure every request and event includes an explicit session ID and conversation ID, and that your UI always fetches state from the shared store rather than local memory.

In short, the fix is to externalize session state and event coordination, because Cloud Run instances are stateless and not guaranteed to handle related requests consistently.

Thanks for the input @a_aleinikov , quite helpful.
Finally moved to Redis (Memorystore).