Hey all,
Vertex AI recently shipped model co-hosting for LLMs. Instead of dedicating a full GPU node to each model, you can now run Llama, Gemma, Mistral, etc. side by side on the same VM with explicit GPU memory partitioning.
With the model cohosting, the team found:
- Throughput improvement at saturation
- Near-zero latency regression when properly partitioned
- Virtually no interference between co-hosted models
You can get an explicit GPU memory slice via --gpu-memory-partitions when you deploy. Also you can also hot-swap models at runtime using the update_models API without restarting the container.
Resources:
- Blog post: “Closing the efficiency gap in LLM serving with model co-hosting with Vertex AI” on Cloud AI Engineering Blog (Co-authored with Kathy Yu and Jiuqiang Tang).
- Tutorial notebook: model_garden_model_cohost.ipynb in GoogleCloudPlatform/vertex-ai-samples on GitHub
As always, let’s connect on LinkedIn or X/Twitter for questions or feedback.
