Model co-hosting for LLMs on Vertex AI

Hey all,

Vertex AI recently shipped model co-hosting for LLMs. Instead of dedicating a full GPU node to each model, you can now run Llama, Gemma, Mistral, etc. side by side on the same VM with explicit GPU memory partitioning.

With the model cohosting, the team found:

  1. Throughput improvement at saturation
  2. Near-zero latency regression when properly partitioned
  3. Virtually no interference between co-hosted models

You can get an explicit GPU memory slice via --gpu-memory-partitions when you deploy. Also you can also hot-swap models at runtime using the update_models API without restarting the container.

Resources:

As always, let’s connect on LinkedIn or X/Twitter for questions or feedback.

2 Likes

Thank you Ivan!

Would it be possible to support further quantization, custom deployments, or maybe even unsloth? (via scripts easily?)

I haven’t check it out but probably implementation would be look like this → create cluster, or cloud gpus, then connect and run scripts.

But for the LLMs, I believe underlying tech stack would be one of the known. Like Accelerate, Axolotl or Unsloth. Or maybe this feature already supported.

Best,