Vertex AI Online Predictions Scale Down

Hello, I’d like to set the minimum number of compute nodes for Vertex AI Online Predictions to 0 instead of 1. Why is this not possible?

Hello,
Unfortunately that is not possible, Vertex AI Endpoints need a minimum of 1 replicas to be up at all times. Check this doc

link
However there is another option that can be used but needs more complex setting up, you can use cloud run instances to serve your model and it can scale down 0, another thing to note is that cloud run services now supports gpu too. Check out this doc, note that its talking specifcally about llm deployments but there is nothing preventing you from deploying any model since at the end of the day youre running containers on the cloud run instances. link
Hope this helps!

Hi @izvonkov,

Welcome to the Google Cloud Community!

You might find it helpful to check this case to address your issue. It offers a suggested approach to address the issue you encountered.

If you’re seeking other options, you could look into manual scaling or opt for AI Platform, which allows scaling down to zero. You may want to check this documentation for more information.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.