Hi everyone,
I am currently deploying a fine-tuned Gemini 2.5 model on Google Cloud Vertex AI for production use. I have a few concerns and would appreciate advice from the community or Google engineers on the following:
-
I am being charged hourly for my deployed Vertex AI endpoint even during periods with no prediction requests or traffic. This results in significant fixed costs for keeping my endpoint available 24/7.
-
From my understanding, these hourly charges are for virtual machines (nodes) that stay continuously running to ensure low-latency responses, but I want to confirm this and understand the rationale better.
-
Given that my main usage happens only during specific business hours (e.g., American market hours), paying for idle time outside these windows seems costly.
-
I want to know the best practices or available features within Vertex AI to minimize or avoid these idle-time hourly charges without sacrificing availability during peak usage.
-
Also, I would appreciate insights into how others manage cost-efficient deployment of fine-tuned Gemini models in production—whether through autoscaling, scheduled start/stop, or other methods.
-
Any tips for balancing cost and responsiveness, handling autoscaling limitations (like minimum replica counts), or alternatives like batch prediction workflows would be extremely helpful.
Thank you in advance for your guidance!