HELP: Clarification and Cost Optimization for Hourly Billing on Deployed Vertex AI Endpoints

Hi everyone,

I am currently deploying a fine-tuned Gemini 2.5 model on Google Cloud Vertex AI for production use. I have a few concerns and would appreciate advice from the community or Google engineers on the following:

  • I am being charged hourly for my deployed Vertex AI endpoint even during periods with no prediction requests or traffic. This results in significant fixed costs for keeping my endpoint available 24/7.

  • From my understanding, these hourly charges are for virtual machines (nodes) that stay continuously running to ensure low-latency responses, but I want to confirm this and understand the rationale better.

  • Given that my main usage happens only during specific business hours (e.g., American market hours), paying for idle time outside these windows seems costly.

  • I want to know the best practices or available features within Vertex AI to minimize or avoid these idle-time hourly charges without sacrificing availability during peak usage.

  • Also, I would appreciate insights into how others manage cost-efficient deployment of fine-tuned Gemini models in production—whether through autoscaling, scheduled start/stop, or other methods.

  • Any tips for balancing cost and responsiveness, handling autoscaling limitations (like minimum replica counts), or alternatives like batch prediction workflows would be extremely helpful.

Thank you in advance for your guidance!

Hi Mikey,

It looks like you are encountering an issue with high fixed costs from keeping your Vertex AI endpoint for your fine-tuned Gemini 2.5 model running 24/7, even during idle periods. You only need it active during your specific business hours and are looking for cost-optimization strategies that preserve availability during your peak usage times.

Here are the potential ways that might help with your use case:

  • Start with Autoscaling: You may want to start with autoscaling by setting minReplicaCount=1 and tuning your scaling metrics, as this provides a balanced approach for maintaining performance in production while controlling costs.

  • Migrate Non-Real-Time Loads to Batch: You may want to use Vertex AI Batch Prediction, orchestrated by Cloud Scheduler or Vertex AI Pipelines, for any part of your workload that doesn’t require instant results, this can significantly reduce costs.

  • Consider Scheduled Deletion/Re-creation ONLY IF:

    • Zero idle cost is a non-negotiable requirement.
    • You’re willing to accept noticeable cold-start delays at the start of each business day.
    • You’re prepared to manage the increased operational complexity that comes with it.