HELP: Clarification and Cost Optimization for Hourly Billing on Deployed Vertex AI Endpoints

Mikey · August 12, 2025, 12:11pm

Hi everyone,

I am currently deploying a fine-tuned Gemini 2.5 model on Google Cloud Vertex AI for production use. I have a few concerns and would appreciate advice from the community or Google engineers on the following:

I am being charged hourly for my deployed Vertex AI endpoint even during periods with no prediction requests or traffic. This results in significant fixed costs for keeping my endpoint available 24/7.
From my understanding, these hourly charges are for virtual machines (nodes) that stay continuously running to ensure low-latency responses, but I want to confirm this and understand the rationale better.
Given that my main usage happens only during specific business hours (e.g., American market hours), paying for idle time outside these windows seems costly.
I want to know the best practices or available features within Vertex AI to minimize or avoid these idle-time hourly charges without sacrificing availability during peak usage.
Also, I would appreciate insights into how others manage cost-efficient deployment of fine-tuned Gemini models in production—whether through autoscaling, scheduled start/stop, or other methods.
Any tips for balancing cost and responsiveness, handling autoscaling limitations (like minimum replica counts), or alternatives like batch prediction workflows would be extremely helpful.

Thank you in advance for your guidance!

MarvinLlamas · August 12, 2025, 9:33pm

Hi Mikey,

It looks like you are encountering an issue with high fixed costs from keeping your Vertex AI endpoint for your fine-tuned Gemini 2.5 model running 24/7, even during idle periods. You only need it active during your specific business hours and are looking for cost-optimization strategies that preserve availability during your peak usage times.

Here are the potential ways that might help with your use case:

Start with Autoscaling: You may want to start with autoscaling by setting minReplicaCount=1 and tuning your scaling metrics, as this provides a balanced approach for maintaining performance in production while controlling costs.
Migrate Non-Real-Time Loads to Batch: You may want to use Vertex AI Batch Prediction, orchestrated by Cloud Scheduler or Vertex AI Pipelines, for any part of your workload that doesn’t require instant results, this can significantly reduce costs.
Consider Scheduled Deletion/Re-creation ONLY IF:
- Zero idle cost is a non-negotiable requirement.
- You’re willing to accept noticeable cold-start delays at the start of each business day.
- You’re prepared to manage the increased operational complexity that comes with it.

Topic		Replies	Views
Vertex AI AutoML Endpoint Cost Optimisation for Idle State Custom ML & MLOps vertex-ai-platform , vertex-ai-model-registry	1	24	July 25, 2025
Pricing - Vertex hosted tuned model Custom ML & MLOps vertex-ai-platform	2	48	July 9, 2025
Vertex AI endpoints: custom models support and billing Custom ML & MLOps vertex-ai-platform	1	41	December 6, 2024

HELP: Clarification and Cost Optimization for Hourly Billing on Deployed Vertex AI Endpoints

AI Suggested topics