Pricing - Vertex hosted tuned model

Brieuc · March 31, 2025, 3:25pm

Hello,

I am trying to figure out the cost of using a fine-tuned version of Gemini-flash-2.0. I believe all the information is contained here : https://cloud.google.com/vertex-ai/pricing

I understand that there are training costs, and then also inference costs

It says : Prediction pricing for tuned model endpoints are the same as for the base foundation model.
However, does this mean that having an endpoint deployed for my tuned model cost me each hour that the endpoint is up and running, even with no queries? Or is it only paid per token?

This quote comes from the section in Pricing for AutoML models : "You pay for each model deployed to an endpoint, even if no prediction is made. You must undeploy your model to stop incurring further charges. Models that are not deployed or have failed to deploy are not charged." I think this could also be the case for Generative AI endpoints

So my question is : is the cost for inference ONLY per token, or do you also pay for a deployed endpoint by the hour?

(side note : the former could make sense as you can just take a server running the normal untuned gemini flash, and change the weights in the VRAM with the LoRA weights for the inference)

SuwarnaKale · April 6, 2025, 4:27am

Hello @Brieuc ,

For fine-tuned Gemini models on Vertex AI, you incur both endpoint deployment costs and per-token inference charges. Like AutoML models, having an endpoint deployed (even idle) triggers hourly infrastructure fees until undeployed, separate from usage-based token costs. The pricing page confirms tuned models follow base model rates for inference tokens; only endpoint hosting remains billable.

Some key steps you may try:

Hourly endpoint costs apply while deployed (like AutoML)
Per-token pricing matches base Gemini Flash rates
Undeploy to stop infrastructure charges

While LoRA weight swaps could theoretically enable token-only billing, Google currently charges for endpoint uptime regardless. Always undeploy unused models to optimize costs.

Best regards,

Suwarna

deft · July 9, 2025, 6:28pm

thanks @SuwarnaKale are these Hourly endpoint costs per node? say 50 nodes are used concurrently for 1 hour, will the hourly cost be multiplied by 50?

Topic		Replies	Views
text-bison@001 tuned model serving Custom ML & MLOps vertex-ai-platform	18	24	September 28, 2023
Assistance Required for Estimating Hosting and Deployment Costs for Fine-Tuning Gemini 1.0 Pro Custom ML & MLOps vertex-ai-platform	3	29	October 18, 2024
Ongoing deployment/hosting costs with AutoML vs custom training? Custom ML & MLOps automl , vertex-ai-platform	0	11	January 15, 2024

Pricing - Vertex hosted tuned model

AI Suggested topics