Hello,
I am trying to figure out the cost of using a fine-tuned version of Gemini-flash-2.0. I believe all the information is contained here : https://cloud.google.com/vertex-ai/pricing
I understand that there are training costs, and then also inference costs
It says : Prediction pricing for tuned model endpoints are the same as for the base foundation model.
However, does this mean that having an endpoint deployed for my tuned model cost me each hour that the endpoint is up and running, even with no queries? Or is it only paid per token?
This quote comes from the section in Pricing for AutoML models : "You pay for each model deployed to an endpoint, even if no prediction is made. You must undeploy your model to stop incurring further charges. Models that are not deployed or have failed to deploy are not charged." I think this could also be the case for Generative AI endpoints
So my question is : is the cost for inference ONLY per token, or do you also pay for a deployed endpoint by the hour?
(side note : the former could make sense as you can just take a server running the normal untuned gemini flash, and change the weights in the VRAM with the LoRA weights for the inference)