LLMs deployed from Model Garden cut replies short

I have deployed several models (Gemma, Gemma2 and Qwen) from Model Garden to endpoints in two different regions europe-west1 and europe-west4.

When I test these they cut replies short. These are the two latest tests I saves:

“predictions”: [
“Prompt:\nWhat is the capital of France and what I can do there?\nOutput:\n Paris.\n\nThe area of Paris is 41 km2 (16 miles”
] From qwen_qwen2-1_5b

{
“predictions”: [
“Prompt:\nWhat is the capital of France and what I can do there?\nOutput:\nParis is the capital of France and a global city known for its”
] From gemma2-2b-it

Is there anything I can do with this? I used Vertex AIs one click deployment.

Hi @Makro ,

Welcome to Google Cloud Community!

Cut replies short on those LLMS deployed on Open Models(Gemma, Gemma2 and Qwen) on Model Garden may be due to several reasons.

One possible reason is you are already hitting the max_output_tokens of those models, usually the reply is cut short before reaching the full answer as the max_output_token were already exhausted. This might also be due to model limitations, so I suggest considering other models with a higher max_output_token.

You can also check endpoint configuration in the vertex AI console since there are other factors that might possibly affect the cut replies short like resource constraint that limits the model’s ability for complete response.

Additionally, you can try simplifying your Prompts to make it manageable and prevent cut replies.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Hi @marckevin

My prompts are simple. We used one click deployment in Model Garden. I would think max_output_tokens should be more than 10 which looks to be replies we are getting. There is very little we can do differently. We have abandon idea of using one click deployments for this and other reasons. This issue is with one click deployments.