Fine-Tuning Queue/Jobs Taking Several Hours to Start With Gemini/Vertex AI

Hi,

I’m currently fine-tuning with model gemini-1.0-pro-002; has anyone experienced issues where jobs become stuck in the queue for several hours with no progress? All I see is a page that says “Queuing for capacity…” Normally I don’t have such issues, but when I do, it stalls progress for several hours at times and I don’t know the core issue. Is there a massive backlog on google’s end and/or does the fine-tuning have less cloud resources to submit jobs since it’s fairly new?

Any feedback? Anyone else have such issues?

Best Regards,
Jacob

Hi, is anyone else having issues fine-tuning? I even opened a new account and I still cannot complete any tasks. The jobs sit in queue.

Hi @jsniff12 ,

Welcome to the Google Cloud Community!

If you’re experiencing extended delays with the “Queuing for capacity…” message during fine-tuning, especially with the gemini-1.0-pro-002 model, this is a common issue, and several factors could be contributing to it:

  • High demand periods: Heavy traffic can lead to long queues, particularly when there are many fine-tuning jobs running simultaneously.
  • Quota limits: Your project may have reached its quota for compute resources (GPUs, CPUs) or for concurrent fine-tuning jobs. You can check your project’s quotas in the Google Cloud Console.
  • Fine-tuning configuration: The size of your dataset, number of epochs, and other parameters impact resource usage. Larger jobs may receive lower priority, resulting in longer queue times.

Here are some steps that could help resolve the issue:

  1. Try Again Later: Sometimes, waiting a bit and retrying later in the day or on another day can work.
  2. Check Your Quotas: Make sure you haven’t hit any quota limits by reviewing your project’s quotas in the Google Cloud Console.
  3. Simplify Your Job: If possible, consider reducing the size of your training dataset or simplifying your fine-tuning settings. This could help your job be processed faster.

Additionally, please note that the gemini-1.0-pro model will be deprecated on February 15, 2025. Although Google has extended its availability until April 9, 2025, it’s important to migrate to a supported model, like gemini-1.5-pro or gemini-1.5-flash, before then to avoid any disruptions. After April 9, 2025, the gemini-1.0-pro will no longer be available.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

@dawnberdan I’m facing issue with gemini 2.0flash training through vertexai dashboard and my tuning jobs are in running. It was working fine before but now its in runnig for almost 18 hrs!!!1

Hi, were you able to figure out the issue? I also created a tuning job for fine-tuning gemini-2.0-flash-001, but it was stuck in “Preparing for tuning..” for about 5 hours. I eventually cancelled it and tried to create a new one. I am not certain that this will not be similar.

No I’m stil lfacing this issue in running state