While loading a model from huggingface, my custom training job gets stuck

Pooya · November 2, 2025, 4:10pm

Hi,
I am trying to load a model from huggingface and do a pruning and benchmarking on it as a custom training job. I made a custom docker container and it runs on my local machine.
The job on vertex AI is so slow and is stuck on the line for loading model:
model = AutoModelForCausalLM.from_pretrained(
“meta-llama/Llama-3.1-8B-Instruct”, cache_dir=HF_CACHE_DIR)
I have cached the model in a bucket.

Tried a lot of combination of arguments and parameters, enabled hf logs, still the job is not running on those machines.
I am on first 3 month trial credits, could it be the reason?

ilnardo92 · November 10, 2025, 4:25am

Hi @Pooya ,

thanks for posting your training issue!

Based on your description, I strongly suspect this is not related to the initial three-month trial credits; those generally impact billing, not runtime execution like training jobs.

To help you could you please share your full training code and setup scripts on GitHub? I’d be happy to dive into the repository and pinpoint the issue for you.

Pooya · November 10, 2025, 9:08pm

I changed my app to load model from local disk of container, having copied the large model into the image already.

Now the model loaded in 5 mins, on an n1-standard-16
compared to more than 4h previously on an n2-standard-16, where I loaded the model from GCS bucket.
no GPUs in any case.

I read that GCS imposes a lot of IO latency, but damn I was not expecting this much.

Topic		Replies	Views
Vertex AI custom training job never finished Custom ML & MLOps vertex-ai-platform	1	78	May 20, 2023
Fine tuning gemini in Vertex AI taking many hours Custom ML & MLOps gemini-in-looker , vertex-ai-platform	1	64	February 4, 2025
Vertex AI tuning job for gemini 2.0stuck at running for 16 hours with small data set i.e 30 entries Custom ML & MLOps vertex-ai-platform , vertex-ai-model-registry	5	58	June 19, 2025

While loading a model from huggingface, my custom training job gets stuck

AI Suggested topics