Hi,
I am trying to load a model from huggingface and do a pruning and benchmarking on it as a custom training job. I made a custom docker container and it runs on my local machine.
The job on vertex AI is so slow and is stuck on the line for loading model:
model = AutoModelForCausalLM.from_pretrained(
“meta-llama/Llama-3.1-8B-Instruct”, cache_dir=HF_CACHE_DIR)
I have cached the model in a bucket.
Tried a lot of combination of arguments and parameters, enabled hf logs, still the job is not running on those machines.
I am on first 3 month trial credits, could it be the reason?
Based on your description, I strongly suspect this is not related to the initial three-month trial credits; those generally impact billing, not runtime execution like training jobs.
To help you could you please share your full training code and setup scripts on GitHub? I’d be happy to dive into the repository and pinpoint the issue for you.
I changed my app to load model from local disk of container, having copied the large model into the image already.
Now the model loaded in 5 mins, on an n1-standard-16
compared to more than 4h previously on an n2-standard-16, where I loaded the model from GCS bucket.
no GPUs in any case.
I read that GCS imposes a lot of IO latency, but damn I was not expecting this much.