I deployed llama-2 13 B and 70 B in vertex ai through the model garden. Deployment was successful but when I am hitting the endpoint through curl I keep on getting below error. Has anyone tried llama-2 in vertex ai?
{
"error": {
"code": 503,
"message": "Took too long to respond when processing endpoint_id: {endpoint_id}, deployed_model_id: {deployed_model_id}",
"status": "UNAVAILABLE"
}
}
Same Here, for batch predictions using the colab provided, i have the docker image does not accept the recommanded accelerator,
and from the endpoint, the logs shows this error: (to complete the timeout)
ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format.
It is possible this is a concern with your resources that vertex ai uses in your project or your project hitting quota (You can check these on Logging and In your quota page respectfully) . I would recommend contacting Google support to further investigate your concern: https://cloud.google.com/contact
I had it working for the batch predictions using a regions with the recommanded gpu accelerators, it does seems to be a matter of gpu availability in the region
I’m having the same issue here. Fortunately one of the Endpoint is working with the configurations below. Non of the others with different machine types, or accelerators didn’t work. Check whether the same works for you all.
It’s not possible to use these configs when “one-click” deploying llama2-7b because of the swap memory required. You can use n1-standard-8 instead, which has more memory capacity. It will be more expensive though