LLAMA 2 in Vertex AI not working

anonymous8 · September 5, 2023, 4:28am

I deployed llama-2 13 B and 70 B in vertex ai through the model garden. Deployment was successful but when I am hitting the endpoint through curl I keep on getting below error. Has anyone tried llama-2 in vertex ai?

{
  "error": {
    "code": 503,
    "message": "Took too long to respond when processing endpoint_id: {endpoint_id}, deployed_model_id: {deployed_model_id}",
    "status": "UNAVAILABLE"
  }
}

caupetit · September 6, 2023, 8:45am

Same Here, for batch predictions using the colab provided, i have the docker image does not accept the recommanded accelerator,

and from the endpoint, the logs shows this error: (to complete the timeout)

ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format.

nceniza · September 6, 2023, 4:14pm

It is possible this is a concern with your resources that vertex ai uses in your project or your project hitting quota (You can check these on Logging and In your quota page respectfully) . I would recommend contacting Google support to further investigate your concern: https://cloud.google.com/contact

anonymous8 · September 6, 2023, 4:18pm

@nceniza problem is support in google cant be contacted without paid plan

anonymous8 · September 7, 2023, 12:32pm

Receiving the same error.

---------------------------------------------------------------------------
_InactiveRpcError                         Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/google/api_core/grpc_helpers.py:65, in _wrap_unary_errors.<locals>.error_remapped_callable(*args, **kwargs)
     64 try:
---> 65     return callable_(*args, **kwargs)
     66 except grpc.RpcError as exc:

File /opt/conda/lib/python3.10/site-packages/grpc/_channel.py:946, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
    944 state, call, = self._blocking(request, timeout, metadata, credentials,
    945                               wait_for_ready, compression)
--> 946 return _end_unary_response_blocking(state, call, False, None)

File /opt/conda/lib/python3.10/site-packages/grpc/_channel.py:849, in _end_unary_response_blocking(state, call, with_call, deadline)
    848 else:
--> 849     raise _InactiveRpcError(state)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Took too long to respond when processing endpoint_id: {endpoint_id}, deployed_model_id: {deployed_model_id}"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:{ipv4}:443 {created_time:"2023-09-07T12:24:43.661919946+00:00", grpc_status:14, grpc_message:"Took too long to respond when processing endpoint_id: {endpoint_id}, deployed_model_id: {deployed_model_id}}"

anonymous8 · September 7, 2023, 12:37pm

Seems like google has released it without testing . Unfortunately no one from google team is helping on this.

caupetit · September 7, 2023, 12:50pm

I had it working for the batch predictions using a regions with the recommanded gpu accelerators, it does seems to be a matter of gpu availability in the region

From here: https://cloud.google.com/vertex-ai/docs/general/locations?hl=fr#region_considerations

I could find a region for me for a llama2-7b with a V100 gpu available for predictions (the ones without the *)

anonymous8 · September 7, 2023, 5:02pm

I’m having the same issue here. Fortunately one of the Endpoint is working with the configurations below. Non of the others with different machine types, or accelerators didn’t work. Check whether the same works for you all.

Working config,

Region : us-central1 (Iowa)

Access : standard

Model : llama2-7b-chat

machine : n1-standard-4

Accelerator : NVIDIA_TESLA_T4

Accelerator count : 1

anonymous8 · September 8, 2023, 12:43pm

@Achila thanks your config worked for 7B. But api is very slow. Also I am still looking for solution for 70B. Did you get any success on it.

anonymous8 · September 14, 2023, 5:03am

Nope. Still trying :(. Found out there were multiple downtimes as well in GCP recently. I think AIModelGarden isn’t stable yet.

anonymous8 · October 5, 2023, 8:04am

Anyone got the right configurations for deployment of LLAMA2 13B ?

anonymous8 · October 5, 2023, 9:18am

Following configuration no more works for the deployment of LLAMA2 7B (Chat).

Region : us-central1 (Iowa)

Access : standard

Model : llama2-7b-chat

machine : n1-standard-4

Accelerator : NVIDIA_TESLA_T4

Accelerator count : 1

Error: ValueError: Too large swap space. 16.00 GiB out of the 14.65 GiB total CPU memory is allocated for the swap space.

anonymous8 · October 17, 2023, 8:30pm

I’m now trying to deploy LLaMa 2 7B on the configuration @Achila suggested, but it doesn’t work. Was anybody able to fix it?

capaci · November 8, 2023, 4:59pm

It’s not possible to use these configs when “one-click” deploying llama2-7b because of the swap memory required. You can use n1-standard-8 instead, which has more memory capacity. It will be more expensive though

Topic		Replies	Views
Error 503 - Too long to process Custom ML & MLOps model-registry	1	70	September 6, 2024
vertex AI llama-2 7b deploy issue "CustomModelServingL4GPUsPerProjectPerRegion" Custom ML & MLOps agent-platform , model-registry , agent-platform-workbench	3	24	April 5, 2024
Fail to deploy my registered model to endpoint on Vertex AI Custom ML & MLOps model-registry	8	661	March 28, 2025

LLAMA 2 in Vertex AI not working

AI Suggested topics