I do not have much experience with Vertex AI. I fine-tuned Llama 3.3, and now I want to use it to test my prompts. To run batch inference, I tried registering the model and submitting a batch job as shown below, but it keeps saying that it cannot find the config file. However, when I search with gsutil, the file clearly exists at the GCS path. What could be causing this? This account has editor-level permissions.
Error:
Can’t load the configuration of ‘/gcs/llm_proj/llama/llama_5/postprocess/node-0/checkpoints/final/’. If you were trying to load it from ’ Models – Hugging Face ', make sure you don’t have a local directory with the same name. Otherwise, make sure ‘/gcs/llm_proj/llama/llama_5/postprocess/node-0/checkpoints/final/’ is the correct path to a directory containing a config.json file
Code:
aiplatform.init(project=PROJECT_ID, location=LOCATION)
old_model = aiplatform.Model(model_name=ORIGINAL_MODEL_ID)
old_model._sync_gca_resource()
image_uri = old_model.container_spec.image_uri
GCS_MODEL_PATH = “gs://llm_proj/llama/llama_5/postprocess/node-0/checkpoints/final”
MODEL_PATH_IN_CONTAINER = “/gcs/llm_proj/llama/llama_5/postprocess/node-0/checkpoints/final/”
#VLLM_COMMAND = [“python3”, “-m”, “vllm.entrypoints.api_server”]
VLLM_COMMAND = [“bash”, “-c”]
CORRECTED_ARGS = [
f"python3 -m vllm.entrypoints.api_server "
f"–model {MODEL_PATH_IN_CONTAINER} "
"–host 0.0.0.0 "
"–port 7080 "
"–tensor-parallel-size 4 "
"–max-model-len 8192 "
"–gpu-memory-utilization 0.85 "
"–swap-space 16 "
"–max-num-seqs 12 "
"–enable-chunked-prefill "
"–enable-auto-tool-choice "
"–tool-call-parser llama3_json "
"–dtype bfloat16 "
"–load-format safetensors "
"–trust-remote-code "
“–distributed-executor-backend mp”
]
ENVIRONMENT_VARIABLES = {
“TRANSFORMERS_OFFLINE”: “0”,
“HF_HUB_OFFLINE”: “0”,
“VLLM_USE_MODELSCOPE”: “False”
}
new_model = aiplatform.Model.upload(
display_name=“llama3-3-70b-v4-final-fixed”,
serving_container_image_uri=image_uri,
serving_container_command=VLLM_COMMAND,
serving_container_args=CORRECTED_ARGS,
serving_container_environment_variables=ENVIRONMENT_VARIABLES,
serving_container_predict_route=“/v1/chat/completions”,
serving_container_health_route=“/health”,
serving_container_ports=[7080],
)
batch_job = new_model.batch_predict(
job_display_name=“llama3-3-batch-inference-fixed”,
gcs_source=“gs://llm_proj/ft_eval_data/eval_input_llama_test.jsonl”,
gcs_destination_prefix=“gs://llm_proj/output/”,
machine_type=“g2-standard-48”,
accelerator_type=“NVIDIA_L4”,
accelerator_count=4,
starting_replica_count=1,
max_replica_count=1,
)