Simplify Open Models’ Deployment on Vertex AI with Import Custom Model Weights

This blog has been co-author with Eliza Huang*, Software Engineer, Vertex AI at Google Cloud.*

TL;DR

You fine-tuned your open model. So now what? The next step is model deployment. This guide shows how to easily take custom open models from Hugging Face and deploy them to a scalable endpoint on Vertex AI with just a couple of commands. The new ‘Import Custom Model Weights’ feature, using the Vertex AI Model Garden SDK, automates the model deployment process and provisions the necessary serving infrastructure. This guide provides a complete walkthrough.

This guide is current as of August 2025 and requires google-cloud-aiplatform version 1.105.0 or newer.


The challenge: The “last mile” of open model deployment

The open models’ community is incredible. Platforms like Hugging Face give us access to thousands of powerful, specialized, and fine-tuned models. We can download them, experiment locally, and see what they’re capable of.

However, the journey from a model sitting in a repository to a production-ready, scalable endpoint in the cloud—the ‘last mile’ of model deployment—has often been filled with friction. This final stage of model deployment has traditionally involved a series of steps, as illustrated below:

This process is tedious and can slow down the process that brings models from experimentation to a successful model deployment.

The solution: Custom Model for simplified model deployment

The new ‘Import Custom Model Weights’ feature directly addresses this problem. Vertex AI built a process right into the Model Garden SDK that handles the heavy lifting of your model deployment.

Instead of manual steps, you now have a high-level API that treats model deployment as a simple, repeatable, model-centric process. The core of this new workflow is the vertexai.preview.model_garden.CustomModel class, which acts as your single entry point for importing and deploying custom open models to a scalable endpoint.

The ‘Import Custom Model Weights’ transforms model deployment from a tedious infrastructure task to a streamlined process.

Example: Deploying a fine-tuned Gemma model

To demonstrate this, we’ll deploy a fine-tuned Gemma model from Hugging Face in three steps. You can find the full, runnable notebook for this example here.

Step 1: Transfer the fine-tuned model to Google Cloud Storage

First, the model’s assets need to be in Google Cloud Storage. We’ve created a helper function that automates this entire step, preparing your assets for model deployment. It uses the hf_transfer library for accelerated downloads and our transfer_manager for efficient uploads to your Google Cloud Storage (GCS) bucket.

All you need to do is call it with the model’s Hugging Face ID, your bucket name and the region you want to create the bucket:

# The Hugging Face model we want to import.
hf_model_id = "xsanskarx/thinkygemma-4b"

# This command will download the model and upload it to your Google Cloud Storage (GCS) bucket.
imported_custom_model_uri = transfer_model(hf_model_id, BUCKET_NAME, LOCATION)

Step 2: Import and deploy the model on Vertex AI

With the model artifacts in Google Cloud Storage, you are now set to deploy your custom model. We use the CustomModel class, pointing it to the Google Cloud Storage path where our model now lives.

from vertexai.preview import model_garden

# Create a CustomModel object from the artifacts in GCS.
model = model_garden.CustomModel(
    gcs_uri=imported_custom_model_uri,
)

Next, a single call to the .deploy() method handles the rest of the model deployment process. This creates a dedicated endpoint, which provides a private, low-latency network path to your model, ensuring consistent performance and enhanced security for production workloads.

# This command initiates the model deployment to a new endpoint.
# This can take 15-20 minutes as it provisions the hardware.
endpoint = model.deploy(
    machine_type="g2-standard-24",
    accelerator_type="NVIDIA_L4",
    accelerator_count=2,
)

Running the command triggers a validation process and provisions the complete serving infrastructure—the underlying compute, GPUs, and optimized inference engine—needed to run your model efficiently. You can monitor the progress in the Vertex AI Endpoint UI, as shown below:

Step 3: Run inference on the scalable endpoint

Once the model deployment is complete, your model is live on a scalable endpoint. You can immediately start sending requests to it.


# Interact with the deployed model on the scalable endpoint
response = endpoint.predict(
    instances=[{"prompt": "how many r does strawberry have?"}],
    use_dedicated_endpoint=True
)

print(response.predictions)

Bonus: Using the OpenAI SDK with your scalable endpoint

If you prefer the OpenAI client library, your dedicated, scalable endpoint is also compatible! This allows you to integrate your custom deployed model into existing applications with minimal code changes.

Here’s how you can connect to the same scalable endpoint using the openai Python library:

import google.auth
import google.auth.transport.requests
import openai

# Authenticate and get credentials
creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

# Construct the special endpoint URL for the OpenAI client
endpoint_url = f"https://{endpoint.gca_resource.dedicated_endpoint_dns}/v1beta1/{endpoint.resource_name}"

# Initialize the OpenAI client with the endpoint URL and your GCP token
client = openai.OpenAI(base_url=endpoint_url, api_key=creds.token)

# Make the prediction request
prediction = client.chat.completions.create(
    model="", # The model name is managed by the endpoint
    messages=[{"role": "user", "content": "Tell me a joke"}],
    temperature=0.7
)

print(prediction.choices[0].message.content)

Some considerations for your model deployment

It’s important to note that this ‘Import Custom Model Weights’ functionality is currently in Public Preview. This means it is actively developing it and your feedback would help to improve it.

Here are a few things to keep in mind:

  • Broad Model Compatibility: This feature is designed to be highly flexible for any model deployment. It supports all deployable models in Model Garden and the vast majority of Hugging Face models, both full-fined and PEFT (LORA) with backend serving runtimes including vLLM, SGLang, TGI, TEI, and regular PyTorch.
  • Pricing Model: Billing is based on the hourly cost of the provisioned serving infrastructure for your scalable endpoint, not on a per-token basis.
  • Scale-to-Zero: We know cost-efficiency is key. Support for scaling endpoints down to zero is in the works. Stay tuned for updates!

Conclusion

The ‘Import Custom Model Weights’ feature is an important improvement for open model deployment on Vertex AI. It provides a simple, model-centric mechanism for bridging the gap between open models and a production-ready, scalable endpoint. This new approach to model deployment means you can focus more on customizing models and less on managing serving infrastructure.

What’s next

To start your next model deployment with your own models on Vertex AI, check out the following resources:

We’d love to hear from you! Share your feedback and connect with our community on LinkedIn, X/Twitter.

Happy building!

5 Likes