Introduction
Are you struggling to customize open models deployment on Vertex AI using Hugging Face Deep Learning Containers? Do you need precise control over your inference pipeline? Custom handlers are the solution. This blog post provides a practical example using the Google PaliGemma model for image captioning, demonstrating how to create your own custom handler, deploy it to Vertex AI (including local testing), and send prediction requests.
Deploying open models from Hugging Face on Vertex AI became quite straightforward thanks to Hugging Face Deep Learning Containers support in Vertex AI Model Garden. But what if your model needs special care and attention? Think unique dependencies (like those LoRA-powered diffusion models), complex input/output transformations (converting image segmentations to specific mask formats), or integrations with external services (like fetching weights from private cloud storage).
For these advanced scenarios, you need custom handlers!
What is a custom handler?
Custom handlers are lightweight Python classes that act as the orchestrator for your inference pipeline. They provide granular control over the inference process when running your model within a Hugging Face Deep Learning Container for PyTorch Inference on Vertex AI.
Simply create a handler.py
file (and optionally a requirements.txt for dependencies) in your model repository or Google Cloud Storage (GCS) Bucket, and the Hugging Face Deep Learning Container for PyTorch Inference will automatically detect and use them. So Custom handlers are great for serving any private / unreleased / unsupported models natively in Transformers, Diffusers, or Sentence-Transformers.
Let’s illustrate with a practical example: deploying PaliGemma on Vertex AI. You can find the notebook here.
Custom handlers in action: Image captioning with PaliGemma
Let’s say you’re interested in using PaliGemma for image captioning. And you want to use a Hugging Face Deep Learning Container on Vertex AI to accomplish this. PaliGemma, a versatile open vision-language model (VLM), can process both text and images as inputs. Its capabilities include generating captions, answering questions, detecting objects and more. So, according to the Hugging Face Deep Learning Containers documentation, you can use the Hugging Face Deep Learning Container for Text Generation Inference (TGI) to deploy PaliGemma on Vertex AI. And, in fact, if you try it, it works (see a detailed guide on how to deploy LLMs using the TGI DLC at Deploy Gemma 7B with TGI DLC on Vertex AI). After deploying the model, you can send a prediction request to the endpoint as shown below.
from google.cloud import aiplatform
PROJECT_ID = ...
LOCATION = us-central1
ENDPOINT_ID = ...
aiplatform.init(project=PROJECT_ID, location=LOCATION)
endpoint = aiplatform.Endpoint(
f"projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}"
)
output = endpoint.predict(
instances=[
{
"inputs": "what is in the image?\n",
"parameters": {
"max_new_tokens": 256,
"do_sample": False,
},
},
],
)
print(output.predictions[0])
# A rabbit with...
While this basic deployment using the DLC for TGI initially works for simple use-cases, it requires passing image URLs, which isn’t practical for many applications. Custom handlers in the Hugging Face PyTorch DLC for Inference address this limitation.
You start crafting a handler.py
module. As detailed in the guide “Serve any model with Inference Endpoints + Custom Handlers”, our handler needs a specific structure:
from typing import Any, Dict
class EndpointHandler:
def __init__(self, model_dir: str, **kwargs: Any) -> None:
...
def __call__(self, data: Dict[str, Any]) -> Any:
...
The class must be named EndpointHandler
and include both the __init__
and __call__
methods. Beyond that, you have the freedom to add other methods within the class, or even functions outside of it, and utilize them within your class methods.
Here you can find an example of a custom handler for serving PaliGemma in our image captioning scenario.
from typing import Any, Dict, List
import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import base64
from io import BytesIO
from PIL import Image
import logging
import sys
# Configure logging to output to stdout
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[logging.StreamHandler(sys.stdout)]
)
logger = logging.getLogger('huggingface_inference_toolkit')
class EndpointHandler:
"""
Handles inference requests for a PaliGemma model that processes images and text prompts.
"""
def __init__(
self,
model_dir: str = '/opt/huggingface/model',
**kwargs: Any,
) -> None:
# Initialize the model and processor from the specified directory
self.processor = PaliGemmaProcessor.from_pretrained(model_dir)
# Load model with memory optimization and automatic device placement
self.model = PaliGemmaForConditionalGeneration.from_pretrained(
model_dir,
low_cpu_mem_usage=True,
device_map="auto").eval()
def __call__(self, data: Dict[str, Any]) -> Dict[str, List[Any]]:
"""
Process inference requests containing image and text prompts.
Args:
data: Dictionary containing 'instances' list with 'prompt' and 'image_base64' for each instance
Returns:
Dictionary containing list of generated responses
"""
logger.info("Processing new request")
predictions = []
for instance in data['instances']:
# Validate input contains required fields
if any(key not in instance for key in {"prompt", "image_base64"}):
error_msg = "Missing prompt or image_base64 in request body"
logger.info(error_msg)
raise ValueError(error_msg)
try:
# Decode base64 image and convert to PIL Image
image_bytes = BytesIO(base64.b64decode(instance['image_base64']))
image = Image.open(image_bytes)
logger.info("Image loaded successfully")
except Exception as e:
error_msg = f"Failed to load image: {str(e)}"
logger.info(error_msg)
raise ValueError(error_msg)
# Process the input text and image using the model's processor
inputs = self.processor(
text=instance["prompt"], images=image, return_tensors="pt"
).to(self.model.device)
input_len = inputs["input_ids"].shape[-1]
logger.info(f"Input processed, length: {input_len}")
with torch.inference_mode():
# Get generation parameters from request or use defaults
generation_kwargs = data.get(
"generation_kwargs", {"max_new_tokens": 100, "do_sample": False}
)
logger.info(f"Generation kwargs: {generation_kwargs}")
# Generate response using the model
generation = self.model.generate(**inputs, **generation_kwargs)
# Extract only the new tokens (excluding input tokens)
generation = generation[0][input_len:]
# Decode the generated tokens to text
response = self.processor.decode(generation, skip_special_tokens=True)
logger.info(f"Generated response: {response[:100]}...")
predictions.append(response)
logger.info(f"Successfully processed {len(predictions)} instances")
return {'predictions': predictions}
For each incoming request, the handler preprocesses the input by extracting a text prompt and a base64-encoded image. It then uses the PaliGemma processor to convert these inputs into PyTorch tensors, which are subsequently fed into the PaliGemma model for inference. The generated output is then decoded back into text and returned as a dictionary with the generated text under the “predictions” key.
To deploy this solution, upload both the custom handler code (handler.py) and the PaliGemma model (weights and configuration) to a Google Cloud Storage (GCS) Bucket. You can obtain the PaliGemma model weights from the Hugging Face Hub, after accepting their licensing and terms of use, with Hugging Face CLI (huggingface-cli) or use a locally cached version (if available in your HF_HOME directory). Below you have a view of the Google Cloud bucket you should have.
Before deploying your model to Vertex AI, you can benefit from the Python SDK’s LocalModel class which simulates a Vertex AI Endpoint deployment, enabling local building and testing. LocalModel creates a Docker container encapsulating your custom predictor code and handler within the Hugging Face Deep Learning Container. The example below demonstrates how to test the PaliGemma model using this approach.
from google.cloud.aiplatform.prediction import LocalModel
import json
local_paligemma_model = LocalModel(
serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311", serving_container_ports=[5000],
)
local_paligemma_endpoint = local_paligemma_model.deploy_to_local_endpoint(
artifact_uri = str(model_uri),
gpu_device_ids=get_cuda_device_names()
)
local_paligemma_endpoint.serve()
vertex_prediction_request = json.dumps(prediction_request)
vertex_prediction_response = local_paligemma_endpoint.predict(request=vertex_prediction_request, headers={"Content-Type": "application/json"})
print(vertex_prediction_response.json()['predictions'])
You initialized a LocalModel instance by setting the Hugging Face PyTorch DLC for Inference as the serving container image and configuring environment variables to control model serving behavior. The deploy_to_local_endpoint
and serve methods then build and launch the local endpoint. And you can finally test the model and its handler by sending a prediction request.
After running local testing, you deploy PaliGemma to a Vertex AI Endpoint by registering the model within the Vertex AI Model Registry. This centralized repository manages the lifecycle of your ML models on Vertex AI. The following Python code demonstrates how to register the PaliGemma model using the Vertex AI SDK:
from google.cloud.aiplatform import Model
model = Model.upload(
display_name="google--paligemma-3b-mix-448",
artifact_uri=str(model_uri),
serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311",
serving_container_ports=[8080],
)
model.wait()
And successfully registering your model in Vertex AI Model Registry creates a new version, visible in the Model Registry UI.
After registering the model, you can deploy your custom-handler-enabled Paligemma model for image captioning to an Endpoint within Vertex AI Prediction for scalable online and batch inference. The code example below demonstrates how to create an endpoint and deploy the model, including how to specify both the machine type and accelerator settings.
from google.cloud.aiplatform import Endpoint
deployed_model = model.deploy(
endpoint=Endpoint.create(display_name="google--paligemma-7b-it-endpoint"),
machine_type="g2-standard-4",
accelerator_type="NVIDIA_L4",
accelerator_count=1,
)
You can verify the status of the endpoint in Vertex AI Prediction UI.
Deploying your model to Vertex AI typically takes some time. Once the endpoint is ready, you can use PaliGemma to generate image captions by submitting your image and prompt as demonstrated in the Gradio application below.
Conclusion
Are you struggling to customize open models deployment on Vertex AI using Hugging Face Deep Learning Containers? Do you need precise control over your inference pipeline? This post provided a step-by-step guide to deploying sophisticated models with custom handlers, using the PaliGemma image captioning model as a practical example. We walked through building a custom handler to tailor the inference process, demonstrated local testing using the Vertex AI SDK, and finally deployed the model to Vertex AI for online predictions. Learn how custom handlers help you to tackle even the most intricate open (or private) model deployment scenarios and build new Generative AI applications.
So what are you going to build next?
What’s next
Explore these resources to dive deeper into Vertex AI Model Garden and Hugging Face Deep Learning containers.
Documentation
GitHub examples
- Go to Vertex AI Model Garden
- Open models on Vertex AI
- Hugging Face Google Cloud Containers on GitHub
Thanks for reading
I hope you enjoyed the article. If so, ??? ??, this article or leave comments. Also let’s connect on LinkedIn or X to share feedback and questions
about Vertex AI you would like to find an answer.