1.0 Introduction
Choosing the correct endpoint type for a machine learning model is a foundational architectural decision that dictates how the model is accessed and secured. This decision has significant implications for network security, compliance, and operational complexity. Vertex AI offers several primary endpoint types for online prediction, each designed for a specific set of use cases.
This document provides a detailed technical deep-dive into each endpoint type, corrected and enhanced with official Google Cloud documentation:
- Public Endpoints: Accessible from the public internet, secured by IAM. These come in two variations: Shared and Dedicated.
- Private Endpoint with VPC Peering: Accessible only from within a specified Virtual Private Cloud (VPC) network via a VPC Peering connection.
- Private Endpoint with Private Service Connect (PSC): A modern mechanism to securely expose a private endpoint to consumers in other VPC networks or on-premise locations.
Understanding the prerequisites, architecture, creation process, and testing methodology for each is critical to deploying a secure, compliant, and performant ML service.
2.0 Core concept: Network accessibility
The fundamental difference between the endpoint types is their network accessibility.
- Public IP Address: A Public Endpoint is assigned a publicly routable IP address on the internet. Any client with valid credentials and internet access can reach it.
- Private IP Address: Private Endpoints (both VPC Peering and PSC) are not assigned a public IP address. They are assigned an IP address from a private range within your VPC network (e.g.,
10.128.0.5). They are fundamentally unreachable from the public internet, providing a powerful layer of network isolation.
3.0 Deep dive: Endpoint types
3.1. Public Endpoints
Public endpoints are accessible over the public internet and are secured by Google Cloud’s Identity and Access Management (IAM). They are the easiest to use as they don’t require private network infrastructure. They come in two forms: Shared and Dedicated.
What they are
- Shared (Standard) Endpoint: The default, multi-tenant endpoint type. It is suitable for development, testing, and use cases that are not latency-sensitive. Notably, tuned Gemini models can only be deployed to shared public endpoints.
- Dedicated Public Endpoint: The recommended best practice for production. It provides dedicated networking resources, resulting in optimized network latency, support for larger payloads (up to 10 MB), and longer, configurable request timeouts (up to 1 hour).
Architecture
+-----------------+ +-------------------+ +-------------------------+
| | | | | |
| Client App |---->| Public Internet |---->| Vertex AI Endpoint |
| (e.g., Laptop, | | | | (Public IP, IAM-Secured)|
| Mobile App) | | | | |
+-----------------+ +-------------------+ +-------------------------+
When to use them
- Shared Endpoints:
- For deploying fine-tuned Gemini models.
- During early development and rapid prototyping where ease of use is prioritized.
- When clients are outside of Google Cloud and cannot connect to a VPC.
- Dedicated Endpoints:
- For production applications requiring consistent performance and resource isolation.
- When low latency, large payloads, or long-running inference (e.g., streaming GenAI) is required.
- For most production use cases, this is the recommended public endpoint type.
Prerequisites
- IAM Permissions: The user or service account making the prediction request must have the
aiplatform.endpoints.predictpermission on the endpoint resource. The predefinedVertex AI Userrole (roles/aiplatform.user) includes this permission.
How to create them (Python SDK)
Creating a Shared Endpoint is the default behavior in the console.
To create a Dedicated Endpoint, which is recommended, you must explicitly enable it. You can also configure logging and timeouts.
# In your deploy_model_to_vertex_op component
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=REGION)
# To create a DEDICATED public endpoint
# This is the recommended approach for production.
dedicated_endpoint = aiplatform.Endpoint.create(
display_name="my-dedicated-public-endpoint",
# Key parameter to create a dedicated endpoint
dedicated_endpoint_enabled=True,
# Optional: Configure inference timeout (max 3600s)
inference_timeout=600, # in seconds
# Optional: Enable and configure request-response logging to BigQuery
enable_request_response_logging=True,
request_response_logging_sampling_rate=1.0,
request_response_logging_bq_destination_table="bq://your-project-id.your_dataset.your_table"
)
# Citation: For more details, see the official documentation on Creating a public endpoint. [14]
# The endpoint.deploy(...) call remains the same.
How to test them (detailed steps)
This command can be run from any machine with the Google Cloud SDK (gcloud) and curl.
-
Authenticate your Local Machine:
gcloud auth application-default login -
Set Environment Variables:
export PROJECT_ID="your-gcp-project-id" export REGION="your-region" export ENDPOINT_ID="your-endpoint-id" # Get this from the Cloud Console or SDK output -
Discover the Full Prediction URL:
This command queries the API to find the specific URI for your endpoint.export PREDICTION_URL="https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/{ENDPOINT_ID}:predict" echo "Prediction URL: $PREDICTION_URL" -
Craft the JSON Payload:
Create a file namedrequest.json.cat > request.json <<EOF { "instances": [ { "Amount": "", "Indicator": "", "Merchant": "", "Description": "" } ] } EOF -
Send the Authenticated Request:
Thecurlcommand uses a bearer token for authentication.curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "$PREDICTION_URL" -d @request.jsonCitation: For more details, see the official documentation on Getting online predictions.
3.2. Private Endpoint with VPC Peering (private services access)
This endpoint type has no public IP address and uses VPC Network Peering to establish a private connection between your VPC and the Vertex AI service.
What it is
A Private Endpoint accessible only from within a specified VPC network. This provides strong network-level isolation, ensuring that only services within your trusted network can send requests to the model. All traffic stays within Google’s network.
Architecture
+-------------------------------------------------+
| |
| Your VPC Network (e.g., prod-shared-vpc) |
| (Peered with Google's Service Network) |
| +-------------+ +-----------------------+ |
| | | | | |
| | GCE Test VM |---->| Vertex AI Endpoint | |
| | | | (Private IP Only) | |
| +-------------+ | | |
| +-----------------------+ |
+-------------------------------------------------+
^
|
(No Path - Blocked by Firewall)
|
+-------------------+
| |
| Public Internet |
| |
+-------------------+
When to use it
- This is a strong choice for production banking and regulated environments.
- For internal microservices that should never be exposed to the public internet.
- When all clients (consumers) of the model reside within the same VPC network.
- Important Limitations: These endpoints do not support traffic splitting, SSL/TLS, or Vertex Explainable AI.
Prerequisites
- VPC Network: A VPC network must exist.
- Private Services Access: You must have a Private Services Access connection configured for your VPC. This is a one-time setup that reserves an IP range for Google-managed services (like Vertex AI) to peer with your VPC.
- IAM Permissions:
- The service account creating the endpoint needs
compute.networks.getpermission on the VPC network (e.g.,roles/compute.networkUser). - The caller making the prediction needs
aiplatform.endpoints.predict.
- The service account creating the endpoint needs
Citation: For a detailed guide, see the official documentation on Using Private Endpoints for Online Prediction and Configuring Private Services Access.
How to create it (Python SDK) - Corrected method
To create a VPC-peered private endpoint, you must use the lower-level gapic.EndpointServiceClient instead of the high-level aiplatform.Endpoint.create() method. This client allows you to construct a full Endpoint object where the network field can be specified.
# In your deploy_model_to_vertex_op component
from google.cloud import aiplatform
from google.cloud.aiplatform_v1.types import Endpoint
# 1. Initialize the low-level GAPIC client for the specified region.
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}
endpoint_client = aiplatform.gapic.EndpointServiceClient(client_options=client_options)
# 2. Define the full parent resource path for endpoint creation.
parent = f"projects/{PROJECT_ID}/locations/{REGION}"
# 3. Get the full network name from your configuration.
NETWORK_NAME = config["network"]
# 4. Construct the Endpoint object. The presence of the 'network' field
# is what instructs the API to create a VPC-peered private endpoint.
private_endpoint_spec = Endpoint(
display_name="my-private-peering-endpoint",
network=NETWORK_NAME,
)
# 5. Call the client's create_endpoint method. This returns a long-running operation.
operation = endpoint_client.create_endpoint(
parent=parent,
endpoint=private_endpoint_spec
)
print("Waiting for private endpoint creation operation to complete...")
created_endpoint_resource = operation.result()
print(f"Successfully created private endpoint: {created_endpoint_resource.name}")
# 6. (Optional but recommended) Wrap the created resource in the high-level SDK object.
endpoint = aiplatform.Endpoint(endpoint_name=created_endpoint_resource.name)
# The endpoint.deploy(...) call remains the same.
# endpoint.deploy(model=model, ...)
A guide to navigating the documentation: The source of confusion
The confusion you encountered—where a parameter seems to be missing from a high-level SDK method—is a common and important one when working with Google Cloud. Understanding why it happens is key to navigating the documentation effectively.
Google Cloud tools exist in a hierarchy of abstraction:
- Cloud Console (GUI): The highest level of abstraction, designed for interactive use.
gcloudCLI: A powerful command-line tool that often provides direct flags (--network) for common operations.- High-Level Python SDK (
google-cloud-aiplatform): The “Pythonic” library (e.g.,aiplatform.Endpoint). It’s designed for convenience and the most common 80% of use cases. It simplifies object creation but may hide less common or legacy parameters. - Low-Level GAPIC Python SDK (
aiplatform.gapic): An auto-generated library that provides a direct, 1-to-1 mapping of the underlying API. It is more verbose but guarantees access to every single API feature. - REST/gRPC API: The fundamental service definition. This is the ultimate source of truth for what the service can do.
The discrepancy arose because:
- The
networkparameter exists in the fundamental REST API definition for anEndpointresource. - The
gcloudCLI exposes this directly with the--networkflag. - The High-Level
aiplatform.Endpoint.create()method chose not to expose this specific parameter, likely to streamline the interface and guide users toward the more modern Private Service Connect (PSC) method. - Therefore, to access this parameter from Python, we must drop down one level of abstraction to the Low-Level GAPIC SDK, which is guaranteed to have it.
How to know which tool to use:
| Tool | When to Use It | Key Characteristics |
|---|---|---|
gcloud CLI |
Interactive tasks, shell scripting, quick operations. | Fast and convenient for single actions. |
| High-Level SDK | Your default choice for application development in Python. | Pythonic, object-oriented, easy to read. Covers most common use cases. |
| Low-Level GAPIC SDK | When a feature or parameter is missing from the high-level SDK. | Verbose, requires manual object construction, but provides 100% API coverage. |
The golden rule: If you see a feature in the REST API or gcloud documentation but can’t find it in the high-level Python SDK’s method signature, the solution is almost always to use the low-level GAPIC client for that specific task.
How to test It (detailed steps)
You must perform these steps from a machine inside the specified VPC (e.g., a Google Compute Engine VM).
-
Connect to a Test VM in the VPC:
gcloud compute ssh my-test-vm --zone=europe-west3-a --project=your-host-project-id -
Run test commands inside the VM:
The prediction URL for a private endpoint has a different format and must be constructed manually.# --- The following commands are run INSIDE the VM --- export PROJECT_ID="your-gcp-project-id" export REGION="your-region" export ENDPOINT_ID="your-private-endpoint-id" export DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \ --region=$REGION \ --project=$PROJECT_ID \ --format="value(deployedModels[0].id)") export PREDICTION_URL="http://${ENDPOINT_ID}.aiplatform.googleapis.com/v1/models/${DEPLOYED_MODEL_ID}:predict" echo "Prediction URL: $PREDICTION_URL" cat > request.json <<EOF { "instances": [ { "Amount": "1250", "Indicator": "D", "Merchant": "Myntra", "Description": "Online shopping for clothes" } ] } EOF curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "$PREDICTION_URL" -d @request.json
Citation: The specific URL format and testing methodology are outlined in the private endpoint documentation.
3.3. Private Endpoint with Private Service Connect (PSC)
Private Service Connect (PSC) is a networking feature that allows you to privately and securely consume a service (like a Vertex AI Private Endpoint) from a different VPC network or an on-premise environment. It acts as a secure, one-way entry point into the service producer’s VPC without merging networks.
What it is
This is the recommended approach for private, secure, and low-latency connections, especially when serving multiple consumer projects. It uses a forwarding rule in your VPC to send traffic to a service attachment that exposes the Vertex AI service.
Architecture
+--------------------------+ +--------------------------+
| Service Producer VPC | | Service Consumer VPC |
| (e.g., prod-shared-host) | | (e.g., another team's) |
| | | |
| +----------------------+ | +------------------+ | +----------------------+ |
| | Vertex AI Endpoint | | | Service Attachment |<--| Forwarding Rule | |
| | (Private IP) | |<--| (PSC) | | | (Stable Private IP) | |
| +----------------------+ | +------------------+ | +----------------------+ |
| | | ^ |
+--------------------------+ | | |
| +-------------+ |
| | GCE Test VM |----------+
| +-------------+ |
+--------------------------+
When to use it
- When a central ML platform team needs to provide prediction services to multiple business units, each with their own separate VPC network.
- When an on-premise application needs to securely call a Vertex AI model without traversing the public internet.
- When you need to provide a stable, private IP address for your service that consumers can add to their allowlists.
Prerequisites
- All prerequisites for a Private Endpoint with VPC Peering (Section 3.2).
- IAM roles: Setup may require permissions held by a Network Administrator (
compute.networkAdmin). - Consumer VPC: A separate VPC network that will consume the service.
How to create it
Creation involves creating a special type of private endpoint and then connecting it via PSC. There are two methods: manual and automated.
Method 1: Manual PSC Setup (ML Engineer + Network Admin)
-
Create the Private Endpoint with PSC Config (ML Engineer):
The endpoint is created with aPrivateServiceConnectConfigthat specifies which projects are allowed to connect.from google.cloud import aiplatform aiplatform.init(project=PROJECT_ID, location=REGION) # Create a PSC-ready private endpoint psc_endpoint = aiplatform.PrivateEndpoint.create( display_name="my-psc-ready-endpoint", private_service_connect_config=aiplatform.PrivateEndpoint.PrivateServiceConnectConfig( project_allowlist=[CONSUMER_PROJECT_ID_1, CONSUMER_PROJECT_ID_2] ) ) # Citation: This code is based on the official PSC documentation. [2] -
Set up PSC Forwarding Rule (Network Admin):
After the endpoint is created, a network admin gets its service attachment and creates a forwarding rule in the consumer’s project.# 1. Get the Service Attachment URI from the endpoint in the PRODUCER project SERVICE_ATTACHMENT=$(gcloud ai endpoints describe $ENDPOINT_ID \ --region=$REGION \ --project=$PRODUCER_PROJECT_ID \ --format="value(serviceAttachment)") # 2. In the CONSUMER project, create a forwarding rule pointing to the service attachment gcloud compute forwarding-rules create my-model-consumer-endpoint \ --project=$CONSUMER_PROJECT_ID \ --region=$REGION \ --network=consumer-vpc-name \ --target-service-attachment="$SERVICE_ATTACHMENT"Citation: This two-part manual process is detailed in the PSC documentation.
Method 2: PSC Automation (preview)
This new feature simplifies the process by automatically creating the PSC forwarding rule, which is ideal for ML developers who lack network admin permissions.
- One-Time Setup (network admin): The network admin creates a
service-connection-policythat allows Vertex AI to create PSC endpoints in the network. - Create Endpoint with Automation Config (ML Engineer): The ML engineer creates the endpoint with a special configuration.
# Example of creating an endpoint with PSC Automation enabled config = aiplatform.compat.types.service_networking.PrivateServiceConnectConfig( enable_private_service_connect=True, project_allowlist=[PROJECT_ID], psc_automation_configs=[ aiplatform.compat.types.service_networking.PSCAutomationConfig( project_id=PROJECT_ID, network=f"projects/{PROJECT_ID}/global/networks/{NETWORK_NAME}" ) ] ) psc_endpoint = aiplatform.PrivateEndpoint.create( display_name="my-automated-psc-endpoint", private_service_connect_config=config, ) # Citation: This preview feature is described in the PSC documentation. [2]
How to test it (detailed steps)
Testing is done from a VM inside the consumer’s VPC, and the request is sent to the forwarding rule’s IP address.
-
Connect to a VM in the Consumer VPC:
gcloud compute ssh my-consumer-vm --zone=... --project=$CONSUMER_PROJECT_ID -
Run test commands inside the Consumer VM:
Thecurlcommand is more complex as it requires overriding theHostheader.# --- The following commands are run INSIDE the consumer VM --- export PRODUCER_PROJECT_ID="..." export CONSUMER_PROJECT_ID="..." export REGION="..." export ENDPOINT_ID="..." # The original endpoint ID from the producer project export FORWARDING_RULE_NAME="my-model-consumer-endpoint" # 1. Get the IP address of the forwarding rule in the consumer VPC FORWARDING_RULE_IP=$(gcloud compute forwarding-rules describe $FORWARDING_RULE_NAME \ --project=$CONSUMER_PROJECT_ID \ --region=$REGION \ --format="value(IPAddress)") # 2. The PREDICTION_URL is the forwarding rule's IP. Port 443 is used for HTTPS. export PREDICTION_URL="https://${FORWARDING_RULE_IP}:443/v1/projects/${PRODUCER_PROJECT_ID}/locations/${REGION}/endpoints/${ENDPOINT_ID}:predict" # 3. Craft the JSON payload (same as before) cat > request.json <<EOF { "instances": [ { "Amount": "1250", "Indicator": "D", "Merchant": "Myntra", "Description": "Online shopping for clothes" } ] } EOF # 4. Send the request. # The 'Host' header tells the Google Front End which service you are targeting. # The '--insecure' or '--cacert' flag is needed because the endpoint uses a self-signed certificate. curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ -H "Host: ${REGION}-aiplatform.googleapis.com" \ "$PREDICTION_URL" -d @request.json --insecureCitation: This advanced testing pattern is detailed in the documentation for getting online inferences via PSC.
4.0 Comparison summary
| Feature | Shared Public Endpoint | Dedicated Public Endpoint | Private Endpoint (VPC Peering) | Private Endpoint (PSC) |
|---|---|---|---|---|
| Accessibility | Public Internet | Public Internet | Internal VPC Only | Internal Cross-VPC / On-Prem |
| Connection Method | Internet | Internet | VPC Peering | Private Service Connect |
| Security Level | Good (IAM-based) | Good (IAM-based, dedicated resources) | Excellent (Network Isolation) | Excellent (Network Isolation) |
| Use Case | Dev/Test, Tuned Gemini | Production public apps | Internal services, single VPC | Central ML platform, multi-VPC, hybrid-cloud |
| VPC-SC Support | Yes | No, use PSC instead | Yes | Yes |
| Traffic Splitting | Yes | Yes | No | No (Workaround with proxy like Cloud Run) |
| Creation Complexity | Low | Low | Medium (Requires VPC Peering setup) | High (Requires network admin setup) |
| Testing Complexity | Low (From anywhere) | Low (From anywhere) | Medium (From inside the peered VPC) | High (From consumer VPC to forwarding rule) |
5.0 Recommendation for your environment
For a production banking environment, security and compliance are the primary drivers.
-
Required Standard: Private Endpoint with VPC Peering (Section 3.2). This should be your default choice for all production models whose consumers are within the same VPC network. It provides the necessary network isolation to meet regulatory requirements by ensuring that sensitive transaction data and model interactions never traverse the public internet.
-
Future Scalability: Private Endpoint with Private Service Connect (Section 3.3). If, in the future, other departments or business units with their own separate VPCs need to consume this transaction classification model, you should use PSC. You can expose your existing private endpoint via a PSC service attachment without needing to redeploy the model, providing a clear scalability path.
Action: Modify your deploy_model_to_vertex_op component to include the network=network parameter in the aiplatform.Endpoint.create call, as detailed in Section 3.2, for all initial production deployments.
6.0 Appendix: How to find your network resource name
To create a Private Endpoint (either type), you need the full resource name of your VPC network.
- In the Google Cloud Console, navigate to VPC network.
- Select VPC networks from the left-hand menu.
- Find your Shared VPC network in the list (e.g.,
prod-shared-vpc). - The full resource name is constructed as follows:
projects/HOST_PROJECT_ID/global/networks/NETWORK_NAME- Example:
projects/prod-shared-host-services/global/networks/prod-shared-vpc
- Example:
This is the value that should be placed in your configs/config.prod.json file for the “network” key.
References
- Set up a Private Service Connect interface for Vertex AI resources
- Medium Blog
- Overview of getting inferences on Vertex AI
- Google Blog
- Vertex AI access online prediction endpoints privately using PSC
- Choose an endpoint type
- Use dedicated private endpoints based on Private Service Connect
- Google Blog
- Create a public endpoint
- Set up VPC Network Peering
- Use dedicated public endpoints for online inference
- Use private services access endpoints for online inference
- About accessing Vertex AI services through Private Service Connect endpoints