I recently spent some time exploring Vertex AI’s OpenAI-compatible endpoint, and wanted to share a small example in case it’s useful to others here.
What I found particularly nice is that you can point standard OpenAI SDK-based code at Gemini with just a couple of environment variables. That means, for some use cases, you may not need a separate client setup or a larger migration effort.
export OPENAI_BASE_URL="https://us-central1-aiplatform.googleapis.com/v1beta1/projects/YOUR_PROJECT/locations/us-central1/endpoints/openapi"
export OPENAI_API_KEY=$(gcloud auth print-access-token)
That was enough for me to get started.
What I tried
I’ve been building an open-source agent runtime called JamJet, and I wanted to see whether it would work with Vertex AI without adding any Vertex-specific code.
In my testing, it did:
from jamjet import task, tool
@tool
async def web_search(query: str) -> str:
"""Search the web for current information."""
...
@task(model="google/gemini-2.0-flash-001", tools=[web_search])
async def research(question: str) -> str:
"""You are a research assistant. Search first, then summarize clearly."""
result = await research("What are the key trends in AI agents in 2025?")
print(result)
What I liked here is that the task code itself stays unchanged — after setting the environment variables, Gemini can be used in a very similar way to other OpenAI-compatible setups.
A quick benchmark I recorded
This was from a real GCP project while generating a research-style output:
Model – Gemini 2.0 Flash (google/gemini-2.0-flash-001)
Strategy – plan-and-execute (plan → steps → synthesize)
Wall-clock – 41.8s for a full research report
Total tokens – 10,961
Estimated cost ~$0.002
I thought the result was encouraging: a structured and coherent report at very low cost.
Of course, this is only one small benchmark, not a broad performance claim, but it gave me confidence that this path is practical.
One practical note for production
One thing worth keeping in mind: gcloud auth print-access-token expires in about an hour. For anything beyond quick experimentation, refreshing credentials programmatically seems like the better approach.
import os
import google.auth
import google.auth.transport.requests
credentials, _ = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
credentials.refresh(google.auth.transport.requests.Request())
os.environ["OPENAI_API_KEY"] = credentials.token
Models I tested / looked at through this endpoint
-
google/gemini-2.0-flash-001 — fast and cost-efficient
-
google/gemini-1.5-pro-002 — useful for longer context workloads
-
google/gemini-1.5-flash-002 — fast with large context support
Full example
I put together a complete example with:
-
a simple plain OpenAI SDK version
-
the JamJet-based example
-
benchmark script
-
recorded output
Here it is: https://github.com/jamjet-labs/examples/tree/main/vertex-ai
Would be glad if this helps someone experimenting with Vertex AI interoperability.
Also happy to learn from others here — especially around regional model availability, auth patterns, and any production considerations I may have missed.