Using Vertex AI’s OpenAI-compatible endpoint with a simple agent runtime

I recently spent some time exploring Vertex AI’s OpenAI-compatible endpoint, and wanted to share a small example in case it’s useful to others here.

What I found particularly nice is that you can point standard OpenAI SDK-based code at Gemini with just a couple of environment variables. That means, for some use cases, you may not need a separate client setup or a larger migration effort.

export OPENAI_BASE_URL="https://us-central1-aiplatform.googleapis.com/v1beta1/projects/YOUR_PROJECT/locations/us-central1/endpoints/openapi"
export OPENAI_API_KEY=$(gcloud auth print-access-token)

That was enough for me to get started.

What I tried

I’ve been building an open-source agent runtime called JamJet, and I wanted to see whether it would work with Vertex AI without adding any Vertex-specific code.

In my testing, it did:

from jamjet import task, tool

@tool
async def web_search(query: str) -> str:
    """Search the web for current information."""
    ...

@task(model="google/gemini-2.0-flash-001", tools=[web_search])
async def research(question: str) -> str:
    """You are a research assistant. Search first, then summarize clearly."""

result = await research("What are the key trends in AI agents in 2025?")
print(result)

What I liked here is that the task code itself stays unchanged — after setting the environment variables, Gemini can be used in a very similar way to other OpenAI-compatible setups.

A quick benchmark I recorded

This was from a real GCP project while generating a research-style output:

Model – Gemini 2.0 Flash (google/gemini-2.0-flash-001)

Strategy – plan-and-execute (plan → steps → synthesize)

Wall-clock – 41.8s for a full research report

Total tokens – 10,961

Estimated cost ~$0.002

I thought the result was encouraging: a structured and coherent report at very low cost.

Of course, this is only one small benchmark, not a broad performance claim, but it gave me confidence that this path is practical.

One practical note for production

One thing worth keeping in mind: gcloud auth print-access-token expires in about an hour. For anything beyond quick experimentation, refreshing credentials programmatically seems like the better approach.

import os
import google.auth
import google.auth.transport.requests

credentials, _ = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
credentials.refresh(google.auth.transport.requests.Request())
os.environ["OPENAI_API_KEY"] = credentials.token

Models I tested / looked at through this endpoint

  • google/gemini-2.0-flash-001 — fast and cost-efficient

  • google/gemini-1.5-pro-002 — useful for longer context workloads

  • google/gemini-1.5-flash-002 — fast with large context support

Full example

I put together a complete example with:

  • a simple plain OpenAI SDK version

  • the JamJet-based example

  • benchmark script

  • recorded output

Here it is: https://github.com/jamjet-labs/examples/tree/main/vertex-ai

Would be glad if this helps someone experimenting with Vertex AI interoperability.

Also happy to learn from others here — especially around regional model availability, auth patterns, and any production considerations I may have missed.