Now GA: OpenAI's gpt-oss & Qwen3 Models on Vertex AI as Open Model APIs!

Hey all,

I’m excited to share that the new open models from OpenAI and Qwen are now available in the Vertex AI Model Garden in General Availability!

You can now use gpt-oss-120b, gpt-oss-20b, Qwen3-Coder, and Qwen3-235B as a Model-as-a-Service (MaaS). We know that using a new model isn’t just about its capabilities—it’s about performance, cost, and ease of integration.

So, let’s break down what you get when you use these models as fully managed APIs on Vertex AI.

Performance & Scalability

When you use these models on Vertex AI, you get a fully managed, serverless endpoint. This means you can forget about provisioning or managing GPUs. The backend infrastructure scales automatically with your usage, ensuring consistent performance whether you’re prototyping or running a high-traffic production application.

The default endpoints are provisioned with generous throughput limits. Here are the specifics for each model:

OpenAI Models

  • gpt-oss-120b
    • Model ID: openai/gpt-oss-120b-maas
    • Region: us-central1
    • Quota: 650 Queries Per Minute (QPM)
    • Tokens: 790,000 Input TPM / 120,000 Output TPM
    • Context Window: 131,072 tokens
  • gpt-oss-20b
    • Model ID: openai/gpt-oss-20b-maas
    • Region: us-central1
    • Quota: 1,200 QPM
    • Tokens: 1,300,000 Input TPM / 250,000 Output TPM
    • Context Window: 131,072 tokens

Qwen3 Models

  • Qwen3-Coder-480B-A35B-Instruct
    • Model Name: qwen/qwen3-coder-480b-a35b-instruct-maas
    • Region: us-south1
    • Quota: 40 QPM
    • Tokens: 170,000 Input TPM / 8,500 Output TPM
    • Context Window: 262,144 tokens
  • Qwen3-235B-A22B-Instruct-2507
    • Model Name: qwen/qwen3-235b-a22b-instruct-2507-maas
    • Region: us-south1
    • Quota: 20 QPM
    • Tokens: 170,000 Input TPM / 15,000 Output TPM
    • Context Window: 262,144 tokens

For dedicated, high-throughput needs beyond these defaults, please submit a request to have a quota increase request. Check out the official documentation to know more (Open AI, Qwen).

Transparent, Pay-as-you-go costs

The pricing for these models is straightforward and entirely usage-based, so you only pay for what you use. This model lets you experiment freely and scale your costs predictably as your application grows.

There are no monthly fees or upfront commitments for on-demand usage. Here is the current pricing per million tokens (MTOK):

OpenAI Models:

  • gpt-oss-120b:
    • Input: $0.15 per million tokens
    • Output: $0.60 per million tokens
  • gpt-oss-20b:
    • Input: $0.075 per million tokens
    • Output: $0.30 per million tokens

Qwen3 Models:

  • Qwen3-Coder-480B-A35B-Instruct:
    • Input: $1.00 per million tokens
    • Output: $4.00 per million tokens
  • Qwen3-235B-A22B-Instruct-2507:
    • Input: $0.25 per million tokens
    • Output: $1.00 per million tokens

For the most up-to-date information, always refer to the official Vertex AI Pricing Page for OpenAI and Qwen3 models and the model cards in the Model Garden.

How to Use It: Chat Completion API

Getting started is simple. The models are served through an API that’s compatible with OpenAI’s SDK, making it easy to build interactive applications.

Here’s all the Python code you need to start a chat session. Just plug some project information and MODEL_ID for the model you want to use.

import google.auth
import google.auth.transport.requests
import openai

# The id of your GCP Project
PROJECT_ID = "your-project-id"
# The region where your model is enabled
LOCATION = "us-central1" 
# The model publisher 
PUBLISHER = "openai"
# The specific model ID you want to use
MODEL_ID = "gpt-oss-120b-maas" 

# Authenticate to Google Cloud and get an access token
creds, _ = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

# Construct the Vertex AI endpoint URL
endpoint_url = f"https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/publishers/{PUBLISHER}/models/{MODEL_ID}"

# Create an OpenAI client pointed at your Vertex AI endpoint
# Note: The 'api_key' is your Google Cloud access token in this case
client = openai.OpenAI(base_url=endpoint_url, api_key=creds.token)

# Send a prediction request
prediction = client.chat.completions.create(
    # For Vertex AI endpoints, the model is in the URL, so this can be an empty string
    model="", 
    messages=[{"role": "user", "content": "Explain what a serverless API is in a simple analogy."}],
    temperature=0.7
)

print("🤖 Prediction Response: 🤖")
print(prediction.choices[0].message.content)

You can control the model’s output by tuning the generation configuration. Key supported parameters include:

  • temperature: (Range: 0.0 to 1.0) Controls randomness. Higher values (e.g., 0.8) produce more creative responses; lower values (e.g., 0.2) make the output more deterministic.
  • top_p: (Range: 0.0 to 1.0) An alternative to temperature for controlling randomness by selecting from the most probable tokens.
  • max_tokens: Sets the maximum number of tokens to generate. This varies by model:
    • gpt-oss-120b/gpt-oss-20b/Qwen3-Coder: 32,768
    • Qwen3-235B: 16,384
  • stop: (Up to 5 strings) A list of strings that, when encountered, will stop the generation process.

You can explore all these models right now in the Vertex AI Model Garden.

We’d love to see what you build. Drop your questions, projects, and feedback below! Also let’s connect on LinkedIn, X/Twitter.

Happy building!