Hey all,
In case you missed it, DeepSeek-V3.1 model is available in the Vertex AI Model Garden in General Availability!
You can now use deepseek-v3.1-maas as a Model-as-a-Service (MaaS). This latest release from DeepSeek brings new features like a hybrid thinking mode, faster thinking, and improved agentic capabilities.
Here what you need to know about DeepSeek-V3.1 as a fully managed API on Vertex AI.
Performance & Scalability
By running DeepSeek-V3.1 on Vertex AI, you’re tapping into a serverless, fully managed endpoint. The default endpoint is provisioned with a good throughput limit. Here are the specifics for the model:
DeepSeek-V3.1
- Model ID:
deepseek/deepseek-v3.1-maas - Region:
us-west2 - Quota: 600 Queries Per Minute (QPM)
- Tokens: 1,000,000 Input Tokens Per Minute (TPM) / 200,000 Output Tokens Per Minute (TPM)
- Context Window: 163,840 tokens
Transparent, Pay-as-you-go Costs
The pricing for DeepSeek-V3.1 is straightforward and entirely usage-based, so you only pay for what you use. Here is the current pricing per million tokens (MTOK):
DeepSeek-V3.1:
- Input: $0.60 / million tokens
- Output: $1.70 / million tokens
For the most up-to-date information, always refer to the official Vertex AI Pricing Page and the model card in the Model Garden.
About quota, it comes with Dynamic Shared Queue (DSQ) just like Gemini models . It provides access to the model based on real-time availability of resources and real-time demand across all customers.
How to Use It: Chat Completion API
The model is served through an API that’s compatible with OpenAI’s SDK, making it easy to integrate into your existing applications. Here’s all the Python code you need to start a chat session.
import google.auth
import google.auth.transport.requests
import openai
# The id of your GCP Project
PROJECT_ID = "your-project-id"
# The region where your model is enabled
LOCATION = "us-west2"
# The model publisher
PUBLISHER = "deepseek"
# The specific model ID you want to use
MODEL_ID = "deepseek-v3.1-maas"
# Authenticate to Google Cloud and get an access token
creds, _ = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)
# Construct the Vertex AI endpoint URL
endpoint_url = f"https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/publishers/{PUBLISHER}/models/{MODEL_ID}"
# Create an OpenAI client pointed at your Vertex AI endpoint
# Note: The 'api_key' is your Google Cloud access token in this case
client = openai.OpenAI(
base_url=endpoint_url,
api_key=creds.token
)
# Send a prediction request
prediction = client.chat.completions.create(
# For Vertex AI endpoints, the model is in the URL, so this can be an empty string
model="",
messages=[{"role": "user", "content": "Explain the concept of 'hybrid thinking mode' in an LLM and how it improves tool use."}],
temperature=0.7
)
print("🤖 Prediction Response: 🤖")
print(prediction.choices[0].message.content)
You can control the model’s output by tuning the generation configuration. Key supported parameters include temperature, top_p, max_tokens, and stop.
What’s Next
You can explore DeepSeek-V3.1 right now in the Vertex AI Model Garden.
We’d love to see what you build. Drop your questions, projects, and feedback below! Also, let’s connect on LinkedIn and X/Twitter.
Happy building!

