Joshua Broyde is an AI/ML Healthcare and Life Sciences specialist customer engineer at Google Cloud. He works with HCLS and MedTech companies to architect, build, and bring enterprise AI and Generative AI systems to production
Introduction
Provisioned Throughput (PT) gives your Gemini API calls dedicated capacity — eliminating the resource contention that comes with drawing from a shared pool. Capacity is purchased in Generative Scale Units (GSUs), and the core question is always the same: How many GSUs do you actually need?
This article walks through the analysis you should do before making that decision. The centerpiece of this analysis is a simulation that tells you exactly what percentage of your traffic would be served by PT at any GSU level — before you spend a dollar.
Have you done all you can to sanitize and harden your Gemini calls?
Before reaching for Provisioned Throughput, make sure you are taking full advantage of what standard Gemini calls have to offer. See this blog for more details.
Understand How Generative Scale Units (GSUs) work
GSU math is covered exhaustively in the documentation, but a few key points are worth internalizing before diving into the examples. All examples below use Gemini Flash 3.1-preview — see here for other models. By default, traffic that exceeds your PT quota spills over to standard PayGo rather than being dropped.
GSU capacity scales linearly — each GSU gives you 2,015 tokens per second — but the enforcement window shrinks as well. This is summarized in this table:
| GSUs | Tokens/Sec | Max Window | Token budget | TPM |
|---|---|---|---|---|
| 3 | 6045 | 120s | 725,400 | 362,700 |
| 10 | 20,150 | 30s | 603,750 | 1,209,000 |
| 50 | 100,750 | 5s | 503,750 | 6,045,000 |
At higher GSU tiers, the shorter window means bursting is less forgiving.
Example 1
You buy 3 GSUs for Gemini Flash 3.1-preview. Every 10 seconds, you have 1 call to the model where you pass 100,000 tokens. What percentage of calls go through PT?
Answer: Your capacity over the maximum 120-second window is 725,400 tokens. After 70 seconds, you will have used 700,000 tokens (7 calls). The 8th call of 100,000 tokens will not fit in the remaining 25,400 tokens and will be routed through PayGO. You will need to wait for your window to reset. In 120 seconds, 12 calls arrive but only 7 go through PT — roughly 58% of calls.
Example 2
You buy 3 GSUs for Gemini Flash 3.1-preview. Every 30 minutes, you have 1 call to the model with a giant prompt of 1 million tokens. What percentage of calls go through PT?
Answer: None. With 3 GSUs, you have a maximum window capacity of 725,400 tokens. 1 million tokens is greater than this, so every call will skip PT and go straight to PayGo. The fact that the call only happens once every 30 minutes is irrelevant. PT capacity does not accumulate window to window.
Analyze Your Own Actual Traffic
Start by pulling at least a month of traffic data and graphing your actual TPM over time — raw token counts alone won’t tell you what you need to know. From there, look at your 429 error rate and map it against those TPM graphs; the correlation will usually be obvious. Make sure you’re isolating 429s specifically, since other errors like malformed function calls will skew the picture. A 0.5–2% error rate is typical for most applications, but your acceptable threshold depends on how critical the workload is.
Find your Input Tokens per Request
Plot a histogram of your input token sizes — this sets a floor on how many GSUs you actually need.
A well-behaved input distribution: P50 at 5K tokens with a long but thin tail out to 70K.
Above is a well-behaved workload. Traffic is concentrated at small token sizes with a modest tail, meaning a few GSUs can realistically cover the bulk of requests. Consider on the other hand the following much heavier workload:
A long-tailed workload
The long tail here is the problem. Unless there are generous GSU purchases, requests at the p95 and beyond will routinely exceed PT window capacity and spill to PayGo — PT can’t help them.
Analyze the Spikiness of your workloads
Spiky workloads are a poor fit for PT — they blast through provisioned capacity in bursts while leaving it idle the rest of the time, so you end up paying for throughput you’re not using.
Take a look at the example below:
Extreme spikiness: TPM swings from 30M down to under 100K.
This workload is too volatile for PT to absorb. Spikes this sharp will blow through provisioned capacity immediately, and the troughs mean you’re paying for GSUs that sit idle most of the day.
Consistent traffic hovering around 10K TPM with occasional spikes to 100K.
Steady, predictable traffic like this is exactly what PT is designed for. Capacity is actually used, and the occasional spike is modest enough that spillover to PayGo is minimal.
A TPM histogram gives you another view of the same data, showing how often each TPM level occurs rather than how it moves over time.
A mostly idle workload. Most minutes are idle (2,642 zero-TPM minutes), but when traffic is active it clusters around 10K–50K TPM.
The large zero-TPM bar tells you the workload is mostly idle. This bimodal pattern is worth flagging — it means PT sits unused most of the time— a pattern that may argue for a smaller PT purchase or a different tier strategy altogether.
Perform Simulations on your traffic to estimate PT usage
A simulation is the most direct way to answer the GSU question: it walks through your actual historical traffic minute by minute, applies the window capacity math, and produces a curve showing what percentage of requests would be served by PT at each GSU level.This simulation tells you exactly where you stand before you commit to buying PT.
Good case - coverage saturates at ~4 GSUs
Coverage saturates quickly here — buying beyond 4–5 GSUs yields almost nothing. This is the result you want to see; it means PT is well-matched to your workload and the purchase decision is straightforward.
Not every workload looks like this:
Challenging case - coverage stays low regardless
When coverage plateaus early like this, the culprit is usually oversized prompts or extreme burstiness — neither of which PT can fully absorb.
In practice, this simulation is often the deciding factor with customers I have spoken to. That said, the simulation alone doesn’t capture the full picture. Your input token distribution, TPM patterns, and error rate all add context the simulation can’t provide — which is why the other analyses in this article matter too.
Conclusion
Done right, Provisioned Throughput becomes the reliability layer for your most critical workloads, while Batch, Flex, and Priority PayGo handle everything else. The goal is a deployment strategy where every tier is earning its cost. The code linked throughout this article is designed to help you get there.
Thank you to my colleague Phillip Knoll for peer reviewing this piece.






