I’m planning to use Google Cloud Platform’s Vertex AI for a few projects. So, I was looking through the documentation in the section on rate limits and I came across this:
https://cloud.google.com/vertex-ai/generative-ai/docs/quotas
But I haven’t found any information anywhere about the algorithm that sets these limits. That is, I have two scenarios in my mind:
- First scenario: The limits are at fixed times. For example, between 08:00:00 AM and 08:00:59 AM there are 4 million tokens available and at 08:01:00 AM the tokens are reset.
- Second scenario: The limits move as requests are made.
Or maybe it’s different from the scenarios outlined.
I would appreciate if someone could explain to me how Google calculates it, or if there is a section of the documentation where I can find this since I haven’t seen it.