429 error from metadata.google.internal when trying to refresh auth token

I just got two errors like this in a python worker process that uses the google.cloud.storage.transfer_manager.upload_many function to upload intermediate data as it runs. The worker is long-lived batch process that is running in a docker container on c2 instance via google batch – so using the latest container optimized os imge.

google.auth.exceptions.TransportError: (‘Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/pr-0-zephyrus-sa@zephr-xyz-firebase-development.iam.gserviceaccount.com/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.full_control%2Chttps%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.read_only%2Chttps%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.read_write from the Google Compute Engine metadata service. Status: 429 Response:\nb'“Too many requests.”'’, <google.auth.transport.requests._Response object at 0x79ff3da7bb30>)

I had maybe 200-ish similar instances running at the time we hit the failure. I can’t for the life of me figure out what the quota’s are for the metadata server – do I need to adjust them (we just significantly increasing the number of concurrent workers allocated to this process) – or is there something crazy/unexpected happening inside the worker causing way more instance metadata requests than needed/expected …

I’ve been googling for awhile and can’t find any leads to track this down.

3 Likes

Hi I am facing similar issue. Can anyone help

1 Like

Same here. We started receiving this error about a week back. No problem before that, for at least two years, and nothing was changed infra side recently.

1 Like

Hi, we also started observing the same error recently.

1 Like

Same thing started on our GKE clusters a week ago. anyone have any idea why this is happening?

1 Like

My theory is that we are hitting one of the mutative API limits related to token refresh – afaict these are quite low and can’t be adjusted – the metadata endpoint hits the limit internally and passes back a 429 without information about the actual quota being violated.

In my case I suspect additionally that the upload_many function, which spins up n worker threads to do concurrent uploading, is probably refreshing the token in every worker thread (in thread local state?) – which 10x’s the max rate of token refreshes vs what the workload requires …

1 Like

This happens with me inside a dataflow worker…i am really not aware of this…what can be the issue…i have added a retry to this in the code…but why it is happening…the authorizing mechanism is still very confusing to mee

1 Like