Cloud Batch suddenly refusing to use Spot VMs?

Vinayak_Nagpal · July 31, 2025, 9:41pm

Was there any change in Batch that support for Spot VMs is removed?

Suddenly today, all our batch jobs are failing with the error:

Job gets non-retryable information Batch Error: code - CODE_GCE_UNSUPPORTED_OPERATION, description - Creating queued resource with SPOT VMs is not supported..

Nothing changed on our end and these same jobs were running fine till yesterday!

LeoK · August 1, 2025, 12:25pm

Hello @Vinayak_Nagpal,

I’ve tried to run some Cloud Batches with spot VM and worked properly.

Are you creating the Batch from the Web Console, gcloud cli or SDK ? Would you mind sharing how you’re creating them so we can take a look at your configuration.

Also, if it’s from the Web Console, can you prepare your Batch Job as usual and share the equivalent code.

Vinayak_Nagpal · August 1, 2025, 1:31pm

We are creating the job using the batch python library. When we set the instance policy to SPOT, it runs a Spot VM but when we specify a Spot instance template the job goes from Scheduled to Queued and then to Failed with the error I provided above.

This change in behavior started yesterday, we had several jobs running smoothly with this configuration for weeks.

Vinayak_Nagpal · August 1, 2025, 2:15pm

LeoK · August 1, 2025, 2:39pm

Interesting

From Task Details, can you share with us :

Runnable (is it Docker ? A script ? How many Runnable(s) ?)
Task Count
Parallelism
Machine type
Allowed locations

Also, do you see something unusual that you could share from the logs tab ?

Last but not least, I think that it could be worth sharing your python code that request the Cloud Batch Job. (Anonymise your data first)

I would like to try to reproduce the issue.

Vinayak_Nagpal · August 1, 2025, 3:10pm

Most jobs have two runnables, one script and one docker. Task count varies, we have many types of jobs - between handful to few thousand. Parallelism also varies between regions (based on our quotas etc.), in the range of 200 usually.

The machine type is g2 with L4 GPU. us-central1 and asia-south1 are primary allowed locations.
The logs tab is empty because the jobs never start, they fail immediately at the queued stage.
Below is the relevant part of code.

We already tried giving it an allocation policy in python with spot mode. If we do that, we get a spot VM but it overrides everything from our instance template and uses a generic instance type that doesn’t work for our job. If we give it just the instance template then it respects the instance template. If we specify Standard allocation in the instance template - things work. If the instance template specifies spot allocation we get this error. We also verified that absolutely nothing has changed on our end, this worked smoothly for large set of jobs and suddenly two days ago we get this error saying unsupported operation. We get same issue in both regions that we are deployed.

    def create_allocation_policy(self, region: str) -> batch.AllocationPolicy:
        """Create allocation policy for batch job."""
        allocation_policy = batch.AllocationPolicy(
            location=batch.AllocationPolicy.LocationPolicy(
                allowed_locations=["zones/asia-south1-c", "regions/asia-south1"] if region == INDIA else ["zones/us-central1-c", "regions/us-central1"]
            )
        )

        # Set service account
        service_account = batch.ServiceAccount()
        service_account.email = SERVICE_ACCOUNT_EMAIL

        # Set instance template
        instances = batch.AllocationPolicy.InstancePolicyOrTemplate()
        instances.instance_template = INSTANCE_TEMPLATE

        allocation_policy.instances = [instances]
        allocation_policy.service_account = service_account

        return allocation_policy

LeoK · August 1, 2025, 6:06pm

@Vinayak_Nagpal,

I have experimented a little on my side :

SPOT e2-medium NO GPU
SPOT n1-standard-1 + GPU NVIDIA T4
SPOT n1-standard-1 + GPU NVIDIA P100
STANDARD g2-standard-4 L4 GPU
SPOT g2-standard-4 L4 GPU

All template used Rocky Linux 8 with the latest Nvidia driver (570) x86/64, x86_64 optimized for GCP built on 20250710.

I have reproduced the same error :

Job gets non-retryable information Batch Error: code - CODE_GCE_UNSUPPORTED_OPERATION, description - Creating queued resource with SPOT VMs is not supported..

Surprisingly, I was able to create a SPOT g2-standard-4 L4 GPU VM from the same template.

Using STANDARD g2-standard-4 L4 GPU was, at first, delayed :

Waiting for resources. Currently there are not enough resources available to fulfill the request

But ended up with success after a while.

Using T4 or even P100 works fine, even on SPOT.

I’ve also seen 2 things that are interesting.

From Log Explorer :

From the failed job itself :

The VM provisioning model is stuck at Pending, even once it failed.

Also, when you create a Cloud Batch and select a GPU oriented VM type, you can’t select an L4 type.

Per my understanding,

You can use SPOT VM with Cloud Batch, even with GPU
There is an issue with L4 GPU x Cloud Batch that may be linked to insufficient ressources
OR, you never used L4 before and something/someone updated your template, leading to this issue
OR, GCP without communication, decided to remove L4 from Cloud Batch (point 2 ?). I found nothing in the documentation saying that L4 cannot be used in Cloud Batch

As a workaround, I think that you should update your template in order to use another type of GPU (T4, P100).

I’ve opened a ticket with all these information.

Vinayak_Nagpal · August 1, 2025, 7:34pm

Yeah, our workloads currently are tied to L4 GPUs. This issue has become a blocker for us. Thanks for opening a ticket. Is there a way to escalate this?

LeoK · August 1, 2025, 9:27pm

I think that the only way to escalate is to open a case on GCP Support using Customer Care.

Vinayak_Nagpal · August 19, 2025, 9:31pm

The ticket you opened is no longer accessible. I get access denied. Do you understand why this is the case? The issue still seems unfixed.

Topic		Replies	Views
GCP Batch unable to run jobs for us since 2025 Nov 11th Compute Infrastructure compute-engine , batch	1	104	November 20, 2025
Example to show batch job retry in GCP Workflow Compute Infrastructure batch	3	76	June 10, 2024
Unable to create a Batch job using custom template with hyperdisk Compute Infrastructure batch	2	20	June 20, 2024

Cloud Batch suddenly refusing to use Spot VMs?

AI Suggested topics