Batch task parallelism GPU usage

mt_ploc · May 16, 2024, 10:17pm

I am running a Batch job with tasks that uses the GPU. When the task parallelism is greater than 1, the gpu memory gets exhausted by the first task. What is the best practice for handling concurrent task that uses the GPU? How do I prevent container tasks from being schedule on the same VM? Or should I change the allocationPolicy to have more than one gpu and in the code set the visible gpu device per task?

"allocationPolicy": {
  "instances": [
    {
      "installGpuDrivers": true,
      "policy": {
        "machineType": "g2-standard-16",
        "accelerators": [
          {
            "type": "nvidia-l4",
            "count": 1
          }
        ]
      }
    }
  ]
}

juliadeanne · May 22, 2024, 5:35pm

Hello @mt_ploc ,

Welcome to the Google Cloud Community!

Yes, you can modify your allocationPolicy to request 2 GPUs. This ensures the scheduler allocates a VM with enough GPUs for parallel tasks.

Another option is to create affinity groups and assign them to your VMs. Configure your Batch job to schedule tasks only on VMs within a specific affinity group. This physically separates the VMs, ensuring the tasks run in isolation.

bolianyin · May 24, 2024, 12:28am

Alternatively, you can set taskCountPerNode=1 if you only want to put one task on a VM while still have parallelism > 1.

Topic		Replies	Views
GCP Batch Jobs Parallelism Compute Infrastructure compute-engine , batch	2	76	November 28, 2023
GCP Batch always set max parallelism for tasks Compute Infrastructure compute-engine , high-performance-computing-hpc , batch	3	29	September 22, 2023
Batch task scheduling inefficiency Compute Infrastructure batch	4	16	February 13, 2024

Batch task parallelism GPU usage

AI Suggested topics