Batch task parallelism GPU usage

I am running a Batch job with tasks that uses the GPU. When the task parallelism is greater than 1, the gpu memory gets exhausted by the first task. What is the best practice for handling concurrent task that uses the GPU? How do I prevent container tasks from being schedule on the same VM? Or should I change the allocationPolicy to have more than one gpu and in the code set the visible gpu device per task?

"allocationPolicy": {
  "instances": [
    {
      "installGpuDrivers": true,
      "policy": {
        "machineType": "g2-standard-16",
        "accelerators": [
          {
            "type": "nvidia-l4",
            "count": 1
          }
        ]
      }
    }
  ]
}
1 Like

Hello @mt_ploc ,

Welcome to the Google Cloud Community!

Yes, you can modify your allocationPolicy to request 2 GPUs. This ensures the scheduler allocates a VM with enough GPUs for parallel tasks.

Another option is to create affinity groups and assign them to your VMs. Configure your Batch job to schedule tasks only on VMs within a specific affinity group. This physically separates the VMs, ensuring the tasks run in isolation.

2 Likes

Alternatively, you can set taskCountPerNode=1 if you only want to put one task on a VM while still have parallelism > 1.

2 Likes