Batch - Add version option to "installGpuDrivers" Flag

Feature Request

In a batch job specification ‘the installGpuDrivers’ Boolean flag can be added to initiate a script which attempts to install Nvidia drivers.

This script checks os version and runs the appropriate tool. For debian this is a curl and a python script, and for COS this is the cos-extensions install gpu tool.

The fault driver that is installed is version ~470, which is not compatible with CUDA 12. Both of these install scripts accept a version argument, allowing a user to install a more recent version. This argument should be plumbed through to the Batch config.

@Shamel

Hi jacksonwb,

This feature is not supported now. For short-term workaround, you can set installGPUDrivers flag as false, and add a runnable which installs gpu drivers with needed version config before the runnables using GPUs.

Here is a job spec example:

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "cos-extensions install gpu -- -version=latest"
            }
          },
          {
            "container": {
              "imageUri": "busybox",
              "entrypoint": "/bin/sh",
              "commands": [
                "-c",
                "sleep 3000"
              ]
            }
          }
        ]
      },
      "taskCount": 1,
      "parallelism": 1
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        "policy": {
          "bootDisk": {
            "image": "batch-cos"
          }
        }
      }
    ],
    "location": {
      "allowedLocations": [
        "regions/us-central1"
      ]
    }
  }
}

Please let me know if it works for you.

Thanks!

Wen

Thanks we, yes this is essentially what I have done. Although I believe when a script and a container runnable are submitted the default os is debian and not COS, so one would need to use a different driver install command. Is this not the case?

Hi jacksonwb,

Yes, you are right, it will use debian as default. Updated the answer for using cos image.

Thanks!

Wen

Hi @jacksonwb we recently added the ability to specify GPU Driver versions with the driverVersion flag. For more details see

https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#accelerator

 
2 Likes

Hi Shamel,

I notice the driverVersion definition is included as part of the Accelerator block, when specifying GPUs like T4, V100 to VMs.
Is there any way to specify the driverVersion when a job is launched with VM types that are packaged with a GPU and that block is not included, such as with L4 and H100 machine types?