In a batch job specification ‘the installGpuDrivers’ Boolean flag can be added to initiate a script which attempts to install Nvidia drivers.
This script checks os version and runs the appropriate tool. For debian this is a curl and a python script, and for COS this is the cos-extensions install gpu tool.
The fault driver that is installed is version ~470, which is not compatible with CUDA 12. Both of these install scripts accept a version argument, allowing a user to install a more recent version. This argument should be plumbed through to the Batch config.
This feature is not supported now. For short-term workaround, you can set installGPUDrivers flag as false, and add a runnable which installs gpu drivers with needed version config before the runnables using GPUs.
Thanks we, yes this is essentially what I have done. Although I believe when a script and a container runnable are submitted the default os is debian and not COS, so one would need to use a different driver install command. Is this not the case?
I notice the driverVersion definition is included as part of the Accelerator block, when specifying GPUs like T4, V100 to VMs.
Is there any way to specify the driverVersion when a job is launched with VM types that are packaged with a GPU and that block is not included, such as with L4 and H100 machine types?