How to fix Insufficient nvidia.com/gpu on GKE

Creating a fresh new Kubernetes cluster on Google using GPUs sometimes (or quite often lately) results in:

Warning  FailedScheduling default-scheduler  0/2 nodes are available: 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..

Here is how to solve the problem:

  1. List pods kube-system: kubectl get pods -n kube-system. If any the nvidia-gpu-device-pluginpods are failing, this solution should work. For other problems I'm unsure.
  2. You can check the potential problem describing the nvidia-gpu-device-plugin by issuing command kubectl describe pod nvidia-gpu-device-plugin-small-abcc -n kube-system. The results might report a problem of MountVolume fail: MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
  3. Install NVIDIA GPU device drivers by installing Googles official DaemonSet: `
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

3. You should see something like:

nvidia-driver-installer-2v595                           0/1     Init:0/2            0          19s
nvidia-driver-installer-jls5j                           0/1     Init:0/2            0          19s
nvidia-gpu-device-plugin-small-9d4gg                    0/1     ContainerCreating   0          18m
nvidia-gpu-device-plugin-small-c8k8x                    0/1     ContainerCreating   0          18m

After a while you should see all kube-system pods running:

...
nvidia-driver-installer-2v595                           1/1     Running   0          7m20s
nvidia-driver-installer-jls5j                           1/1     Running   0          7m20s
nvidia-gpu-device-plugin-small-9d4gg                    1/1     Running   0          25m
nvidia-gpu-device-plugin-small-c8k8x                    1/1     Running   0          25m
...

Resources: