How to fix Insufficient nvidia.com/gpu on GKE
Creating a fresh new Kubernetes cluster on Google using GPUs sometimes (or quite often lately) results in:
Warning FailedScheduling default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
Here is how to solve the problem:
- List pods kube-system:
kubectl get pods -n kube-system
. If any thenvidia-gpu-device-plugin
pods are failing, this solution should work. For other problems I'm unsure. - You can check the potential problem describing the
nvidia-gpu-device-plugin
by issuing commandkubectl describe pod nvidia-gpu-device-plugin-small-abcc -n kube-system
. The results might report a problem of MountVolume fail:MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
- Install NVIDIA GPU device drivers by installing Googles official DaemonSet: `
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
3. You should see something like:
nvidia-driver-installer-2v595 0/1 Init:0/2 0 19s
nvidia-driver-installer-jls5j 0/1 Init:0/2 0 19s
nvidia-gpu-device-plugin-small-9d4gg 0/1 ContainerCreating 0 18m
nvidia-gpu-device-plugin-small-c8k8x 0/1 ContainerCreating 0 18m
After a while you should see all kube-system pods running:
...
nvidia-driver-installer-2v595 1/1 Running 0 7m20s
nvidia-driver-installer-jls5j 1/1 Running 0 7m20s
nvidia-gpu-device-plugin-small-9d4gg 1/1 Running 0 25m
nvidia-gpu-device-plugin-small-c8k8x 1/1 Running 0 25m
...