Creating a fresh new Kubernetes cluster on Google using GPUs sometimes (or quite often lately) results in:
Warning FailedScheduling default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
Here is how to solve the problem:
- List pods kube-system:
kubectl get pods -n kube-system. If any the
nvidia-gpu-device-pluginpods are failing, this solution should work. For other problems I'm unsure.
- You can check the potential problem describing the
nvidia-gpu-device-pluginby issuing command
kubectl describe pod nvidia-gpu-device-plugin-small-abcc -n kube-system. The results might report a problem of MountVolume fail:
MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
- Install NVIDIA GPU device drivers by installing Googles official DaemonSet: `
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
3. You should see something like:
nvidia-driver-installer-2v595 0/1 Init:0/2 0 19s nvidia-driver-installer-jls5j 0/1 Init:0/2 0 19s nvidia-gpu-device-plugin-small-9d4gg 0/1 ContainerCreating 0 18m nvidia-gpu-device-plugin-small-c8k8x 0/1 ContainerCreating 0 18m
After a while you should see all kube-system pods running:
... nvidia-driver-installer-2v595 1/1 Running 0 7m20s nvidia-driver-installer-jls5j 1/1 Running 0 7m20s nvidia-gpu-device-plugin-small-9d4gg 1/1 Running 0 25m nvidia-gpu-device-plugin-small-c8k8x 1/1 Running 0 25m ...