Multiple NVIDIA GPUs Not Showing on OpenShift Worker
Environment
OpenShift Container Platform 4.18
Issue
Multiple NVIDIA GPU cards are installed in an OpenShift worker node but nvidia-smi is only reporting one of the GPUs.
Resolution
Check for hardware errors in the dmesg logs. In this example connecting the required power cables to the GPU card should resolve the issue. If the hardware errors are something else open a case with NVIDIA to determine resolution to hardware errors.
Root Cause
This situation was caused by a hardware issue, specifically in this example power cables not being connected so the GPU card did not have sufficient power to run.
Diagnostic Steps
There are physically two cards in the system but only one is shown with nvidia-smi:
sh-5.1# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P40 On | 00000000:82:00.0 Off | Off |
| N/A 25C P8 8W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
If we check the lspci output we can see the kernel reports two cards as well:
sh-5.1# lspci -nnk -d 10de:
04:00.0 3D controller [0302]: NVIDIA Corporation GP102GL [Tesla P40] [10de:1b38] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d9]
Kernel driver in use: nvidia
Kernel modules: nouveau
82:00.0 3D controller [0302]: NVIDIA Corporation GP102GL [Tesla P40] [10de:1b38] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d9]
Kernel driver in use: nvidia
Kernel modules: nouveau
However if we look at the dmesg logs and grep out any NVRM messages we can see there are no power cables connected to the NVIDIA GPU:
[11163.919194] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[11163.919675] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1513)
[11163.919718] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments