Multiple NVIDIA GPUs Not Showing on OpenShift Worker

Solution Verified - Updated -

Environment

OpenShift Container Platform 4.18

Issue

Multiple NVIDIA GPU cards are installed in an OpenShift worker node but nvidia-smi is only reporting one of the GPUs.

Resolution

Check for hardware errors in the dmesg logs. In this example connecting the required power cables to the GPU card should resolve the issue. If the hardware errors are something else open a case with NVIDIA to determine resolution to hardware errors.

Root Cause

This situation was caused by a hardware issue, specifically in this example power cables not being connected so the GPU card did not have sufficient power to run.

Diagnostic Steps

There are physically two cards in the system but only one is shown with nvidia-smi:

sh-5.1# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      On  |   00000000:82:00.0 Off |                  Off |
| N/A   25C    P8              8W /  250W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

If we check the lspci output we can see the kernel reports two cards as well:

sh-5.1# lspci -nnk -d 10de:
04:00.0 3D controller [0302]: NVIDIA Corporation GP102GL [Tesla P40] [10de:1b38] (rev a1)
    Subsystem: NVIDIA Corporation Device [10de:11d9]
    Kernel driver in use: nvidia
    Kernel modules: nouveau
82:00.0 3D controller [0302]: NVIDIA Corporation GP102GL [Tesla P40] [10de:1b38] (rev a1)
    Subsystem: NVIDIA Corporation Device [10de:11d9]
    Kernel driver in use: nvidia
    Kernel modules: nouveau

However if we look at the dmesg logs and grep out any NVRM messages we can see there are no power cables connected to the NVIDIA GPU:

[11163.919194] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[11163.919675] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1513)
[11163.919718] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments