The CPU frequency issue with NVIDIA Grace processor

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux for ARM 64 9.4
  • Red Hat Enterprise Linux for ARM 64 9.3
  • HPE ProLiant Compute DL384 Gen12 (NVIDIA Grace (2 CPU/2 GPU) based system)
  • Pegatron Corporation Pegatron SVR AS201-1N0(NVIDIA Grace (1 CPU/1 GPU) based system)
  • HPE Cray EX254n (EX Supercomputer)

Issue

  1. The reported minimum and maximum frequencies are inconsistent with the expected values.
  2. A single, random CPU core does not operate at the expected min frequency.

lscpu output

    Architecture:                       aarch64
    CPU op-mode(s):                     64-bit
    Byte Order:                         Little Endian
    CPU(s):                             144
    On-line CPU(s) list:                0-143
    Vendor ID:                          ARM
    BIOS Vendor ID:                     NVIDIA
    Model name:                         Neoverse-V2
    BIOS Model name:                    Grace A02
    Model:                              0
    Thread(s) per core:                 1
    Core(s) per socket:                 72
    Socket(s):                          2
    Stepping:                           r0p0
    Frequency boost:                    disabled
    CPU(s) scaling MHz:                 95%
    CPU max MHz:                        3492.0000
    CPU min MHz:                        81.0000
    ......

The reported CPU frequencies by RHEL hardware certification test suite:

Run 1: 

         User Min                User Max                Performance
-------  ---------------------   ----------------------  ---------------------
expected:     81 MHz              3.490 GHz              3.490 GHz
cpu 0    351 MHz   (384.00 sec)   3.753 GHz (6.51 sec)   4.248 GHz (6.47 sec)   
cpu 1    369 MHz   (261.16 sec)   2.952 GHz (6.22 sec)   3.870 GHz (6.21 sec)   
cpu 2    342 MHz   (261.13 sec)   2.988 GHz (6.27 sec)   2.943 GHz (6.25 sec)   
cpu 3    333 MHz   (262.44 sec)   3.330 GHz (6.25 sec)   2.844 GHz (6.23 sec)
......

Run2

         User Min                User Max                Performance
-------  ---------------------   ----------------------  ---------------------
expected:     81 MHz              3.490 GHz              3.490 GHz
cpu 10   360 MHz   (263.07 sec)   3.978 GHz (6.27 sec)   2.853 GHz (6.22 sec)   
cpu 11   342 MHz   (379.17 sec)   5.832 GHz (6.52 sec)   3.141 GHz (6.47 sec)   
cpu 12   333 MHz   (262.50 sec)   3.465 GHz (6.23 sec)   3.213 GHz (6.25 sec)
......

Resolution

  1. The min/max frequencies in lscpu are 81 MHz and 3.490 GHz, but the min/max frequencies reported by rhcert hardware test suite do not match these values. For example, the reported min and max frequencies are 351 MHz and 3.753 GHz. This is a reporting issue, and the min/max frequencies should operate at the expected values.
  2. A random CPU core fails to operate at its expected minimum frequency. For example:
Run1
cpu 0    351 MHz   (384.00 sec)   3.753 GHz (6.51 sec)   4.248 GHz (6.47 sec)   
Run2
cpu 11   342 MHz   (379.17 sec)   5.832 GHz (6.52 sec)   3.141 GHz (6.47 sec) 

Compared to other CPU cores, which completed the task at the min frequency in about 260 seconds, CPU 0 and CPU 11 in these test runs took 384 and 379 seconds respectively to complete the task. This is lower than the expected min frequency. However, this does not affect the overall functionality of the CPU. HPE, Nvidia, and Red Hat are investigating the root cause of the issue with the min frequency and are working towards a resolution.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments