ETCD is degraded with NAME-PENDING error in RHOCP 4

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4

Issue

  • ETCD is degraded with following error message:

    ClusterMemberControllerDegraded: unhealthy members found during reconciling members EtcdMembersDegraded: 2 of 3 members are available, NAME-PENDING-x.x.x.x has not started reason: ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers
    

Resolution

  • Get the shell access of the ETCD pod from healthy node:

    $ oc rsh -n openshift-etcd etcd-ocp-xxx-master-1
    
  • Check the list of etcd members by using following command:

    $ etcdctl member list -w table
    +------------------+-----------+------------------+---------------------------+---------------------------+------------+
    |        ID        |  STATUS   |      NAME        |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
    +------------------+-----------+------------------+---------------------------+---------------------------+------------+
    | 2782xxxxxxxxx409 | unstarted |                  | https://xxx.xx.xx.70:2380 |                           |       true |
    | bc14xxxxxxxxx6ab |   started | ocp-xxx-master-2 | https://xxx.xx.xx.94:2380 | https://xxx.xx.xx.94:2379 |      false |
    | c7ef0xxxxxxx5881 |   started | ocp-xxx-master-1 | https://xxx.xx.xx.95:2380 | https://xxx.xx.xx.95:2379 |      false |
    +------------------+-----------+------------------+---------------------------+---------------------------+------------+
    
  • Refer the steps in official doc to remove the unhealthy member

Root Cause

  • One of the ETCD member has different IP from the IP of respective master node where it is deployed.

Diagnostic Steps

  • Check the status of ETCD cluster operator:

    $ oc get co etcd -o yaml
    status:
    conditions:
    - lastTransitionTime: "2024-06-03T07:58:39Z"
      message: |-
        ClusterMemberControllerDegraded: unhealthy members found during reconciling members
        EtcdMembersDegraded: 2 of 3 members are available, NAME-PENDING-xxx.xx.xx.70 has not started
      reason: ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers
      status: "True"
      type: Degraded
    
  • Check the member list by using following command:

    $ etcdctl member list -w table
    +------------------+-----------+------------------+---------------------------+---------------------------+------------+
    |        ID        |  STATUS   |      NAME        |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
    +------------------+-----------+------------------+---------------------------+---------------------------+------------+
    | 2782xxxxxxxxx409 | unstarted |                  | https://xxx.xx.xx.70:2380 |                           |       true |
    | bc14xxxxxxxxx6ab |   started | ocp-xxx-master-2 | https://xxx.xx.xx.94:2380 | https://xxx.xx.xx.94:2379 |      false |
    | c7ef0xxxxxxx5881 |   started | ocp-xxx-master-1 | https://xxx.xx.xx.95:2380 | https://xxx.xx.xx.95:2379 |      false |
    +------------------+-----------+------------------+---------------------------+---------------------------+------------+
    
  • Verify that the IP address of the nodes, the IP Address of the ETCD and respective node should be identical, if IP is different, then it needs to be fixed:

    $ oc get nodes -owide | awk '{print $1"\t\t"$7}'
    NAME              INTERNAL-IP
    ocp-xxx-master-0  xxx.xx.xx.93
    ocp-xxx-master-1  xxx.xx.xx.94
    ocp-xxx-master-2  xxx.xx.xx.95
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments