What are the different 'modeCode' and 'phase' in NooBaa's BackingStore

Updated -

If NooBaa’s BackingStore is not in a healthy state, it becomes difficult to understand the meaning of modeCode and phase and where to look for the issue. In this Article we aim to explain the different modeCode's and phase's that will help in troubleshooting.

If the modeCode explanation and generalized suggestions do not resolve the issue, please open a Support Case with Red Hat and reference this KCS Article, #7015245. Please upload and ODF Must Gather as well as any relevant logs and terminal output created by following the modeCode explanation and generalized suggestions in this KCS Article

We have written Red Hat Documentation Bugzilla #2274762.
Specifically, to have this same information placed in the Troubleshooting OpenShift Data Foundation manual, section 6.4 - Resolving NooBaa Bucket Error State".

MODECODE

In this section we will explain the various modeCode's

HAS_NO_NODES

This error is reported by NooBaa when a pool (BackingStore) has 0 active hosts. From NooBaa in OpenShift it can imply that there are 0 Backingtore pods running in the OpenShift cluster.

Pinpointing the issue and resolving them:
  • If the underlying storage is PV - This error is reported if for some reason the desired number of backingStore pods are not running. This could be due to:

    • BackingStore pods continuously crashing
    • NooBaa operator reconciliation failures
    • BackingStore pods not being created due to some kubernetes specific reason including but not limited to PV mount issue, SELinux issue, etc.

    Logs : Relevant logs can come from all 3 components - noobaa-backingstore, noobaa-operator, noobaa-core.

    Resolution :

    • Check if it is a kubernetes specific issue (like PV mount/SELinux etc). If it is, resolve accordingly.
    • If NooBaa operator reconciliation is suspected to be erroring in the middle of reconciliation then check the logs and try to identify the issue. Logs might indicate a malformed spec field which might need to be fixed. If that’s not the case - reach out to Red Hat Support as detailed above.
    • If the BackingStore pod is continuously crashing then check the logs of the pod and determine the cause. If the cause appears to be something obvious like inability to connect to the network then resolve accordingly or else reach out to Red Hat Support as detailed above.
    • If none of the above - reach out to Red Hat Support as detailed above.

INITIALIZING

It is not an error. This status is given to a pool’s node when it is starting up and no RPC address has been assigned to it.

DELETING

This is not an error. This status is given to a node when it is marked for deletion. A node can be marked for deletion when a pool which owns it is marked for deletion. In Kubernetes terminologies, a node will be marked for deletion when a BackingStore (kind of an alias to pool) is deleted.

NOTE: An advanced user might be able to delete the pool by simply executing the delete pool RPC to NooBaa.

ALL_NODES_OFFLINE

This error is reported by NooBaa BackingStore when NooBaa core suspects that a particular BackingStore is offline. Here, offline can mean different things based on the kind of the BackingStore (AWS, IBM COS, Azure Container, S3, PV Pool, etc).

Pinpointing the issue and resolving them
  • If the underlying storage is NOT PV Pool - This error indicates that NooBaa acknowledges that the storage has been initialized, exists and NooBaa could successfully authenticate as well. Due to the ambiguity of the underlying cause of the issue here, it is recommended to reach out to Red Hat Support as detailed above.

  • If the underlying storage is PV Pool - This error indicates that the number of offline nodes from NooBaa’s perspective has reached 100%. For example if a backingStore’s numVolumes is set to 3 and it NooBaa identifies that all of them are OFFLINE then this error would be reported. To debug the node being OFFLINE refer HAS_NO_NODES.

SCALING

Most likely not an error. This status is given to a pool (backingStore) in following circumstances:

  • BackingStore numVolumes > 1 but lesser backingStore pods are running in the cluster. Refer to HAS_NO_NODES for resolving the issue if the mode persists for a significantly longer period.

  • BackingStore numVolumes was previously greater than 1 but then later on reduced. This scenario does not lead to an issue on the noobaa-core side; rather is an error raised by the operator hence the backingStore modeCode will still remain OPTIMAL.

    • If the user did do that - Revert the change. Scaling down is not supported.
    • If the user is certain that they did not in fact reduce the numVolumes - reach out to Red Hat Support as detailed above.

NO_CAPACITY

This error can be raised in the scenario when a certain backingstore does not have enough capacity (<= 1MB) in the Backingstore underlying storage.

Pinpointing the issue and resolving them
  • If the underlying storage is PV Pool - This error is usually reported when backingstore pod’s storage is full (less than or equal to 1MB).

    Logs : Relevant logs can come from noobaa-core pod and backingstore pod.

    Resolution :

    • Exec into the backingStore pod and verify if the storage is indeed full. If it is then try to clean up some data from the bucket. The NooBaa operator at the moment doesn’t support increasing the storage size.
    • If cleaning data is not possible, consider increasing the numVolumes count.
    • If the backingStore pod storage is not full and still getting this error - reach out to Red Hat Support as detailed above.

MOST_STORAGE_ISSUES

This error is reported by NooBaa when there more than 90% BackingStore nodes are not OPTIMAL.
This error does not point to any specific cause. Refer to other more specific errors to debug the issue.

MANY_STORAGE_ISSUES

This error is reported by NooBaa when there more than 50% and less than 90% BackingStore nodes are not OPTIMAL.
This error does not point to any specific cause. Refer to other more specific errors to debug the issue.

MANY_NODES_OFFLINE

This error is reported by NooBaa when there more than 50% and less than 100% Backingstore nodes are OFFLINE.
This error just indicates that multiple nodes are in an offline state. This modeCode is rare to occur because MANY_STORAGE_ISSUES take precedence over this error while having similar triggering conditions.
Refer to other similar issues for debugging guides.

LOW_CAPACITY

This is not necessarily an error rather NooBaa reporting that storage is lesser than what NooBaa considers appropriate.

Pinpointing the issue and resolving them
  • This mode code occurs when :

    • If the combined storage space of all the backingStore nodes is less than 30GB.
    • If there is free storage but NooBaa for some reason can only use <= 20% of the free storage space.

    Logs : Storage related information can be seen in the logs of BackingStore pods. It should be verified that the storage stats being reported in the logs matches with the actual values.

    Resolution : Refer NO_CAPACITY

HIGH_DATA_ACTIVITY

In most cases, this is not an error. To understand the underlying cause of the modeCode, grep noobaa-core logs for ”data activity reason of”.
It is expected to see this status when, the numVolumes was increased on the BackingStore.

IO_ERRORS

This modeCode error is reported when NooBaa concludes that it has some issues with the underlying storage which do not fall in the category of authentication failure, storage absent, etc. The underlying cause for different types of storage may differ in case of this error.

Pinpointing the issue and resolving them
  • If the underlying storage is a Azure Container - This error will not be reported in the modeCode. It will be logged.

  • If the underlying storage is a Google Cloud Storage - This error will not be reported in the modeCode. It will be logged.

  • If the underlying storage is S3 compatible storage like AWS S3/RGW/IBM COS - This error will not be reported in the modeCode. It will be logged.

  • If the underlying storage is PV Pool - This error will be reported when NooBaa either fails to write or read from the disk (first write is tested and then read is tested).

    Logs : Relevant logs can come from backingStore pods.

    Resolution :

    • Check the logs, they will report the exact IO error that NooBaa detected. Grep for ”encountered unknown error in test_store_perf”. Respond accordingly.
    • If the above error is not found or is not actionable - reach out to Red Hat Support as detailed above.

STORAGE_NOT_EXIST

This error is reported when the NooBaa system cannot reach out to an underlying storage system either because the storage is damaged/unresponsive/network down/deleted. This storage system can be a cloud storage (azure container, S3 bucket, etc) or can be a PV pool (mounted filesystem).

Pinpointing the issue and resolving them
  • If the underlying storage is a Azure Container - This error will be reported when the azure client (via Azure storage JS SDK) reports ContainerNotFound which eliminates the probability of the network being down or unresponsive or damaged leaving us with the only possibility of the container being deleted or the requesting credentials do not have enough the permissions to access the container.

    Logs : Relevant logs can come from noobaa-core pod.

    Resolution :

    • Check if the underlying container indeed exists. If it doesn’t, create the container with the same name given to NooBaa.
    • If the container exists then check if the credentials have enough privileges to access the container. If it does not then use Azure IAM to fix the issue.
    • If none of the above - reach out to Red Hat Support as detailed above.
  • If the underlying storage is Google Cloud Storage - The error is reported when google cloud storage wrapper (this wrapper is a thin NooBaa wrapper on top of Google’s JS SDK) reports request status to be 404. This indicates that the bucket doesn’t exist.

    Logs : Relevant logs can come from noobaa-core pod

    Resolution :

    • Check if the underlying bucket indeed exists. If it doesn’t, create the bucket with the same name given to NooBaa.
    • If none of the above - reach out to Red Hat Support as detailed above.
  • If the underlying storage is a S3 compatible storage like AWS S3/RGW/IBM COS - This error will be reported when the aws s3 client (via AWS JS SDK) reports NoSuchBucket which eliminates the probability of the network being down or unresponsive or damaged leaving us with the only possibility of the bucket being deleted or the requesting credentials do not have enough the permissions to access the container.

    Logs : Relevant logs can come from noobaa-core pod.

    Resolution :

    • Check if the underlying bucket indeed exists. If it doesn’t, create the bucket with the same name given to NooBaa.
    • If the bucket exists then check if the credentials have enough privileges to access the container. If it does not then use Bucket Policy or ACL (or whatever else is supported).
    • If none of the above - reach out to Red Hat Support as detailed above.
  • If the underlying storage is PV Pool - This error is reported when the root path does not exist on the agent handling the request.
    What is a “root path”? Root path is usually /noobaa_storage/<pv-pool-backingstore-pod-name>.

    Logs : Relevant logs can come from noobaa-core pod and noobaa-backingstore pod.

    Resolution :

    • This issue can arise from the faults on the underlying file system. The FS might be corrupted or the mounting might have failed. One way to confirm this is to rely on tools like Fsck. This is not a NooBaa issue.
    • Although unlikely but it might happen that the issue arises from within NooBaa, it can happen in the case when NooBaa fails to create appropriate directories on the FS. Most probable cause could be permission issues. reach out to Red Hat Support as detailed above.
    • If none of the above - reach out to Red Hat Support as detailed above.

AUTH_FAILED

This error is reported when the NooBaa system fails to access the underlying storage due to authentication or authorization failures.

Pinpointing the issue and resolving them
  • If the underlying storage is a Azure Container - This error will be reported when the azure client (via Azure storage JS SDK) reports AuthenticationFailed. Indicates that the azure authentication failed

    Logs : Relevant logs can come from noobaa-core pod.

    Resolution :

    • If the container exists then check if the credentials have enough privileges to access the container. If it does not then use Azure IAM to fix the issue.
    • If none of the above - reach out to Red Hat Support as detailed above.
  • If the underlying storage is Google Cloud Storage - The error is reported when google cloud storage wrapper (this wrapper is a thin NooBaa wrapper on top of Google’s JS SDK) reports request status to be 403. This indicates that the bucket doesn’t exist.

    Logs : Relevant logs can come from noobaa-core pod.

    Resolution :

    • Check if the underlying bucket indeed exists. If it doesn’t, create the bucket with the same name given to NooBaa.
    • If none of the above - reach out to Red Hat Support as detailed above.
  • If the underlying storage is a S3 compatible storage like AWS S3/RGW/IBM COS - This error will be reported when the aws s3 client (via AWS JS SDK) reports NoSuchBucket which eliminates the probability of the network being down or unresponsive or damaged leaving us with the only possibility of the bucket being deleted or the requesting credentials do not have enough the permissions to access the container.

    Logs : Relevant logs can come from noobaa-core pod.

    Resolution :

    • Check if the underlying bucket indeed exists. If it doesn’t, create the bucket with the same name given to NooBaa.
    • If the bucket exists then check if the credentials have enough privileges to access the container. If it does not then use Bucket Policy or ACL (or whatever else is supported).
    • If none of the above - reach out to Red Hat Support as detailed above.
  • If the underlying storage is PV Pool - This error is reported when the root path does not exist on the agent handling the request.
    What is a “root path”? Root path is usually /noobaa_storage/<pv-pool-backingstore-pod-name>.

    Logs : Relevant logs can come from noobaa-core pod and noobaa-backingstore pod.

    Resolution :

    • This issue can arise from the faults on the underlying file system. The FS might be corrupted or the mounting might have failed. One way to confirm this is to rely on tools like Fsck. This is not a NooBaa issue.
    • Although unlikely but it might happen that the issue arises from within NooBaa, it can happen in the case when NooBaa fails to create appropriate directories on the FS. Most probable cause could be permission issues - reach out to Red Hat Support as detailed above.
    • If none of the above - reach out to Red Hat Support as detailed above.

OPTIMAL

Not an error. NooBaa deems everything to be OK. Note that it doesn’t necessarily mean that everything is indeed OK, it is a mere indication that NooBaa can perform the actions it cares about for the time being.

In an event of BackingStore deterioration, it would be worth investigating the logs even when the mode was reported to be optimal.

For example it can happen that out of 3 storage nodes when one goes out NooBaa doesn’t report an error but when the second goes out NooBaa starts reporting issues while the outage of first node and second might be related.

PHASE

In this section we will explain the various phase's

REJECTED

This phase indicates that operator has put the backingStore in a rejected state due to one of the following modeCodes it encountered:

HAS_NO_NODES
ALL_NODES_OFFLINE
NO_CAPACITY
IO_ERRORS
STORAGE_NOT_EXIST
AUTH_FAILED

VERIFYING

This phase indicates that the NooBaa operator is verifying the created BackingStore. Multiple kinds of validations are performed like:

  • Validity secret name provided in the backingStore (for cloud backed store).
  • Validity of target bucket provided in the backingStore (for cloud backed store).
  • Validity of the PV pool - size (should be more than 16GB), numVolumes ([1,20]), name length (< 43).
  • Validity of AWS Signature version. Only v2, v4, empty (none) are allowed.

If the BackingStore is stuck on this phase and not making any progress then more details can be found both in the resource status and the logs.

CONNECTING

This phase indicates that the NooBaa operator is establishing connection to the NooBaa core. It also prepares parameters for creation/updation of the resource in this phase.
Multiple steps are performed in this phase which can potentially fail causing the BackingStore to be stuck in this phase.

Potential failures :

  • NooBaa failing to communicate with NooBaa core:

    • Check operator log to determine the cause of failure.
    • Check core pod to see if NooBaa core is running without issues and all the processes are running and not crashing.
    • Check if the network connectivity works between the 2 pods.
  • NooBaa failing to invoke RPC with NooBaa core:

    • Check both operator and core logs to debug.
  • Attempt to decrease spec.numVolumes will cause the error as well:

    • Check operator logs to confirm this. Core logs will not indicate any issue.
  • If the resource is being updated from one type to another, it is not allowed and will cause error:

    • Check operator logs to confirm this. Core logs will not indicate any issue.
  • In case of a cloud resource is used as underlying storage, if the resource URL parsing fails (ie. a malformed URL) then this error will show up.

    • Check operator logs to confirm this. Core logs will not indicate any issue.

CREATING

This phase indicates that the NooBaa operator is requesting the resource creation from the NooBaa core. This phase is to some extent 1:1 mapping of core’s INITIALIZING.
Multiple steps are performed in this phase which can potentially fail causing the BackingStore to be stuck in this phase.

Potential failures :

  • NooBaa failing to communicate with NooBaa core:

    • Check operator log to determine the cause of failure.
    • Check core pod to see if NooBaa core is running without issues and all the processes are running and not crashing.
    • Check if the network connectivity works between the 2 pods
  • NooBaa failing to invoke RPC with NooBaa core:

    • Check both operator and core logs to debug.
  • NooBaa core rejecting the connection parameters (could be due to the parameters being invalid - core cannot use the credentials to control the resources):

    • Check both operator and core logs to debug the cause.

READY

This phase indicates that BackingStore is ready to be used. It does not necessarily always mean that the store is completely healthy. The following modeCodes can be on the backing store even though it is marked ready:

DELETING
SCALING
MOST_NODES_ISSUES
MANY_NODES_ISSUES
MOST_STORAGE_ISSUES
MANY_STORAGE_ISSUES
MANY_NODES_OFFLINE
LOW_CAPACITY
OPTIMAL

DELETING

This phase indicates that the BackingStore was marked for deletion and the NooBaa operator is working to get it deleted.

If the BackingStore is stuck on deletion for a long period of time then it could be due to following reasons:

  • The BackingStore had a lot of items in it. In some older versions of NooBaa, deletion of BackingStore can take a very long time. “A lot of items” roughly means > 500k objects (in the entire NooBaa and not just target BackingStore).

  • There are some buckets associated with the target BackingStore:

    • Check operator logs to debug.
  • Core refusing to delete the connection parameters:

    • Check operator logs to confirm this issue and then review core logs to identify the underlying cause.

Comments