RHACS Central DB repeatedly enters recovery mode after upgrade, applying redo log'

Solution Verified - Updated -

Environment

Red Hat Advanced Cluster Security for Kubernetes (RHACS) 4.x upgrading from =< 4.0

Issue

After an RHACS upgrade, the central-db PV became full. After increasing the capacity, Central started showing errors/warnings, and the Central database restarted regularly. Log excerpt:

2024-10-23 19:56:38.457 UTC [281] LOG: PID 134 in cancel request did not match any process
2024-10-23 19:56:38.458 UTC [282] LOG: PID 133 in cancel request did not match any process
2024-10-23 19:56:38.461 UTC [285] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.462 UTC [284] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.538 UTC [286] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.541 UTC [287] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.646 UTC [274] LOG: database system was not properly shut down; automatic recovery in progress
2024-10-23 19:56:38.648 UTC [274] LOG: redo starts at 3A/BCCF3898
2024-10-23 19:56:38.653 UTC [288] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.653 UTC [291] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.654 UTC [289] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.655 UTC [293] LOG: could not accept SSL connection: EOF detected
2024-10-23 19:56:38.655 UTC [292] LOG: could not accept SSL connection: Connection reset by peer
2024-10-23 19:56:38.655 UTC [294] LOG: could not accept SSL connection: EOF detected
2024-10-23 19:56:38.655 UTC [290] LOG: could not accept SSL connection: Connection reset by peer
2024-10-23 19:56:38.656 UTC [296] LOG: could not accept SSL connection: EOF detected
2024-10-23 19:56:38.657 UTC [297] LOG: could not accept SSL connection: EOF detected
2024-10-23 19:56:38.656 UTC [295] LOG: could not accept SSL connection: EOF detected
2024-10-23 19:56:38.744 UTC [298] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.838 UTC [299] FATAL: the database system is in recovery mode
2024-10-23 19:56:38.876 UTC [274] LOG: invalid record length at 3A/BDEDE9E0: wanted 24, got 0
2024-10-23 19:56:38.877 UTC [274] LOG: redo done at 3A/BDEDE9B8
2024-10-23 19:56:39.052 UTC [14] LOG: database system is ready to accept connections
2024-10-23 19:58:09.316 UTC [390] LOG: no left sibling (concurrent deletion?) of block 5 in "pg_toast_16960_index"

Resolution

Scale down central to give central-db time to recover. Monitor memory usage as the pod might need more memory to recover. Increase the memory if necessary. Adding more resources to the scanner v4 db pod will also speed up the recovery mode and prevent it from falling.

Root Cause

To support backwards compatibility, if a release is coming from 4.0 or earlier, it will make a copy of the database if there is space. Most likely, there was space, so a copy called central_previous was made.
It is not confirmed, but we expect the copy took it close to the edge of the available space, and additional data as part of the >= 4.1 pushed that over the edge.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments