CephFS: System Deadlock during File Move Operations on Shared Volumes
Issue
When applications attempt to move files (mv command) within a shared directory on a CephFS-mounted Persistent Volume (PV), the mv command can become stuck, failing to return.
This triggers a system-wide deadlock, causing the affected directory and its contents to become inaccessible from all applications, pods, and even the host nodes.
Other file operations like ls -l on the affected path will also hang.
The issue is likely to be seen with workloads involving highly concurrent and automated file operations, particularly those including Extract-Transform-Load (ETL) processes that frequently move or list files. Telco environment may likely to hit the issue.
Environment
- Red Hat Openshift Container Platform 4.14
- Red Hat Openshift Container Platform 4.16
- Red Hat Openshift Data Foundation 4.14
- Red Hat Openshift Data Foundation 4.16
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.