CephFS: System Deadlock during File Move Operations on Shared Volumes

Solution Verified - Updated -

Issue

When applications attempt to move files (mv command) within a shared directory on a CephFS-mounted Persistent Volume (PV), the mv command can become stuck, failing to return.

This triggers a system-wide deadlock, causing the affected directory and its contents to become inaccessible from all applications, pods, and even the host nodes.

Other file operations like ls -l on the affected path will also hang.

The issue is likely to be seen with workloads involving highly concurrent and automated file operations, particularly those including Extract-Transform-Load (ETL) processes that frequently move or list files. Telco environment may likely to hit the issue.

Environment

  • Red Hat Openshift Container Platform 4.14
  • Red Hat Openshift Container Platform 4.16
  • Red Hat Openshift Data Foundation 4.14
  • Red Hat Openshift Data Foundation 4.16

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content