Score:0

OSD stuck in booting with messenger component log: Operation not permitted error

ca flag

After rebooting k8s nodes, OSD didn't join the cluster with errors related to authentication. I have added them to auth list and that error disappears. Now OSD nodes join the cluster but they don't show as up and pgs don't show up in ceph -s.

I spent 2 weeks on this issue, but I don't understand why OSD don't show as up. When setting ms subsystem logging to 20, there is an error showing OSD >> MGR - Operation not permitted:

4038023360,v1:10.244.135.63:6801/4038023360] conn(0x55c2deb3a000 0x55c2dd0ee000 crc :-1 s=READY pgs=2984 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
((1) Operation not permitted)

and checking OSD status directly from its daemon:

[root@rook-ceph-osd-3-79b4cddd7f-52kwm ceph]# ceph daemon osd.3 status
{
    "cluster_fsid": "6078f23a-41af-4f36-aa54-ddc67de63c18",
    "osd_fsid": "b89817d2-3752-4f20-a916-b992990dee8d",
    "whoami": 3,
    "state": "booting",
    "oldest_map": 9094,
    "newest_map": 9677,
    "num_pgs": 33
}

What I tried so far:

  • Check dmesg | scsi - seems fine
  • Check network - exec into OSD and ping mgr and mons, OK.
  • Check iostat -x, util is low
  • Upgraded rook -> 1.9, ceph -> 17

What I'll try (unfortunately... :( ):

  • ZAP OSD disk, clear all ceph cluster components, and Redeploy

Any clues to fix this issue are very appreciated.

Score:0
US flag

Don't zap it! That will drop whatever was on the disk.

My guess would be something to do with permissions at the UNIX level, somewhere. Probably within the container, but it could be elsewhere.

If you can re-create the data stored in your ceph cluster, then I suppose zap & restore might work, but the problem may still persist if the container does.

Ahmad Ahmadi avatar
ca flag
How can I re-create the data in the cluster? And, how can I inspect the issue at UNIX level? I don't have any clue to follow.
Charles Bedford avatar
md
By default ceph stores 3 copies of all the data it stores, so it's probably on other OSDs in a redundant capacity. If that's the case it should show up on some of the lower-level commands to look at the PGs that are having problems.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.