Score:0

Ceph's failure to automount after network failure

in flag
Tio

I'm having some problems with the mounting of a ceph-cluster on debian machines, don't know if I'm doing something wrong, if it's a version problem or anything else.

I'm using the ceph cluster from OVH, and then mounting with fstab on around 20 vm's ( 2 bare metal servers with a proxmox instance on each one ).

The problem appears when there is some network failure between the ceph cluster and our bare metal, from that point on, the mounts of ceph are completely unusable. Versions being used, and can only be brought back to use if I restart the server.

  • Ceph-Cluster: 14.2.16
  • Debian 10 Buster
  • Ceph installed on debian: 14.2.21 nautiles ( stable )

Ceph configuration:

[global]
fsid = xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
mon_host = XX.XX.XXX.XX XX.XX.XXX.XX XX.XX.XXX.XX

fstab configuration:

:/     /mnt/ceph     ceph     name=ceph_user,_netdev,noatime        0     0

Running mount:

xx.xx.xx.xx:6789,xx.xx.xx.xx:6789,xx.xx.xx.xx:6789:/ on /mnt/ceph type ceph (rw,noatime,name=ceph_user,secret=<hidden>,acl)

Edit just happened now, so adding some more info:

When this happens, this is what appears when I try ls the folder /mnt/:

d????????? ? ?    ?       ?            ? ceph

If I try mount -a:

mount error 16 = Device or resource busy

Log from /var/log/messages:

Jul 23 21:48:27 prod7-2 kernel: [28344.425057] libceph: mon2 xx.xx.xxx.xx:6789 session lost, hunting for new mon
Jul 23 21:48:27 prod7-2 kernel: [28344.427340] libceph: mon1 xx.xx.xxx.xx:6789 session established
Jul 23 21:48:54 prod7-2 kernel: [28371.560529] ceph: mds0 caps stale
Jul 23 21:52:53 prod7-2 kernel: [28610.660328] ceph: mds0 hung
Jul 23 21:53:25 prod7-2 kernel: [28642.659775] libceph: mon1 xx.xx.xxx.xx:6789 session lost, hunting for new mon
Jul 23 21:53:25 prod7-2 kernel: [28642.677667] libceph: mon0 xx.xx.xxx.xx:6789 session established
Jul 23 21:53:39 prod7-2 kernel: [28656.231175] libceph: mds0 xx.xx.xxx.xx:6801 socket closed (con state OPEN)
Jul 23 21:53:40 prod7-2 kernel: [28657.459175] libceph: reset on mds0
Jul 23 21:53:40 prod7-2 kernel: [28657.459179] ceph: mds0 closed our session
Jul 23 21:53:40 prod7-2 kernel: [28657.459180] ceph: mds0 reconnect start
Jul 23 21:53:40 prod7-2 kernel: [28657.498027] ceph: mds0 reconnect denied
Jul 23 21:53:40 prod7-2 kernel: [28657.513419] libceph: mds0 xx.xx.xxx.xx:6801 socket closed (con state NEGOTIATING)
Jul 23 21:53:41 prod7-2 kernel: [28658.454421] ceph: mds0 rejected session

Am I doing something wrong? Thanks

Romeo Ninov avatar
in flag
Is it the column symbol `:` in `fstab` mistake or this is the real record?
Tio avatar
in flag
Tio
It's the real record, it's exactly like that.
us flag
Unfortunately, there are very few options to bring a stale mount back, usually a reboot is actually the best option. You could try it with `umount -l` though, but since ceph is a network storage system this won't be the only issue you can expect if you have regular network outages. This can easily lead to corrupt PGs and data loss. I would recommend to find out the root cause of the network issues.
Tio avatar
in flag
Tio
@eblock the problem seems to be something on the infrastructure of OVH, I guess they do some maintenance, or something like that and the connection simply goes down for some time which prevents the reconnection.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.