is mdadm over infiniband a bad idea?
what is the real trick to get reasonable performing storage to survive a single machine failure?
We have been running ceph for a few years now and its great for easy (ish) redundancy, but its performance is eye watering. NVMEs easily get to 3GB/s, while our ceph is doing 100MB/s over 50Gbs network while consuming 64 core CPUs. I just don't think i made the right choice here for the performance expectations.
Infiband seems extremely cost effective in comparison with used previous gen 100gb cards costing less than a 50GB ethernet card. It seems very easy and well performing to just expose a local disk over infiniband to another host using iSER.
Now the naive solution to making this survive host failure would be mdraid over multiple remote targets. But i haven't found many people actually doing that and this answer is indicating this might even be a bad idea, since mdraid has no understanding of an underlying device being remote. Also this comment makes it clear that this setup will likely run into edge case bugs.
But how else would you build an infiniband storage network in a way that recovers from node failure unattended?