Score:14

Linux - RAM Disk as part of a Mirrored Logical Volume

US flag

We have a server with 64GB of total RAM, applications are using typically a maximum of 30GB of that available RAM. One of those applications deals with a lot of flat files, and we're having throughput issues, namely waiting on disk I/O. While exploring possible solutions the idea of a RAM disk came up. The problem I have with a RAM disk is the inherent volatility.

I've found separate documentation on RAM disks, RAID 1 configuration, and Logical Mirrored Volumes to group physical disks, but I can't seem to find any documentation that suggests if either of these disk replication solutions can be used with a RAM disk. More importantly since the idea is to have the RAM disk be available for read/write, and have the physical disk "shadowing" the RAM disk, catching up with writes, we would want the RAM disk to be the "primary" disk for all reads/writes.

To note, we would like to avoid merely RAM caching the files with the OS, but if we can get the same performance as a stand-alone RAM disk, that could work. We initially avoided this since often times certain files will not be accessed for long periods of time, but still need the read/write speed on-demand.

vidarlo avatar
ar flag
Why do you think this would work better than letting the OS cache whatever it can in RAM?
forest avatar
cn flag
@vidarlo OP pointed out that the native page cache will drop files that have not been accessed for long periods of time, which causes accesses after they are dropped to be too slow. This is why I pointed out vmtouch in my answer, since it can get around that limitation.
Criggie avatar
in flag
There's a half-memory in my head about a linux filesystem that uses a cache disk, but the details escape me. I will check with a cow orker for a reminder, unless someone else does an answer on these lines first... No its not Ceph, and not a network FS.
forest avatar
cn flag
@Criggie Are you thinking of bcachefs?
Criggie avatar
in flag
@forest yes - looks like it. https://bcachefs.org/ needs a full answer
forest avatar
cn flag
@Criggie Bcachefs works with a caching SSD, not just memory, so it may not be what OP needs (it won't be _as fast_ as memory, even if it is much faster than spinning rust).
de flag
Linux does not care. If it's a block device (which ram disks are with the right setup), it'll work. But you *will* lose files on reboot, whether intentional or by accident (such as with kernel panics or by new employee mistake).
jcaron avatar
co flag
Of course, the real solution is quite probably to avoid "a lot of flat files".
Shayne Fitzgerald avatar
md
Believe me, I wish I had been in the room for the architecture discussion for this particular application. At least then I'd know who to address my swearing to!
Score:26
cn flag

To note, we would like to avoid merely RAM caching the files with the OS, but if we can get the same performance as a stand-alone RAM disk, that could work. We initially avoided this since often times certain files will not be accessed for long periods of time, but still need the read/write speed on-demand.

You could use vmtouch to solve your problem. This is a utility which allows you to pin certain files or even entire directories and everything under them in the page cache so they do not get evicted, even if they are not accessed for long periods of time (which was your initial reason for not simply relying on the page cache). This requires at most the same amount of memory as your RAM disk, or less in practice. You'll still be using the page cache, but it will result in similar performance to using a RAM disk for everything (actually superior performance as the MD driver will not be involved).

Score:13
us flag

This could be hacked together, but it is a bad idea and likely has multiple issues with reliability and maintainability.

I think a RAID1 of RAMdisk and physical disk would be limited to the performance of the physical disk, as part of RAID1 functionality is to ensure that both copies are in sync.

For reads, there could be some benefit, because MD driver can distribute reads between different devices.

Possible steps to create this:

  1. Create an empty file, which has size of the array you want to support
  2. Use losetup to create a block device out of the file.
  3. Use mdadm to create the array with the newly created block device and corresponding hard disk partition.
  4. Create a filesystem on the new MD array.

I haven't tried this myself, so it is only theoretical example how it could be done.

P.Péter avatar
in flag
Also, use the --write-mostly flag on the physical disk. This will direct most reads to the ram disk and thus free up the pysical disk for more writes.
Score:8
mx flag

First off, a RAM disk is almost never the correct answer on Linux. Because it is a block device, you end up with any read having to go through the block layer, and the filesystem, and the regular VFS layer, and the data will end up cached in RAM in addition to being stored in the RAM disk. This duplication of data as well as the number of additional layers involved are why tmpfs exists on Linux, instead of involving the block layer, a tmpfs filesystem just stores data directly in the page cache, skipping all the extra complexity. It also happens to auto-size based on the amount of data stored in it (instead of having to have the size defined up front), and it even can leverage swap space. If you think you need a ramdisk, then 99% of the time you should really be using tmpfs instead.


Now, as far as actual solutions...

If all of your data actually fits in RAM, you’re much better off just pinning it all in RAM, either by using a tool like vmtouch, or by having the application mmap all the files and then call mlock on all the mapped regions.

If your data does not all fit in RAM, you have two realistic options:

  • Store the data compressed on disk, ideally using a filesystem that provides transparent compression, such as BTRFS, F2FS, or ZFS. Provided you have a reasonably fast CPU, this will usually reduce the time required to read a large file, at the cost of requiring a bit more CPU time. The improvement is generally proportionate to how well the data compresses, but in many cases can easily translate to a 30% or more improvement.
  • Look into investing in faster storage. Either enough to just replace your existing storage, or some smaller amount that you can then use with bcache to functionally speed up your existing storage.
Score:7
in flag

I have done something like this using AWS ephemeral disks, which are very fast but do not survive a power off/on cycle.

We had a "seed-disk" which was a normal cheap EBS volume of GP2 (GP3 now) and it was in a RAID1 with the fast ephemeral disks

I created a bash script for rc.local to figure out with the nvme list command output if there was an ephemeral disk, and join it to the raid where appropriate.
In your case, something at startup would have to create the ramdisk, join it to the existing degraded array.

PROD pathservice1.taws ~ $ nvme list
Node             SN                   Model        Namespace Usage                 Format           FW Rev
---------------- --- ----------------------------- --------- -------------------- ---------------- --------
/dev/nvme0n1     123 Amazon Elastic Block Store          1   128.85 GB / 128.85 GB    512   B +  0 B   1.0
/dev/nvme1n1     234 Amazon Elastic Block Store          1   107.37 GB / 107.37 GB    512   B +  0 B   1.0
/dev/nvme2n1     345 Amazon Elastic Block Store          1   2.20   TB /  2.20  TB    512   B +  0 B   1.0
/dev/nvme3n1     456 Amazon EC2 NVMe Instance Storage    1   900.00 GB / 900.00 GB    512   B +  0 B   0
/dev/nvme4n1     567 Amazon EC2 NVMe Instance Storage    1   900.00 GB / 900.00 GB    512   B +  0 B   0

The last two are ephemeral disks of 900G each.

  • Use the "write-mostly" option on the EBS volume. It will still do reads if the fast disk is absent, or doesn't have those blocks yet. Once the fast disk is populated (or "warmed") then reads will happen there.

The good thing is that writes to the mdX device will persist through orderly reboots and poweroffs. It is possible that unexpected hard power-downs may cause writes to be lost.

So this is a poor substitute for a backup - you should still be doing backups using whatever method works for you.

sa flag
PSA: you might save costs by using AWS EBS st1 or sc1 as your seed disk - however it will take much longer to seed. Calculate whether that tradeoff makes sense for you.
Criggie avatar
in flag
@user253751 excellent point - rightsizing disks is a good idea. However the bill at work is 80% EC2 instances, with about 4% on EBS, and the rest on other amazon services. If I can reduce a single m5.24xlarge then that'll save more than all our EBS costs combined :-\
sa flag
oh you can possibly save *lots* of money by *not* getting those from "cloud" providers. AWS/Azure/Google are all terribly expensive and you pay for the buzzword...
Shayne Fitzgerald avatar
md
Not an AWS environment sadly, but definitely fascinating solution. Thanks for sharing
Score:7
ca flag

If you need persistence, a RAMDISK is not the correct solution.

I strongly suggest to invest in a pair of fast (read: enterprise-grade, with powerloss protection) NVMe disk to put in a classical RAID1 (mirrored) array.

Dai avatar
cn flag
Dai
_"a classical RAID1 array"_ - Hasn't btrfs/zfs/etc rendered traditional RAID obsolete?
shodanshok avatar
ca flag
@Dai as "classical" RAID1 I really meant a mirror setup, with your filesystem of choice.
Score:4
ng flag

If you have this much free ram (that can hold most of these files and their metadata) chances are that they mostly reside in the RAM cache and your limiting factor is not reading, but writing them.

If this is the case, mirroring forcibly this volume in RAM will not bring you any performance.

In the possible case where another i/o activity constantly kicks your files out of RAM, locking this much RAM for your disk-like solution will probably affect these another i/o processes.

sa flag
OP may want to know how to prioritize certain files in cache
Score:3
km flag

Memcached, Redis

You basically described Memcached, and somewhat Redis. Both are good at caching, Redis has better support for persistence.

Note that you can only gain "all" the performance if these flat files amount to less than about 30GB total size (in your machine), otherwise some eviction mechanism should be employed. Even so, if this application uses some files very frequently, a Redis/Memcached solution would increase the performance.

These products are well supported by vendors, so you can use external hosted Memcached/Redis servers to completely isolate your machine from the specifics of caching.

Shayne Fitzgerald avatar
md
I'd absolutely be using a database system given the opportunity, unfortunately the software that chose to store using flat files is kind of a black box.
Score:0
at flag

The question is about RAM-disk like speed and persistence. This is possible as long as one allows asynchronous writes (keep the disk "catching up with writes").

As long as the application refrains from using sync or fsync, it runs faster and is easier to configure using the regular cache without using a RAM disk and a mirrored volume configuration.

In order keep the application running even if it writes large amounts of data that make large amounts of the memory dirty, one needs to allow 32 GB of dirty memory in the case of this question. This keeps all disk writes in the kernel flusher threads away from the application process and is configured by
sysctl vm.dirty_bytes=$((32*1024*1024*1024)) # 32 GB

(The default is sysctl vm.dirty_ratio=20 which allows no more than 20 % of the "available" memory to be dirty, throttling the application if this limit is reached, which happens long before 32 GB of the memory are dirly.)

Because the application "deals with a lot of flat files", I suspect it has a linear read behavior such that explicitly prefetching the data would not be helpful. But if it had a random read behavior, the cache should be warmed up before starting the application.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.