Score:4

Server

why read is faster when using O_DIRECT flag?

Abdul Wadood

8/11/24, 10:45 PM

I copied a 10GB of file in my SSD which has read bandwidth of around 3.3GB/s, benchmarked using fio command. Here is the ref: https://cloud.google.com/compute/docs/disks/benchmarking-pd-performance

I cleared out the cache using this "sync; echo 3 > /proc/sys/vm/drop_caches". After that I tried to read the file in small chunks of 3MB every time using system calls open() and read(). If I open the file without O_DIRECT and O_SYNC it gives me a bandwidth of around 1.2GB/s. However, If I use O_DIRECT and O_SYNC it gives me a bandwidth of around 3GB/s. Clearing the cache both times even O_DIRECT doesn't really use the page cache.

My question is why O_DIRECT is giving normal IO bandwidth and without O_DIRECT I cant get it. As the data going from IO to the page cache has bandwidth of 3.3GB/s and from page cache to user buffer is around 7GB/s i suppose. This pipeline should also give normal 3.3GB/s. Why it is slower?

I am always reading a new 3MB every time. I am not reusing the data so cache is not really useful. But the pipeline should be bound by IO why it is not?

CPU is Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz. I am not sure about the DRAM speed. But the thing is if i re-read the same 3MB multiple times then i get ~8GB/s bandwidth. Which should be the DRAM bandwidth i suppose. Because linux can use all of the free RAM as page cache.

Update

I tried the fio command with and without O_DIRECT enabled in it and logged the iostat.

Used this fio command. "fio --name=read_throughput --directory=$TEST_DIR --numjobs=1 --size=10G --time_based --runtime=30s --ramp_time=0s --ioengine=sync --direct=0 --verify=0 --bs=4K --iodepth=1 --rw=read --group_reporting=1 --iodepth_batch_submit=64 --iodepth_batch_complete_max=64"

Used this iostat.

"iostat -j ID nvme0c0n1 -x 1"

I had the following conclusion, a single threaded read without O_DIRECT flag is not able to saturate the SSD with enough read requests to achieve 3.3GB/s irrespective of the block size being used. However, with O_DIRECT falg a single threaded read is able to saturate the device when block size is 64M or higher. At 3M it is around 2.7GB/s.

Now the question is why without O_DIRECT flag, CPU is not able to send enough read requests to SSD why it is limiting them? Does it has to do with cache management limitation? If yes, which parameter is limiting it? Can I change it and see does it affect amount of read requests being sent to the device?

149

1 + 4

linux

bandwidth

read-only

benchmark

fio

djdomi

8/12/24, 6:50 AM

if you need support for writing an application, stackoverflow.com is a better source in this due specifically for programming

Esa Jokinen

8/12/24, 10:47 AM

I would rather migrate this to the Unix & Linux Stack Exchange.

Abdul Wadood

8/13/24, 3:09 AM

i just need to understand why it is happening.

Grant Curell

8/13/24, 10:08 PM

As a guy who does a lot of benchmarking for work, I would submit that this question does belong here. Much of what determines these behaviors is very hardware/server specific. How was the I/O dye configured? What drives are being run? How are the PCIe lanes configured? For these reasons I would submit that the question, at least to me, makes sense in serverfault. If the OP asked, how do I “optimize JavaScript async calls” for example - agreed, no hardware factors - give it to SO.

Score:2

Server

Grant Curell

8/13/24, 6:37 AM

O_DIRECT is faster than a generic read because it bypasses the operating system's buffers. You are reading directly from the drive. There are a couple of reasons this could be faster though keep in mind at this level things get insanely setup specific. Example of what I mean by setup specific factors: if you have a drive that is optimized for 8kB writes inside the NAND vs 4kB chunks and you write/read at the wrong size you'll see half performance but this requires you to have an internal understanding of how drives work. This can even vary within the same model - ex: the A model of a drive might have different optimizations than the same B model of the drive (I have seen this multiple times in the field)

But back to your question:

No cache to have to copy in and out of
If you're doing something like FIO you'll get more predictable read behavior
1MB is a large block size so you'll benefit extra from not dealing with the cache

Beyond that you have to start getting deeper into benchmarking and that's a pretty complex topic.

My general recommendation is to start with io_stat. Is avgqu-sz high? Is util close to 100%, it probably will be if you're close to the drive's max capacity. Is await long? Do you have RAID? What scheduling algorithm did you pick? The list of things I've seen cause these sorts of things are myriad and figuring out exactly what causes which behavior is going to be very unique to your specific system.

What I said in the beginning though may get you in the ballpark. Best guess is that if you are doing big block reads you're getting savings regarding some sort of cache inefficiency.

+ 9

Abdul Wadood

8/13/24, 10:07 PM

Posting a thread of comments. Not sure how to respond on the answer. I used fio and changed the parameters against which i am trying to understand the behaviour. Here is the command "fio --name=read_throughput --directory=$TEST_DIR --numjobs=1 --size=10G --time_based --runtime=30s --ramp_time=0s --ioengine=sync --direct=0 --verify=0 --bs=1M --iodepth=1 --rw=read --group_reporting=1 --iodepth_batch_submit=64 --iodepth_batch_complete_max=64"

Abdul Wadood

8/13/24, 10:07 PM

I disabled the O_DIRECT, engine is sync which is normal read write, depth is 1 and numjobs is also 1. Initially this would give 1.3GB/s when it is actually sending read requests to disk and after that it goes to 6~7GB/s and then it is no longer sending any I/O requests. here is the iostat when the read requests were being sent and bandwidth was 1.3GB/s. r/s = 10799 rKB/s = 1382272 r_await = 0.2 rareq-sz = 128 svctm = 0.09 %util = 100

Abdul Wadood

8/13/24, 10:07 PM

rest of the parameters were all 0. Checked using this command "iostat -j ID nvme0c0n1 -x 1" But why it is 1.3GB/s not 3.3GB/s in this case? If i only enable O_DIRECT with the same command the bandwidth is 1.7GB/s. Which is a bit faster but now is not near normal I/O. However, if I change the numjobs to 4. Both of them starts at 3.3GB/s and maintain it when the reads are being requested from the disk. iostats are like this r/s = 26139 rKB/s = 3345792 r_await = 0.43 rareq-sz = 128 svctm = 0.04 %util = 100

Abdul Wadood

8/13/24, 10:07 PM

Why is this behaviour like this? Essentially the "svctm" decreases for a 128KB of avg. size. Why it got decreased? The 128KB size serviced in 40us is 3.3GB/s contrary to 90us which is 1.45GB/s. As per this "https://unix.stackexchange.com/questions/104192/iostat-await-vs-svctm" "The svctm column (service time) should display the average time spent servicing the request, i.e. the time spent "outside" the OS." Why this svctm changes

Abdul Wadood

8/13/24, 10:07 PM

Q1: Why with or without O_DIRECT and numjob = 1 the bandwidth is not 3.3GB/s? Q2: What causes the difference between 1.3GB/s and 1.7GB/s? Like what is happening extra in the process that it is 1.3 not 1.7? Q3: Why it is normal I/O bandwidth if the numjobs = 4?

Abdul Wadood

8/13/24, 10:17 PM

Some insight that i can think of is. When the numjobs=4. Some part of process gets overlapped with other and that time gets hidden and the overall bandwidth seems to be 3.3GB/s. Otherwise it is sequential and is added in that 3.3GB/s bandwidth making it slower. But what that is?

Abdul Wadood

8/13/24, 10:57 PM

There is no RAID in the system that i am testing on

Abdul Wadood

8/13/24, 11:13 PM

Ok. I did some more testing. So, increasing the block size of the read from 1M to 64M or higher with O_DIRECT can saturate the read requests for the SSD and can achieve 3.3GB/s with numjobs=1. But it is unable to send enough read requests to SSD without O_DIRECT flag. So, that it can saturate the device. Why without O_DIRECT CPU is not able to saturate the read requests even if the bs = 64M or higher?

Grant Curell

8/14/24, 3:36 AM

hahaha ok - I'll read through this tomorrow but the accepted way to do this is take all these comments - copy pasta them into your OP, and just put them under a section called `## Update` or someithng.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: why read is faster when using O_DIRECT flag?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.