I copied a 10GB of file in my SSD which has read bandwidth of around 3.3GB/s, benchmarked using fio command. Here is the ref: https://cloud.google.com/compute/docs/disks/benchmarking-pd-performance
I cleared out the cache using this "sync; echo 3 > /proc/sys/vm/drop_caches". After that I tried to read the file in small chunks of 3MB every time using system calls open() and read(). If I open the file without O_DIRECT and O_SYNC it gives me a bandwidth of around 1.2GB/s. However, If I use O_DIRECT and O_SYNC it gives me a bandwidth of around 3GB/s. Clearing the cache both times even O_DIRECT doesn't really use the page cache.
My question is why O_DIRECT is giving normal IO bandwidth and without O_DIRECT I cant get it. As the data going from IO to the page cache has bandwidth of 3.3GB/s and from page cache to user buffer is around 7GB/s i suppose. This pipeline should also give normal 3.3GB/s. Why it is slower?
I am always reading a new 3MB every time. I am not reusing the data so cache is not really useful. But the pipeline should be bound by IO why it is not?
CPU is Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz. I am not sure about the DRAM speed. But the thing is if i re-read the same 3MB multiple times then i get ~8GB/s bandwidth. Which should be the DRAM bandwidth i suppose. Because linux can use all of the free RAM as page cache.
Update
I tried the fio command with and without O_DIRECT enabled in it and logged the iostat.
Used this fio command.
"fio --name=read_throughput --directory=$TEST_DIR --numjobs=1 --size=10G --time_based --runtime=30s --ramp_time=0s --ioengine=sync --direct=0 --verify=0 --bs=4K --iodepth=1 --rw=read --group_reporting=1 --iodepth_batch_submit=64 --iodepth_batch_complete_max=64"
Used this iostat.
"iostat -j ID nvme0c0n1 -x 1"
I had the following conclusion, a single threaded read without O_DIRECT flag is not able to saturate the SSD with enough read requests to achieve 3.3GB/s irrespective of the block size being used. However, with O_DIRECT falg a single threaded read is able to saturate the device when block size is 64M or higher. At 3M it is around 2.7GB/s.
Now the question is why without O_DIRECT flag, CPU is not able to send enough read requests to SSD why it is limiting them? Does it has to do with cache management limitation? If yes, which parameter is limiting it? Can I change it and see does it affect amount of read requests being sent to the device?