Score:1

What do results of Linux time wrapper tell me happened to this cp command?

in flag

My view on this issue is from the developer side. I write the code that gets placed on a RHEL virtual machine running as one of many in an enterprise system. The filesystem being used is a remote, network attached storage device.

We had some high variability on simple commands during batch. So we set up a test to get more information, but now I don't know what we've found.

We ran the following command every 30 minutes and logged the output. It's a copy of a 6 gb file. What I see is elapsed time jump from 11 seconds to 190 seconds when the system is busy running lots of jobs and this test command gets low CPU time.

What I can see is that column "I" (filesystem inputs) gets populated when the CPU is low, but not when it's high. Column "w" (involuntary swaps) is also much higher.

My question is, what is happening to this job/command that forces it to run SO MUCH LONGER when the CPU time goes down? Does the swap in/out store all of that data on some other device that is much slower? Generally, what happens during a swap in/out?

Command being run:

/usr/bin/time -a -o filename.txt cp file.txt fileCopy.txt
Date Time e S U P c w I O
3/14/2022 5:19:02 64.9 16.23 1.03 26% 3005 29210 12000016 12000000
3/14/2022 5:49:02 12.7 11.63 0.79 97% 2069 76 0 12000000
3/14/2022 6:19:02 100.39 14.74 0.78 15% 1034 29925 12000136 12000000
3/14/2022 6:49:24 191.32 18.86 0.94 10% 3374 36164 12001024 12000000
3/14/2022 7:19:02 71.61 15.61 0.88 23% 1610 30316 12000296 12000000
3/14/2022 7:49:02 70.73 17.5 0.91 26% 1408 29540 12000072 12000000
3/14/2022 8:19:02 10.95 9.89 0.7 96% 1709 75 0 12000000
3/14/2022 8:49:02 11.01 10.22 0.73 99% 239 85 0 12000000

The column descriptions from the man page for /usr/bin/time

e   Elapsed real time (in seconds).
S   Total number of CPU-seconds that the process spent in kernel mode.
U   Total number of CPU-seconds that the process spent in user mode.
P   Percentage of the CPU that this job got, computed as (%U + %S) / %E.
c   Number of times the process was context-switched involuntarily (because the time slice expired).
w   Number of waits: times that the program was context-switched voluntarily, for instance while waiting for an I/O operation to complete.
I   Number of filesystem inputs by the process.
O   Number of filesystem outputs by the process.
Score:1
cn flag

P in this context means the ratio of CPU time this job got over total time elapsed. Near 100% means almost all of the time it was on CPU, and so was CPU constrained for those runs. In contrast to the other runs where something else was the limiting factor. More system (aka kernel) time than system time, as is typical of I/O heavy tasks.

Given the workload was copy a 6 GB file, we can infer the 11 second runs averaged more than 0.5 GB writes per second. O column confirms the same number of writes each time, consistent with a simple copy one file process.

Input column has major swings, however. Slow runs have about equal reads as writes. But the fast runs don't do any reads! I assume the file is still cached in RAM from when it was last read. DRAM is much faster than even solid state storage. Which is a great speed boost, until under memory pressure the OS drops the cached data, and has to read from slow storage again.

So this is a 200 second task, which occasionally can take 12 seconds. Likely due to Linux page cache.


Finding root cause of a performance problem often requires a deeper understanding of the overall system, beyond any particular set of metrics.

The filesystem being used is a remote, network attached storage device.

Note your copy thing is over networked storage, so it could also be anything on the remote system or the network in between. Remote storage performance. Network (probably IP) speeds and utilization. Or it could be local to this VM, where the guest is competing for resources with everything else running on your infrastructure.

Always possible to go deeper into how things work. Does network storage (NFS?) matter at all, or do you also see this for local disk? 0.7 seconds of user CPU time is quite a bit of work actually, how much is accounting to manage many system calls? What does CPU busy actually mean when most of this is waiting for slow memory and very slow storage? Not easy questions to answer, however perhaps no need to dig too deeply once the thing is performing adequately.

Orion Red avatar
in flag
"I assume the file is still cached in RAM from when it was last read." - This makes total sense when I really sit and look at the situation as a whole. It may also lead me to a way to fix this particular bump in the processing time. It's a pig in a python situation where we are getting a massive 16gb file, encrypting/compressing it for use later, then decrypting/expanding it when needed. THEN sorting it. The other relevant files are less than 800 MB, so I might be able to put those steps back-to-back and remove the need for so many operations.
Orion Red avatar
in flag
How would you like to be credited when I take this up the chain?
John Mahowald avatar
cn flag
Maybe the decryption and processing can happen in a pipe, if stream access is possible, but possibly not. Or just throw memory and fast storage at the problem. As to attribution, my name and a link to this, as typical for the CC BY-SA license for all Stack Exchange stuff https://stackoverflow.com/help/licensing Maybe you would not get the opportunity to adapt and share, but if say you ever write an engineering blog post about interpreting the humble time command, the license requires attribution.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.