Score:-1

Dell PE RHEL 7 terrible performance

cn flag

We have an 'old' Dell PE r740xd server with quite high specs, installed with rhel 7 (latest). Running ls -l on / can take minutes.

Some specs:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:              4
CPU MHz:               2400.000
BogoMIPS:              4800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              28160K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40                                                                                            ,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41                                                                                            ,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca                                                                                             cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1g                                                                                            b rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonst                                                                                            op_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 s                                                                                            sse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_dead                                                                                            line_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3                                                                                             invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi fle                                                                                            xpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm                                                                                             cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw                                                                                             avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_lo                                                                                            cal dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d

# free -h
              total        used        free      shared  buff/cache   available
Mem:           376G        4.5G        371G         10M        342M        370G
Swap:          4.0G          0B        4.0G

# lsblk
NAME                     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                        8:0    0  17.5T  0 disk
└─sda1                     8:1    0  17.5T  0 part
sdb                        8:16   0 111.7G  0 disk
├─sdb1                     8:17   0     1G  0 part /boot
└─sdb2                     8:18   0 110.7G  0 part
  ├─rhel_lab110--16-root 253:0    0    50G  0 lvm  /
  ├─rhel_lab110--16-swap 253:1    0     4G  0 lvm  [SWAP]
  └─rhel_lab110--16-home 253:2    0  56.7G  0 lvm  /home

Only sdb is being used right now, I have just installed the OS. What can be affecting the performance so dramatically?

cn flag
so, one downvote. Please explain why this deserves that, professional courtesy and all that. If you require more info i'm happy to try to provide it, but it is extremely painful to run any command at all right now, that's why I am looking for additional expertise in serverfault.
Grant Curell avatar
mx flag
(Legally required note: - I work for Dell) I wasn't the downvote, but I monitor the Dell tag and do a lot of performance tuning for work. The issue with this question is that it's kind of impossible to answer. Performance tuning is an immensely complicated thing dependent on an a close to infinite number of factors. The question in its current state is definitely impossible to answer in that even though this is one of the specialties I cover at Dell, I wouldn't be able to give you anything only knowing that it's on R740xd and what CPU is in there. Suggestions to follow in next comment
Grant Curell avatar
mx flag
It looks like you're really active on the forum so you are probably familiar with the customary stuff - what have you tried? Lot's of things can bring an OS to its knees ranging from problems with the OS to problems with the hardware. Have you done anything to try narrowing it down to which it is? For example: if you massively underpower to the point that it has just enough to turn it on but nothing more, it will run... but very slow (not usually this slow though). I doubt that's your problem but without any further information it's impossible to even begin to hazard a guess.
Grant Curell avatar
mx flag
One of the easiest ways to start narrowing this down is boot it to a live Linux CD. If that runs just fine you can start narrowing things down. Probably software problem OR something that the live CD isn't leveraging - ex drives. Since you would be running things out of RAM maybe the OS' drives are struggling for some reason. That's easy enough to test with FIO or even some basic DD magic. Hope this helped move the ball for you.
Grant Curell avatar
mx flag
Final note: Albeit - FIO/DD are destructive so testing this gets into it depending on how you're storing your data, is this production/lab, etc.
cn flag
this is a newly installed host (deployed from the foreman). The disks are sdb (boss card, where OS has been installed) and a big raid 6 array with sas disks that is not even formatted yet. Today I will be deploying a more recent rhel version (8), to compare, but the application requires 7
cn flag
turns out to be a broken up-link in a switch. So no problem with the dell server ;-)
Grant Curell avatar
mx flag
Hahaha glad you got it worked out . May the golden rule continue to apply: Always blame the distant end
Score:2
cn flag

As you only mentioned ls -l / taking a long time (and not all directories, for example), one possibility is that your root inode got really large.

You can check this with stat / and look at the reported size. A typical root inode on a filesystem with 4K blocks would be only 4K.

A directory's inode can get really large by creating lots of names in it---it doesn't matter whether those names are files, directories, device nodes, etc. Anytime the names don't fit in the inode's current blocks, it has to be expanded.

A directory with a large inode will be slow to enumerate all of the names that it contains, even if most of the names have since been removed. If that's the root inode, it can affect many filesystem operations, such as calls to open(), etc.

Unfortunately, most filesystems won't automatically shrink inodes when names are removed.

For large non-root inodes, you can create a new directory, move everything from the old to the new, remove the old, then rename the new.

For large root inodes on an ext2/3/4 filesystem, you can run fsck -f -D /dev/... on the block device if you can connect it to another system. If you can't do that, you can try shutdown -r -F now to restart the system and force a fsck on startup; it might optimize and shrink the directory.

For other filesystems, the only sane remedy may likely be to rebuild the filesystem on a new disk.

To prevent a large root inode in the future, try to identify what program created so many names in / and prevent it from doing so in the future. It's likely that a program is storing its temp files there; configure it to use /tmp instead; or, even better, a subdirectory of /tmp just for it, so that you don't have to interrupt other programs using /tmp if you want to rebuild the offending program's temp directory again.

While looking for such files, use ls -a / to show hidden files. If that doesn't turn up anything, you might try wading through the output of lsof / | grep -i del; there may be files that had been created in /, opened, then unlinked so the name no longer shows up.

cn flag
thanks for a well thought answer. The default file system in rhel 7 is xfs, so that is what we use. In this case the issue turned out to be a broken switch port, so, this has been fixed.
Score:0
cn flag

it turns out, this was a broken up-link port on a switch. This has been repaired, and now the performance is what we would expect.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.