Score:1

Synchronizing threads in multithreaded applications

er flag
PBH

I use the SIESTA dft package on a CentOS 8 (core) system with a 16 core, 32 thread XEON processor with OpenMPI version 4.1.1 for all calculations.

  1. Since I have 32 threads, I use 28 of them to do a SIESTA calculation (which consumes a good amount of the memory ~60%) and keep the remaining 4 free.

  2. However, if I start using 2 or 3 of the remaining threads for some other application (which has negligible memory usage), while maintaining the SIESTA calculation at 28 threads, I see that the speed of the SIESTA calculation is decreased by around 50-60%.

  3. I have checked the CPU utilization and I see that one thread remains almost idle when using the system in scenario 2.

Is there a way to diagnose and solve this problem? Does this happen because of some process scheduling error? Can some sort of process binding or job scheduling package be used to improve this?

Score:1
cn flag

CPU utilization as a simple % cannot convey the complexity of a multiple core, multiple thread, multiple execution unit CPU and memory. Almost certainly CPU is actually stalled on memory or cache. And processes that do have their data will be fighting over execution units.


This CPU only has 16 cores. Treating it like it has 32 will at some point degrade performance severely, as you discovered. Even with SMT 2. Maybe you can get the number of threads to 125% of cores (20) but 175% (28) is pushing it. Especially with other things running. Back down the threads.

Be sure to calculate useful work done per thread per second. Experiment, changing one variable at a time. Maybe try processors with different cache and core count configurations, if you have access to those.


Measure how stalled you are with performance monitoring counters. Won't work in a VM, but worth a try on Linux. From Gregg which I linked earlier:

perf stat -a -- sleep 10

Theoretical top speed on Xeons is 4 or 5 instructions per cycle. You won't get that, but < 1.0 IPC is extra stalled on memory.


Definitely get an understanding of the application's code and hot spots. What functions spend most of the time on CPU? What assembly code gets hit the hardest? Which execution units on your CPU in particular are working the hardest to process these uops?

Flame graphs are nice for visualizing on CPU functions. You mentioned EL 8, which has packaged flamegraph tooling.

yum install perf js-d3-flame-graph
# system wide, 99 Hz, for 60 seconds
perf script flamegraph -a -F 99 sleep 60 

A developer level understanding of the program is necessary to fully interpret the results. With symbols or source code, perf reports can be annotated in a debugger like experience.

PBH avatar
er flag
PBH
Hi, thanks for the reply. I checked and my IPC was 0.97 so it seems the system is stalled on memory. However I have only this one system and so cannot check elsewhere. I shall check on the IPC changes by varying the number of cores used once the current running calculation ends (would probably take over a week).
PBH avatar
er flag
PBH
What is your opinion regarding the `taskset` tool. Would it make any difference if I bind the SIESTA calculations to a set of cores and the other program to a separate core? Would that enable the given core to work with only one type of a workload?
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.