CPU utilization as a simple % cannot convey the complexity of a multiple core, multiple thread, multiple execution unit CPU and memory. Almost certainly CPU is actually stalled on memory or cache. And processes that do have their data will be fighting over execution units.
This CPU only has 16 cores. Treating it like it has 32 will at some point degrade performance severely, as you discovered. Even with SMT 2. Maybe you can get the number of threads to 125% of cores (20) but 175% (28) is pushing it. Especially with other things running. Back down the threads.
Be sure to calculate useful work done per thread per second. Experiment, changing one variable at a time. Maybe try processors with different cache and core count configurations, if you have access to those.
Measure how stalled you are with performance monitoring counters. Won't work in a VM, but worth a try on Linux. From Gregg which I linked earlier:
perf stat -a -- sleep 10
Theoretical top speed on Xeons is 4 or 5 instructions per cycle. You won't get that, but < 1.0 IPC is extra stalled on memory.
Definitely get an understanding of the application's code and hot spots. What functions spend most of the time on CPU? What assembly code gets hit the hardest? Which execution units on your CPU in particular are working the hardest to process these uops?
Flame graphs are nice for visualizing on CPU functions. You mentioned EL 8, which has packaged flamegraph tooling.
yum install perf js-d3-flame-graph
# system wide, 99 Hz, for 60 seconds
perf script flamegraph -a -F 99 sleep 60
A developer level understanding of the program is necessary to fully interpret the results. With symbols or source code, perf reports can be annotated in a debugger like experience.