Score:3

Dell PowerEdge R7525 + Nvidia A16

aw flag

We have a PowerEdge R7525 server with nvidia A16 graphics card on debian 11. But we have about 50% lower gpu performance than other servers. I suspect it's the missing "Above 4G decoding" option in the BIOS. According to nvidia this server should handle up to 3 A16 gpu units. Can anyone advice me some work-around or something to harness the full power of this gpu?

Thank you very much in advance

Score:6
mx flag

(I work for Dell) - specifically, I do a lot of optimization.

I think you're tracking a bit off course; "Above 4G decoding" is a feature left over from when BIOS PCIe memory enumeration was limited to 32bits, which is no longer the case and hasn't been for quite some time. The addressing is now native 64 bit.

But we have about 50% lower gpu performance than other servers.

I'm not sure what you mean by this. I may be reading too much into this, but this statement makes me think this may be your first foray into optimization in which case, awesome! It's a complicated but fascinating world. GPU performance can be measured in myriad different ways so this statement on its own doesn't narrow down what the problem is.

With regards to why you're seeing poor performance, this is an enormously complex question on which people write entire books. Some common mistakes I see people make particularly on AMD-based servers:

  • Failing to account for PCIe lane / proc alignment. Make sure whatever processes you're running against the GPU are assigned to the proc that has the GPU's PCIe lanes rather than the distant proc
  • Failing to set NUMA's per core appropriately for the workload (this is unique to AMD systems like the R7525)
  • Failing to account for bottlenecks elsewhere. For example: I've had people see poor GPU performance but in reality part of their software was storage IO bound.
  • Maybe this is obvious, but try setting the BIOS profile to performance. If you set it to power saver that can lead to downclocks potentially when you don't want them
  • Poorly aligned memory transfers

Optimization is extremely workload specific. If this is the first time you've gone through it, I would focus my time on really understanding exactly how the data flows and where it might be bottlenecking. Try to identify things that seem out of place. Ex: if you think GPU performance is low, what is the GPUs utilization? Is it at 100%? If it is close to 100%, I start to lean towards software problems. If it's not at 100%, why is it not? Are you not feeding it data fast enough? Is the card underpowered? Server overheating? Etc.

Aotor avatar
aw flag
Hello, first of all, I want to thank you for your time. The 50% lower gpu performance is meant when transcoding. We have several other servers with very similar configuration but on supermicro hardware (especially motherboard). same platform, same installation, same cpu, same gpu. But on this dell we can only transcode 20-24 channels on this gpu without errors. On other servers we have no problem even with 40 channels.
Aotor avatar
aw flag
The server and gpu are not overheated. The gpu ranges from 37 Celsius to 43 Celsius. Utilization is 97% - 100%, but on our other servers we have utilization around 65%-75% when runs 40 ffmpeg process.
Aotor avatar
aw flag
Can you please explain more this point? Failing to set NUMA's per core appropriately for the workload (this is unique to AMD systems like the R7525)
Aotor avatar
aw flag
Thank you very much in advance
Grant Curell avatar
mx flag
I don't know what processor you're running, but if you're marching into optimization territory on Rome/Milan/Genoa/newer you need to be very familiar with the NUMA topology. I would start here: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/redhat-enterprise-linux-tuning-guide-amd-epyc7003-series-processors.pdf
Grant Curell avatar
mx flag
See here for getting started understanding their numas per socket setting. This has massive performance implications - if you have the same proc on SuperMicro I expect you have tuned it there if it's working well for you. I would start with whatever you're using in the good setup. If you don't see numas per socket on the supermicro than you don't have the same CPU or SuperMicro is hiding major options from you. https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/white-papers/overview-amd-epyc7003-series-processors-microarchitecture.pdf#page=10
Aotor avatar
aw flag
Thank you again for your time. We have in this server this cpu AMD EPYC 75F3.
Aotor avatar
aw flag
Thank for docs i'll go thru them and hope it helps.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.