Score:0

accounting GPU compute time on HPC clusters

kr flag

How do you account for GPU compute time on your HPC clusters ?

I have a growing, and quite heterogeneous (SXM4 A100s, PCIe A100s, NVlinked V100s, PCIe V100s, T4s, AMD cards arriving soon etc...), GPU partition on an HPC cluster (mixed hardware Debian servers running OAR scheduler).

Traditionally, we accounted compute time as seconds per core per job. Despite CPU and memory variability between nodes (fat nodes, high speed nodes, standard nodes), the difference was sufficiently small that it didn't impact accounting noticeably, especially in a small university setting.

On GPUs, things change quite a bit. The difference in performance and cost between an SXM4 A100 node and a T4 are quite significant and our current model is probably not going to cut it, moreover as growing university partnerships impose that we host more and more private sector projects which we will have to account for precisely.

I'm exploring how to do this accounting with our current infrastructure but was also wondering what methods by other people operating HPC GPU clusters. If you have any advice as to how to do this or what strategy/tools you have used, I'd be very willing to hear them!

Thanks!

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.