NPS4 on a Threadripper 3960x gives two nodes with no memory at all

fr flag

I set my 3960x to NPS4 (Nodes Per Socket: 4) mode to experiment with NUMA on Linux. My system has 4 32 GiB DIMMs across 4 channels, so I expected each of the 4 nodes to get one. Instead, nodes 1 & 2 get 64 GiBs each, and nodes 0 & 3 get 0:

tavianator@tachyon $ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 64342 MB
node 1 free: 4580 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 64438 MB
node 2 free: 4276 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  12  12  12 
  1:  12  10  12  12 
  2:  12  12  10  12 
  3:  12  12  12  10 

Is this expected? Are the node 0/3 cores further away from memory than than the node 1/2 cores?

cn flag

Ryzen 5 3960x is a desktop part. There are not the same quality of balanced memory guides like there are for EPYC server CPUs. On EPYC, memory is really in quadrants of memory channel pairs. Not being able to find one for Matisse, my guess is that half the channels means half the interleave sets, so two.

Even though it can be creative with its topology, this still is one socket, one hop away from all its memory. More serious NUMA effects do not take effect until multiple sockets need to talk to each other.

To see actual NUMA, get a 2 socket server. However, its possible your workloads do not need that, AMD makes some big single socket boxes these days.

2 nodes per socket possibly will result in a more reasonable topology. For development purposes only, to see what it looks like. I am skeptical this will result in noticeable performance improvements.

The default in production should still be NPS1, unless you have data to suggest otherwise.

fr flag

I learned a lot digging into this which I'll summarize below. TLDR: yes, it seems like the NPS4 NUMA topology is accurate. Nodes 1/2 do have lower-latency access to memory than nodes 0/3. This is surprising to me because I'd always seen the 3960x/3970x diagrammed like this:

simplified topology diagram for 3960x/3970x

The package has 4 CCDs arranged into quadrants, each with two 3-core (3960x) or 4-core (3970x) CCXs. From this diagram it seems like the two CCDs on the left should have equal access to the memory channels on that side. So, not one channel per CCD like I was thinking, but two channels shared between two CCDs, making NPS2 seem most reasonable.

However, a more detailed diagram from WikiChip shows some asymmetry:

more detailed diagram of Threadripper 3 topology

CCD0 (top right) is connected to the I/O die by a GMI2 link (red lines). Right next to it are two memory controllers (UMC0&1), but these are not connected to any memory channels. In contrast, CCD2 underneath it is right next to UMC2/3, which are connected to memory channels. It's conceivable that CCD2 has lower memory latency than CCD1.

Can we measure it? One tool for this is the Intel Memory Latency Checker. Let's try it!

# tar xf mlc_v3.10.tgz
# sysctl vm.nr_hugepages=4000
vm.nr_hugepages = 4000
# ./Linux/mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.10
malloc(): corrupted top size
[1]    18377 IOT instruction (core dumped)  ./Linux/mlc --latency_matrix

Uh, okay then, let's try the previous version:

# tar xf ~/Downloads/mlc_v3.9a.tgz
# sysctl vm.nr_hugepages=4000
vm.nr_hugepages = 4000
# ./Linux/mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.9a
Command line parameters: --latency_matrix 

Using buffer size of 600.000MiB
Measuring idle latencies (in ns)...
                Numa node
Numa node            0       1       2       3
       0        -         98.8   108.1  -
       1        -         93.3   111.9  -
       2        -        112.3    93.2  -
       3        -        107.6    97.9  -

This confirms it! Same-node latency is ~93ns, and node 1↔2 latency is ~112ns, but node 0↔1 and 2↔3 latency is in between at ~98ns. Interestingly, nodes 0/3 worst-case latency is slightly better than nodes 1/2 at ~108ns. This makes sense looking at the diagram, as CCD0 is slightly closer to UMC4/5 than CCD2. Bandwidth has a similar story:

# ./Linux/mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.9a
Command line parameters: --bandwidth_matrix 

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1       2       3
       0        -       33629.4 31791.2 -
       1        -       34332.5 31419.5 -
       2        -       31193.1 34266.8 -
       3        -       32077.3 33799.3 -

What this seems to mean is that some cores on a 3960x (and presumably 3970x) are slightly privileged with regard to memory latency and bandwidth. I'd be curious to see the results for a 3990x -- does e.g. CCD1 perform similarly to CCD0?

I sit in a Tesla and translated this thread with Ai:


Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.