This is a fairly specific, but basic question regarding a specific computer system in combination with Linux EDAC. If you're the TL;DR type, please skip to the section labeled Question. Otherwise, please bear with me as some information is in needed.
Motherboard Docs
The motherboard for the HP ProLiant DL380 G7 lists two ways to identify a memory slot:
- Population order (A through I)
- Slot number (1 through 9)
This motherboard has two CPUs with two banks of 9 sticks of RAM for each CPU. Each bank of RAM has 9 slots and is separated into 3 channels with up to 3 RAM sticks populating each channel.
The population order is A, B, C until you reach I. This means that when populating RAM sticks, you must insert them A first, then B, then C and so on. Secondarily, the slots for A, B and C are white where the rest are not. The white coloring illustrates which slots to populate first.
Slot Numbers vs Slot Letters
This motherboard uses the following two labeling conventions per bank and channel to identify a slot:
Ch1 Ch2 Ch3
{G, D, A} {H, E, B} {I, F, C}
{1, 2, 3} {4, 5, 6} {7, 8, 9}
The {brackets} denote the start and end of a channel.
For example, to insert 1 memory stick for each CPU available on this motherboard, the first slot would be A3 (a white slot). This is where EDAC becomes a problem. Because there are two ways to identify a slot (by number and/or by letter), determining the slot via EDAC appears problematic in some cases.
Identification by Population
If identifying by population order, then the slot numbers should look like so:
Ch1 Ch2 Ch3
Slot {2, 1, 0) {2, 1, 0} {2, 1, 0)
Letter {G, D, A} {H, E, B} {I, F, C}
Count {1, 2, 3} {4, 5, 6} {7, 8, 9}
Logically, the reason that A3, B6 and C9 should be identified as Slot#0 (DIMM#0) here is that they are white slots which are populated first. It makes logical sense that the order of population should dictate the lowest Slot numbers by those filled first.
EDAC
The difficulty comes in with EDAC error logging. The first problem with EDAC is that it uses a 'begins-at-0' ideology. The motherboard docs don't use this numbering convention and instead begin at 1. I only mention this as it's something to be aware.
The second and bigger problem is that on this motherboard, EDAC seems to enumerate the memory sticks based on the numerical order (Count) of the slots (i.e., 1-9), with EDAC choosing to ignore the A-I population order.
What this means is that => CPU#1Channel#1_DIMM#0 <= is what you see via EDAC in dmesg. While it's relatively easy to identify the CPU and Channel number, it's more difficult to determine the actual DIMM slot from EDAC.
In the case of this EDAC message above, which stick is referred to as DIMM#0 in Channel#1 on CPU#1?
Enumeration Order
Following the enumeration order that EDAC displays, it seems that the slot numbers appear to be identified by EDAC like so (the opposite of population order):
Ch1 Ch2 Ch3
Slot {0, 1, 2) {0, 1, 2} {0, 1, 2)
Letter {G, D, A} {H, E, B} {I, F, C}
Count {1, 2, 3} {4, 5, 6} {7, 8, 9}
Where DIMM#0 is so labeled because it is the lowest Count number while enumerating through all of the RAM stick channels and slots.
Strictly following EDAC logging only and based on the above, I'd conclude that CPU#1Channel#1_DIMM#0 identifies stick H4 (remembering that EDAC begins its counts at 0). However, because there's now doubt about whether DIMM#0 means H4 or B6, this makes EDAC's obfuscated identification system problematic. I've also found no conclusive way to resolve this discrepancy. The disconnect between EDAC's identification abstract system and the motherboard population guidelines is most definitely problematic.
This means that to be safe, if stick H4 (or B6) went bad, you'd have to replace both H4 and B6 to ensure you've replaced the bad stick identified by EDAC as DIMM#0. That means replacing two sticks all because EDAC failed to choose a more conclusive means to identify the DIMM's true slot.
I don't even really understand the reason the EDAC developers chose to use 0, 1 and 2 for a channel's DIMM / Slot identification. This is a more-or-less made-up convention used only by EDAC, which has no connection to the motherboard docs or the labeling on the motherboard. Thus, EDAC makes slot identification unnecessarily difficult and confusing for no real world benefit.
Question
Does anyone truly understand how to conclusively identify the correct memory stick via EDAC with its needlessly confusing labeling scheme? I'm at a loss.