Score:0

With EDAC logging, how does one identify the DIMM slot on an HP DL380 G7?

in flag

This is a fairly specific, but basic question regarding a specific computer system in combination with Linux EDAC. If you're the TL;DR type, please skip to the section labeled Question. Otherwise, please bear with me as some information is in needed.

Motherboard Docs

The motherboard for the HP ProLiant DL380 G7 lists two ways to identify a memory slot:

  1. Population order (A through I)
  2. Slot number (1 through 9)

This motherboard has two CPUs with two banks of 9 sticks of RAM for each CPU. Each bank of RAM has 9 slots and is separated into 3 channels with up to 3 RAM sticks populating each channel.

The population order is A, B, C until you reach I. This means that when populating RAM sticks, you must insert them A first, then B, then C and so on. Secondarily, the slots for A, B and C are white where the rest are not. The white coloring illustrates which slots to populate first.

Slot Numbers vs Slot Letters

This motherboard uses the following two labeling conventions per bank and channel to identify a slot:

   Ch1       Ch2       Ch3
{G, D, A} {H, E, B} {I, F, C}
{1, 2, 3} {4, 5, 6} {7, 8, 9}

The {brackets} denote the start and end of a channel.

For example, to insert 1 memory stick for each CPU available on this motherboard, the first slot would be A3 (a white slot). This is where EDAC becomes a problem. Because there are two ways to identify a slot (by number and/or by letter), determining the slot via EDAC appears problematic in some cases.

Identification by Population

If identifying by population order, then the slot numbers should look like so:

          Ch1       Ch2       Ch3
Slot   {2, 1, 0) {2, 1, 0} {2, 1, 0)
Letter {G, D, A} {H, E, B} {I, F, C}
Count  {1, 2, 3} {4, 5, 6} {7, 8, 9}

Logically, the reason that A3, B6 and C9 should be identified as Slot#0 (DIMM#0) here is that they are white slots which are populated first. It makes logical sense that the order of population should dictate the lowest Slot numbers by those filled first.

EDAC

The difficulty comes in with EDAC error logging. The first problem with EDAC is that it uses a 'begins-at-0' ideology. The motherboard docs don't use this numbering convention and instead begin at 1. I only mention this as it's something to be aware.

The second and bigger problem is that on this motherboard, EDAC seems to enumerate the memory sticks based on the numerical order (Count) of the slots (i.e., 1-9), with EDAC choosing to ignore the A-I population order.

What this means is that => CPU#1Channel#1_DIMM#0 <= is what you see via EDAC in dmesg. While it's relatively easy to identify the CPU and Channel number, it's more difficult to determine the actual DIMM slot from EDAC.

In the case of this EDAC message above, which stick is referred to as DIMM#0 in Channel#1 on CPU#1?

Enumeration Order

Following the enumeration order that EDAC displays, it seems that the slot numbers appear to be identified by EDAC like so (the opposite of population order):

          Ch1       Ch2       Ch3
Slot   {0, 1, 2) {0, 1, 2} {0, 1, 2)
Letter {G, D, A} {H, E, B} {I, F, C}
Count  {1, 2, 3} {4, 5, 6} {7, 8, 9}

Where DIMM#0 is so labeled because it is the lowest Count number while enumerating through all of the RAM stick channels and slots.

Strictly following EDAC logging only and based on the above, I'd conclude that CPU#1Channel#1_DIMM#0 identifies stick H4 (remembering that EDAC begins its counts at 0). However, because there's now doubt about whether DIMM#0 means H4 or B6, this makes EDAC's obfuscated identification system problematic. I've also found no conclusive way to resolve this discrepancy. The disconnect between EDAC's identification abstract system and the motherboard population guidelines is most definitely problematic.

This means that to be safe, if stick H4 (or B6) went bad, you'd have to replace both H4 and B6 to ensure you've replaced the bad stick identified by EDAC as DIMM#0. That means replacing two sticks all because EDAC failed to choose a more conclusive means to identify the DIMM's true slot.

I don't even really understand the reason the EDAC developers chose to use 0, 1 and 2 for a channel's DIMM / Slot identification. This is a more-or-less made-up convention used only by EDAC, which has no connection to the motherboard docs or the labeling on the motherboard. Thus, EDAC makes slot identification unnecessarily difficult and confusing for no real world benefit.

Question

Does anyone truly understand how to conclusively identify the correct memory stick via EDAC with its needlessly confusing labeling scheme? I'm at a loss.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.