Displaying articles for: 06-28-2009 - 07-04-2009
On Ed Groden's post in the Intel Server Room, Mario Valetti asks about memory configurations for an Intel Xeon 5500-based server running a 32-bit OS. Ed gives a couple good rules of thumb for what memory to use, including his rule #4, the ultra-important suggestion to populate DIMMs in groups of 3 for each processor. However, the rules break down if you're stuck running Windows Server 2003 Standard edition.
Mario is confronting this issue on a new Nehalem-based server (a ProLiant DL380 G6 running Intel Xeon 5500 series processors, to be precise), but actually the problem isn't new to the x86 space. Multiprocessor AMD Opteron-based servers have had a NUMA architecture for a few years now, resulting in similar problems on Windows 2003. Nahalem makes it notably worse, though...and it's because of that pesky rule about "groups of 3".
I'll explain the exact problem, then give a couple of solutions.
By default, the BIOS on a HP DL380 G6 server (and any of the 2-processor Nahalem-based server blades) builds a memory map by placing all the memory attached to processor #1 first, followed by all the memory on processor #2. So in our 4-DIMM scenario, the memory would look like this:
Because of the "groups of 3" rule -- which is based on Nahalem processors' ability to interleave memory across its three memory channels -- memory bandwidth on the green region will be about 3x the bandwidth in the blue region. I'll call this the "bandwidth" problem.
Also, memory accesses from processor #2 to the green region will have about twice the latency as memory accesses to that same region from processor #1. This is the "latency" problem.
These problems stem from two limitations of Windows 2003 Standard: It maxes out at 4GB; and it doesn't grok NUMA. It doesn't take into account memory location when assigning threads to CPUs. What you'll see in our 4-DIMM scenario above is that threads will sometimes have dramatically different memory bandwidth and latency, resulting in very uneven performance.
Two ideas might jump to your mind on how to fix this:
1. You could tell the BIOS to build a memory map by assigning alternate slices of memory to different CPUs. HP BIOS actually lets you do this; it's called “Node Interleaving”, and can be selected in RBSU as an alternative to "NUMA" style assignment. (This "interleaving" shouldn't be confused with the memory channel interleaving done by each processor). With Node Interleaving enabled, the 4GB of memory will be evenly split across both processors:
Why won't this fix things? Well, you've broken the "groups of 3" rule, so your bandwidth problem becomes...well, broader. 100% of the time, your memory bandwidth will be only 2/3 of what it could be. Plus, you've still got the latency problem, and it's even more unpredicable: about half the time, a thread will get assigned to the "wrong" processor, slowing that thread down.
2. You could installing six 1GB DIMMs in the server, like this:
The OS will only see 4GB of memory (actually a little less, because of the so-called "Memory Hole" issue).
Why won't this fix things? Well, it'll fix the bandwidth issue, since both the blue and green regions now have nice, fast bandwidth. However, the latency problem isn't solved: Windows still won't assign threads to the "right" processor. You'll end up a with a thread on processor #1 wanting memory that's attached to processor #2, and vice versa. The result? At any given time, about 1/3 of your threads will suffer from poor memory latency.
So how can you fix both the bandwidth and latency problems?
You can disable NUMA the old fashioned way: Go with a 1-processor configuration. Now, all threads will get the same high bandwidth and low latency. You can use three 2GB DIMMs and still get your "full" 4GB. Obviously this only works if your applications aren't CPU bound. (I talked with Mario, and luckily he believes he'll be able to go this route.)
Another way to fix it is to use a NUMA-aware OS. How does that help? When BIOS builds the memory, it passes along data structures called called Static Resource Affinity Tables (SRAT) tables to the OS, which describe which processor is attached to which memory region. A NUMA-aware OS can then use that info to decide where to assign threads, helping get rid of the latency problem. Windows 2008 can handle that, as can the Enterprise editions of 2003.
You'll still want to follow the "groups of 3" rule, though!