Eye on Blades Blog: Trends in Infrastructure
Get HP BladeSystem news, upcoming event information, technology trends, and product information to stay up to date with what is happening in the world of blades.

Nehalem and Windows 2003: Why 6 x 1GB = 4GB

On Ed Groden's post in the Intel Server Room, Mario Valetti asks about memory configurations for an Intel Xeon 5500-based server running a 32-bit OS.   Ed gives a couple good rules of thumb for what memory to use, including his rule #4, the ultra-important suggestion to populate DIMMs in groups of 3 for each processor.     However, the rules break down if you're stuck running Windows Server 2003 Standard edition.


Mario is confronting this issue on a new Nehalem-based server (a ProLiant DL380 G6 running Intel Xeon 5500 series processors, to be precise), but actually the problem isn't new to the x86 space.   Multiprocessor AMD Opteron-based servers have had a NUMA architecture for a few years now, resulting in similar problems on Windows 2003.  Nahalem makes it notably worse, though...and it's because of that pesky rule about "groups of 3".


I'll explain the exact problem, then give a couple of solutions.


Since Win2003 Standard only addresses 4GB of memory, and the smallest DIMM available on a DL380 G6 is 1GB,  your first instinct on a 2-processor server is to use four 1GB DIMM, like this:


By default, the BIOS on a HP DL380 G6 server (and any of the 2-processor Nahalem-based server blades) builds a memory map by placing all the memory attached to processor #1 first, followed by all the memory on processor #2.    So in our 4-DIMM scenario, the memory would look like this:



Because of the "groups of 3" rule -- which is based on Nahalem processors' ability to interleave memory across its three memory channels --  memory bandwidth on the green region  will be about 3x the bandwidth in the blue region.  I'll call this the "bandwidth" problem.


Also, memory accesses from processor #2 to the green region will have about twice the latency as memory accesses  to that same region from processor #1.   This is the "latency" problem.


These problems stem from two limitations of Windows 2003 Standard: It maxes out at 4GB; and it doesn't grok NUMA.  It doesn't take into account memory location when assigning threads to CPUs.   What you'll see in our 4-DIMM scenario above is that threads will sometimes have dramatically different memory bandwidth and latency, resulting in very uneven performance.


Two ideas might jump to your mind on how to fix this:


1. You could tell the BIOS to build a memory map by assigning alternate slices of memory to different CPUs.  HP BIOS actually lets you do this; it's called “Node Interleaving”, and can be selected in RBSU as an alternative to "NUMA" style assignment.  (This "interleaving" shouldn't be confused with the memory channel interleaving done by each processor).  With Node Interleaving enabled, the 4GB of memory will be evenly split across both processors:


 


 Why won't this fix things?  Well, you've broken the "groups of 3" rule, so your bandwidth problem becomes...well, broader.   100% of the time, your memory bandwidth will be only 2/3 of what it could be.  Plus, you've still got the latency problem, and it's even more unpredicable: about half the time, a thread will get assigned to the "wrong" processor, slowing that thread down.


2.  You could installing six 1GB DIMMs in the server, like this:



The OS will only see 4GB of memory (actually a little less, because of the so-called "Memory Hole" issue).


Why won't this fix things?  Well, it'll fix the bandwidth issue, since both the blue and green regions now have nice, fast bandwidth.  However, the latency problem isn't solved: Windows still won't assign threads to the "right" processor.  You'll end up a with a thread on processor #1 wanting memory that's attached to processor #2, and vice versa.    The result?  At any given time, about 1/3 of your threads will suffer from poor memory latency.


So how can you fix both the bandwidth and latency problems?


You can disable NUMA the old fashioned way: Go with a 1-processor configuration.    Now, all threads will get the same high bandwidth and low latency.  You can use three 2GB DIMMs and still get your "full" 4GB.   Obviously this only works if your applications aren't CPU bound.  (I talked with Mario, and luckily he believes he'll be able to go this route.)


Another way to fix it is to use a NUMA-aware OS.  How does that help?   When BIOS builds the memory, it passes along data structures called called Static Resource Affinity Tables (SRAT) tables to the OS, which describe which processor is attached to which memory region. A NUMA-aware OS can then use that info to decide where to assign threads, helping get rid of the latency problem.  Windows 2008 can handle that, as can the  Enterprise editions of 2003.   


You'll still want to follow the "groups of 3" rule, though!


 

Comments
Anonymous(anon) | ‎07-16-2009 08:07 PM

yP3zjU

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the community guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
About the Author(s)
  • More than 25 years in the IT industry developing and managing marketing programs. Focused in emerging technologies like Virtualization, cloud and big data.
  • I work within EMEA ISS Central team and a launch manager for new products and general communications manager for EMEA ISS specific information.
  • Hello! I am a social media manager for servers, so my posts will be geared towards HP server-related news & info.
  • HP Servers, Converged Infrastructure, Converged Systems and ExpertOne
  • WW responsibility for development of ROI and TCO tools for the entire ISS portfolio. Technical expertise with a financial spin to help IT show the business value of their projects.
  • I am a member of the HP BladeSystem Portfolio Marketing team, so my posts will focus on all things blades and blade infrastructure. Enjoy!
  • Luke Oda is a member of the HP's BCS Marketing team. With a primary focus on marketing programs that support HP's BCS portfolio. His interests include all things mission-critical and the continuing innovation that HP demonstrates across the globe.
  • Global Marketing Manager with 15 years experience in the high-tech industry.
  • Network industry experience for more than 20 years - Data Center, Voice over IP, security, remote access, routing, switching and wireless, with companies such as HP, Cisco, Juniper Networks and Novell.
  • 20 years of marketing experience in semiconductors, networking and servers. Focused on HP BladeSystem networking supporting Virtual Connect, interconnects and network adapters.
  • Greetings! I am on the HP Enterprise Group marketing team. Topics I am interested in include Converged Infrastructure, Converged Systems and Management, and HP BladeSystem.


Follow Us