<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>article Nehalem and Windows 2003: Why 6 x 1GB = 4GB in Eye on Blades Blog: Trends in Infrastructure</title>
    <link>http://h30507.www3.hp.com/t5/Eye-on-Blades-Blog-Trends-in/Nehalem-and-Windows-2003-Why-6-x-1GB-4GB/ba-p/77883</link>
    <description>&lt;p&gt;On &lt;a href="http://communities.intel.com/community/openportit/server/blog/2009/05/05/nehalem-memory-help-im-lost" title="Nahalem Memory Help Im Lost"&gt;Ed Groden&amp;#39;s post in the Intel Server Room&lt;/a&gt;, Mario Valetti asks about memory configurations for an Intel Xeon 5500-based server running a 32-bit OS.&amp;nbsp;&amp;nbsp; Ed gives a couple good rules of thumb for what memory to use, including his rule #4, the ultra-important suggestion to populate DIMMs in groups of 3 for each processor.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; However, the rules break down if you&amp;#39;re stuck running Windows Server 2003 Standard edition.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Mario is confronting this issue on a new Nehalem-based server (a ProLiant DL380 G6 running Intel Xeon 5500 series processors, to be precise), but actually the problem isn&amp;#39;t new to the x86 space.&amp;nbsp;&amp;nbsp; Multiprocessor AMD Opteron-based servers have had a &lt;a target="_self" href="http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access" title="Non Unified Memory Architecture"&gt;NUMA architecture&lt;/a&gt; for a few years now, resulting in similar problems on Windows 2003.&amp;nbsp; Nahalem makes it notably worse, though...and it&amp;#39;s because of that pesky rule about &amp;quot;groups of 3&amp;quot;. &lt;/p&gt;&lt;BR&gt;
&lt;p&gt;I&amp;#39;ll explain the exact problem, then give a couple of solutions.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Since Win2003 Standard only addresses 4GB of memory, and the smallest DIMM available on a DL380 G6 is 1GB,&amp;nbsp; your first instinct on a 2-processor server is to use four 1GB DIMM, like this:&lt;br /&gt;&lt;a href="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_4GB.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_4GB.gif" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;By default, the BIOS on a HP DL380 G6 server (and any of the 2-processor Nahalem-based server blades) builds a memory map by placing all the memory attached to processor #1 first, followed by all the memory on processor #2.&amp;nbsp;&amp;nbsp;&amp;nbsp; So in our 4-DIMM scenario, the memory would look like this:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&lt;a href="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_Map.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_Map.gif" style="border:0;" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Because of the &amp;quot;groups of 3&amp;quot; rule -- which is based on Nahalem processors&amp;#39; ability to interleave memory across its three memory channels --&amp;nbsp; memory bandwidth on the green region&amp;nbsp; will be about 3x the bandwidth in the blue region.&amp;nbsp; I&amp;#39;ll call this the &amp;quot;bandwidth&amp;quot; problem.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Also, memory accesses from processor #2 to the green region will have about twice the latency as memory accesses&amp;nbsp; to that same region from processor #1.&amp;nbsp;&amp;nbsp; This is the &amp;quot;latency&amp;quot; problem.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;These problems stem from two limitations of Windows 2003 Standard: It maxes out at 4GB; and it doesn&amp;#39;t grok NUMA.&amp;nbsp; It doesn&amp;#39;t take into account memory location when assigning threads to CPUs.&amp;nbsp;&amp;nbsp; What you&amp;#39;ll see in our 4-DIMM scenario above is that threads will sometimes have dramatically different memory bandwidth and latency, resulting in very uneven performance.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Two ideas might jump to your mind on how to fix this:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;1. You could tell the BIOS to build a memory map by assigning alternate slices of memory to different CPUs.&amp;nbsp; HP BIOS actually lets you do this; it&amp;#39;s called &amp;ldquo;&lt;i&gt;Node Interleaving&lt;/i&gt;&amp;rdquo;, and can be selected in &lt;a href="http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00191707/c00191707.pdf" title="HP RBSU User&amp;#39;s Guide"&gt;RBSU&lt;/a&gt; as an alternative to &amp;quot;&lt;i&gt;NUMA&lt;/i&gt;&amp;quot; style assignment.&amp;nbsp; (This &amp;quot;interleaving&amp;quot; shouldn&amp;#39;t be confused with the memory channel&amp;nbsp;interleaving done by each processor).&amp;nbsp; With Node Interleaving enabled, the 4GB of memory will be evenly split across both processors:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&amp;nbsp; &lt;a href="/legacyfs/online/eyeonblades/NahalemConfig2_2D00_Node.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig2_2D00_Node.gif" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&amp;nbsp;Why won&amp;#39;t this fix things?&amp;nbsp; Well, you&amp;#39;ve broken the &amp;quot;groups of 3&amp;quot; rule, so your bandwidth problem becomes...well, broader.&amp;nbsp;&amp;nbsp; 100% of the time, your memory bandwidth will be only 2/3 of what it could be.&amp;nbsp; Plus, you&amp;#39;ve still got the latency problem, and it&amp;#39;s even more unpredicable: about half the time, a thread will get assigned to the &amp;quot;wrong&amp;quot; processor, slowing that thread down.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;2.&amp;nbsp; You could installing six 1GB DIMMs in the server, like this:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&lt;a href="/legacyfs/online/eyeonblades/NahalemConfig3_2D00_6GB.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig3_2D00_6GB.gif" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;The OS will only see 4GB of memory (actually a little less, because of the so-called &lt;a target="_self" href="http://blogs.msdn.com/oldnewthing/archive/2006/08/14/699521.aspx" title="Why can&amp;#39;t I see all of the 4GB of RAM in my machine?"&gt;&amp;quot;Memory Hole&amp;quot; issue&lt;/a&gt;).&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Why won&amp;#39;t this fix things?&amp;nbsp; Well, it&amp;#39;ll fix the bandwidth issue, since both the blue and green regions now have nice, fast bandwidth.&amp;nbsp; However, the latency problem isn&amp;#39;t solved: Windows still won&amp;#39;t assign threads to the &amp;quot;right&amp;quot; processor.&amp;nbsp; You&amp;#39;ll end up a with a thread on processor #1 wanting memory that&amp;#39;s attached to processor #2, and vice versa.&amp;nbsp;&amp;nbsp;&amp;nbsp; The result?&amp;nbsp; At any given time, about 1/3 of your threads will suffer from poor memory latency.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;So how &lt;b&gt;can&lt;/b&gt; you fix both the bandwidth and latency problems?&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;You can disable NUMA the old fashioned way: Go with a 1-processor configuration.&amp;nbsp;&amp;nbsp;&amp;nbsp; Now, all threads will get the same high bandwidth and low latency.&amp;nbsp; You can use three 2GB DIMMs and still get your &amp;quot;full&amp;quot; 4GB.&amp;nbsp;&amp;nbsp; Obviously this only works if your applications aren&amp;#39;t CPU bound.&amp;nbsp; (I talked with Mario, and luckily he believes he&amp;#39;ll be able to go this route.)&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Another way to fix it is to use a NUMA-aware OS.&amp;nbsp; How does that help?&amp;nbsp;&amp;nbsp; When BIOS builds the memory, it passes along data structures called called Static Resource Affinity Tables (SRAT) tables to the OS, which describe which processor is attached to which memory region.&amp;nbsp;A NUMA-aware OS can then use that info to decide where to assign threads, helping get rid of the latency problem.&amp;nbsp;&amp;nbsp;Windows 2008 can handle that, as can the&amp;nbsp;&lt;a href="http://www.microsoft.com/whdc/archive/numa_isv.mspx#ESC" title="SRAT and NUMA in Windows 2003"&gt; Enterprise editions&lt;/a&gt; of 2003.&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/p&gt;&lt;BR&gt;
&lt;p&gt;You&amp;#39;ll still want to follow the &amp;quot;groups of 3&amp;quot; rule, though!&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
    <pubDate>Thu, 02 Jul 2009 03:21:00 GMT</pubDate>
    <dc:creator>Daniel Bowers</dc:creator>
    <dc:date>2009-07-02T03:21:00Z</dc:date>
    <item>
      <title>Nehalem and Windows 2003: Why 6 x 1GB = 4GB</title>
      <link>http://h30507.www3.hp.com/t5/Eye-on-Blades-Blog-Trends-in/Nehalem-and-Windows-2003-Why-6-x-1GB-4GB/ba-p/77883</link>
      <description>&lt;p&gt;On &lt;a href="http://communities.intel.com/community/openportit/server/blog/2009/05/05/nehalem-memory-help-im-lost" title="Nahalem Memory Help Im Lost"&gt;Ed Groden&amp;#39;s post in the Intel Server Room&lt;/a&gt;, Mario Valetti asks about memory configurations for an Intel Xeon 5500-based server running a 32-bit OS.&amp;nbsp;&amp;nbsp; Ed gives a couple good rules of thumb for what memory to use, including his rule #4, the ultra-important suggestion to populate DIMMs in groups of 3 for each processor.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; However, the rules break down if you&amp;#39;re stuck running Windows Server 2003 Standard edition.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Mario is confronting this issue on a new Nehalem-based server (a ProLiant DL380 G6 running Intel Xeon 5500 series processors, to be precise), but actually the problem isn&amp;#39;t new to the x86 space.&amp;nbsp;&amp;nbsp; Multiprocessor AMD Opteron-based servers have had a &lt;a target="_self" href="http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access" title="Non Unified Memory Architecture"&gt;NUMA architecture&lt;/a&gt; for a few years now, resulting in similar problems on Windows 2003.&amp;nbsp; Nahalem makes it notably worse, though...and it&amp;#39;s because of that pesky rule about &amp;quot;groups of 3&amp;quot;. &lt;/p&gt;&lt;BR&gt;
&lt;p&gt;I&amp;#39;ll explain the exact problem, then give a couple of solutions.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Since Win2003 Standard only addresses 4GB of memory, and the smallest DIMM available on a DL380 G6 is 1GB,&amp;nbsp; your first instinct on a 2-processor server is to use four 1GB DIMM, like this:&lt;br /&gt;&lt;a href="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_4GB.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_4GB.gif" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;By default, the BIOS on a HP DL380 G6 server (and any of the 2-processor Nahalem-based server blades) builds a memory map by placing all the memory attached to processor #1 first, followed by all the memory on processor #2.&amp;nbsp;&amp;nbsp;&amp;nbsp; So in our 4-DIMM scenario, the memory would look like this:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&lt;a href="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_Map.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig1_2D00_Map.gif" style="border:0;" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Because of the &amp;quot;groups of 3&amp;quot; rule -- which is based on Nahalem processors&amp;#39; ability to interleave memory across its three memory channels --&amp;nbsp; memory bandwidth on the green region&amp;nbsp; will be about 3x the bandwidth in the blue region.&amp;nbsp; I&amp;#39;ll call this the &amp;quot;bandwidth&amp;quot; problem.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Also, memory accesses from processor #2 to the green region will have about twice the latency as memory accesses&amp;nbsp; to that same region from processor #1.&amp;nbsp;&amp;nbsp; This is the &amp;quot;latency&amp;quot; problem.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;These problems stem from two limitations of Windows 2003 Standard: It maxes out at 4GB; and it doesn&amp;#39;t grok NUMA.&amp;nbsp; It doesn&amp;#39;t take into account memory location when assigning threads to CPUs.&amp;nbsp;&amp;nbsp; What you&amp;#39;ll see in our 4-DIMM scenario above is that threads will sometimes have dramatically different memory bandwidth and latency, resulting in very uneven performance.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Two ideas might jump to your mind on how to fix this:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;1. You could tell the BIOS to build a memory map by assigning alternate slices of memory to different CPUs.&amp;nbsp; HP BIOS actually lets you do this; it&amp;#39;s called &amp;ldquo;&lt;i&gt;Node Interleaving&lt;/i&gt;&amp;rdquo;, and can be selected in &lt;a href="http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00191707/c00191707.pdf" title="HP RBSU User&amp;#39;s Guide"&gt;RBSU&lt;/a&gt; as an alternative to &amp;quot;&lt;i&gt;NUMA&lt;/i&gt;&amp;quot; style assignment.&amp;nbsp; (This &amp;quot;interleaving&amp;quot; shouldn&amp;#39;t be confused with the memory channel&amp;nbsp;interleaving done by each processor).&amp;nbsp; With Node Interleaving enabled, the 4GB of memory will be evenly split across both processors:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&amp;nbsp; &lt;a href="/legacyfs/online/eyeonblades/NahalemConfig2_2D00_Node.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig2_2D00_Node.gif" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&amp;nbsp;Why won&amp;#39;t this fix things?&amp;nbsp; Well, you&amp;#39;ve broken the &amp;quot;groups of 3&amp;quot; rule, so your bandwidth problem becomes...well, broader.&amp;nbsp;&amp;nbsp; 100% of the time, your memory bandwidth will be only 2/3 of what it could be.&amp;nbsp; Plus, you&amp;#39;ve still got the latency problem, and it&amp;#39;s even more unpredicable: about half the time, a thread will get assigned to the &amp;quot;wrong&amp;quot; processor, slowing that thread down.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;2.&amp;nbsp; You could installing six 1GB DIMMs in the server, like this:&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&lt;a href="/legacyfs/online/eyeonblades/NahalemConfig3_2D00_6GB.gif"&gt;&lt;img src="/legacyfs/online/eyeonblades/NahalemConfig3_2D00_6GB.gif" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;The OS will only see 4GB of memory (actually a little less, because of the so-called &lt;a target="_self" href="http://blogs.msdn.com/oldnewthing/archive/2006/08/14/699521.aspx" title="Why can&amp;#39;t I see all of the 4GB of RAM in my machine?"&gt;&amp;quot;Memory Hole&amp;quot; issue&lt;/a&gt;).&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Why won&amp;#39;t this fix things?&amp;nbsp; Well, it&amp;#39;ll fix the bandwidth issue, since both the blue and green regions now have nice, fast bandwidth.&amp;nbsp; However, the latency problem isn&amp;#39;t solved: Windows still won&amp;#39;t assign threads to the &amp;quot;right&amp;quot; processor.&amp;nbsp; You&amp;#39;ll end up a with a thread on processor #1 wanting memory that&amp;#39;s attached to processor #2, and vice versa.&amp;nbsp;&amp;nbsp;&amp;nbsp; The result?&amp;nbsp; At any given time, about 1/3 of your threads will suffer from poor memory latency.&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;So how &lt;b&gt;can&lt;/b&gt; you fix both the bandwidth and latency problems?&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;You can disable NUMA the old fashioned way: Go with a 1-processor configuration.&amp;nbsp;&amp;nbsp;&amp;nbsp; Now, all threads will get the same high bandwidth and low latency.&amp;nbsp; You can use three 2GB DIMMs and still get your &amp;quot;full&amp;quot; 4GB.&amp;nbsp;&amp;nbsp; Obviously this only works if your applications aren&amp;#39;t CPU bound.&amp;nbsp; (I talked with Mario, and luckily he believes he&amp;#39;ll be able to go this route.)&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;Another way to fix it is to use a NUMA-aware OS.&amp;nbsp; How does that help?&amp;nbsp;&amp;nbsp; When BIOS builds the memory, it passes along data structures called called Static Resource Affinity Tables (SRAT) tables to the OS, which describe which processor is attached to which memory region.&amp;nbsp;A NUMA-aware OS can then use that info to decide where to assign threads, helping get rid of the latency problem.&amp;nbsp;&amp;nbsp;Windows 2008 can handle that, as can the&amp;nbsp;&lt;a href="http://www.microsoft.com/whdc/archive/numa_isv.mspx#ESC" title="SRAT and NUMA in Windows 2003"&gt; Enterprise editions&lt;/a&gt; of 2003.&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/p&gt;&lt;BR&gt;
&lt;p&gt;You&amp;#39;ll still want to follow the &amp;quot;groups of 3&amp;quot; rule, though!&lt;/p&gt;&lt;BR&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
      <pubDate>Thu, 02 Jul 2009 03:21:00 GMT</pubDate>
      <guid>http://h30507.www3.hp.com/t5/Eye-on-Blades-Blog-Trends-in/Nehalem-and-Windows-2003-Why-6-x-1GB-4GB/ba-p/77883</guid>
      <dc:creator>Daniel Bowers</dc:creator>
      <dc:date>2009-07-02T03:21:00Z</dc:date>
    </item>
    <item>
      <title>re: Nehalem and Windows 2003: Why 6 x 1GB = 4GB</title>
      <link>http://h30507.www3.hp.com/t5/Eye-on-Blades-Blog-Trends-in/Nehalem-and-Windows-2003-Why-6-x-1GB-4GB/bc-p/77884#M333</link>
      <description>&lt;p&gt;yP3zjU &lt;/p&gt;</description>
      <pubDate>Thu, 16 Jul 2009 19:07:14 GMT</pubDate>
      <guid>http://h30507.www3.hp.com/t5/Eye-on-Blades-Blog-Trends-in/Nehalem-and-Windows-2003-Why-6-x-1GB-4GB/bc-p/77884#M333</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2009-07-16T19:07:14Z</dc:date>
    </item>
  </channel>
</rss>

