I recently was engaged on a case where a customer was not able to achieve the network throughput required to satisfy business requirements, even though the network was a virtual switch (vswitch) within a VMware hypervisor and the communication was between two guests within the same hypervisor (HOST).
Due to advancements in technology, computers are getting significantly more dense all while reducing physical dimensions. Not too long ago, a machine with 512 cores would require significant real estate within the datacenter. I’m sure there are those of us that remember a time when computers were the size of entire rooms -- even buildings. The legacy HP Superdome is one such example of “big iron” when 64 processors (128 cores) with 1TB of RAM would equate to two refrigerators.
Today, the new Superdome-2 Itanium blades can scale beyond 256 Itanium cores and the HP Proliant G7 & G8 blades can scale to four 16-core AMD Opteron 6200 series processors per blade, allowing for a density unparalleled in today’s compute Enterprise -- 512 Cores in one C7000 chassis (relativity the size of a collage dorm room refrigerator) and a memory footprint of 2TB. See http://www.hp.com/go/blades --Integrity, Non-Stop, and Proliant.
Today's density levels are bringing back a problem which I have seen many times over the years –CPU to memory performance. These CPU’s densities combined with BUS speeds and vast amounts of memory leads to page faults which requires CPU’s to fetch memory. Yes this is incredible fast; however, in the field of High Performance Computing, microsecond tend to be a significant cost when looking at every single transaction.
Memory BUS saturation and latency:
This case initially presented with an observed delta in network throughput of approximately 50-75% when testing with AMD and Intel Vmware hypervisor HOST (INTEL being faster of the two). This Initial observation led to hypothesis for which further scrutiny debunked due to the various testing variables which were thought to be of no consequence – i.e. 48 cores vs 64 cores, 100GB of Ram Vs 1TB of RAM, AMD vs INTEL (core counts were not the same.. Intel had Hyper-threading enabled). Other theories emerged surrounding processor architecture which led to core count (CPU switches – kernel scheduling for SMP); however, there was one relatively simple, albeit obscure fact when comparing different architectures and blades and testing paradigms – the amount of memory and the BUS layout. My testing clearly illustrated a problem with the CPU's ability to fetch data from memory. The memory BUS speed varies from 800MHz to 1600 MHz depending on these options.
Today their are two main BUS architectures for the CPU to memory transport: the QuickPatch Interconnect (QPI ) for INTEL (G7 & G8) which replaced the legacy Intel Front Side Bus (FSB) (G5) and the AMD Hyper Transport BUS (G7 & G8). These BUS architectures are capable of performing 6.4GT/sec; however, this customer’s server was not achieving this level of performance. The customer’s testing revealed that the AMD system had a problem while the INTEL system seemed to not have a problem with network performance. Given that the AMD architecture allowed for a higher density core count and the overall architecture met all business requirements, the challenge was for my team to work with our partners and resolve the network latency issue.
The biggest issue with the test was that the INTEL system did not have the same amount of memory.
Identification of problem:
Further instrumentation was utilized and testing procedures identified to account for the observed delay in network packet generator delays. Network traces only showed a delay of microseconds and a delta of 40 microseconds between the platforms. With our instrumentation and testing we knew the problem was closely linked to the CPU’s ability to fetch (load store) and execute code; therefore the problem was with page faults and not the ability of the CPU to execute instructions or optimizations of instruction sets. Not finding any issue with the execution stack, my focus turned toward the amount of memory on the HOST and the configuration of that memory.
It is not as easy as just throwing memory into a system.. variables exist -- the frequency at which the memory operates, size of DIMMS, slots occupied, type of memory, etc.
An example chart:
DDR3 memory comparison
Maximum DIMM capacity
Maximum Server Capacity
AMD: 1 TB max capacity*
Intel: 2TB max capacity
Intel 2 socket: 768 GB
Intel: 48 GB max capacity
Intel 2 socket: 384 GB
Maximum # of DIMMs/channel
3 dual rank
3 quad rank (LRDIMM only)
2 dual rank
3 dual rank
Low power option
4 GB, 8 GB, 16 GB, 32 GB
2GB, 4 GB, 8GB
Address error detection
Using the HP Memory Configuration tool: http://h18004.www1.hp.com/products/servers/options
We were able to quantify the exact BUS speeds between the INTEL and the AMD based on the fact that the INTEL only had a fraction of the amount of memory which was on the AMD, not to mention different DIMM types. Even after this, I was far short of the performance required by the business.
BIOS settings and interleaving
If you assume that the default memory layout is non-memory interleaved at the NODE level, then you would expect, as long as the thread did not perform a CPU switch then the L1 cache is still valid. After I went into the BIOS, turns out the customer’s machines had NODE memory interleaving enabled (Not the same with INTEL.. anothe data point that explained the differences in performance). Upon this finding, I immediately disabled this setting, booted and re-ran the test. More than doubled the applications test results and passing business requirements for performance for performance. This means the problem was everytime the CPU had a page faut, mostlikely the request had to traverse the HyperTransport BUS to fetch the page from a location near another CPU.
Though server technology advancements are keeping up with Big Data analytic requirements, we are not yet at the “plug&play” configuration for High Performance Computer Clusters. Teams of highly skilled programmers, professional technicians, Master Technologist and business leaders are required to design solutions as well as business requirements so that Solution Architects are able to design IT solutions which are successful. HP Technical Services has the expertise to address these challanges -- helping you make the most of today's technology.
Whether or not to use memory interleaving depends on use case. In this situation, the application required node memory interleaving to be disabled due to the fact that the working set size of the application aligned with this mode; however, there are cases when the interleaving would be required.
What are the types of memory interleaving?
Memory bank interleaving
When you use memory bank interleaving, data goes alternately to memory banks through the common memory channel connecting the DIMM banks and the integrated memory controller. Memory bank interleaving increases the probability that more DIMMs will remain in an active state (requiring more power) because the memory controller alternates between memory banks and between DIMMs.
Memory bank interleaving is automatically enabled on a processor node under the following conditions:
• Two single-rank DIMMs per channel result in two-way bank interleaving.
• Two dual-rank DIMMs per channel result in four--way bank interleaving.
• Two quad-rank DIMMs per channel result in eight-way bank interleaving.
• Two dual-rank DIMMs and one quad-rank DIMM result in eight-way bank interleaving, in servers using three DIMMs per channel.
Memory channel interleaving
Memory channel interleaving transfers data by alternate routing through the two available memory channels. As a result, when the memory controller must access a block of logically contiguous memory, the requests don’t stack up in the queue of a single channel. Alternate routing decreases memory access latency and increases performance. However, memory channel interleaving increases the probability that more DIMMs must remain in an active state.
Memory channel interleaving is always active on AMD Opteron 6200 Series processors.
Memory node interleaving
Node interleaving can interleave memory across any subset of nodes in the multi-processor system.
Node interleaving breaks memory into 4 KB addressable entities and assigns blocks of addresses to the nodes in the sequence indicated in the following table.
Sequencing of memory node interleaving across multiprocessor systems Node