When the sun set over Waikiki, HP BladeSystem stood as the victor of InfoWorld 2010 Hawaii Blade Shoot-Out. Editor Paul Venezia blogged about HP's gear sliding off a truck, but other behind-the-scenes pitfalls meant the world's #1 blade architecture nearly missed the Shoot-Out entirely.
Misunderstandings about the test led us to initially decline the event, but by mid-January we'd signed on. Paul's rules were broad: Bring a config geared toward "virtualization readiness" that included at least 4 blades and either fibre or iSCSI shared storage. Paul also gave us copy of the tests he would run, which let vendors select and tune their configurations. Each vendor would get a 2- or 3-day timeslot on-site in Hawaii for Paul to run the tests himself, plus play around with the system's management features. HP was scheduled for the first week of March.
In late January we got the OK to bring pre-released equipment. Luckily for Paul, Dell, IBM, and HP all brought similar 2-socket server blades with then unannounced Intel Xeon 5670 processors ("Westmere"). We scrambled to come up with the CPUs themselves; at the time, HP's limited stocks were all in use to support Intel's March announcement.
HP's final config: One c7000 enclosure, four ProLiant BL460c G6 server blades with VMWare ESX and using 6-core Xeon processors and 8GB LVDIMMs. Two additional BL460c G6's with StorageWorks SB40c storage blades for shared storage; a Virtual Connect Flex-10 module, and a 4Gb fibre switch. (We also had a 1U KVM console and an external MSA2000 storage array just in case, but ended up not using them.)
To show off some power-reducing technology, we used solid state drives in the storage blades, and low-voltage memory in the server nodes. HP recently added these Samsung-made "Green" DDR3 DIMMs that use 2Gb-based DRAMS built with 40nm technology. LV DIMMs can run at 1.35 volts (versus the normal 1.5 volts), so that they "ditch the unnecessary energy drain" (as Samsung's Sylvie Kadivar put it recently).
Our pre-built system left Houston three days before I did, but it still wasn't there when I landed in Honolulu Sunday afternoon. We had inadvertently put the enclosure into an extra-large Keal case (a hard-walled shipping container) which was too tall to fit in some aircraft. It apparently didn't fit the first cargo flight. Or the second one. Or the third one...
Sunday evening, already stressed about our missing equipment, the four of us from HP met in the home of our Hawaiian host, Brian Chee of the University of Hawaii's Advanced Network Computing Laboratory. Our dinnertime conversation generated additional stress: We realized that I'd mis-read the lab's specs, and we'd built our c7000 enclosure with 3-phase power inputs that didn't match the lab's PDUs. Crud.
We nevertheless headed to the lab on Monday, where we spotted the rats-nest of cables intended to connect power meters to the equipment. Since our servers still hadn't arrived, two of the HP guys fetched parts from a nearby Home Depot, then built new junction boxes that would both handle the "plug conversion" to the power whips, plus provide permanent (and much safer) test points for power measurements.
Meanwhile, we let Paul get a true remote management experience on BladeSystem. I VPN'd into HP's corporate network, and pointed a browser to the Onboard Administrator of an enclosure back in a Houston lab. Even with Firefox (Paul's choice of browser), controlling an enclosure that's 3000 miles distant is still simple.
Mid-morning on day #2, Paul got a cell call from the lost delivery truck driver. After chasing him down on foot, we hauled the shipping case on to the truck's hydraulic lift...which suddenly lurched under the heavy weight, spilling the wheels off of the side, and nearly sending the whole thing crashing onto the ground. It still took a nasty jolt.
Some pushing and shoving got the gear to the Geophysics building's piston-driven, hydraulic elevator, then up to the 5th floor. (I suppose I wouldn't want to be on that elevator when the "Low Oil" light turns on!)
We unpacked and powered up the chassis, but immediately noticed a health warning light on one blades. We quickly spotted the problem; a DIMM had popped partway out. Perhaps not coincidently, it was the blade that took the greatest shock when the shipping container had slipped from the lift.
With everything running (whew), Paul left the lab for his "control station", an Ubuntu-powered notebook in an adjoining room. Just as he sat down to start deploying CentOS images to some of the blades...wham, internet access for the whole campus blinked out. It didn't affect the testing itself, but it caused other network problems in the lab.
An hour later, those problems were solved, and performance tests were underway. It went quick. Next, some network bandwidth tests. Paul even found some the time to run some timed tests to evaluate Intel's new AES-NI instructions, using timed test with some OpenSSL tools.
Day #3 brought us a new problem. HP's Onboard Administrator records actual power use, but Paul wanted independent confirmation of the numbers. (Hence, the power meters and junction box test-points.) But the lab's meters couldn't handle redundant three-phase connections. An hour of reconfiguration and recalculation later, we found a way to corroborate measurements. (In the end, I don't think Paul published power numbers, though he may have factored them into his ratings.)
We rapidly re-packed the equipment at midday on day #3 so that IBM could move into the lab. Paul was already drafting his article as we said "Aloha", and headed for the beach -- err, I mean, back to the office.
Here's my count-down of the top technologies that will have the most impact on servers in 2010.
10. DDR3L - The JEDEC spec for low-voltage DDR-3 memory came out last year, but 2010 should mark significant adoption of these 1.35-volt DIMMs. Since the memory in a modern, high-memory server can consume more power than the processors, DDR3L will play a key role in helping solve data center power consumption and capacity problems.
9. Oracle Fusion Applications - Currently in beta testing, Oracle Fusion Apps is an evolutionary step in Oracle's piecing together of key technologies from its "standard" products with those it recently acquired, like PeopleSoft and Siebel. In some cases, I expect we'll be learning (and managing) applications that are effectively brand-new.
8. Tukwila and Power7 - The UNIX-oriented mission-critical processors grow beyond dual-core, and get hefty caches shared between cores. Intel expects to bring its Itanium into production in the first part of 2010, while published roadmaps from IBM also put Power7 in the 2010 timeframe.
7. RHEL6 - I haven't seen schedules from Red Hat showing RH Enterprise Linux futures, but based on their plan to move RHEL5 into "phase 2" of their lifecycle in early 2011 (that's basically the "no new features, just bug fixes" phase), 2010 would be the logical year for this virtualization-tuned generation of the OS. Fedora 11 and 12 (now released) were the planned "feature previews" for RHEL6, so we'll see.
6. SPEC virtualization benchmark - I'm making another guess at roadmaps to predict the SPEC Virtualization committee might reveal its plans for a benchmark in 2010. (HP is a committee member, though I'm not personally involved in that; as always on this blog, I'm speaking for myself and not for HP.) VMMark is a great tool, but the SPEC benchmark should boost our ability to do vendor-agnostic comparisons of virtualization systems.
5. SAS SSDs - Solid state drives with a SATA interface have been available for a couple of years in servers. (I think IBM was the first to use them as internal drives on blades.) However, servers have traditionally relied on performance & reliability advantages that the SAS protocol brings, and so SAS SSDs are really going to help bring SSDs into everyday use inside servers.
4. Nehalem-EX - The benefits of an integrated memory controller and hyper-threading that emerged with the Intel Xeon 5000 processor will be available to servers with more than 2 processors. Plus, with bigger cache and a beefier memory subsystem, performance will be impressive -- Intel says Nehalem-EX will bring the "largest performance leap in Xeon history".
3. CEE 'ratified' - Converged Enhanced Ethernet (CEE) is the final piece to enable a standardized Fibre-Channel over Ethernet (FCoE). This carries the possibility of effectively eliminating an entire fabric from data centers, so there's much-anticipated cost savings and flexibility boosts. Actually, there is no single "CEE" standard; but the key final pieces (the 802.1Qbb and 802.1az standards from IEEE) are targeted for final ratification around mid-2010.
2. Win2003 Server End of Mainstream Support - There are really only two reasons to upgrade an OS: You want some new feature, or the old one can't be patched. For those who are relying on Windows 2003, the chance of the latter happening is about to get larger in 2010, so expect a lot more pressure to upgrade older systems to Server 2008.
1. Magny-Cours processor - Twelve-core x86 processors; enough said. Actually, maybe not: AMD's next-gen Opteron has other performance-boosting features (like additional memory channels), and Magny-Cours will be available for 2-processor as well as 4+ processor servers at the same time. What else? I'm impressed with John Fruehe's comments about AMD's plans to enable 4P performance with 2P economics. I predict Magny-Cours will be the big story in 2010.
Top-ten lists don't seem complete without honorable mentions, so here are my two: Ratification of the PCI Express 3.0 spec, and Microsoft's Madison / Parallel Data Warehouse extension of its SQL server line.
And finally, one new product that almost, but thankfully didn't, appear on this list: The 0.0635 meter hard drive. The EU's Metric Directive , which comes into effect in 2010, originally prohibited publishing specs in anything but metric units. Among other things, that could have lead to a renaming of 2.5-inch and 3.5-inch hard drives. Luckily, later modifications to the EU rules mean the "0.0636 meter drive" won't make its appearance -- at least in 2010.
On Ed Groden's post in the Intel Server Room, Mario Valetti asks about memory configurations for an Intel Xeon 5500-based server running a 32-bit OS. Ed gives a couple good rules of thumb for what memory to use, including his rule #4, the ultra-important suggestion to populate DIMMs in groups of 3 for each processor. However, the rules break down if you're stuck running Windows Server 2003 Standard edition.
Mario is confronting this issue on a new Nehalem-based server (a ProLiant DL380 G6 running Intel Xeon 5500 series processors, to be precise), but actually the problem isn't new to the x86 space. Multiprocessor AMD Opteron-based servers have had a NUMA architecture for a few years now, resulting in similar problems on Windows 2003. Nahalem makes it notably worse, though...and it's because of that pesky rule about "groups of 3".
I'll explain the exact problem, then give a couple of solutions.
By default, the BIOS on a HP DL380 G6 server (and any of the 2-processor Nahalem-based server blades) builds a memory map by placing all the memory attached to processor #1 first, followed by all the memory on processor #2. So in our 4-DIMM scenario, the memory would look like this:
Because of the "groups of 3" rule -- which is based on Nahalem processors' ability to interleave memory across its three memory channels -- memory bandwidth on the green region will be about 3x the bandwidth in the blue region. I'll call this the "bandwidth" problem.
Also, memory accesses from processor #2 to the green region will have about twice the latency as memory accesses to that same region from processor #1. This is the "latency" problem.
These problems stem from two limitations of Windows 2003 Standard: It maxes out at 4GB; and it doesn't grok NUMA. It doesn't take into account memory location when assigning threads to CPUs. What you'll see in our 4-DIMM scenario above is that threads will sometimes have dramatically different memory bandwidth and latency, resulting in very uneven performance.
Two ideas might jump to your mind on how to fix this:
1. You could tell the BIOS to build a memory map by assigning alternate slices of memory to different CPUs. HP BIOS actually lets you do this; it's called “Node Interleaving”, and can be selected in RBSU as an alternative to "NUMA" style assignment. (This "interleaving" shouldn't be confused with the memory channel interleaving done by each processor). With Node Interleaving enabled, the 4GB of memory will be evenly split across both processors:
Why won't this fix things? Well, you've broken the "groups of 3" rule, so your bandwidth problem becomes...well, broader. 100% of the time, your memory bandwidth will be only 2/3 of what it could be. Plus, you've still got the latency problem, and it's even more unpredicable: about half the time, a thread will get assigned to the "wrong" processor, slowing that thread down.
2. You could installing six 1GB DIMMs in the server, like this:
The OS will only see 4GB of memory (actually a little less, because of the so-called "Memory Hole" issue).
Why won't this fix things? Well, it'll fix the bandwidth issue, since both the blue and green regions now have nice, fast bandwidth. However, the latency problem isn't solved: Windows still won't assign threads to the "right" processor. You'll end up a with a thread on processor #1 wanting memory that's attached to processor #2, and vice versa. The result? At any given time, about 1/3 of your threads will suffer from poor memory latency.
So how can you fix both the bandwidth and latency problems?
You can disable NUMA the old fashioned way: Go with a 1-processor configuration. Now, all threads will get the same high bandwidth and low latency. You can use three 2GB DIMMs and still get your "full" 4GB. Obviously this only works if your applications aren't CPU bound. (I talked with Mario, and luckily he believes he'll be able to go this route.)
Another way to fix it is to use a NUMA-aware OS. How does that help? When BIOS builds the memory, it passes along data structures called called Static Resource Affinity Tables (SRAT) tables to the OS, which describe which processor is attached to which memory region. A NUMA-aware OS can then use that info to decide where to assign threads, helping get rid of the latency problem. Windows 2008 can handle that, as can the Enterprise editions of 2003.
You'll still want to follow the "groups of 3" rule, though!
Our engineering team has put together a good white paper on how different memory technologies work in computer systems today. Advantages, disadvantages, architecture models in server design when selecting the best price and performance in memory configurations for your servers.
Lots of questions about advaced memory protection and capabilities with the latest HP ProLiant G6 server announcement.
HP BladeSystem Announcing today page: www.hp.com/go/bladesystem/news
How does DDR3 memory work and one of the new features is memory lock-step capabilities?
Lock-step mode is an advanced memory protection feature supported in many of the G6 servers announced yesterday (3/30/09), including the BL460c G6 and BL490c G6. It takes two of the Xeon 5500 processor's three memory channels and runs them together, which enables 8-bit error correction instead of the 4-bit correction you get in normal Advanced ECC (non-lockstep) mode. Positives1) Achieves the same level of protection as ChipKill*, so there are some additional scenarios in which the system can correct memory errors. Negatives: (1) You have to leave one of the three memory channels on each processor un-populated, so you cut your available number of DIMM slots by 1/3. (2) Performance is measurably slower than normal Advanced ECC mode.(3) You can only isolate uncorrectable memory errors to a pair of DIMMs (instead of down to a single DIMM). Lock-Step mode is not default operation; it must be enabled in RBSU. We don't know how many customers will want to use it. *Normal" ECC can correct single-bit errors and detect double-bit errors. HP's term "Advanced ECC" means that the server corrects single-bit errors, detects multi-bit errors, and corrects some multi-bit errors that occur on the same DRAM. Advanced ECC is not the exact same thing as ChipKill, which is an IBM term. In some but not all scenarios, Advanced ECC offers the same protection as ChipKill.
We are updating the "HP Advanced Memory Protection technologies - Technology Brief" to include info about new features like this.http://h18004.www1.hp.com/products/servers/technology/whitepapers/index.html?jumpid=servers/technology