Some notable election results were just released today, and the incumbent won big.
Over the holidays, I took my daughter to see a movie about a Cajun frog. I admit to a pang of jealousy as I passed crowds lined up for screenings of "Avatar". However, given my situation -- namely a 5-year-old clamoring for popcorn and a Princess movie -- I made the best choice for my needs.
It turns out those Avatar-watching throngs got to see the result of another best choice; one made by a group of IT experts in Miramar, New Zealand.
Weta is an Academy Award-winning studio that did the digital effects for Avatar. The imaginations at Weta Digital have created some incredible virtual realities. Jim Ericson from Information Management quotes Weta's Paul Gunn as explaining that 'if it's something that doesn't exist, we'll make it.' Pretty amazing innovations coming from a relatively small place on the other side of the world from Hollywood and Silicon Valley.
In an article and blog, Jim sketches for us the 4000-server facility Weta used to render the VFX of the blockbuster. One eye-opener: the final output from this behemoth server farm fits on a single hard drive.
Weta's space- and power-constrained facility uses advanced techniques like blades and water cooling. Performance is a paramount need –so much so that their server clusters comprise seven of the world's 500 largest supercomputers. But their workloads didn't just need massive scalability, they also required high bandwidth between individual server nodes, and relatively local storage.
As Jim points out, they chose to build their infrastructure with HP BladeSystem, using the double-dense BL2x220c server blade. This very innovative, compact server (shown in the video below) let them achieve, in their words, 'greater processing density than anything else found on the market'.
Actually, any engineer could stick 64 Intel® Xeon® processors into a 17-inch-high box and get it to run. However, very few computer companies have the expertise -- and resources -- to make such a thing affordable and efficient, and to be able to warranty that it will run without pause for 3+ years.
Even more important: Weta possessed something relatively rare when they chose HP BladeSystem. They were already experts in bladed architectures. Their prior infrastructure was based on IBM blade servers, so they already expected the space- and power-saving benefits of blades. Weta was seeking the best bladed architecture. And Weta determined that, for them, HP BladeSystem was the best choice.
Last week at IDF, two Intel technologists spoke about different fixes to the problem of compute capacity outpacing the typical server's ability to handle it.
For the past 5 years, x86 CPU makers have boosted performance by adding more cores within the processor. That's enabled servers with ever-increasing CPU horsepower. RK Hiremane (speaking on "I/O Innovations for the Enterprise Cloud") says that that I/O subsystems haven't kept pace with this processor capacity, moving the bottleneck for most applications from the CPU to the network and storage subsystems.
He gives the example of virtualized workloads. Quad-core processors can support the compute demands for a bunch of virtual machines. However, the typical server I/O subsystem (based on 1Gb Ethernet and SAS hard drives) gets overburdened by the I/O demand of all those virtual machines. He predicts an immindent evolution (or revolution) in server I/O to fix this problem.
Among other things, he suggests solid-state drives (SSDs) and 10 gigabit Ethernet will be elements of that (r)evolution. So will new virtualization techniques for network devices. (BTW, some of the changes he predicts are already being adopted on ProLiant server blades, like embedded 10GbE controllers with "carvable" Flex-10 NICs. Others, like solid-state drives, are now being widely adopted by many server makers.)
Hold on, said Anwar Ghuloum. The revolution that's needed is actually in programming, not hardware. There are still processor bottlenecks holding back performance; they stem from not making the shift in software to parallelism that x86 multi-core requires.
He cites five challenges to mastering parallel programming for x86 multi-core:
* Learning Curve (programmer skill sets)
* Readability (ability for one programmer to read & maintain other programmer's parallel code)
* Correctness (ability to prove a parallel algorithm generates the right results)
* Scalability (ability to scale beyond 2 and 4 cores to 16+)
* Portability (ability to run code on multiple processor families)
Anwar showed off one upcoming C++ library called Ct from RapidMind (now part of Intel) that's being built to help programmers solve these challenges. (Intel has a Beta program for this software, if you're interested.)
To me, it's obvious that the "solution" is a mix of both. Server I/O subsystems must (and are) improving, and ISVs are getting better at porting applications to scale with core count.
For some of financial and data-acquisition applications, it's more important to finish one calculation super-fast than a bunch of calculations slightly slower. There's a group of HPC apps with a similar requirement: two identical instructions need to have precisely the same latency, every time they're executed.
Real-Time Operating Systems (RTOS) can help address these two scenarios. These OSes address latency in a number of ways; for example, by ditching device-polling and background cleanup tasks that that standard OS's normally do.
However, some features of modern industry-standard servers can hurt low- and consistant-latency computing. For example, low-power processor modes might save power, but any such processor throttling can increase latency. Another example would be management routines that consume CPU cycles, such as routines built into the BIOS of ProLiant server blades that occasionally use CPU cycles to track resource utilization and monitor correctable memory errors in the memory controller.
If you face these situations and have already gone with an RTOS, HP's got some settings in our RBSU (ROM BIOS Setup Utility) that can offer additional help.
Load up RBSU (accessed by pressing F9 while the system is booting), and change the following settings:
1) Set "ProLiant Power Regulator Mode" to "Static High Mode".
2) Disable processor c-state support.
3) If you are running an application that is single-threaded, set "Processor Core Disable" to "One Core Enabled".
4) On Intel Xeon 5500-based servers (like the BL460c G6), disable "QPI Power Management", and ensure "Intel Turbo Boost Technology" is set to "Enabled".
If you want to go even further, there's a way to disable some of those periodic BIOS checks on processor utilization and correctable errors. For most G5 and G6 server blades, HP has a tool called conrep (provided with the Smart Start Scripting Tool Kit) that let you control these settings.
In the BL280c G6, BL460c G6, and BL490c G6, you can also disable those things straight from RBSU. Hit "Control-A" within the RBSU, and some additional options will appear in the
"Service Options" menu.
On Ed Groden's post in the Intel Server Room, Mario Valetti asks about memory configurations for an Intel Xeon 5500-based server running a 32-bit OS. Ed gives a couple good rules of thumb for what memory to use, including his rule #4, the ultra-important suggestion to populate DIMMs in groups of 3 for each processor. However, the rules break down if you're stuck running Windows Server 2003 Standard edition.
Mario is confronting this issue on a new Nehalem-based server (a ProLiant DL380 G6 running Intel Xeon 5500 series processors, to be precise), but actually the problem isn't new to the x86 space. Multiprocessor AMD Opteron-based servers have had a NUMA architecture for a few years now, resulting in similar problems on Windows 2003. Nahalem makes it notably worse, though...and it's because of that pesky rule about "groups of 3".
I'll explain the exact problem, then give a couple of solutions.
By default, the BIOS on a HP DL380 G6 server (and any of the 2-processor Nahalem-based server blades) builds a memory map by placing all the memory attached to processor #1 first, followed by all the memory on processor #2. So in our 4-DIMM scenario, the memory would look like this:
Because of the "groups of 3" rule -- which is based on Nahalem processors' ability to interleave memory across its three memory channels -- memory bandwidth on the green region will be about 3x the bandwidth in the blue region. I'll call this the "bandwidth" problem.
Also, memory accesses from processor #2 to the green region will have about twice the latency as memory accesses to that same region from processor #1. This is the "latency" problem.
These problems stem from two limitations of Windows 2003 Standard: It maxes out at 4GB; and it doesn't grok NUMA. It doesn't take into account memory location when assigning threads to CPUs. What you'll see in our 4-DIMM scenario above is that threads will sometimes have dramatically different memory bandwidth and latency, resulting in very uneven performance.
Two ideas might jump to your mind on how to fix this:
1. You could tell the BIOS to build a memory map by assigning alternate slices of memory to different CPUs. HP BIOS actually lets you do this; it's called “Node Interleaving”, and can be selected in RBSU as an alternative to "NUMA" style assignment. (This "interleaving" shouldn't be confused with the memory channel interleaving done by each processor). With Node Interleaving enabled, the 4GB of memory will be evenly split across both processors:
Why won't this fix things? Well, you've broken the "groups of 3" rule, so your bandwidth problem becomes...well, broader. 100% of the time, your memory bandwidth will be only 2/3 of what it could be. Plus, you've still got the latency problem, and it's even more unpredicable: about half the time, a thread will get assigned to the "wrong" processor, slowing that thread down.
2. You could installing six 1GB DIMMs in the server, like this:
The OS will only see 4GB of memory (actually a little less, because of the so-called "Memory Hole" issue).
Why won't this fix things? Well, it'll fix the bandwidth issue, since both the blue and green regions now have nice, fast bandwidth. However, the latency problem isn't solved: Windows still won't assign threads to the "right" processor. You'll end up a with a thread on processor #1 wanting memory that's attached to processor #2, and vice versa. The result? At any given time, about 1/3 of your threads will suffer from poor memory latency.
So how can you fix both the bandwidth and latency problems?
You can disable NUMA the old fashioned way: Go with a 1-processor configuration. Now, all threads will get the same high bandwidth and low latency. You can use three 2GB DIMMs and still get your "full" 4GB. Obviously this only works if your applications aren't CPU bound. (I talked with Mario, and luckily he believes he'll be able to go this route.)
Another way to fix it is to use a NUMA-aware OS. How does that help? When BIOS builds the memory, it passes along data structures called called Static Resource Affinity Tables (SRAT) tables to the OS, which describe which processor is attached to which memory region. A NUMA-aware OS can then use that info to decide where to assign threads, helping get rid of the latency problem. Windows 2008 can handle that, as can the Enterprise editions of 2003.
You'll still want to follow the "groups of 3" rule, though!