Kevin Houston recently asked which Intel Xeon 5500-series processor was most popular for running VMWare. My short answer: Probably the 2.26GHz Xeon E5520. Long answer follows...
Intel makes more than a dozen different speeds of the Xeon 5500 series processor. Why so many? One reason is that Intel divides them up into three different groups, based on their features. Within each group, there's 3 or 4 different models, each with a different CPU clock frequency.
The Advanced group of processors include full-speed links & features; the Standard have the features but slower links, and the Basic group lacks Hyper-Threading & Turbo Boost:
|Category||QPI throughput||Memory speed||Hyper-Threading||Max Turbo Boost|
|Advanced||6.40 GT/s||1333 MT/s||Yes||+400 MHz|
|Standard||5.86 GT/s||1066 MT/s||Yes||+266 MHz|
|Basic||4.80 GT/s||800 MT/s||No||None|
So which ones are popular for virtualization? I found a good proxy for estimating this: searching for how often each model is mentioned in the VMWare community forums. There, four models rise to the top: The E5504, the E5520, the X5550, and the X5570. The E5520 gets the winner's ribbon for most-mentioned.
Why these four? I believe 3 of the 4 are popular because they're the best price/performance within each of Intel's groups. The last get attention because it's the fastest generally-available Xeon 5500 processor.
To show why the price/performance stands out for the E5504 (a "Basic" processor), E5520 (a "Standard"), and X5550 (an "Advanced"), compare Intel's "1ku" list price (basically their bulk-rate prices) to a processor-oriented benchmark result. I've picked the SPECcpu2006 integer rate benchmark for two reasons: One, it's an established benchmark with lots of published results, and two, the benchmark it's so processor-focused that you usually get the same results no matter what server -- or server vendor -- you use.
I created a Yahoo Pipe that scrapes benchmark results from SPEC.org and processor list prices from Intel, then plots them using a Google chart. Just for fun, I've listed the latest 2-processor AMD Opteron results too. Here's a link to the pipe itself (it's actually a series of pipes.) Although the Yahoo pipe is dynamic, on this blog I'm just going to post a static picture of the chart that it produces:
Each processor model is plotted by list price (vertical axis) and cpu2006_rate (horizontal axis). The further right on the chart, the more performance; the further up, the more expensive.
You'll notice that the results fall into three vertical bands, with the dual-core E5502 as an outlier. Within each band, the SPECint results don't change much, but the price does:
That means within these bands, you might as well pick the cheapest processor, because you don't get much extra horsepower by moving to a pricier CPU. What are the cheapest within each band? It's those 3 models that get all the discussion on the VMWare forums.
As I said, the fourth popular model is the X5570, which is the fastest "normal" Xeon 5500 series processor. (The two W-model processors are actually faster, but since they're geared toward workstations and not servers, they're not as commonly available.) So what makes the E5520 the apparent king of the popularity contest? Probably because it's the cheapest Xeon 5500 processor that's got Hyper-Threading.
So how does this VMWare Communities "proxy" compare to what processors HP ships on its ProLiant servers? Well, it's pretty close. Across all ProLiant servers, six models stand out: the four above, plus the E5540 (a "Standard") and the L5520 (a special Low-Power "Standard").
If you just consider ProLiant server blades, the E5504 drops off that short-list, while the L5520 gets a lot of use, and the X5570 gets heavier representation than my "VMWare communities proxy" would suggest. This makes sense to me, since folks deploying blades tend to be more interested in power-reducing and maximum-performance features.
Here's my count-down of the top technologies that will have the most impact on servers in 2010.
10. DDR3L - The JEDEC spec for low-voltage DDR-3 memory came out last year, but 2010 should mark significant adoption of these 1.35-volt DIMMs. Since the memory in a modern, high-memory server can consume more power than the processors, DDR3L will play a key role in helping solve data center power consumption and capacity problems.
9. Oracle Fusion Applications - Currently in beta testing, Oracle Fusion Apps is an evolutionary step in Oracle's piecing together of key technologies from its "standard" products with those it recently acquired, like PeopleSoft and Siebel. In some cases, I expect we'll be learning (and managing) applications that are effectively brand-new.
8. Tukwila and Power7 - The UNIX-oriented mission-critical processors grow beyond dual-core, and get hefty caches shared between cores. Intel expects to bring its Itanium into production in the first part of 2010, while published roadmaps from IBM also put Power7 in the 2010 timeframe.
7. RHEL6 - I haven't seen schedules from Red Hat showing RH Enterprise Linux futures, but based on their plan to move RHEL5 into "phase 2" of their lifecycle in early 2011 (that's basically the "no new features, just bug fixes" phase), 2010 would be the logical year for this virtualization-tuned generation of the OS. Fedora 11 and 12 (now released) were the planned "feature previews" for RHEL6, so we'll see.
6. SPEC virtualization benchmark - I'm making another guess at roadmaps to predict the SPEC Virtualization committee might reveal its plans for a benchmark in 2010. (HP is a committee member, though I'm not personally involved in that; as always on this blog, I'm speaking for myself and not for HP.) VMMark is a great tool, but the SPEC benchmark should boost our ability to do vendor-agnostic comparisons of virtualization systems.
5. SAS SSDs - Solid state drives with a SATA interface have been available for a couple of years in servers. (I think IBM was the first to use them as internal drives on blades.) However, servers have traditionally relied on performance & reliability advantages that the SAS protocol brings, and so SAS SSDs are really going to help bring SSDs into everyday use inside servers.
4. Nehalem-EX - The benefits of an integrated memory controller and hyper-threading that emerged with the Intel Xeon 5000 processor will be available to servers with more than 2 processors. Plus, with bigger cache and a beefier memory subsystem, performance will be impressive -- Intel says Nehalem-EX will bring the "largest performance leap in Xeon history".
3. CEE 'ratified' - Converged Enhanced Ethernet (CEE) is the final piece to enable a standardized Fibre-Channel over Ethernet (FCoE). This carries the possibility of effectively eliminating an entire fabric from data centers, so there's much-anticipated cost savings and flexibility boosts. Actually, there is no single "CEE" standard; but the key final pieces (the 802.1Qbb and 802.1az standards from IEEE) are targeted for final ratification around mid-2010.
2. Win2003 Server End of Mainstream Support - There are really only two reasons to upgrade an OS: You want some new feature, or the old one can't be patched. For those who are relying on Windows 2003, the chance of the latter happening is about to get larger in 2010, so expect a lot more pressure to upgrade older systems to Server 2008.
1. Magny-Cours processor - Twelve-core x86 processors; enough said. Actually, maybe not: AMD's next-gen Opteron has other performance-boosting features (like additional memory channels), and Magny-Cours will be available for 2-processor as well as 4+ processor servers at the same time. What else? I'm impressed with John Fruehe's comments about AMD's plans to enable 4P performance with 2P economics. I predict Magny-Cours will be the big story in 2010.
Top-ten lists don't seem complete without honorable mentions, so here are my two: Ratification of the PCI Express 3.0 spec, and Microsoft's Madison / Parallel Data Warehouse extension of its SQL server line.
And finally, one new product that almost, but thankfully didn't, appear on this list: The 0.0635 meter hard drive. The EU's Metric Directive , which comes into effect in 2010, originally prohibited publishing specs in anything but metric units. Among other things, that could have lead to a renaming of 2.5-inch and 3.5-inch hard drives. Luckily, later modifications to the EU rules mean the "0.0636 meter drive" won't make its appearance -- at least in 2010.
Last week at IDF, two Intel technologists spoke about different fixes to the problem of compute capacity outpacing the typical server's ability to handle it.
For the past 5 years, x86 CPU makers have boosted performance by adding more cores within the processor. That's enabled servers with ever-increasing CPU horsepower. RK Hiremane (speaking on "I/O Innovations for the Enterprise Cloud") says that that I/O subsystems haven't kept pace with this processor capacity, moving the bottleneck for most applications from the CPU to the network and storage subsystems.
He gives the example of virtualized workloads. Quad-core processors can support the compute demands for a bunch of virtual machines. However, the typical server I/O subsystem (based on 1Gb Ethernet and SAS hard drives) gets overburdened by the I/O demand of all those virtual machines. He predicts an immindent evolution (or revolution) in server I/O to fix this problem.
Among other things, he suggests solid-state drives (SSDs) and 10 gigabit Ethernet will be elements of that (r)evolution. So will new virtualization techniques for network devices. (BTW, some of the changes he predicts are already being adopted on ProLiant server blades, like embedded 10GbE controllers with "carvable" Flex-10 NICs. Others, like solid-state drives, are now being widely adopted by many server makers.)
Hold on, said Anwar Ghuloum. The revolution that's needed is actually in programming, not hardware. There are still processor bottlenecks holding back performance; they stem from not making the shift in software to parallelism that x86 multi-core requires.
He cites five challenges to mastering parallel programming for x86 multi-core:
* Learning Curve (programmer skill sets)
* Readability (ability for one programmer to read & maintain other programmer's parallel code)
* Correctness (ability to prove a parallel algorithm generates the right results)
* Scalability (ability to scale beyond 2 and 4 cores to 16+)
* Portability (ability to run code on multiple processor families)
Anwar showed off one upcoming C++ library called Ct from RapidMind (now part of Intel) that's being built to help programmers solve these challenges. (Intel has a Beta program for this software, if you're interested.)
To me, it's obvious that the "solution" is a mix of both. Server I/O subsystems must (and are) improving, and ISVs are getting better at porting applications to scale with core count.
On Ed Groden's post in the Intel Server Room, Mario Valetti asks about memory configurations for an Intel Xeon 5500-based server running a 32-bit OS. Ed gives a couple good rules of thumb for what memory to use, including his rule #4, the ultra-important suggestion to populate DIMMs in groups of 3 for each processor. However, the rules break down if you're stuck running Windows Server 2003 Standard edition.
Mario is confronting this issue on a new Nehalem-based server (a ProLiant DL380 G6 running Intel Xeon 5500 series processors, to be precise), but actually the problem isn't new to the x86 space. Multiprocessor AMD Opteron-based servers have had a NUMA architecture for a few years now, resulting in similar problems on Windows 2003. Nahalem makes it notably worse, though...and it's because of that pesky rule about "groups of 3".
I'll explain the exact problem, then give a couple of solutions.
By default, the BIOS on a HP DL380 G6 server (and any of the 2-processor Nahalem-based server blades) builds a memory map by placing all the memory attached to processor #1 first, followed by all the memory on processor #2. So in our 4-DIMM scenario, the memory would look like this:
Because of the "groups of 3" rule -- which is based on Nahalem processors' ability to interleave memory across its three memory channels -- memory bandwidth on the green region will be about 3x the bandwidth in the blue region. I'll call this the "bandwidth" problem.
Also, memory accesses from processor #2 to the green region will have about twice the latency as memory accesses to that same region from processor #1. This is the "latency" problem.
These problems stem from two limitations of Windows 2003 Standard: It maxes out at 4GB; and it doesn't grok NUMA. It doesn't take into account memory location when assigning threads to CPUs. What you'll see in our 4-DIMM scenario above is that threads will sometimes have dramatically different memory bandwidth and latency, resulting in very uneven performance.
Two ideas might jump to your mind on how to fix this:
1. You could tell the BIOS to build a memory map by assigning alternate slices of memory to different CPUs. HP BIOS actually lets you do this; it's called “Node Interleaving”, and can be selected in RBSU as an alternative to "NUMA" style assignment. (This "interleaving" shouldn't be confused with the memory channel interleaving done by each processor). With Node Interleaving enabled, the 4GB of memory will be evenly split across both processors:
Why won't this fix things? Well, you've broken the "groups of 3" rule, so your bandwidth problem becomes...well, broader. 100% of the time, your memory bandwidth will be only 2/3 of what it could be. Plus, you've still got the latency problem, and it's even more unpredicable: about half the time, a thread will get assigned to the "wrong" processor, slowing that thread down.
2. You could installing six 1GB DIMMs in the server, like this:
The OS will only see 4GB of memory (actually a little less, because of the so-called "Memory Hole" issue).
Why won't this fix things? Well, it'll fix the bandwidth issue, since both the blue and green regions now have nice, fast bandwidth. However, the latency problem isn't solved: Windows still won't assign threads to the "right" processor. You'll end up a with a thread on processor #1 wanting memory that's attached to processor #2, and vice versa. The result? At any given time, about 1/3 of your threads will suffer from poor memory latency.
So how can you fix both the bandwidth and latency problems?
You can disable NUMA the old fashioned way: Go with a 1-processor configuration. Now, all threads will get the same high bandwidth and low latency. You can use three 2GB DIMMs and still get your "full" 4GB. Obviously this only works if your applications aren't CPU bound. (I talked with Mario, and luckily he believes he'll be able to go this route.)
Another way to fix it is to use a NUMA-aware OS. How does that help? When BIOS builds the memory, it passes along data structures called called Static Resource Affinity Tables (SRAT) tables to the OS, which describe which processor is attached to which memory region. A NUMA-aware OS can then use that info to decide where to assign threads, helping get rid of the latency problem. Windows 2008 can handle that, as can the Enterprise editions of 2003.
You'll still want to follow the "groups of 3" rule, though!