Situation: A glitch is causing unexpected system reboots. After much testing, you identify the problem. A firmware patch should prevent it from recurring. Luckily, you've already got the tools that will let you remotely "flash", or update, your firmware.
Complication: If your system glitches while you're remotely updating firmware, you won't be able to connect to it remotely anymore. Oh...and your system is on another planet.
That's the firmware problem facing a NASA team right now. The Mars Reconnaissance Orbiter has seen unexpected reboots, and engineers believe they've got a patch that could fix it. However, they're worried that a mistake or unexpected reboot during the patch process might leave the satellite so confused it will stop transmitting its data.
ProLiant engineers have actually grappled with this very same problem, though a little closer to home.
Before I explain that, an aside: there's a cool connection between HP engineering and Mars spacecraft. Lossless compression technology developed by HP labs and used in HP's RGS software for workstations was used by NASA for transferring images from the Spirit Rover on Mars.
Here's two ways that ProLiant blades -- including the RGS-using ProLiant WS460c G6 workstation blade -- protect you from this "botched update" scenario:
1. Redundant ROMs - There are two ROM images stored on each blade. One is a "primary" image, used to boot. The other is a "backup" image. Here's a screenshot from RBSU showing the version numbers (dates, actually) of the primary and backup images on one blade.
When you flash a ROM, it actually overwrites the backup image, and then makes this image the new primary. The original primary becomes the new backup. This hedges against both a new image being bad, and against the flash process failing to complete or corrupting the image. (One reason a flash might fail: total loss of power during a flash.)
By the way, if both ROM images are valid, you can select which one you want to use at boot time from RBSU. Here's a short video showing that:
There's also a manual way described in the Maintenance and Server Guide to force a boot to the redundant image by setting some physical DIP switches inside the blade itself.
2. Bootblock - There's actually a third, non-flashable section of a ProLiant ROM. This "boot block" section
includes a a disaster-recovery feature that lets the server flash a new ROM image, even if both of the existing ROM images are corrupted.
BIOS & firmware updates are often used to fix glitches, but HP (and presumably NASA) also add new features or enhancements too. We post release notes that describe all the fixes and enhancements added to each version. Here's a recent one added to the BL460c G6.
For example, one enhancement in this latest version is a "boot override menu" (see screenshot below), displayed by hitting F11 during boot. It lets you specify a "one time" override of the RBSU boot order, so you can boot to some other device. After booting that one time, the system will fall back to its original boot order settings.
For some of financial and data-acquisition applications, it's more important to finish one calculation super-fast than a bunch of calculations slightly slower. There's a group of HPC apps with a similar requirement: two identical instructions need to have precisely the same latency, every time they're executed.
Real-Time Operating Systems (RTOS) can help address these two scenarios. These OSes address latency in a number of ways; for example, by ditching device-polling and background cleanup tasks that that standard OS's normally do.
However, some features of modern industry-standard servers can hurt low- and consistant-latency computing. For example, low-power processor modes might save power, but any such processor throttling can increase latency. Another example would be management routines that consume CPU cycles, such as routines built into the BIOS of ProLiant server blades that occasionally use CPU cycles to track resource utilization and monitor correctable memory errors in the memory controller.
If you face these situations and have already gone with an RTOS, HP's got some settings in our RBSU (ROM BIOS Setup Utility) that can offer additional help.
Load up RBSU (accessed by pressing F9 while the system is booting), and change the following settings:
1) Set "ProLiant Power Regulator Mode" to "Static High Mode".
2) Disable processor c-state support.
3) If you are running an application that is single-threaded, set "Processor Core Disable" to "One Core Enabled".
4) On Intel Xeon 5500-based servers (like the BL460c G6), disable "QPI Power Management", and ensure "Intel Turbo Boost Technology" is set to "Enabled".
If you want to go even further, there's a way to disable some of those periodic BIOS checks on processor utilization and correctable errors. For most G5 and G6 server blades, HP has a tool called conrep (provided with the Smart Start Scripting Tool Kit) that let you control these settings.
In the BL280c G6, BL460c G6, and BL490c G6, you can also disable those things straight from RBSU. Hit "Control-A" within the RBSU, and some additional options will appear in the
"Service Options" menu.