2:37 AM –The Chief Technology Officer along with other executives have their pagers, cells, and home telephones light up. On the other end of the line is the data center manager. The manager begins to explain, frantically, that the business is offline and there is no ETA on a fix.
Sounds like the making of a good novel, right? Just imagine data integrity, OS’s unable to boot, hundreds if not thousands of Virtual Servers and Desktops impacted by a physical layer outage. Not just a single point of failure – but all redundancy rendered useless due to the failure mode. Data Integrity, servers offline, business unit crippled—to have any one of these issues would yield a bad day; however, having all of them at the same time is indeed a nightmare. As I am sure most of you have heard at one time or another Murphy’s law– What can go wrong, will go wrong—so ask yourself if you need a higher level of support?
Recently I was engaged on an ongoing problem that sustained all three of the aforementioned catastrophic events. To make matters worse, the customer had just bought the new solution utilizing a new technology (FCoE) which utilizes converged adapters—in this case, they were integrated Emulex dual port adapters.
To add complexity –as if that’s possible—the customer opted to go with a multivendor hardware, software, and multivendor support strategy; which, I’m sure you can imagine lead to some finger pointing. In this particular case, only the HP C7000 with BL620, and BL685 G7 were under HP support, the rest of the environment was vendor X for network and vendor Y for Storage – feel free to fill in x and Y J…
Prior to my engagement the 3rd party support teams, had been at it for more than 12 days. Needless to say nerves were frayed and the customer’s patients were all but non-existent. This was further compounded by the executive staff questioning the IT directors decision to adopt such a vulnerable solution. The technical team, which was comprised of technologist from many leading companies as well as the customer’s own technical team, and all were convinced the problem resolved around the new FCoE technology. In fact, the customer stated on numerous occasions that this new bleeding edge technology is riddled with bugs. Suffice it to say, any new bleeding edge technology will inevitably encounter a defect for which safety logic will be scrutinized. Though the customer and third party support teams felt confident that the underlying issue laid solely on the new FCoE technology , all traces captured yielded no signs of ill behavior by the technology. With that understanding, you can imagine my difficulties of turning around a ship lost in trying seas.
The key to overcoming a complex business impacting event such as this is having a technical support partner that understands both the business and technology applied. With my teams engagement, we cordially requested leads from all parties to collaborate on findings this far and all technical troubleshooting conducted in order to clearly understand the full gambit of the technology failure. Hardware protocol analyzers were deployed, software-based network sniffer were engaged, and system memory dumps were being captured by the dozens, and although I could easily write a dissertation on the events that transpired over the hours that followed, I will sum it up by stating the troubleshooting efforts had went awry, chasing the wrong rabbit down a very dark hole. We asked everyone to step back, rethink/regroup and think about Occam’s Razor in that the simplest hypotheses with fewest assumptions that explains the behavior is most likely the culprit. Within hours, a hypothesis was derived and root cause isolated to LUN masking on the array. Wow, I know what you’re thinking; seriously something as simple as LUN masking took this much time to resolve? Unfortunately, this is all too often what happens in a multivendor solution. When one vendor (utilizing known technology) tells a customer that the their product is configured correct, everyone then tends to steer blindly to the newest technology as the root of all ill behavior. All in all, a rather simplistic LUN masking configuration allowed multiple host to see the same storage lun. What transpired next is one hosts failure to boot followed by another host overwriting the boot sector, followed by--- well, I do not need to continue that vicious cycle. As most technologist will know, it is ill advised to have multiple hosts that are not cluster members accessing the same storage luns. In this situation, every time the system was rebuilt a separate system would once again overwrite the boot sector making it appear as if the new technology was failing.
Take your time when designing your IT infrastructure , think about who will support it , and whether we’re not you have the skilled technologist in your organization to handle the most critical situations conceivable . Remember that always on support from HP is just that, always on. HP has the technologist that handle complex technical issues, and business acumen on a daily basis tackling the world’s biggest crisis with speed and passion for our customers . And lastly, remember to focus on the real technology problem.