February 29 is just another day for most people. For a few, it is their first birthday in 4 years. For some, it is a time to look to the wisdom of a groundhog on how much winter is left. For Microsoft, and Azure customers, it was a day to experience a problem that hasn’t existed for years.
Yes, due to a flaw in a date routine, the date was calculated incorrectly and Azure was unavailable. To make matters worse, procedural and human errors lengthened the outage. Bill Laing and Microsoft did do something right though. They fully disclosed the problem, the root cause of the problem, and the problem in the fix that actually lengthened the outage, and they even described how MS Azure worked (nothing in extreme detail but at a level most would understand).
I applaud Bill and Microsoft for doing this. Too many times companies dodge the issue and dance around the explanation. Some even deny that the problem exists. For Bill to blog about the outage and disclose the details that he did is very refreshing to me.
On the other hand, in this day and age, how does a date routine not account for Leap Year? This is an issue that was encountered, and solved, decades ago. Then, when the fix was being rolled out, MS implemented a module that was tested with an old version of the Host Agent. And, not surprisingly, when everything was “fixed”, there were still corrupted files and servers to be repaired. Yes, this was a firefight, and I’m sure customers and management were screaming, but how could their change control process as well as their QA/promote to production process let mismatched modules slide through? A coding problem followed by human error and procedural issues resulted in a lengthy outage on Azure.
BTW, MS has offered a service credit to people experiencing the outage. I appreciate the thought, but does the credit really offset the outage and the impact to businesses? Probably not, but again, I do give MS credit for doing the right thing. I’ve never been a big fan of SLA penalties because they never really offset the business impacts and user perception of IT, but this is one of the few levers we have to pressure service providers. At least MS is trying but I think all the users would rather MS kept the money and invested it in their Quality Assurance and Change Management Processes!
So, my final thoughts are that although MS is doing a lot of good things in response to the outage, the outage sounds completely avoidable to me. If you are considering an external Cloud Solution, make sure you select one that is “commercial strength” and ready to support your enterprise and not one that is cashing in on a consumer feeding frenzy of Cloud Solutions.
Here's a couple of great Blogs on related topics by HP Support experts:
Flynn Maloy’s thoughts on why modernizing your infrastructure calls for a bold new look at the support environment
and Joe Polino’s intro to personalized, proactive, simplified support
Learn how HP Cloud Computing Solutions can help your business, and read how HP Technology Services helped one company roll out a cloud data center that reduced its cost of infrastructure and increased resource utilization.
Explore related content:
- HP CTOs discuss IT trends
- Transformation workshops and assessments to help you innovate your IT services
- Big Data Analytics