How do you measure uptime? Is it becoming more important in your environment? Is downtime costing your company more or less today when compared to a few years ago?
For many customers, the amount of downtime that they experience is increasing, often due to the complexity of new systems. In addition, the cost of downtime is also increasing, usually due to the businesses increased reliance on IT systems. In short, for many customers, downtime is a bigger issue than in the past.
Coming from a business that spends a lot of time working with mission-critical customers, I've seen some interesting changes over the past few years, especially where uptime measurement is concerned.
I've seen that with virtualization, many workloads that each have lower uptime requirements are consolidated onto fewer platforms. Often, this means that the uptime requirements for platform actually increase compared to the individual workloads. However, virtualization also provides benefits, such as moving workloads online which allows maintenance to be completed without bringing down the application - a great way to reduce planned downtime.
I've also noticed that as systems get more complex, and vendors build in more availability into the applications, that the overall uptime of the application increases. However, the uptime of an individual node in a cluster may not be as high as a single node of the application. Why? Because the increased complexity of the cluster results in higher overall availability, but at times it sacrifices the ease of management, configuration, and maintenance that may be available in a single node version, resulting in more downtime.
So, how do you measure uptime in your environment? Do you measure it based on the uptime of the server? Does that change if you can move a virtual machine workload from one system to another to handle planned downtime?
Do you measure uptime based on the OS availability? I can move my virtualized workload from one server to another, and the OS stays running. This is wonderful, and definitely helps reduce planned downtime. If you are running a cluster of virtual machines, and the clustering only measures whether a server is running (for unplanned downtime) or if the administrator needs to manually start an online migration (for planned downtime), it is hard to get OS level availability or application level availability measurements.
Do you measure uptime based on the application availability? This is easy in a clustered environment when the cluster understand the applications, such as with HP Serviceguard . While this works well for mission critical applications, it does take some effort to get that level of application integration. And then, how do you measure uptime on a multi-node solution, such as Oracle RAC? Do you measure the uptime of each node, any of the nodes, or all of the nodes?
So, how do you measure the uptime of your environment, or do you use different measurements for different systems or parts of your environment? How do you navigate vendor uptime claims, especially since different solutions may offer similar claims (ex. 99.9% uptime), but often measure different things (ex. Application uptime versus physical server or virtual machine uptime)? Do your uptime measurements include planned downtime for maintenance, or just unplanned downtime? Comments or thoughts on how this plays out in the real world are always appreciated.