Today’s computer systems are marvels of engineering. Hardware is constantly evolving with faster hardware, new technologies such as touch screen user interfaces, silicon based storage, cloud computing, and new application features and functions. A typical computer system, whether it is a phone, PC or a data center server, is often a unique implementation of hardware, vendor applications and the user’s own applications or settings. These typical computer systems change regularly with new revisions of software and upgrades or repairs to the hardware. In some companies, there are issues with compatibility of different computer systems within the company as always seems to happen when one company acquires another. In these types of environments, the customer has the goal, and even the expectation, that their computer systems will be always on and available.
So, in such an environment, what could possibly go wrong? Plenty. That is why this will probably be the longest article in this blog.
Let’s face it: hardware failures happen. Disk drive crashes in PCs usually take the whole PC down. Data centers can be made more tolerant of hardware failures with redundant or fault-tolerant hardware and software, but they can be susceptible to hardware failures, too. Personal experience has taught me that the repair of a running fault-tolerant machine requires skilled technicians (imagine getting your car fixed while you are still driving it). Hardware monitoring software can identify problems more quickly and sometimes allow the hardware problems to be corrected before an outage, but to think that “hardware does not fail” is an illusion -- and we should be prepared for when it happens with backups for those PC disks and support plans for the larger systems.
When it comes to vendor software, one user’s feature may be another user’s bug. The majority of problems with vendor software are in knowing how to use it – spanning everything from software installation, to configuration setting, to how to use the software. All these can be sources of problems. While reading the manual may be a last resort, it rarely is as effective as speaking to a knowledgeable support engineer who is trained and very familiar with the vendor products. Such knowledge need not limited to the vendor of the product. System integrators and third party software support organizations can be just as qualified to provide such support. HP provides support for dozens, if not hundreds, of third party applications. Knowledge is power and nothing replaces a support engineer familiar with the application.
Software failures happen, and I refer to these as bugs, a delightful term coined from the days when computers were built with mechanical switches. Most reported problems with vendor software are usage issues and not bugs. A bug requires modification to the software to correct the problem. The bug fix can come as either a point correction, which I will call a patch, or can be incorporated into a new version of the software. Virtually all patches are rolled into new versions of the software. Sometimes the purpose of a new version of a software product is to integrate multiple patches into an easily installable revision. Sometimes support agreements include right-to-new-version licenses and sometimes not – depending on the licensing practices of the vendor. Not having the license rights to a known correction can be a problem as some vendors include bug fixes, but not the right-to-new-versions of a software package in their support contracts. No customer likes to hear that a known correction is available in the next release, but you need to buy five thousand upgrade licenses to get it.
Revisions have their own issues. Computer systems sometimes fail when software revisions are made. A patch or revision can have unintended side-effects and because of the uniqueness of the typical computer system, those unintended side-effects are difficult to foresee. Configuration testing is only a partial answer as configurations constantly change. I consider there to be only two options for revision management – detailed revision change control and management and prayer. If you think that replacing a disk drive does not involve revision change control and management, then you are in the prayer group and may not know it.
There are also interoperability issues both with integrating hardware and with software from other vendors. Third party storage attached to other vendor’s servers with customer or third party special hardware have their own set of issues. Running on these are third party software applications custom configured for the customer’s business. Most of these computer systems are communicating with other computer systems either through private networks or the internet. The interactions between any of these subsystems can create problems.
Even the environment for computer systems can create problems like spikes on a power line, over-heating computers, fires, floods, storms or other acts-of-god. Disaster recovery has its place for non-critical systems, but loss of availability may be sufficiently costly to warrant multiple data center sites or investing in cloud computing solutions. Loss of network connectivity frequently means loss of application availability. Natural disasters can cause loss of power for weeks. Acts-of-god are not all that rare.
Then there are security issues that can be both physical and/or software related. Malware varies from nuisance spyware to targeted viruses intended to do major damage to the systems they infect including hardware and attached equipment or facilities. This is one area where it is appropriate to feel a little paranoid. In a world of interconnected computers, security issues will be increasingly important to those who value system availability.
Finally, there are people issues. People operate these complex systems and those people can create problems through their absence (vacations, holidays, retirement, illness, resignation and the like), mistakes, lack of training, and misunderstandings. Operator error is one of the more common faults found for reported problems. Having systems that are fault-tolerant of human error is nearly impossible from my experience. There is no substitute for trained people.
With so many things that could go wrong, it’s a wonder that these systems work as well as they do.
Next: Why Support In Multi-Vendor Environments Is Hard
MrCollaboration (aka Jim Evans) is an HP Global Services Alliance Manager. He has worked in the IT industry for more than 30 years, 22 of which were spent with Digital Equipment Corporation, Compaq and HP. He works with many third party vendors and partners to develop processes to facilitate excellent support and service for mutual customers. Jim is also HP’s representative to the Technical Support Alliance Network (TSANet).