I was fortunate to attend a Software Reliability lecture presented by Dr. Samuel Keene, past president of the IEEE Reliability Society . The lecture re-enforced many of the basic principles we learned as systems engineers over forty years ago. One "Path to Failure" lecture graphic jumped out at me as extremely important, and I hope I've faithfully reproduced it below:
To set the stage and to gain our attention, Dr. Keene recalled several notorious system failures and labeled each with an assignable cause:
- Massive southeast power outage (2008) - administrative and power control system co-located
- Mars Climate Orbitor (1998) - mix of metric and imperial units
- Patriot missile misfire (1991) - operational profile change
- DSC communications failure (1991) - 4 bits changed in 13 LOC, not regression tested
- Alleged F-15 equator navigation system error - operational environment change
- Jupiter flyby - power supply switch programmed, if loss of communications exceeds 7 days
Note -Utility companies, space agencies, and military units are normally not forthcoming with detailed failure reporting required for deep analysis.
The implementation oversight, that starts the path to system failure, can precede the fault activation by the widest range of time - namely, nanoseconds to infinity.
This statement made me think about all the systems I have designed, programmed, tested, installed, and changed over the years.
Is it possible that I have never created a fault-less system?
What can I do to create a fault-less system next time?
Or, can I only be expected to create a system, with faults, that behaves in predictable safe ways when faults are activated?