Insights
Insights are commentaries on the world around Critical.
Summer 2010: Software in Deep Water
As always happens, news media have lost interest in a story about which they were totally obsessed only a few weeks ago. The catastrophic effects of the BP oil spill in the Gulf of Mexico will be felt for a long time, perhaps for more than a decade, but the media have moved on. Maybe the long term effects aren’t as obvious and dramatic as a flock of oil-sodden sea birds struggling pathetically to survive in their ruined habitat. They are felt by the proprietors of, and workers in, a devastated tourist industry. They are felt by pensioners whose investments are shrunk by the need to divert billions from what would have been profits into reparations and damages. They are felt by all of us for whom prices will go up as a result of a diminished appetite for deepwater drilling.
Although the media may have moved on, more responsible interested parties will be spending a long time and a lot of effort trying to figure out what caused the Deepwater Horizon explosion in April 2010, an explosion lest we forget that not only caused an environmental disaster but also claimed the lives of eleven people. Perhaps, despite their best efforts, investigators will never be able to tell us what happened in which case we’ll simply have to be satisfied with speculation, or educated guesswork.
Such speculation has started already, and it has struck a chord with Critical Software, a company that specialises in ensuring that software in safety critical applications doesn’t fail. How many people know what’s involved in drilling the sea bed for oil? Far from being a simply mechanical process, it actually depends on a lot of software-intensive control systems. It’s not widely appreciated, but most of the sophisticated technology that shapes all our lives depends on a lot of software. Sometimes, software failures are an inconvenience. So you had to restart your PC? Big deal. How about if the pilot’s ‘glass cockpit’ packs up in the middle of your holiday flight. That gives a whole new meaning to the ‘blue screen of death’!
In the case of Deepwater Horizon, it’s clear from the Transocean interim report to the Waxman committee that control system software is falling under suspicion.1 Reports have already surfaced in the Houston Chronicle2 that “display screens at the primary workstation used to operate drill controls on the Deepwater Horizon, called the A-chair, had locked up more than once before the deadly accident.” Given the amount of embedded software in oil-rig systems, or the dozens of operations that are carried out under software control, it’s no wonder that software is getting the third degree.
Software is relatively easy to write. Reliable, safety-critical software can be complex and challenging: however in truth often it isn’t much harder to write than the software that’s powering the browser that you’re probably using to read this. To really get close to perfection, it requires independent testing so that the developers’ assumptions, and even egos, are not allowed to stand in the way of the quest for those last few elusive bugs.
Independent testing of something that’s already been tested in the normal way, by its developer, is undeniably an extra expense. It’s not a prohibitive expense though – just the one that’s most likely to be cut when money’s tight and financial control is wielded by those that don’t really understand the true value of what they’re cutting. Critical Software’s experience is that cost pressures are all too often allowed to bear on the safety-critical part of the software development process. Do we skip physical safety checks on trains and boats and ’planes? Not likely! So how is it OK to let finance directors and others of their ilk cause the cutting of corners when it comes to the more abstract and less tangible factors in the safety equation?
Until independent testing, by truly qualified testers, is recognised as sacrosanct within safety-critical developments then we’ll continue to have aircraft falling out of the sky, runaway cars, space-launch disasters and yes, oil rig disasters. At the dawn of a new era of nuclear power generation, it’s time to start changing attitudes now.
1. "Deepwater Horizon Incident—Internal Investigation," draft report, Transocean, 8 June 2010, p. 15; http://energycommerce.house.gov/documents/20100614/Transocean.DWH.Internal.Investigation.Update.Interim. Report.June.8.2010.pdf.
2. B. Clanton, "Drilling Rig Had Equipment Issues, Witnesses Say—Irregular Procedures also Noted at Hearing," Houston Chronicle, 19 July 2010; www.chron.com/disp/story.mpl/business/7115524.html.
![]()
Spring 2010: Shutting the stable door BEFORE the horse bolts
Every few weeks, we see a story in the press that highlights the incongruity of saving money by skimping on safety - incongruous because the eventual cost, not just in money but in human misery, is often so high.
Much less frequently but still too often there are real horror stories. Bodies piled up behind locked emergency exits in burned out public buildings: "well we didn't want people sneaking in without paying, and we'd never had a fire before". It's always "never again", but then memories fade and history repeats itself.
As providers of services in the area of safety-critical systems, the stories that concern us directly never have such an obvious focal point as a locked emergency exit. What we routinely encounter from our (potential) clients is the question: is it really necessary for these tests to be done independently? Wouldn't it be cheaper and quicker if we did them ourselves (or not at all)?
Independence is the main buffer against management pressures that conflict with safety considerations. One of the best-known illustrations of the effects of those management pressures dates all the way back to the Challenger disaster in 1986, where a weakness that had been known about since 1977 remained hidden, or was at least ignored, until eventually the inevitable happened. (For those that don't remember, Morton Thiokol engineers realised that the O-rings in the shuttle's solid rocket boosters could become dangerously brittle and prone to failure at low launch temperatures, but their warnings were suppressed by a management more concerned with maintaining their delivery programme and protecting their cashflow.)
The Challenger case, when it eventually came out, was pretty much as obvious as a locked emergency exit, but often the scenarios are much more subtle. The end results can be at least as tragic though.
Closer to home, it's taken years for the background to emerge of one of the RAF's worst ever peacetime accidents. The crash of Chinook ZD576 on the Mull of Kintyre in June 1994 was officially blamed on the two pilots after an enquiry that more and more looks like a whitewash. The AAIB and the RAF Board of Inquiry were left unaware of a 1993 report by EDS that the software of the Full Authority Digital Engine Control (FADEC) was dangerously flawed in both design and implementation. In fact, EDS had given up its evaluation after inspecting 2,897 lines of the total 16,254 lines because they were finding so many errors and anomalies. The full murky story can be found here. It's obvious that basic principles of safety systems engineering were squeezed out and that, regardless of the cause of the 1994 crash, the Chinook was not airworthy when it went into RAF service.
Coming right up to date, the latest example of safety systems principles being relegated by (im)pure business considerations so that many lives are lost is the Toyota debacle. This story has still to play out, but it is already clear that a company that had previously put such emphasis on quality processes, doing things right and doing things better fell into the trap of starting to prefer all out growth. This would have been bad enough in the days when throttles were controlled by means of a cable running between a pedal and a carburettor, but since "fly by wire" came to the automobile, software safety engineering has been in the mix. How many others besides Toyota have fallen into the trap of "penny wise, pound foolish" when it comes to taking the necessary steps to ensure that complex software-intensive systems are fit and safe for purpose?
The moral in all this is simple. If a thing can go wrong, it eventually will. Making sure that things can't go wrong costs money, but it costs a lot less than the consequences when they do. Manufacturers shouldn't gamble, or be allowed to gamble, with lives at stake. They will continue to do so until the complexities of modern systems, and in particular software-intensive systems, are more widely understood by those in authority, including legislators.
