Ariane 501: One of "History's Worst Software Bugs" is not a software bug.

Orcmid posts a welcome and well researched critique of the widespread software engineering legend regarding the Ariane 501 rocket launch failure. Dennis summarizes

It is simply not the case that a programming error involving a numeric conversion (some call it an overflow) was the root cause of the failure of Ariane 5 Flight 501. That's not what happened.

Understanding that the failure of Ariane flight 501 is a systems engineering failure, a failure in both process and practice, is crucial in learning to build and deploy software based systems. There are engineering lessons in this worth learning, among them lessons in design, testing, and risk assessment.

Dennis asks why software and technology people keep using the Ariane 501 story as an example of a software bug. I was speculating that the software process explanation is avoided because of a desire to reduce complex things (like, um, human process and system QA practice) to a simpler domain (software construction). (We make a similar mistake when we reduce human behavior to brain biochemistry.)

But perhaps it's more a matter of people accepting what they've heard or read. I want to write about SW bugs, what are the big ones? Oh, look, here's a story about the Ariane 5 rocket launch failure. Cool, I'll use this one. Maybe it's not a surprise that not many people are interested in the full story; they're just looking for an example, and since this one is written up in many places it must be true.

Maybe it's a bit of both, and then some. It would be good to look at this one more closely.

Technorati Tags: , , ,

1 Comment

Reading Orcmid's "why it's not a software bug" is convincing -- to a software pro -- but what about the great unwashed. "Well, wasn't it a problem with the program?" You'd have to give a thorough explanation, if not course, in requirements definition. But I still agree; it speaks more about how we train people in proper expectations.

I do have some data on software failure #3, according to the "History's List": the Jan 15, 1990 AT&T network "earthquake" as I choose to call it. That failure was a software failure, a testing failure, and a managment failure. The following week, I happened to be at a Chinese New Year's party in Lincroft. A large portion of the QA team who _fixed_ the bug was there. This is my best recollection of those conversations.

The bug was a result of a shell programmer having been promoted to a C programmer. The programmer used a "break" statement, correctly coded as if it were a shell statement, but in C, a wrong, or misplaced semi-colon. The code compiled correctly in C, though having the wrong semantics, and was in a block not routinely executed. Apparently, no volume or load testing had been performed, as it would have revealed the flaw. Sadly, and this isn't (wasn't) widely reported, but a simple "lint" of the code would have revealed the flaw. It was wide practice in (some parts of) AT&T to hold code reviews, and supply a lint report, if not a "lint-free" bit of code. Any remaining lint would be marked with the available "That's OK"-type of comment. The tiger team said that was not the practice in their lab.

I call it a management failure for the dual reason of assigning a programmer, probably unqualified for the job, as well as the absence of a higher-level policy which should have caught the bug at either of two levels.

My estimate of the lost revenue to AT&T was $50M, with good-will costs probably doubling that amount. I was overheard repeatedly saying, "I'd have loved to have a dime on the dollar, for all that cost us, to have dropped ``lint'' into a default-settable pre-compiler pass". This had been an active discussion just a few years before at the Unix Systems Lab, where I was a product manager, that proposal having been vetoed for lack of available funding.