washingtonpost.com
When Fail-Safe Fails

By Charles B. Perrow
Sunday, June 28, 2009

One subway train in Washington crashes into another, killing nine and injuring scores. A French airliner disappears in a thunderstorm and falls into the Atlantic ocean; there are no survivors. A suspected cause in both instances? Computer failure.

These two recent accidents have monopolized newspaper front pages and raised concerns about the technology with which we humans have surrounded ourselves. We wonder what went wrong, and what role human error might have played.

For an accident specialist like me, the reaction isn't surprising. Even though riding a subway or flying is safer than driving, I know that a multi-car pileup with nine deaths and dozens injured wouldn't get as much attention as Washington's Metrorail crash. That's because in driving, we feel we have some personal control. On a plane or in a train, we give most of that control over to computers. We are at their mercy -- and when we're reminded that they can be faulty, it scares us.

The ultimate question in these tragedies is: Can we really trust computers as much as we trust ourselves? For some things, perhaps not. But if we want to travel faster and in more comfort, we have to let ever more computerization into our lives. And that means that we have to focus more on the humans who interact with the computers.

It's true that there are many examples of highly computerized trains that are accident free (as far as computer failures go), mindlessly looping around airport terminals with no operator on board. But these are slow, simple, repetitive systems that require only a minimum of human oversight from a distant control room. A transit line such as Metrorail is more complex and requires continual scheduling of shared tracks. So we put a human in charge, in case the computers fail (and perhaps to reassure the riders).

But this human/computer interaction is limited to just two commands -- go and stop. Driving a car to work requires vastly greater skills and attentiveness than operating a train full of people does. The Metro driver apparently had only two tasks: closing the doors and standing over a large stop button that she would press if she saw an obstruction ahead that the computer had failed to sense and stop for.

While it's unclear whether this particular operator could have saved the situation, there is something wrong with a computer/operator interface that demands few skills and leads to inattentiveness. Alertness is always a problem, particularly when there's only one operator, so it's important to give people something to do. A subway operator could have the task of monitoring the automatic crash protection system, for example. There have been examples of airline pilots falling asleep on long flights, but alertness warnings after five minutes of inactivity may have solved that. More important, in the supposedly "normal" six-hour flight I took in a cockpit jump seat when I was working on a book on "normal accidents," there was no time for the pilots to doze off. Two of the three radios went out, unexpected weather fronts required diversions, and some warning lights were misbehaving. There is much to keep pilots alert, including co-pilots, and in emergencies the demands are intense.

The Air France case is much more complicated than the Metro case. Modern airliners simply would not be able to fly if they weren't filled with computers. They are designed more like darts or rockets than older generation aircraft in order to fly ever faster, higher and cheaper. No human could manage all the control surfaces and thrusts without the aid of massive computers. The human component of a large airliner, the pilot, is extensively trained, has considerable experience with simpler aircraft and spends a good deal of time in huge, expensive simulators practicing all sorts of emergencies. (When I was a member of a National Academy of Science team that was studying computer failures, I spent some time in one of these, and failed completely to land my simulated fighter on the deck of a simulated aircraft carrier. It was frighteningly realistic.)

Some experts think that the Air France flight's computer was unable to handle heavy turbulence and the failure of air speed sensors that probably iced up. Those sensors sent incoherent signals to the computer, which may not have been designed to anticipate this particular combination of parameters. It's not unheard of. Last October, as a Qantas Airbus A330 flew through turbulence, its computers mistakenly registered an imminent stall, disconnected the autopilot and commanded a strong downward pitch to pick up speed. The crew was able to recover, though 14 people were seriously injured. But even taking over from the computer and "hand flying" the plane still requires significant computer input, even in emergencies that computers themselves may have created.

In the case of the Air France plane, recovery was apparently not possible; experts believe that the plane simply broke up in the air. The loss of speed sensors by itself need not cause a crash; skilled pilots can recover, though it's dicey. Aircraft have had safe landings after losing their tails; an F-15 fighter plane landed safely after losing one wing in a midair collision during an Israeli training flight. Or recall pilot Chesley Sullenberger, who successfully ditched his US Airways plane in the Hudson River last January after birds flew into its engines.

So how do we best design this "man/machine" interface? That's a tricky question, and controversial. Airbus aircraft emphasize the machine, while Boeing planes give pilots more authority. It's a dilemma in the design of all complex systems, and especially those with catastrophic potential. Airbus has an "integrated" architectural philosophy, which is like Nobel laureate Herbert Simon's example of a watchmaker who assembles all the components of the complex watch at once. Boeing, on the other hand, uses a "modular" design, where the watchmaker first assembles a number of modules and then fits them together.

If the first watchmaker is interrupted in his work, the whole thing falls apart and he has to start over again; the second need only start over on the one module he was working on. In integrated systems such as Airbus, an "interruption" can be the failure of speed indicators, excessive turbulence, a short in the coffee machine (it happens and nearly crashed one plane) or a thousand other small wounds that can bring the whole system down. In the modular system there is more of a chance that the "interruption" will only disable one module and that backups and redundancies will be called into play, or that the pilot can use "work-arounds," taking command of the system.

If "interruptions" such as faulty code, hostile environments, intruders or equipment failures are rare and the consequences are not catastrophic, integrated systems are preferable, because they're cheaper to build and will run faster. But if interruptions are not rare, and especially if the consequences of failure can be catastrophic -- e.g. loss of many lives, widespread pollution, cascading destruction of adjacent systems -- then modular systems are preferable.

Airliners are on the edge of this debate. Though failures can be catastrophic, they are very rare; interruptions are not rare, but they are infrequent and the recovery rate is extremely high. There are 1,000 Airbus 330s flying around right now, and despite many near-misses, this is their first fatality. Perhaps Airbus still has enough modularity to insure recovery, and Boeing enough integration to fly fast efficiently.

The problems of subways are not so much technological as financial and managerial -- lack of funds for modernization and maintenance and not acting on warnings, and the like. Some of this may have played a role in the Air France crash as well, as the airline apparently delayed replacing air speed sensors. But as we move up the ladder of catastrophe, the human/computer interface expands. The computers multiply as we demand more of our systems, and the cognitive load on the humans who have to work them expands commensurately. But cognition is by nature limited. To push the envelope of travel (and so much else in our technological society), we'll have to program more and more of our brain capacities into the computer.

Still, in the end, we can't untangle the human factor from the machine. Even if all the machine parts work perfectly, it's up to humans to keep them functioning that way.

charles.perrow@yale.edu

Charles B. Perrow, emeritus professor of sociology at Yale, is the author of "Normal Accidents" and "The Next Catastrophe ."

View all comments that have been posted about this article.

© 2009 The Washington Post Company