Of course, waivers aren't always a good thing...Constraint waivers are too often misused. A waiver intends to explain why a certain constraint doesn't hold in a given case. A waiver that passes a 0.51 percent per day leaky tire had better be accompanied by the engineer's computations showing that the tire still satisfies some larger safety requirement.
In
Challenger's case, the SRBs flew under what was effectively a standing waiver. This is because acknowledging a design flaw in a Criticality 1R assembly means grounding the fleet for two years or more while it's fixed. Waivers that simply sidestep safety constraints in order to improve production capacity will eventually bite you.
What happens is a phenomenon call the normalization of risk. It is the bad end of the probability game. It means that if you allow an unsafe condition to occur, and no consequence follows immediately, you wrongly believe that the system remains safe. The shuttle did not explode the first time the SRB field joints failed. Hence there arose the notion that those elastomeric seals were not as critical as originally believed. How wrong they were.
We introduce design margins to accommodate unforeseen conditions. The system operates safely in those cases because although it wanders briefly outside the operational envelope, it does not exceed the physical envelope. The normalization of risk often employs a design margin to increase production capacity. When that occurs, the system can no longer accommodate momentarily excessive circumstances. Yes, the SRB field joints can accommodate erosion under normal flight conditions. But under cold-weather conditions and excessive wind shear (i.e., excessive bending moments in the casing), the safety that would have been provided by the design margin simply isn't there. And the system fails.
Normalization of risk is a chronic human-nature condition that plagues all engineering.
Although I'm not a forensic engineer, from time to time I do read accident investigation reports.That's the work product of a forensic investigation. It's a great way to get an overview of the field and to get a glimpse into the methods.
They [accidents] invariably seem to happen after a long series of events, any one of which could have prevented the accident had it not happened.Those are multiple-mode failures. Several things have to conspire -- often in an improbable combination -- to fail the system. Weather, combined with operator inattention, combined with some key failure, for example. Any one of them alone would be inconsequential. These occur because they are insanely difficult to design against. In any complex system, the number of single-mode failures is already daunting. 2- and 3-way combinations simply number too great to be imagined.
The Apollo 13 accident is a classic example. Had any of a number of steps toward failure not been taken, the accident would have been averted. But each step seems reasonably innocent. Dropping the tank is unfortunate, but recoverable. Running the heater on GSE was a standard procedure. Failure to test the thermostat under high electrical load was not itself seen as disastrous. We accept these momentary risks because we believe the system as a whole to be resilient. If Step G doesn't catch the failure, Step R will.
This brings up a condition we call systemic loafing, or social loafing. It's social loafing when human activity dominates the system, and systemic loafing when automation predominates. We've long known that if you have a process that employs one quality-control inspector, adding another inspector in sequence actually
reduces overall system quality. This is because when there is only one inspector, he knows that the buck stops with him. But with two or more inspectors, one will always believe that any mistake he misses will be caught by the other one. So he will tend to be less diligent.
The opposite of a multiple-mode failure is a common-mode failure. That is when one or more components suffer because of the failure of some third component that is commonly connected to both. We like to design systems to reduce the criticality. We like to reduce coupling. These design factors affect how the system behaves in the face of component failure.
If you have a fluid reservoir that also functions as a heat exchanger, failure there will produce both thermal and quantity-related effects. The system will run hotter. It may also suffer from over- or under-capacity errors.
You see the same thing in shipping.Actually that's a very insightful arena because the shipping accident rate has remained largely unimproved for 50 years. Despite huge strides in automation, satellite navigation, and shipbuilding, we still experience a relatively constant rate of accidents.
This is because advances intended toward safety, such as autopilots, are being used to achieve greater production capacity. For example, if a GPS system improves your knowledge of the ship's position, ship operators realize they can run harbor channels at a faster speed because they no longer need the positioning margin. It means you can run a ship with fewer crew, leading to more fatigue-related accidents.
What this tells engineers is that humans have an inherent (and largely fixed) notion of acceptable risk. When customers say, "We want the system to be safer," what they often end up saying is, "We want to get more out of our system for the same level of safety."
I've also learned that there are definite limits to human reliability.Indeed, and as systems become more complex and harder to understand, the operators of these systems run up against hard and fast limits in human comprehension.
We saw this in the Three Mile Island nuclear power plant accident, and in the Apollo 13 accident. If you look at how operators responded to the problem, you find that they were simply unable to grasp the scope and nature of the failure at the time. It took them a long time to realize that they were experiencing something beyond a simple failure.
Operators tend to adopt a
de minimis hypothesis early in the accident sequence and to filter incoming information based on the hypothesis. For nearly an hour Apollo 13 controllers believed they were looking at a simple failure that was being aggressively misreported in the telemetry. For about that long, Three Mile Island operators failed to consider that their safety systems themselves were malfunctioning.
Humans just aren't well suited to situations where nothing happens for a very long time, and then suddenly and without warning you have to make a crucial decision.Quite true. I had brunch on Sunday with someone who trains engineers (the train-driving kind) for Union Pacific. Fatigue and attention deficit are the biggest problems he faces.
I think there's an unwarranted belief out there that the human is ultimately more reliable than the machine, and that's just not always so.That's very true. The joke goes that a modern airliner should be flown by one man and one dog. The dog is there to bite the man if he tries to touch the controls, and the man is there to feed the dog. Modern flight control systems are exceedingly adept, and are much more capable than a human of flying an airplane safely and efficiently under normal circumstances, and even under many abnormal ones.
As you note, sometimes simple automation is best. In loss-of-control flight accidents, very often you see (i.e., by examining the DFDR) the pilot trying to regain control of an airplane that has gone into spins, dives, or other uncontrolled maneuvers. And very often you can see that the pilot's command inputs are largely ineffective because he lacks an appropriately detailed spatial awareness of his situation. In those same instances the autopilot is shown to have been better at recovering the airplane. This is because autopilots are dumb: they're just simple control systems that map inputs to output. The roll-channel controller says, "Hm, my roll attitude is way off and my roll rate is excessive; let me apply my ailerons at the properly aggressive position." And this works because the controller is paying attention only to two inputs, and has only one output to manipulate. Similarly pig-headed thoughts are going on in the minds of the pitch and yaw controllers. The combined effect is a deliberate application of the right combination of control inputs to correct the overall attitude errors. It isn't a human pilot flailing at the controls. The autopilot isn't awash in adrenalin and pushed beyond its capacity by a survival instinct.
But there is a similar backlash misconception among some engineers that automation is inherently more reliable. In fact engineering a system for reliability and self-regulation more often than not requires engineered safety devices (ESDs). And these sometimes constitute engineering examples that can themselves go wrong. For example at Three Mile Island a pressure-operated relief valve opened as scheduled to relieve pressure in the coolant loop, but then failed to close again. Often we rely on ESDs to sit dormant for many years untested and untried, and then to function perfectly in the one instance where they are required.
At best ESDs are additions to the system complexity. They will help if they are working properly. They will hurt if they themselves are not well built and maintained.
There is a fine art to designing safety and warning systems. Warning systems that go off for no good reason annoy operators and normalize them to the danger they represent. Often operators will disable a "faulty" warning system, or simply ignore it unless it signals a condition that is visibly harmful.
My theater employs a large mobile stage system designed by Scala in Canada, the same company that automates Cirque du Soleil. It is phenomenally powerful, and rather complex. One of its ESDs is a dead-man's switch that is meant to be operated by a spotter down near the machinery. The spotter inspects the stage in motion and releases the switch (stopping the mechanism) if something goes wrong. The dead-man's switch spends most of its time wedged between a conduit and the wall, held closed and unattended.
Another is a set of astragals that guard pinch and shear hazards. "Astragal" is the technical term for those rubber bumpers on the edges of elevator doors that signal a blockage. The astragals are extremely sensitive, and the normal operation of the stage sometimes bumps the astragals and trips the system. As with most ESDs, the safety interlock cuts power to the actuators and the control relays. Resetting the astragal, rebooting the controller, and advancing to the appropriate part of the program takes, at best 45 seconds -- an eternity in live theater. Hence there was considerable production pressure to avoid tripping the stage.
It took a near-fatal accident in which a stage hand was dragged gruesomely into a formerly-guarded pinch hazard to shock the operators into restoring the human safety factors and to re-engineer the control and safety system so that it would (a) not be so sensitive to normal operational modes, and (b) could be reset fast enough to maintain normal show operations.
"Oh, that alarm always goes off -- it's never right," is one of the big hassles in safety engineering.
Humans are best when it comes to making complex reasoned judgments with plenty of time to do so...Well, reasonably good. There is no machine that compensates for human judgment, but human judgment is as likely to fail the system as it is to save it. Humans have the ability to think creatively, and that's why operators are still required.