Complexity doesn't always mean unreliability

ka9q
Saturn

Posts: 1,292

Complexity doesn't always mean unreliability Jun 23, 2008 19:29:22 GMT -4

Post by ka9q on Jun 23, 2008 19:29:22 GMT -4

In the discussion of parts defects, it's correctly pointed out that many "defects" were trivial or simple misunderstandings that wouldn't stop a mission or kill the crew. But there's a bigger issue that should be specifically addressed. Many non-engineers assume that the more parts you add to a system, the less reliable it necessarily gets. This simply isn't true. It might get more or less reliable.

The most obvious example is redundancy. If you have three fuel cells when you only need one to power the mission, you have three times the parts. And indeed, the chance that at least one will fail does nearly triple. But the overall system reliability goes UP because only one of the three has to work. If the probability of failure of one unit is p, a number between 0 and 1, then the probability of all three failing (assuming they do so independently) is p^3 -- a much smaller number.

The actual reliability of a system can be very complex to determine from the reliability of its parts. One of Apollo's major contributions to engineering was to develop the methods to predict (and maximize) the overall reliability of a large and very complex system built from many components, none of which can ever be totally reliable.

Grand Lunar Welcome to the moon. Posts: 868	Complexity doesn't always mean unreliability Jul 9, 2008 21:50:34 GMT -4 Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Grand Lunar on Jul 9, 2008 21:50:34 GMT -4 Quite true. Nuclear plants on ships are rather complex, with all the redundent systems available. Of course, those reactors are also meant to be self regulating, to a point.
	"You're mistaking our universe for someone else's." - Capt. Archer

JayUtah

Posts: 5,253

Complexity doesn't always mean unreliability Jul 30, 2008 17:47:35 GMT -4

Post by JayUtah on Jul 30, 2008 17:47:35 GMT -4

I meant to address this long ago.

Part count is indeed a very crude measurement of systemic complexity. A machine with one part isn't necessarily simpler than a machine with 10 parts; but that really depends on how the 10 parts work together. You can't use part count as a predictor of complex behavior without qualitative knowledge of the design. The real problem with systemic complexity is not Part A and Part B, but the possibly many (and possibly unforeseen) ways in which Part A and Part B might collude to fail the system. That is, complexity is about interactions between elements, not the individual element behaviors.

Naturally if there are few interactions, there will be little increase in complexity as part count increases. A lighted billboard with three light bulbs wired properly in parallel does not decrease in reliability when a fourth is added. However, such a billboard with bulbs wired unwisely in series (think: old Christmas lights) increases systemic complexity by adding a fourth bulb. When wired in series, all the bulbs must close the circuit in order for any of the bulbs to light. Hence the overall reliability of the lighting system is the product of the individual reliability figures for each light; the more you have, the less reliable the system is.

In your example of the three fuel cells, you omitted the most important element for assessing systemic complexity: the plumbing and wiring. Yes, if the PDF of failure for one fuel cell over 10 days evaluates to p, and you need only one to maintain operations, then the overall probability of system failure by that measure, for that time, is properly p³. But by augmenting the system with plumbing containing isolation valves, check valves, manifolds, heaters, sensors, flow regulators, pressure regulators; and with wiring containing isolation circuits, bus regulators, and so forth, you introduce more system elements that may themselves fail.

This is exactly what doomed Apollo 13. The multiple fuel cells shared a necessarily complex reactant supply manifold. The manifold had not been provided with suitable isolation valves to prevent accidental venting of the fuel cell reactants in certain types of failure. The designers had not identified the particular Apollo 13 failure as a potentially hazardous situation, because there are many such modes and foreseeing all of them is hard. With only one fuel cell, the manifold would have been eliminated, the overall system design simpler, and potential problems easier to see and design against. Redundancy didn't help the situation because a criticality had been hidden in the complexity of the machinery required to achieve and implement the redundancy, and it failed all the fuel cells by starving them of reactants. It didn't matter that multiple fuel cells and multiple reactant sources were provided: the failure lay in the way in which those redundant elements were tied together so they could operate. Redundancy doesn't always eliminate criticality; sometimes it just makes it harder to spot.

The real problem in complex systems is the inability of the designers to provide for all patterns of failure -- indeed even to know about all patterns of failure. This is because, as complex systems become more complex, it becomes impossible for any one human to hold in his mind all the component behaviors and the cross-product of potential interactions. Indeed most real-world failures involve significant interactions among and failures of engineered safety devices (ESDs) that lead to unplanned interactions, false indications, or disastrous control system excursions.

The inability of designers and operators to fully conceptualize system behavior, and the failures of ESDs themselves (e.g., the relief valve on the primary loop accumulator) is exactly what failed the control system at Three Mile Island. A simpler system might actually have prevented the accident.

Data Cable

Posts: 1,351

Complexity doesn't always mean unreliability Jul 31, 2008 1:29:44 GMT -4

Post by Data Cable on Jul 31, 2008 1:29:44 GMT -4

Jul 30, 2008 17:47:35 GMT -4 JayUtah said:

A lighted billboard with three light bulbs wired properly in parallel does not decrease in reliability when a fourth is added. However, such a billboard with bulbs wired unwisely in series (think: old Christmas lights)

...Or a <cough> RAID-0 array.

"Earth diameter is 7,900 miles, and Moon diameter is 2,160 miles. It takes on average 90 minutes to complete one Earth orbit, so one Moon orbit should take roughly 25 minutes." - Sam "NasaScam" Colby

"you data is still open for interpretation, after all a NASA employee might of wipe a booger or dropped a hair on it" - showtime

DataCable²⁰¹² A+

ka9q
Saturn

Posts: 1,292

Complexity doesn't always mean unreliability Apr 9, 2010 3:07:49 GMT -4

Post by ka9q on Apr 9, 2010 3:07:49 GMT -4

Jul 30, 2008 17:47:35 GMT -4 JayUtah said:

This is exactly what doomed Apollo 13. The multiple fuel cells shared a necessarily complex reactant supply manifold. The manifold had not been provided with suitable isolation valves to prevent accidental venting of the fuel cell reactants in certain types of failure.

Do we actually know that the oxygen manifold leaked? I thought all we knew is that the complete failure of O₂ tank 2 somehow caused tank 1 to slowly lose its contents as well over the next hour or two; the exact location of the leak could not be determined.

Apollo 14 added a third O₂ tank, and most importantly it was placed in a different sector of the service module to lessen the chance of losing all three tanks to a single-point failure (e.g., the explosion of one tank).

There were many ways that the Apollo 13 accident could have been much worse. Happening after the LM lifeboat was no longer available was just one of them. Because the leak that emptied O₂ tank 1 was relatively slow, some fuel cell power was still available for another hour or so. Had tank 1 also been lost immediately, the CM would have discharged its entry batteries much more deeply. Starting with Apollo 14, the SM carried a 400 amp-hour auxiliary battery (same type as a lunar module descent battery) that, while it would not have been enough to get Apollo 13 home, would have substantially eased the time pressure in shutting down the CM after the explosion.

I don't think the auxiliary battery was ever used during Apollos 14-17, but they were critical to Skylab. The CSM fuel cells continued to operate after docking until their reactants were depleted in a few weeks, so the aux batteries were needed for independent flight after the crew undocked from the station to come home. The tanks, as well insulated as they were, were not able to store reactants for the entire duration of a stay. During the ride home I presume the crew got its breathing O₂ from the CM surge tank.

Last Edit: Apr 9, 2010 3:08:55 GMT -4 by ka9q

ka9q
Saturn

Posts: 1,292

Complexity doesn't always mean unreliability May 10, 2010 11:11:12 GMT -4

Post by ka9q on May 10, 2010 11:11:12 GMT -4

I checked the Apollo 13 review board report. There's no definite conclusion about the location of the slow leak that depleted O₂ tank 1, but they give one intriguing possibility: its relief valve could have been unseated by the physical shock of the rupture of tank 2.

There are check valves that would have kept tank 1 from emptying through the ruptured tank 2. If they worked without leakage, of course.

Last Edit: May 10, 2010 11:12:07 GMT -4 by ka9q

Complexity doesn't always mean unreliability

Post by ka9q on Jun 23, 2008 19:29:22 GMT -4

Post by Grand Lunar on Jul 9, 2008 21:50:34 GMT -4

Post by JayUtah on Jul 30, 2008 17:47:35 GMT -4

Post by Data Cable on Jul 31, 2008 1:29:44 GMT -4

Post by ka9q on Apr 9, 2010 3:07:49 GMT -4

Post by ka9q on May 10, 2010 11:11:12 GMT -4