In the discussion of parts defects, it's correctly pointed out that many "defects" were trivial or simple misunderstandings that wouldn't stop a mission or kill the crew. But there's a bigger issue that should be specifically addressed. Many non-engineers assume that the more parts you add to a system, the less reliable it necessarily gets. This simply isn't true. It might get more or less reliable.
The most obvious example is redundancy. If you have three fuel cells when you only need one to power the mission, you have three times the parts. And indeed, the chance that at least one will fail does nearly triple. But the overall system reliability goes UP because only one of the three has to work. If the probability of failure of one unit is p, a number between 0 and 1, then the probability of all three failing (assuming they do so independently) is p^3 -- a much smaller number.
The actual reliability of a system can be very complex to determine from the reliability of its parts. One of Apollo's major contributions to engineering was to develop the methods to predict (and maximize) the overall reliability of a large and very complex system built from many components, none of which can ever be totally reliable.
Part count is indeed a very crude measurement of systemic complexity. A machine with one part isn't necessarily simpler than a machine with 10 parts; but that really depends on how the 10 parts work together. You can't use part count as a predictor of complex behavior without qualitative knowledge of the design. The real problem with systemic complexity is not Part A and Part B, but the possibly many (and possibly unforeseen) ways in which Part A and Part B might collude to fail the system. That is, complexity is about interactions between elements, not the individual element behaviors.
Naturally if there are few interactions, there will be little increase in complexity as part count increases. A lighted billboard with three light bulbs wired properly in parallel does not decrease in reliability when a fourth is added. However, such a billboard with bulbs wired unwisely in series (think: old Christmas lights) increases systemic complexity by adding a fourth bulb. When wired in series, all the bulbs must close the circuit in order for any of the bulbs to light. Hence the overall reliability of the lighting system is the product of the individual reliability figures for each light; the more you have, the less reliable the system is.
In your example of the three fuel cells, you omitted the most important element for assessing systemic complexity: the plumbing and wiring. Yes, if the PDF of failure for one fuel cell over 10 days evaluates to p, and you need only one to maintain operations, then the overall probability of system failure by that measure, for that time, is properly p3. But by augmenting the system with plumbing containing isolation valves, check valves, manifolds, heaters, sensors, flow regulators, pressure regulators; and with wiring containing isolation circuits, bus regulators, and so forth, you introduce more system elements that may themselves fail.
This is exactly what doomed Apollo 13. The multiple fuel cells shared a necessarily complex reactant supply manifold. The manifold had not been provided with suitable isolation valves to prevent accidental venting of the fuel cell reactants in certain types of failure. The designers had not identified the particular Apollo 13 failure as a potentially hazardous situation, because there are many such modes and foreseeing all of them is hard. With only one fuel cell, the manifold would have been eliminated, the overall system design simpler, and potential problems easier to see and design against. Redundancy didn't help the situation because a criticality had been hidden in the complexity of the machinery required to achieve and implement the redundancy, and it failed all the fuel cells by starving them of reactants. It didn't matter that multiple fuel cells and multiple reactant sources were provided: the failure lay in the way in which those redundant elements were tied together so they could operate. Redundancy doesn't always eliminate criticality; sometimes it just makes it harder to spot.
The real problem in complex systems is the inability of the designers to provide for all patterns of failure -- indeed even to know about all patterns of failure. This is because, as complex systems become more complex, it becomes impossible for any one human to hold in his mind all the component behaviors and the cross-product of potential interactions. Indeed most real-world failures involve significant interactions among and failures of engineered safety devices (ESDs) that lead to unplanned interactions, false indications, or disastrous control system excursions.
The inability of designers and operators to fully conceptualize system behavior, and the failures of ESDs themselves (e.g., the relief valve on the primary loop accumulator) is exactly what failed the control system at Three Mile Island. A simpler system might actually have prevented the accident.
A lighted billboard with three light bulbs wired properly in parallel does not decrease in reliability when a fourth is added. However, such a billboard with bulbs wired unwisely in series (think: old Christmas lights)
...Or a <cough> RAID-0 array.
"Earth diameter is 7,900 miles, and Moon diameter is 2,160 miles. It takes on average 90 minutes to complete one Earth orbit, so one Moon orbit should take roughly 25 minutes." - Sam "NasaScam" Colby
"you data is still open for interpretation, after all a NASA employee might of wipe a booger or dropped a hair on it" - showtime
This is exactly what doomed Apollo 13. The multiple fuel cells shared a necessarily complex reactant supply manifold. The manifold had not been provided with suitable isolation valves to prevent accidental venting of the fuel cell reactants in certain types of failure.
Do we actually know that the oxygen manifold leaked? I thought all we knew is that the complete failure of O2 tank 2 somehow caused tank 1 to slowly lose its contents as well over the next hour or two; the exact location of the leak could not be determined.
Apollo 14 added a third O2 tank, and most importantly it was placed in a different sector of the service module to lessen the chance of losing all three tanks to a single-point failure (e.g., the explosion of one tank).
There were many ways that the Apollo 13 accident could have been much worse. Happening after the LM lifeboat was no longer available was just one of them. Because the leak that emptied O2 tank 1 was relatively slow, some fuel cell power was still available for another hour or so. Had tank 1 also been lost immediately, the CM would have discharged its entry batteries much more deeply. Starting with Apollo 14, the SM carried a 400 amp-hour auxiliary battery (same type as a lunar module descent battery) that, while it would not have been enough to get Apollo 13 home, would have substantially eased the time pressure in shutting down the CM after the explosion.
I don't think the auxiliary battery was ever used during Apollos 14-17, but they were critical to Skylab. The CSM fuel cells continued to operate after docking until their reactants were depleted in a few weeks, so the aux batteries were needed for independent flight after the crew undocked from the station to come home. The tanks, as well insulated as they were, were not able to store reactants for the entire duration of a stay. During the ride home I presume the crew got its breathing O2 from the CM surge tank.
I checked the Apollo 13 review board report. There's no definite conclusion about the location of the slow leak that depleted O2 tank 1, but they give one intriguing possibility: its relief valve could have been unseated by the physical shock of the rupture of tank 2.
There are check valves that would have kept tank 1 from emptying through the ruptured tank 2. If they worked without leakage, of course.