These accidents can resemble Rube Goldberg devices in the way that small errors of judgment, flaws in technology, and insignificant damages combine to form an emergent disaster. System accidents were defined in 1984 by Charles Perrow. James T. Reason extended Perrow's work with human reliability and the Swiss Cheese model, now widely accepted in aviation safety and healthcare.
Once an enterprise passes a certain point in size, with many employees, specialization, backup systems, double-checking, detailed manuals, and complex communication, employees recourse primarily to following rigid protocol, habit, and "being right." And related phenomenon, such as groupthink, can also be occurring at the same time. Real-world accidents usually have multiple causes, and not just the single cause that could have prevented the accident at the very last minute.
What is called "safety" can be pursued in a rigid way, with the feeling that more than enough time and effort has already been spent on it and therefore with the social consequence that suggestions, of any sort, will be poorly received. Arguably, this was one of the contributing causes of the Space Shuttle Challenger disaster on January 28, 1986. In that situation, a group of engineers again became worried the solid rocket booster O-rings would malfunction in the low temperatures forecast for launch day (it was 39 degrees Fahrenheit at launch). They were ignored because NASA officials did not want to further delay the scheduled launch (already six days behind schedule) and did not want yet another complication (a very human motive), especially on a matter that had seemingly already been resolved. The managers seemed to take the view that the manager's job is to "know the answer," rather than to know how to find out, or how to coach their people to find the answer. In addition, the fact that something comes up yet again is itself information. In the best case scenario, this would have been part of a conversation that looked for both interim solutions and longer-term solutions. Challenger exploded 73 seconds after launch. The lives of all seven crew members were lost. The proximate cause was the O-rings malfunctioning in the cold weather.
Often, processes and events are opaque (Perrow terms this 'incomprehensibility') as opposed to transparent. When processes are 'opaque', it means that there is little genuine communication between the top of the chain of command and the people carrying out orders.
Organizations and technologies prone to system accidents can also be described as "clunky."
A malfunction of the main feedwater pumps in the secondary, non-nuclear cooling circuit led to their shutdown. As a result, water did not flow through the circuit and the reactor was no longer being cooled. The turbine and the nuclear reactor shut down automatically, as they should have. The loss of cooling from the secondary cooling circuit led a temperature and pressure rise of the primary coolant, which was normal and expected in a shutdown. A pilot-operated relief valve (PORV) opened automatically to relieve the excess pressure on the primary coolant side, and it should have closed automatically when it was finished, because the maintenance of high pressure was important to keep the fluid inside from boiling. However, a faulty sensor told computers and the engineers watching the controls that the PORV was closed, when in fact the PORV was still open. Cooling water began to pour out of the open valve, which quickly caused the reactor core to overheat.
There was no direct way to measure the amount of water in the core. Instead, operators watched the level of core water by looking at the level in the pressurizer. In the closed-circuit loop, the pressurizer is mounted higher than the core, meaning that if there is water in the pressurizer, there is water in the core. Since the PORV was stuck open, the water in the pressurizer was turbulent and the water level indication was not accurate. Control room operators did not realize an accident was imminent because they did not have a direct indication of core water level and because they did not interpret other indications correctly.
Meanwhile, another problem surfaced in the emergency feedwater system. Three backup pumps started immediately after shutdown of the main feedwater pumps, but unfortunately the water lines had been closed for testing 42 hours earlier and had not been reopened. This was either an administrative or human error, but in either case no water was able to enter the circuit. There was also no sensor for the water line flow in the control room, but the pump sensors reported, incorrectly, that they were functioning properly. The human control room operators were provided no information on water level in the reactor, but rather judged by pressure, which was thought to have been stabilized by the malfunctioning valve, which they thought was closed.
The closed water lines were discovered about eight minutes after the backup pumps came on, and once reopened, water flowed normally into the primary cooling circuit. However, steam voids, or areas where no water was present, had formed in the primary circuit, which prevented the transfer of heat from the nuclear reactor to the secondary cooling circuit. The pressure inside the primary coolant continued to decrease, and voids began to form in other parts of the system, including in the reactor. These voids caused redistribution of water in the system, and the pressurizer water level rose while overall water inventory decreased. Because of the voids, the pressurizer water indicator, which displayed the amount of coolant available for heat removal, indicated the water level was rising, not falling. The human operators then turned off the backup water pumps for fear the system was being overfilled. The control room personnel did not know that this indicator could read falsely, so they did not know that the water level was indeed dropping and the core was becoming exposed.
The water that was gushing out of the still-open PORV was collecting in a quench tank. The tank overfilled, and the containment building sump filled and sounded an alarm at 4:11 am. Operators ignored these indicators and the alarm initially. At 4:15 am, the relief diaphragm on the quench tank ruptured, and radioactive coolant began flowing into the containment building.
Three of these four failures had occurred previously with no harm, and the fourth was an extra safety mechanism inserted as the result of a previous incident. However, the coincidental combination of human error and mechanical failure created a loss of coolant accident and a near-disaster.
Mechanics removed oxygen canisters from three older aircraft and put in new ones. Most of the emphasis was on installing the new canisters correctly, rather than disposing of the old canisters properly. These were simply put into cardboard boxes and left on a warehouse floor for a number of weeks. The canisters were green-tagged to mean serviceable. A shipping clerk was later instructed to get the warehouse in shape for an inspection. (This very human motive to keep up appearances often plays a contributory role in accidents.) The clerk mistakenly took the green tags to mean non-serviceable and further concluded that the canisters were therefore empty. The safety manual was neither helpful for him nor for the mechanics, using only the technical jargon "expired" canisters and "expended" canisters. The five boxes of canisters were categorized as "company material," and along with two large tires and a smaller nose tire, were loaded into the plane's forward cargo hold for a flight on May 11 1996. A fire broke out minutes after take-off, causing the plane to crash. All five crew members and 105 passengers were killed.
If the oxygen generators had been better labeled – that they generate oxygen through a chemical reaction that produces heat, and thus were unsafe to handle roughly, not to mention fly in a cargo hold – the crash might have been averted.
Many writers on safety have noted that the mechanics did not use plastic safety caps which protect the fragile nozzles on the canisters, and in fact, were not provided with the plastic caps. The mechanics, unaware of the hazardous nature of the canisters, simply cut the lanyards and taped them down. ValuJet was in a sense a Virtual Airline with maintenance contracted out, and the contracting firms often went another level in this process by contracting out themselves. The individual mechanics "inhabited a world of boss-men and sudden firings," obviously not an environment where people can confidently raise concerns. The disaster could also have been prevented by proper disposal or labeling of the tanks.
The Columbia disaster began long before Columbia even left the ground. The bridge between the shuttle and the dock is constructed of a resin foam, which is extremely strong and dense, yet simple to manipulate and form. A chunk of this foam fell and hit the shuttle's left wing, damaging the heat shield. A damage assessment program, nicknamed 'Crater', predicted damage to the wing, but was dismissed as an overestimation by NASA management. In a risk-management scenario similar to the Challenger disaster, NASA management failed to recognize the relevance of engineering concerns for safety. Two examples of this were failure to honor engineer requests for imaging to inspect possible damage, and failure to respond to engineer requests about status of astronaut inspection of the left wing. NASA's chief thermal protection system (TPS) engineer was concerned about left wing TPS damage and asked NASA management whether an astronaut would visually inspect it. NASA managers never responded.
NASA managers felt a rescue or repair was impossible, so there was no point in trying to inspect the vehicle for damage while on orbit. However, the Columbia Accident Investigation Board determined either a rescue mission or on-orbit repair, though risky, might have been possible had NASA verified severe damage within five days into the mission.