February 1, 2003. It’s a little before 9 AM and in 20 minutes the Space Shuttle Columbia is scheduled to land in Florida. In Mission Control, mechanical systems officer Jeff Kling notices two sensors on Columbia’s left wing have cut out. A few moments later, two more cut out. He discusses it with flight director LeRoy Cain, but neither can explain the event. How could four independent sensors just suddenly stop working?
But then Cain remembers back to Columbia’s launch and the large chunk of foam that hit the shuttle as it was rocketing up at supersonic speeds. The debris had hit Columbia’s left wing. Could the loss of the sensors be related? It couldn’t be, thinks Cain. It has to be a coincidence.
Linda Ham, chairwoman of the Mission Management Team, has the same thought. She had approved the analysis that concluded the foam strike was not a problem. She turns to Ralph Roe, another manager:
“Ralph, it’s the left wing.”
“It’s not that,” he says.
A few minutes later, Kling sees more bad signs. The landing gear sensors report zero tire pressure, but only on the left side. Both Kling and Cain get a horrible feeling in the pit of their stomachs. What if they were wrong? What if the foam strike damaged Columbia’s wing? They desperately search for any other explanation: Is it an instrumentation problem? Is there any commonality in the systems?
But they don’t get much time. A minute later Mission Control loses all communications with the orbiter. Comm drop-outs are not uncommon, but this time feels different. The laughter and excitement normally present when the orbiter lands and the mission ends has been replaced with silence and dread.
The only voice is from CAPCOM:
“Columbia, Houston, comm check.”
“Columbia, Houston, UHF comm check.”
For me, an engineer, reading about the Columbia disaster was particularly emotional. Reading about the meetings and conversations of the space shuttle engineers, I could think back to my own experiences on software development teams. We’ve all been in triage meetings where we’ve been asked to give a risk assessment. In Comm Check there is a point at which Linda Ham asks, “Can we say … [there is] no safety of flight issue…?” It was easy for me to recall all the times that I’ve been asked, “Is your bug-fix safe?”
How do we answer that question? As engineers, we are constantly asked to predict the future: Will that bridge hold up in rush-hour traffic? How long will it take to evacuate that building in a fire? Will your program crash? Is Columbia going to land safely? The future is never perfectly predictable. There is no known algorithm, process, or invention that can predict the future with certainty. Nevertheless, we are asked to try. Surprisingly, many times our predictions are correct.
I read Comm Check wondering if it was going to reveal the cause of the disaster—the one mistake that made their predictions wrong. The Columbia Accident Investigation Board (CAIB) blamed the disaster, in part, on “NASA’s history and the human space flight program’s culture.” In other words, no one person (or mistake) was to blame. The accident was caused by NASA’s “broken safety culture.”
Maybe that’s right. When launch video showed a large piece of foam striking the orbiter, a team was dispatched to analyze the incident and predict whether the orbiter would be able to land safely. But in order to predict the future, the team needed information about the past. Where had the foam hit? Had it hit the leading edge of the wing or had it hit the tiled underbelly? How big was the foam? How much damage had the tiles sustained? And a thousand other questions.
With the best information available, the team believed the foam had hit the heat-shield tiles on the wing. With the best information available, the leading tile expert believed no major damage would result. Nevertheless, there was a nagging uncertainty. The team believed there was no safety issue, but could not prove it. They wanted more information. They asked their managers for more information. In particular, they wanted the Air Force to use their super-secret telescopes to take a picture of the orbiter and see if there was any damage. The managers replied, if you don’t think there’s a safety issue, then more information is not a high priority. Come back to us if you think there’s a safety issue. The team was not happy, but they couldn’t argue with the logic.
It’s easy to blame the team for not pushing harder, or blame managers for not asking more questions (Linda Ham was widely attacked in the media). But the CAIB also leveled criticism at past Presidents and Congress for not adequately funding NASA.
There may be some truth to that. NASA keeps a list of all the issues that could lead to the loss of a mission and the crew—essentially a “bug list” for the space shuttle program. At the time that Columbia launched, more than 4,000 critical items were on that list; more than 3,000 of them were “waived” (that is, deferred). With such a huge backlog of bugs, there was no way that a program with a shrinking budget could cope. NASA would need far more resources and far more time to fix all those problems. No wonder the foam impact was not scrutinized more closely—it was one of thousands of problems. The engineers and managers had become desensitized to serious issues because they had to deal with so many of them.
If all goes well, the shuttle Discovery will launch this Sunday. It has been delayed several weeks because of a new problem. In November, a piece of a hydrogen control valve on the shuttle Endeavour (used to pressurize the fuel tank) broke free during launch. Worse, an inspection revealed cracks in similar valves in other orbiters. Since no damage occurred in the November flight, maybe the most expedient thing to do would be to replace the valves on Discovery and proceed with a launch. Perhaps a pre-Columbia NASA would have done that. But this time things were different. NASA grounded the shuttle until its engineers could get to the root cause. Why were the valves failing? Could a brand new valve fail? And if it did fail, what would happen to the orbiter and its crew? A team tackled these questions and after a month of work it produced good answers: The valves were failing because of harmonics in the fuel flow; a brand new valve will not fail during a single launch; and even if the valve did fail, it would not harm the orbiter.
The engineers made their prediction: if you replace the valves, the shuttle will fly safely. This Sunday I will watch the launch confident (but not absolutely certain) that their predictions will come true.