On December 21, 2022, just as the peak holiday season trip began, Southwest Airlines experienced a series of cascading failures during arrangements, initially triggered by severe winter weather in the Denver area. But the problem spread through their network, and over the next 10 days, the crisis eventually sank into more than 2 million passengers and lost $750 million of airlines.
How does local weather systems eventually trigger such a wide range of failures? MIT researchers have looked at this widely reported failure, a case where a system that works properly suddenly crashes and causes a failure in most cases. They have now developed a computing system for using a combination of sparse data about rare failure events, combining a wider range of data about normal operations, working backwards, and trying to pinpoint the root cause of failures, and hopefully finding ways to tweak the system to prevent such failures in the future.
The findings were presented at the International Conference on Learning Representatives (ICLR) held in Singapore from April 24 to 28, held by MIT PhD student Professor Charles Dawson, Professor Aeronautics and Astronautics Chuchu Fan, and colleagues from Harvard and the University of Michigan.
“The motivation behind this work is that it’s really frustrating when we have to interact with these complex systems where it’s hard to understand what’s happening behind the scenes that are causing these problems or the failures we’ve observed,” Dawson said.
The new work builds on previous research by Fan’s Lab, where they looked at problems involving hypothetical failure prediction problems, such as complex systems such as robot groups or power grids that complete tasks, such as Power Grid, looking for ways to predict how that system fails. “The goal of this project is to transform it into a diagnostic tool that we can use on real-world systems,” Fan said.
The idea is to provide a way for someone to “give us data from when there is a problem or failure in this real world,” Dawson said. “We can try to diagnose the root cause and provide some look behind the curtains of this complexity.”
The goal is to target them “a way of working for general cyber physics problems,” he said. These problems, he explained, are “you have an automated decision-making component that interacts with the chaos of the real world.” There are tools for software systems that can run on their own, but when the software has to interact with entities that are active in the actual physical environment, complexity arises, whether it’s the arrangement of the aircraft, the movement of the autonomous car, the interaction of the robot team, or the interaction of the robot, or the input and output on the grid. What happens often in a system like this is: “The software can make a decision that looks pretty good in the first place, but then it has all these dominoes, knocking on the door, making things more confusing and more uncertain.”
But a key difference is that in systems like robotic teams, unlike the schedule of the aircraft, “we can use models in the robotic world,” says Fan, a leading researcher at the Laboratory of Information and Decision Systems (LIDS). “We have a good understanding of the physics behind robotics and we do have ways to create models” representing their activities with reasonable accuracy. But airline scheduling involves processes and systems of proprietary business information, so researchers must find ways to infer what’s behind decisions, using only relatively sparse publicly available information that essentially only includes the actual arrival and departure time of each aircraft.
“We’ve grabbed all this flight data, but it’s the whole system behind the dispatch system, and we don’t know how that system works,” Fan said. Compared to the years of data for normal flight operations, the amount of data related to actual failures is only worth a few days.
The impact of Denver weather events during the Southwest Day scheduled crisis week was clearly seen in flight data, a time of transition from the time range between landing and takeoff at Denver Airport. However, although the system is not very obvious, the ways in which the cascade affects are stacked and more analysis is required. It turns out that the keys are related to the concept of reserve aircraft.
Airlines usually keep some aircraft at various airports, so if one aircraft that is planning to fly finds a problem, it can quickly replace another aircraft. Southwest uses only a single type of plane, so they are all interchangeable, making such substitutions easier. However, most airlines operate on hubs and spoke systems, with most of the reserve aircraft being kept, while Southwest Airlines does not use hubs, so their reserve aircraft are more dispersed throughout the network. The way these aircraft are deployed has proven to play an important role in the evolving crisis.
“The challenge is that there is no public data available in terms of where the aircraft are located throughout the Southwest network,” Dawson said. “What we were able to find using our method is that by looking at public data for arrivals, departures and delays, we can use our method to back up hidden parameters of these aircraft reserves to explain the observations we are seeing.”
They found that the way reserves were “leading indicators”, indicating the problem of cascading in crises nationwide. Some parts of the network that are directly affected by the weather can be quickly restored and on time. “But when we look at other areas in the network, we find that these reserves are just unavailable and things keep getting worse.”
For example, the data shows that Denver’s reserves have dropped rapidly due to weather delays, but then, “this also allows us to track failures from Denver to Las Vegas,” he said. While there is no bad weather there, “our approach still shows that the number of planes we are able to provide flights from Las Vegas has steadily declined.”
“We found that there are these aircraft cycles in the Southwest network, and the planes could start the day in California, then fly to Denver, and then end in Las Vegas,” he said. What happened in the case of this storm was that the cycle was interrupted. As a result, “This storm in Denver broke the cycle and suddenly the reserves in Las Vegas, which are not affected by the weather, began to deteriorate.”
Finally, the Southwest was forced to take huge steps to solve the problem: they had to “hard reset” the entire system, cancel all flights and fly airplanes across the country to rebalance the reserves.
Researchers worked with experts in the air transport system to develop a model that illustrates how the planning system should work. Then, “What our approach does is that we are essentially trying to run the model backward.” Judging from the observed results, the model allows them to work again to see the initial conditions that might produce these results.
Although there is little data on actual failures, extensive data on typical operations helps teach computational models: “What is feasible, possible, possible, and what is the field of physical possibilities,” Dawson said. “This provides us with domain knowledge, in this extreme event, what is the most likely explanation given the space for failure.”
This could lead to real-time monitoring systems where normal operational data is constantly compared with current data and determine the appearance of trends, he said. “Are we going toward a normal trend, or are we going toward an extreme event?” Seeing signs of an upcoming problem may allow preemptive measures such as early redeployment of reserve aircraft as expected issues.
Fan said that in her lab, work on developing such a system is underway. Meanwhile, they produced an open source tool for analyzing a failed system called CALNF that can be used by anyone. Meanwhile, Dawson, who received his PhD last year, is working as a postdoctoral fellow to apply the methods developed in this work to understand failures in the power network.
The research team also included Max Li from the University of Michigan and Van Tran from Harvard. This work was supported by the Air Force Office of Scientific Research and the MIT-DSTA program.