It was a harrowing week in the IT world – with severe repercussions that were felt far and wide. The massive global outage – already being dubbed the “world’s biggest IT failure” – affected millions of Microsoft Windows machines running CrowdStrike’s Falcon cybersecurity software.
A faulty CrowdStrike software update induced a critical stop error on Windows machines, leading to the dreaded “blue screen of death.” This resulted in thousands of delayed or canceled flights, it took down 911 emergency call centers and media outlets, and it disrupted banking and healthcare services. In all, downtime was measured in hours and days and financial impacts will likely exceed $1 billion dollars.
CrowdStrike Holdings, Inc. is a 13-year-old public cybersecurity company based in Austin, Texas that recently posted $3 billion in annual revenue. It’s a trusted company with more than 30,000 subscribers globally – including many businesses in regulated industries like financial services and healthcare – that provides technology and services to protect against cybersecurity attacks and malware.
Throughout this crisis, CrowdStrike co-founder and CEO George Kurtz provided regular updates on social media, assuring the public that the outage was not the result of a cyber attack but rather a system configuration update that “triggered a logic error resulting in a system crash.” Kurtz also issued an apology to customers and partners. Microsoft CEO Satya Nadella was vocal about the issue and remediation steps. Although a workaround and fix were quickly deployed, it takes time (and sometimes multiple reboots!) to fully restore operations. CrowdStrike’s stock closed the week down 11%.
This is exactly the type of large-scale event that we study in the “Crisis Management in Tech” class that I taught at Cornell Tech last spring and will teach at the Yale School of Management in the fall. By exploring recent crises in tech, we’re able to discover best practices and learn what to do and what not to do to mitigate risks, lead during a crisis, and learn from incidents. It’s instructive to do this ahead of an actual crisis – because when you’re in the heat of the battle, there simply isn’t time to dither. Leaders and organizations need playbooks to refer to, teams that have practiced tabletop drills, and well-honed instincts to thoughtfully respond with speed and rigor.
In the early hours of the CrowdStrike outage, there was public speculation about the cause – poor quality control, lackluster testing, deficient oversight, malicious intent, a cybersecurity attack. There was even inappropriate conjecture that a “DEI engineer” (presumably referring to some imaginary incompetent coder selected for their role because they met a diversity, equity, and inclusion target) was responsible for sloppy coding or testing.
But during a crisis, the focus must be on stabilizing the situation and minimizing damage. As tempting as it is to delve into who or what was responsible, to consider the company’s reputation, or to capture the long-term opportunity…these should not get attention until after the smoke clears. Exemplar leaders during a crisis are those who provide purpose and direction, act with urgency and integrity, and proactively engage key stakeholders.
In the aftermath of a crisis – once the acute period has passed – it’s time to shift focus to root cause analysis (RCA) and lessons learned. RCA is a structured process that helps identify underlying factors that led to an incident. A post-mortem deep dive reflects on the crisis and the response, identifying areas for improvement: Why did the incident happen? What impact did it have? What actions were taken to mitigate and resolve it? What should be done to prevent it from happening again? How could the team be better prepared to respond in the future?
“A crisis is a terrible thing to waste,” said economist Paul Romer. Indeed, a crisis can provide an impetus for needed change. When genuine learning occurs, a company and even an industry can emerge stronger, with improved policies, regulations, standards, or technology. Famously, the handling of the 1982 Tylenol tampering crisis was a lesson in leadership. Johnson & Johnson reacted quickly, issuing warning communications and a massive recall, and the FDA mandated new tamper-proof seals. Similarly, Samsung’s management of its 2016 Galaxy Note 7 product failures – which included phones exploding in people’s pockets – enabled it to improve product reliability, maintain customer loyalty, and grow its business.
But these examples are exceptions, not the rule. Many companies in the crosshairs of a crisis haven’t fared as well. In fact, it takes companies an average of seven months to return to full operations, and one in four businesses permanently closes after a crisis.
And so, as they say, an ounce of prevention is worth a pound of cure. In other words – it’s worthwhile to invest in risk mitigation to reduce the chances of a crisis occurring while preparing to respond if needed. Enterprise Risk Management (ERM) is the holistic, continuous process that helps organizations identify, assess, prioritize, and mitigate risks that may threaten their business. For 2024, cyber incidents have emerged as the top business risks – hence the important role of technology like CrowdStrike’s Falcon software.
Best practices for designing products to be reliable and resilient include eliminating single points of failure, designing for fault-tolerance, and testing at multiple levels. Speculation has already begun about how a code update could have been promoted to large-scale production without proper oversight or testing; no doubt a full investigation will yield new approaches, policies, or controls to prevent another outage of this magnitude.
Another way to improve design robustness is through chaos engineering, where engineers create intentional failures in order to understand their impact, solve problems proactively, and avoid large-scale service disruptions. Having business continuity and disaster recovery plans in place can reduce risks and help businesses recover faster.
Leadership matters – always – but never more so than in a crisis. A leader sets the tone for an organization – establishing a risk-aware culture, prioritizing quality and reliability, and fostering an environment where employees are encouraged to raise concerns and suggestions. Leading through a crisis is a true test of a leader’s mettle – just as risk mitigation and crisis avoidance create a lasting legacy.



