CrowdStrike outage explained
It’s complicated but here is a stab at explaining the CrowdStrike outage and its implications to non techies.
Overview of CrowdStrike Outage
Cause
The recent outage affecting thousands of Windows machines worldwide was caused by a faulty update from the cybersecurity provider CrowdStrike. This update included a defect in a kernel-level driver, leading to severe issues such as the Blue Screen of Death (BSOD) at boot. As a result, affected PCs and servers were forced into a recovery boot loop, with some machines unable to start properly.
Impact
The impact of the outage has been widespread and severe, disrupting operations across various sectors including banks, airlines, TV broadcasters, supermarkets, and many other businesses globally. Notable incidents include:
- Australia: Banks, airlines, and TV broadcasters were among the first to raise alarms.
- United Kingdom: UK broadcaster Sky News experienced a significant interruption, unable to broadcast its morning news bulletins for hours.
- Europe: Airlines like Ryan Air reported IT issues impacting flight departures, with the Federal Aviation Administration (FAA) in the US assisting major airlines such as Delta, United, and American Airlines.
- Germany: Berlin airport warned of travel delays due to technical issues.
- India: An airline resorted to handwritten boarding passes due to system failures.
- United States: Emergency call centers in Alaska were also impacted.
Reports from IT administrators on platforms like Reddit indicate that many companies have seen a significant portion of their machines go offline, with some users reporting up to 70% of their laptops stuck in a boot loop.
CrowdStrike's Response
CrowdStrike's CEO, George Kurtz, acknowledged the issue in a statement on XM, clarifying that it was not a security incident or cyber attack. The company has identified the issue and deployed a fix, which involves reverting the problematic update. However, this solution is not straightforward for affected systems, as it requires booting into safe mode and manually deleting specific system files. This workaround is challenging for cloud-based servers and remotely deployed laptops.
What is CrowdStrike?
CrowdStrike is a leading provider of endpoint detection and response (EDR) software, which offers advanced security measures against ransomware and other hacking threats. Founded in 2012 by former executives of antivirus pioneer McAfee, CrowdStrike has become a prominent player in the cybersecurity market, controlling about 18% of the $8.6 billion global market for modern endpoint protection software, slightly ahead of Microsoft.
Unlike traditional antivirus software that hunts for known malware, CrowdStrike's EDR software continually scans machines for any signs of suspicious activities and automates responses to potential threats. To perform these tasks, the software requires deep access to the core of operating systems, which, as seen in the recent outage, can lead to significant disruptions if an update goes wrong.
Despite its effectiveness in defending against ransomware, CrowdStrike's high costs mean that it is usually installed on the most critical systems within organizations. Consequently, when these systems fail, the impact can be substantial.
Kernel-Level Access in Endpoint Protection Systems: A Double-Edged Sword
The Controversy
Endpoint protection systems, like those developed by CrowdStrike, require deep access to the core of operating systems, often at the kernel level. This access is controversial due to the significant impact that faulty code can have on the stability and functionality of systems. When security software operates at such a fundamental level, any errors or vulnerabilities introduced through updates can cause widespread and severe disruptions, as seen in the recent outage.
The Impact of Bad Code
Kernel-level access means that security software can interact directly with the operating system's most critical functions. This level of access is powerful but risky:
- System Crashes: Faulty code can lead to critical failures, such as the Blue Screen of Death (BSOD), causing machines to become unbootable.
- Recovery Challenges: Fixing issues at the kernel level often requires complex recovery procedures that are not easily automated, making it difficult for IT teams to quickly resolve problems, especially across large, distributed networks.
- Operational Downtime: The severity of such crashes can bring down essential services and operations, as seen with airlines, banks, and broadcasters during the CrowdStrike incident.
The Security Imperative
Despite these risks, accessing the kernel level is crucial for effective security monitoring and defense. Here’s why:
- Deep Inspection: To detect and mitigate sophisticated threats, security software must inspect system activities at the deepest level. This involves monitoring kernel operations, memory usage, and other low-level functions that are often targeted by advanced malware.
- Prevention and Response: Effective endpoint protection systems can prevent attacks by intercepting malicious actions before they affect the broader system. They can also respond swiftly to contain threats, minimizing potential damage.
- Comprehensive Coverage: With access to the kernel, security tools can provide comprehensive protection across the entire operating system, ensuring that no part is left vulnerable to attack.
Update Dynamics and Risks
Testing and Rollout
CrowdStrike, like other responsible cybersecurity providers, rigorously tests its updates before deployment. However, despite extensive testing, some issues may only manifest under specific conditions that are difficult to replicate in a test environment. The recent outage suggests that while CrowdStrike's update passed initial tests, a delayed onset of the error likely caused the widespread issues, indicating a scenario that was hard to anticipate and test for comprehensively.
Staggered Rollout
Typically, updates are rolled out in a staggered manner to mitigate potential impacts. This approach allows the company to monitor initial deployments and halt further rollouts if issues are detected. However, the delayed onset of the error in this case complicated the situation, as the problem only became apparent after a significant number of systems had already received the update.
Auto-Update Feature
The auto-update feature in endpoint protection systems introduces additional risks. While it ensures that systems receive the latest security enhancements promptly, it also means that any faulty update can quickly propagate across numerous devices. Consumers have the option to opt out of automatic updates and test new updates in a controlled environment before deploying them widely. It is unclear whether most users failed to exercise this option or if the update in question was a policy update that bypassed these controls.
The Challenges with BitLocker
For systems affected by the faulty update, CrowdStrike cannot push a new update to solve the issue directly because the failure occurs at the kernel level. CrowdStrike has released a safe update, but for any machines that have already updated, the only fix is to manually boot up in safe mode, locate the bad file, and delete it. This process is complicated by the fact that not every machine is easily accessible, and some machines are locked by BitLocker.
BitLocker, a drive encryption program by Microsoft, adds another layer of complexity. When a drive is encrypted with BitLocker, it requires a recovery key to unlock the drive, especially in situations where the machine needs to boot into safe mode or after a system crash. If users do not have easy access to their BitLocker recovery keys, they may find themselves unable to boot into safe mode or perform necessary recovery steps. Furthermore, if the BitLocker recovery keys are stored on the affected machines, it can cause further complications as these machines are locked by the outage.
Balancing Security and Stability
The need for security at the kernel level must be balanced with the imperative to maintain system stability. This balance involves:
- Rigorous Testing: Security updates, especially those affecting the kernel, must undergo extensive testing to ensure they do not introduce new vulnerabilities or stability issues.
- Rollback Mechanisms: Implementing robust rollback mechanisms can help quickly revert to previous stable states in case an update causes problems.
- Layered Security: Employing a layered security approach, where multiple security measures are used in tandem, can reduce the reliance on any single point of failure, even at the kernel level.
- Collaboration with OS Vendors: Close collaboration between endpoint protection providers and operating system vendors can ensure that security measures are designed and implemented in ways that are compatible with the core operating system functionalities.
The Cost Factor and Its Impact
As a market leader in EDR, CrowdStrike is renowned for its effectiveness at protecting devices. However, this protection comes at a significant cost, with prices around $50 per machine. This high cost means that only the largest companies can afford CrowdStrike, and even then, they typically prioritize deploying it to their most critical systems. Consequently, when these systems go down, as seen in the recent incident, the impact is even greater because it affects the most vital parts of an organization's infrastructure.
Note of Thanks to IT Administrators
A special note of thanks is warranted for IT administrators who, under very stressful circumstances, are now tasked with manually rebooting impacted machines. This often involves working over the weekend, dealing with frustrated business leaders, and handling complex recovery processes. Business leaders should understand that these issues are not the fault of the IT teams and cannot be resolved any faster. This situation highlights why updates should not be pushed at the end of the week, as it leads to IT admins working overtime to fix problems that arise.
Conclusion
While the deep kernel-level access required by endpoint protection systems is controversial due to the potential for severe disruptions, it remains essential for robust security. Ensuring that these powerful tools are both effective and reliable requires careful design, rigorous testing, and a balanced approach to integrating security with system stability. The recent CrowdStrike outage highlights the critical importance of managing these risks to protect the very systems these tools are designed to secure.