Aug 3, 2024  –  tech

Cybersecurity company Crowdstrike pushed a broken automatic update to ~8.5M Windows computer BSOD’ing them. This has taken down hospitals, airlines, banks, 911 services, and other critical infrastructure for days.

Software, distributed systems, and an adversarial game make for a tough playing field, but this didn’t have to happen. Staged rollouts and continuous integration have been used for decades to reduce risk. This public postmortem from a cybersecurity company with this much responsibility on 2024 is shameful:

Extract from CrowdStrike PIR Executive Summary.

Crowdstrike’s focus on attack coverage and detection latency instead of higher level but harder to measure metrics like availability and intrusions, shows the problem of optimizing for the wrong metrics.

CIOs of airlines, hospitals, etc should have been more paranoid about worst-case scenarios, and governments should also have been more proactive about protecting society.

Notes

Crowdstrike’s stock down ~33% because of the incident. I thought it would take the company out of business, but there might be too much inertia.
Crowdstrike is giving $10 apology gift cards to their clients when the estimates losses to Fortune 500 companies are ~$5.4B
8.5M computers affected, according to Microsoft, about ~1% of all Windows installs.
Crowdstrike depends on Windows for much of their revenue, but their website’s marketing is not friendly to Windows. I wonder what atmosphere do they have inside the company.
Crowdstrike.com pre-incident.
The US government has already started taking steps in the right direction.

Connections

Iatrogenics: when the cure is worse than the illness.
Artificial Super Intelligence: this is one of the scenarios that worries people who are worried about super-human AIs. Instead of Windows BSOD’ing, an ASI (or a malicious actor) could take over the computers of people with access to automatic iOS, Windows, Linux updates, inject malicious code and do many types of nasty things in the span of a few seconds.
Complex systems: it’s hard to write bug-free software, it’s harder to do that in distributed systems, and it’s even harder to balance that out when you have adversaries (hackers) actively trying to outsmart you.
Antifragility: if we never had incidents like this, would we be over-engineering (no water onboard planes, the US’s TSA)? Would the cost be worth it? Where’s the right threshold for highly interconnected systems?
Disincentives: cybersecurity companies that take their job more seriously aren’t in the news, CIOs get punished if they don’t choose “the #1 cybersecurity tool” and they get hacked, CIOs don’t get rewarded for spending company money on insuring against black swan events.
Black swan events: how can you tell how far you are from extinction level events? How worse could it have been?