The CrowdStrike Windows outage that hit the world this week stems back to an EU-Microsoft deal from 2009 that meant Microsoft had to give antivirus vendors the same Windows API access it had.
We all hate Microsoft for turning Windows into an ad platform but they aren't wrong.
They are legally required to give Crowdstrike or anyone complete low level access to the OS. They are legally required to let Crowdstrike crash your computer. Because anything else means Microsoft is in control and not the software you installed.
It's no different than Linux in that way. If you install a buggy device driver on Linux, that's your/the driver's fault, not Linux.
You are not wrong, but people don't want to hear it. Do we want to retain control over what goes into kernel space or not? If so, we have to accept that whatever we stuff in there can crash the entire thing. That's why we have stuff like driver signatures. Which Crowdstrike apparently bypassed with a technical loophole from how I understand it.
The thing is, Microsoft's virus-scanning API shouldn't be able to BSOD anything, no matter what third-party software makes calls to it, or the nature of those calls. They should have implemented some kind of error handler for when the calls are malformed.
So this is really a case of both Crowdstrike and Microsoft fucking up. Crowdstrike shoulders most of the blame, of course, but Microsoft really needs to harden their API to appropriately catch errors, or this will happen again.
I'm an idiot. For some reason, I was thinking about the Windows Defender API, which can be called from third-party applications.
I don't believe there was any specific API in use here, for virus scanning or not. I suppose maybe the device driver API? I am not a kernel developer so I don't know if that's the right term for it.
Crowdstrike's driver was loaded at boot and caused a null pointer dereference error, inside the kernel. In userspace, when this happens, the kernel is there to catch it so only the application that caused it crashes. In kernelspace, you get a BSOD because there's really nothing else to do.
I stand corrected. For some reason, I was thinking they used the actual Windows Defender API, which can be called programmatically from third-party applications, but you're correct, it was a driver loaded at boot. Microsoft isn't at all at fault, here.
Nope. It's a lower level kernel API that has to be accessed at boot via a driver. The API I was thinking of - and I use the term "thinking" loosely, here - is an API that userspace applications can take advantage of to scan files after boot is already complete.
I actually agree, I own my computer / OS and I should be able to do what you're saying (install and break things). But Microsoft is a trillion dollar multi national corporation and I am certainly going to give them grief about this because I owe them less than nothing, let alone any good will.
But what if Windows have something similar to eBPF in Linux, and CS opted to use it, will this disaster won't happen at all or in a much smaller scale and less impactful?
If you load hacky shit into the kernel it can always find a way to make a nasty surprise. eBPF is a little bit better fence, not some miracle that automatically fixes shitty code.
But in this case Microsoft certified the driver. If they knew the driver included an interpreter that can run arbitrary code, they shouldn't have certified it because they can not fully test it. If they didn't know, then their certification test are inadequate. Most of the blame lies with the security software. If Microsoft didn't certify it, they would have had zero fault.
The Windows Hardware Certification program (formerly Windows Hardware Quality Labs Testing, WHQL Testing, or Windows Logo Testing) is Microsoft's testing process which involves running a series of tests on third-party device drivers, and then submitting the log files from these tests to Microsoft for review. The procedure may also include Microsoft running their own tests on a wide range of equipment, such as different hardware and different Microsoft Windows editions.
For the Nth time, crowdstrike circumvented the testing process
Edit: this is not to say that cs didn’t have to in order to provide their services, nor is this to say that ms didn’t know about the circumvention and/or delegate testing of config files to CS. I’ll take any opportunity to rag on MS, but in this case it is entirely on CS.
I had a read about the WHQL (which I assumes what certified means). It uses the Windows HLK to perform a series of tests, which submited to Microsoft, and only then the driver will be signed.
While certification isn't endorsement, the testing and the resulting certification implies basic compatibility and reliability. And causing bootloops and BSODs is anywhere but close to "basic compatibility and reliability."
Crowdstrike bypassed WHQL because the update was not to the driver, it was to a configuration file that then gets ingested by the driver. It's deliberate so they can push out updates for developing threats without being slowed down by the WHQL process.
And that means when they decide to just send it on a Friday with a buggy config file, nobody is responsible but Crowdstrike.