Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

MicroWave@lemmy.world · 4 months ago

Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

exanime@lemmy.world · 4 months ago

Don’t worry everyone… Each and everyone of the CEOs involved in this debacle will earn millions this year and next and will eventually retire with more money they could possible spend in 10 lifetimes

If anything, they’ll continue to fall upwards completely deserving even more money

lennybird@lemmy.world · 3 months ago

Additionally, don’t worry, they’ll just shift more costs onto the consumer and ultimately widen their profit-margins in no time.

Perhaps Boeing can save the airline industry a little more by lowering the costs of their planes by removing another bolt and jerry-rigging flight software onto an antiquated platform.

ASDraptor@lemmy.autism.place · 4 months ago

499.999.990

Remember that you got your $10 gift card for Uber eats.

Flying Squid@lemmy.world · 4 months ago

Which didn’t work.

Poem_for_your_sprog@lemmy.world · 4 months ago

Why do news outlets keep calling it a Microsoft outage? It’s only a crowdstrike issue right? Microsoft doesn’t have anything to do with it?

jmcs@discuss.tchncs.de · 3 months ago

Because Microsoft could have prevented it by introducing proper APIs in the kernel like Linux did when crowdstrike did the same on their Linux solution?

Echo Dot@feddit.uk · edit-2 4 months ago

It’s sort of 90% of one and 10% of the other. Mostly the issue is a crowdstrike problem, but Microsoft really should have it so their their operating system doesn’t continuously boot loop if a driver is failing. It should be able to detect that and shut down the affected driver. Of course equally the driver shouldn’t be crashing just because it doesn’t understand some code it’s being fed.

Also there is an argument to be made that Microsoft should have pushed back more at allowing crowdstrike to effectively bypass their kernel testing policies. Since obviously that negates the whole point of the tests.

Of course both these issues also exist in Linux so it’s not as if this is a Microsoft unique problem.

smeenz@lemmy.nz · edit-2 3 months ago

The crowdstrike driver has the boot_critical flag set, which prevents exactly what you describe from happening

Echo Dot@feddit.uk · 3 months ago

Yeah I know but booting in safe mode disables the flag so you can boot even if something is set to critical with it disabled. The critical flag is only set up for normal operations.

cheddar@programming.dev · edit-2 4 months ago

The answer is simple: they have no idea what they are talking about. And that is true for almost every topic they are reporting about.

rekorse@lemmy.world · 3 months ago

Its sort of like calling the terrorist attack on 911 the day the towers fell.

Although in my opinion, microsoft does have some blame here, but not for the individual outage, more for windows just being a shit system and for tricking people into relying on it.

Rekhyt@lemmy.world · 4 months ago

It was a Crowdstrike-triggered issue that only affected Microsoft Windows machines. Crowdstrike on Linux didn’t have issues and Windows without Crowdstrike didn’t have issues. It’s appropriate to refer to it as a Microsoft-Crowdstrike outage.

ricecake@sh.itjust.works · 4 months ago

Funny enough, crowdstrike on Linux had a very similar issue a few months back.

Poem_for_your_sprog@lemmy.world · 4 months ago

I guess microsoft-crowdstrike is fair, since the OS doesn’t have any kind of protection against a shitty antivirus destroying it.

I keep seeing articles that just say “Microsoft outage”, even on major outlets like CNN.

Dran@lemmy.world · 4 months ago

To be clear, an operating system in an enterprise environment should have mechanisms to access and modify core system functions. Guard-railing anything that could cause an outage like this would make Microsoft a monopoly provider in any service category that requires this kind of access to work (antivirus, auditing, etc). That is arguably worse than incompetent IT departments hiring incompetent vendors to install malware across their fleets resulting in mass-downtime.

The key takeaway here isn’t that Microsoft should change windows to prevent this, it’s that Delta could have spent any number smaller than $500,000,000 on competent IT staffing and prevented this at a lower cost than letting it happen.

Echo Dot@feddit.uk · 4 months ago

Delta could have spent any number smaller than $500,000,000 on competent IT staffing and prevented this at a lower cost than letting it happen.

I guarantee someone in their IT department raised the point of not just downloading updates. I can guarantee they advise to test them first because any borderline competent I.T professional knows this stuff. I can also guarantee they were ignored.

ricecake@sh.itjust.works · 4 months ago

Also, part of the issue is that the update rolled out in a way that bypassed deployments having auto updates disabled.

You did not have the ability to disable this type of update or control how it rolled out.

https://www.crowdstrike.com/blog/falcon-content-update-preliminary-post-incident-report/

Their fix for the issue includes “slow rolling their updates”, “monitoring the updates”, “letting customers decide if they want to receive updates”, and “telling customers about the updates”.

Delta could have done everything by the book regarding staggered updates and testing before deployment and it wouldn’t have made any difference at all. (They’re an airline so they probably didn’t but it wouldn’t have helped if they had).

corsicanguppy@lemmy.ca · 4 months ago

Delta could have done everything by the book

Except pretty much every paragraph in ISO27002.

That book?

Highlights include:

ops procedures and responsibilities
change management (ohh. That’s a good one)
environmental segregation for safety (ie don’t test in prod)
controls against malware
INSTALLATION OF SOFTWARE ON OPERATIONAL SYSTEMS
restrictions on software installation (ie don’t have random fuckwits updating stuff)

…etc. like, it’s all in there. And I get it’s super-fetch to do the cool stuff that looks great on a resume, but maybe, just fucking maybe, we should be operating like we don’t want to use that resume every 3 months.

External people controlling your software rollout by virtue of locking you into some cloud bullshit for security software, when everyone knows they don’t give a shit about your apps security nor your SLA?

Glad Skippy’s got a good looking resume.

ricecake@sh.itjust.works · 4 months ago

Yes, that book. Because the software indicated to end users that they had disabled or otherwise asserted appropriate controls on the system updating itself and it’s update process.

That’s sorta the point of why so many people are so shocked and angry about what went wrong, and why I said “could have done everything by the book”.

As far as the software communicated to anyone managing it, it should not have been doing updates, and cloudstrike didn’t advertise that it updated certain definition files outside of the exposed settings, nor did they communicate that those changes were happening.

Pretend you’ve got a nice little fleet of servers. Let’s pretend they’re running some vaguely responsible Linux distro, like a cent or Ubuntu.
Pretend that nothing updates without your permission, so everything is properly by the book. You host local repositories that all your servers pull from so you can verify every package change.
Now pretend that, unbeknownst to you, canonical or redhat had added a little thing to dnf or apt to let it install really important updates really fast, and it didn’t pay any attention to any of your configuration files, not even the setting that says “do not under any circumstances install anything without my express direction”.
Now pretend they use this to push out a kernel update that patches your kernel into a bowl of luke warm oatmeal and reboots your entire fleet into the abyss.
Is it fair to say that the admin of this fleet is a total fuckup for using a vendor that, up until this moment, was generally well regarded and presented no real reason to doubt while being commonly used? Even though they used software that connected to the Internet, and maybe even paid for it?

People use tools that other people build. When the tool does something totally insane that they specifically configured it not to, it’s weird to just keep blaming them for not doing everything in-house. Because what sort of asshole airline doesn’t write their own antivirus?

skuzz@discuss.tchncs.de · 4 months ago

Honestly, with how terrible Windows 11 has been degrading in the last 8 or 9 months, it’s probably good to turn up the heat on MS even if it isn’t completely deserved. They’re pissing away their operating system goodwill so fast.

There have been some discussions on other Lemmy threads, the tl;dr is basically:

Microsoft has a driver certification process called WHQL.
This would have caught the CrowdStrike glitch before it ever went production, as the process goes through an extreme set of tests and validations.
AV companies get to circumvent this process, even though other driver vendors have to use it.
The part of CrowdStrike that broke Windows, however, likely wouldn’t have been part of the WHQL certification anyways.
Some could argue software like this shouldn’t be kernel drivers, maybe they should be treated like graphics drivers and shunted away from the kernel.
These tech companies are all running too fast and loose with software and it really needs to stop, but they’re all too blinded by the cocaine dreams of AI to care.

corsicanguppy@lemmy.ca · edit-2 4 months ago

They’re pissing away their operating system goodwill so fast.

They pissed it away {checks DoJ v. Microsoft} 25 years ago.

skuzz@discuss.tchncs.de · 3 months ago

Windows 7 and especially 10 started changing the tune. 10: Linux and Android apps running integrated to the OS, huge support for very old PC hardware, support for Android phone integration, stability improvements like moving video drivers out of the kernel, maintaining backwards compatibility with very old apps (1998 Unreal runs fine on it!) by containerizing some to maintain stability while still allowing old code to run. For a commercial OS, it was trending towards something worth paying for.

dhork@lemmy.world · edit-2 4 months ago

Bastian said the figure includes not just lost revenue but “the tens of millions of dollars per day in compensation and hotels” over a period of five days. The amount is roughly in line with analysts’ estimates. Delta didn’t disclose how many customers were affected or how many canceled their flights.

It’s important to note that the DOT recently clarified a rule that reinforced that if an airline cancels a flight, they have to compensate the customer. So that’s the real reason why Delta had to spend so much, they couldn’t ignore their customers and had to pay out for their inconvenience.

https://www.kxan.com/news/can-you-get-compensation-if-your-flight-was-delayed-or-canceled-by-the-crowdstrike-outage/

So think about how much worse it might have been for fliers if a more industry-friendly Transportation Secretary were in charge. The airlines might not have had to pay out nearly as much to stranded customers, and we’d be hearing about how stranded fliers got nothing at all.

Media Bias Fact Checker@lemmy.world · 4 months ago

CNBC Media Bias Fact Check Credibility: [High] (Click to view Full Report)

CNBC is rated with High Creditability by Media Bias Fact Check.

Bias: Left-Center
Factual Reporting: Mostly Factual
Country: United States of America
Full Report: https://mediabiasfactcheck.com/cnbc/

Check the bias and credibility of this article on Ground.News

Thanks to Media Bias Fact Check for their access to the API.
Please consider supporting them by donating.

Footer

Media Bias Fact Check is a fact-checking website that rates the bias and credibility of news sources. They are known for their comprehensive and detailed reports.

Beep boop. This action was performed automatically. If you dont like me then please block me.💔
If you have any questions or comments about me, you can make a post to LW Support lemmy community.

JCreazy@midwest.social · 4 months ago

Revenge for the times they messed up my travel plans.

ulkesh@lemmy.world · 4 months ago

Aw that’s a shame. Poor rich company.

corsicanguppy@lemmy.ca · 4 months ago

No, POOR PLANNING and allowing an external entity the ability to take you down, that’s what did it. Pretend you’re pros, Delta, and be adequate.

Holy halfwit projection, batman.

Xanis@lemmy.world · 4 months ago

The stories I could tell about how companies will hire a team to run tests on their digital and physical systems while also limiting access to outside nodes disconnected or screened from their core, primary, IMPORTANT systems.

Kicker is that plenty of people who work for these companies get it. Very rarely does someone in a position to do something about it actually understand. A few thousand dollars and they could have hired a hat or two to run penetration on systems and fixed the vulnerabilities, or at least shored them up so this fucking 000 bug didn’t impact them so harshly.

But naaaaaaah. Gotta cut payroll, brb.

emax_gomax@lemmy.world · 4 months ago

I’m not sure any kind of pentest would prevent crowdstrikes backdoor access to release updates at its own discretion and cadence. The only way to avoid that would be blocking crowdstrike from accessing the Internet but I’d bet they’d 100% brick the host over letting that happen. If anything this is a good lesson in not installing malware to prevent even worse malware. You handed the keys to your security to a party that clearly doesn’t care and paid the price. My reaction to that legal disclaimer of crowdstrikes stating they take no responsibility for anything they do… responsibility is the only reason anyone would buy anything from them (aside from being forced by legal requirements that clearly didn’t have anyone who understood them involved in the legislation).

rekorse@lemmy.world · edit-2 3 months ago

I know it seems shocking but some companies do and did plan for backup systems in the event their entire windows platform blue screened. Thats why there were some companies that had a hard time with it and some that didnt.

The original poster is correct that Delta should shoulder some of the blame. The outage caused a problem but it was Deltas response that caused 500 million in damages. I’m sure that CrowdPoint didn’t advise Delta to put all their eggs in one basket did they?

emax_gomax@lemmy.world · 3 months ago

Yeah, I agree. My whole comment was basically crowdstrike is liable but companies should reflect and take some accountability for their overreliance on CS.

Xanis@lemmy.world · edit-2 4 months ago

Bah… you’re right. I’ve just become so disillusioned by the smoke and mirrors. So many critical systems protected by poorly managed file mazes and a prayer that Susan in accounting doesn’t get anything higher than the digital equivalent of a toddler slamming its face onto a keyboard several times email from bos$6&776ggjskbigman@poorlyspelledcompany.bendover because some 13 year old with computer access got clever.

I’m a bit agitated atm, sorry about that.

TheAuthor_13@lemm.ee · 4 months ago

Good. They’ve been stealing from their customers for decades; this is fuckin’ karmic.

themeatbridge@lemmy.world · 4 months ago

Also, maybe don’t put all your eggs into one single basket, from an infrastructure perspective.

stoy@lemmy.zip · 4 months ago

Yeah, I say I as migrate another service to Azure…

2xsaiko@discuss.tchncs.de · 4 months ago

That’s not just putting all your eggs into a single basket, that’s putting all your eggs into a rotting trashcan

stoy@lemmy.zip · 4 months ago

Tell me you haven’t used Azure without telling me you haven’t used Azure.

Is Azure is fine. It is not amazing, it is not terrible, it is fine.

hydrashok@sh.itjust.works · 4 months ago

Pretty sure their software’s legal agreement, and the corresponding enterprise legal agreement, already cover this.

The update was the first domino, but the real issue was the disarray of Delta’s IT Operations and their inability to adequately recover in a timely fashion. Sounds like a customer skimping on their lifecycle and capacity planning so that Ed can get just a bit bigger bonus for meeting his budget numbers.

Brkdncr@lemmy.world · 4 months ago

Negligence can make contracts a little less permanent.

hydrashok@sh.itjust.works · 4 months ago

Delta was the only airline to suffer a long outage. That’s why I say Crowdstrike is the kickoff, but the poor, drawn-out response and time to resolve it is totally on Delta.

Brkdncr@lemmy.world · 4 months ago

Idk, crowdstike had a few screwups in their pocket before this one. They might be on the hook for costs associated with an outage caused by negligence. I’m not a lawyer, but I do stand next to one in the elevator.

rekorse@lemmy.world · 3 months ago

It breaks down once Delta begins arguing costs directly associated with their poor disaster recovery efforts.

Why is CrowdStrike responsible for Deltas poor practices?

Semi-Hemi-Lemmygod@lemmy.world · 4 months ago

I wasn’t affected by this at all and only followed it on the news and through memes, but I thought this was something that needed hands-on-keyboard to fix, which I could see not being the fault of IT because they stopped planning for issues that couldn’t be handled remotely.

Was there some kind of automated way to fix all the machines remotely? Is there a way Delta could have gotten things working faster? I’m genuinely curious because this is one of those Windows things that I’m too Macintosh to understand.

Shadow@lemmy.ca · 4 months ago

All the servers and infrastructure should have “lights out management”. I can turn on a server, reconfigure the bios and install windows from scratch on the other side of the world.

Potentially all the workstations / end point devices would need to be repaired though.

The initial day or two I’ll happily blame on crowdstrike. After that, it’s on their IT department for not having good DR plans.

Riskable@programming.dev · 4 months ago

Yeah… Maybe don’t put all your IT eggs in one basket next time.

Delta is the one that chose to use Crowdstrike on so many critical systems therefore the fault still lies with Delta.

Every big company thinks that when they outsource a solution or buy software they’re getting out of some responsibility. They’re not. When that 3rd party causes a critical failure the proverbial finger still points at the company that chose to use the 3rd party.

The shareholders of Delta should hold this guy responsible for this failure. They shouldn’t let him get away with blaming Crowdstrike.

clstrfck@lemdro.id · 4 months ago

So you think Delta should’ve had a different antivirus/EDR running on every computer?

Th4tGuyII@fedia.io · 4 months ago

I think what @riskable@programming.dev was saying is you shouldn’t have multiple mission critical systems all using the same 3rd party services. Have a mix of at least two, so if one 3rd party service goes down not everything goes down with it

partial_accumen@lemmy.world · 4 months ago

That sounds easy to say, but in execution it would be massively complicated. Modern enterprises are littered with 3rd party services all over the place. The alternative is writing and maintaining your own solution in house, which is an incredibly heavy lift to cover the entirety of all services needed in the enterprise. Most large enterprises are resources starved as is, and this suggestion of having redundancy for any 3rd party service that touches mission critical workloads would probably increase burden and costs by at least 50%. I don’t see that happening in commercial companies.

Th4tGuyII@fedia.io · 4 months ago

As far as the companies go, their lack of resources is an entirely self-inflicted problem, because they’re won’t invest in increasing those resources, like more IT infrastructure and staff. It’s the same as many companies that keep terrible backups of their data (if any) when they’re not bound to by the law, because they simply don’t want to pay for it, even though it could very well save them from ruin.

The crowdstrike incident was as bad as it was exactly because loads of companies had their eggs in one basket. Those that didn’t recovered much quicker. Redundancy is the lesson to take from this that none of them will learn.

partial_accumen@lemmy.world · 4 months ago

As far as the companies go, their lack of resources is an entirely self-inflicted problem, because they’re won’t invest in increasing those resources, like more IT infrastructure and staff.

Play that out to its logical conclusion.

Our example airline suddenly doubles or triples its IT budget.
The increased costs don’t actually increase profit it merely increases resiliency
Other airlines don’t do this.
Our example airline has to increase ticket prices or fees to cover the increased IT spending.
Other airlines don’t do this.
Customers start predominantly flying the other airlines with their cheaper fares.
Our example airline goes out of business, or gets acquired by one of the other airlines

The end result is all operating airlines are back to the prior stance.

brianary@startrek.website · 4 months ago

Two big assumptions here.

First, multiple business systems are already being supported, and the OS only incidentally. Assuming double or triple IT costs is very unlikely, but feel free to post evidence to the contrary.

Second, a tight coupling between costs and prices. Anyone that’s been paying attention to gouging and shrinkflation of the past few years of record profits, or the doomsaying virtually anywhere the minimum wage has increased and businesses haven’t been annihilated, would know this is nonsense.

partial_accumen@lemmy.world · 4 months ago

First, multiple business systems are already being supported, and the OS only incidentally. Assuming double or triple IT costs is very unlikely, but feel free to post evidence to the contrary.

The suggestion the poster made was that ALL 3rd party services need to have an additional counterpart for redundancy. So we’re not just talking about a second AV vendor. We have to duplicate ALL 3rd party services running on or supporting critical workloads to meet what that poster is suggesting.

inventory agents
OS patching
security vulnerability scanning
file and DB level backup
monitoring and alerting
remote access management
PAM management
secrets management
config managment

…the list goes on.

Anyone that’s been paying attention to gouging and shrinkflation of the past few years of record profits, or the doomsaying virtually anywhere the minimum wage has increased and businesses haven’t been annihilated, would know this is nonsense.

You’re suggesting the companies simply take less profits? Those company’s board of directors will get annihilated by shareholders. The board would be voted out with their IT improvement plans, and replace with those that would return to profitability.