Stream: general

Topic: Incident Horror Stories


view this post on Zulip Thomas Eckert (Oct 08 2024 at 00:34):

On Friday, I brought down a customer's production cluster for 5 hours by accident when doing maintenance in their environment. Everyone was nice about it, even the customer, but man did I feel stupid.

What are your incident horror stories to celebrate this spooky season?

view this post on Zulip Philip Durbin (Oct 08 2024 at 01:05):

One time I brought down the production email server because I was ssh'ed into it from my Linux laptop and didn't double check which shell I was in. I was trying to reboot my laptop and suddenly realized something else was rebooting.

view this post on Zulip Matthew Sanabria (Oct 08 2024 at 04:21):

We sent around 3.8M custom metrics to Datadog for a few hours. Scared to see the bill.

view this post on Zulip AJ Kerrigan (Oct 09 2024 at 02:46):

I once deleted hundreds of "unused" database instances... but some core internal tooling assumed those instances would always exist :grimacing: . So I left entire development teams unable to deploy their projects. Ended up performing mass restores not so those databases could actually do any work or accept connections, but purely so the AWS APIs could stop returning empty responses. Felt gross about both the problem _and_ the rollback :sob:

view this post on Zulip Chris Glaubitz (Oct 19 2024 at 12:40):

Back in about 2001 fat fingered the outgoing phone number of a dial back service. We called customer routers with that number, the router would reject and call back. Because of the typo, customer routers accepted the call for 1 sec and then dial back. Instead of a usual 0€ bill, it was about 20k€ that month. And a 10cm high stack of paper invoices.

view this post on Zulip Jamie Tanna (Oct 30 2024 at 07:16):

https://status.elastic.co/incidents/9mmlp98klxm1 :upside_down: (I work at Elastic)

view this post on Zulip shaun smiley (Oct 31 2024 at 16:55):

While trying to "test automation", I once took down every ec2 instance in the core AWS account. Thankfully we had been thinking about DR and resilience, so this "accidental chaos" engineering showed us where the gaps were and we didn't lose any data. There were a few critical resources offline, which meant hundreds of devs sitting around waiting though.


Last updated: Dec 12 2024 at 15:17 UTC