On Friday, I brought down a customer's production cluster for 5 hours by accident when doing maintenance in their environment. Everyone was nice about it, even the customer, but man did I feel stupid.
What are your incident horror stories to celebrate this spooky season?
One time I brought down the production email server because I was ssh'ed into it from my Linux laptop and didn't double check which shell I was in. I was trying to reboot my laptop and suddenly realized something else was rebooting.
We sent around 3.8M custom metrics to Datadog for a few hours. Scared to see the bill.
I once deleted hundreds of "unused" database instances... but some core internal tooling assumed those instances would always exist :grimacing: . So I left entire development teams unable to deploy their projects. Ended up performing mass restores not so those databases could actually do any work or accept connections, but purely so the AWS APIs could stop returning empty responses. Felt gross about both the problem _and_ the rollback :sob:
Back in about 2001 fat fingered the outgoing phone number of a dial back service. We called customer routers with that number, the router would reject and call back. Because of the typo, customer routers accepted the call for 1 sec and then dial back. Instead of a usual 0€ bill, it was about 20k€ that month. And a 10cm high stack of paper invoices.
https://status.elastic.co/incidents/9mmlp98klxm1 :upside_down: (I work at Elastic)
While trying to "test automation", I once took down every ec2 instance in the core AWS account. Thankfully we had been thinking about DR and resilience, so this "accidental chaos" engineering showed us where the gaps were and we didn't lose any data. There were a few critical resources offline, which meant hundreds of devs sitting around waiting though.
Last updated: Dec 12 2024 at 15:17 UTC