Havoc Ape: Using Chaos Engineering to Build Resilience and Automate Reboots

In a perfect world, servers would have infinite memory and would never have memory leaks. Unfortunately, this isn't a perfect world, and we've not yet figured out a way around the fundamental laws of physics, so it's quite likely you will have seen something like this if you've been administering Linux servers for a while:
[11686.043641] Out of memory: Kill process 2603 (applicationWithSlowMemoryLeak) score 761 or sacrifice child
[11686.043647] Killed process 2603 (applicationWithSlowMemoryLeak) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB
If not, then you're either really lucky, or you haven't dealt with applications that aren't perfect. Linux tries its best to sort this using something called the OOM killer. This is a relatively simple part of the kernel that deals with memory saturation issues to prevent a complete system breakdown. It's well-intentioned, and tries to kill the process that it thinks might have a memory leak, but leaves servers in an unknown state, often killing the wrong daemon and requiring manual intervention to fix.
For the remainder of this blog post, I'll first dive into how we fix issues with the OOM killer in production across our fleet of servers, then a bit about how we make use of chaos engineering in production to reduce outages and build resilience in our systems.