Skip to content

Havoc Ape: Using Chaos Engineering to Build Resilience and Automate Reboots

AI-generated image of a monkey pulling apart power cables in a server room, with sparks flying
Keep your data centre under lock and key, or risk the wrath of the ape... (AI-generated image, Havoc Ape project logo)

In a perfect world, servers would have infinite memory and would never have memory leaks. Unfortunately, this isn't a perfect world, and we've not yet figured out a way around the fundamental laws of physics, so it's quite likely you will have seen something like this if you've been administering Linux servers for a while:

[11686.043641] Out of memory: Kill process 2603 (applicationWithSlowMemoryLeak) score 761 or sacrifice child
[11686.043647] Killed process 2603 (applicationWithSlowMemoryLeak) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB

If not, then you're either really lucky, or you haven't dealt with applications that aren't perfect. Linux tries its best to sort this using something called the OOM killer. This is a relatively simple part of the kernel that deals with memory saturation issues to prevent a complete system breakdown. It's well-intentioned, and tries to kill the process that it thinks might have a memory leak, but leaves servers in an unknown state, often killing the wrong daemon and requiring manual intervention to fix.

For the remainder of this blog post, I'll first dive into how we fix issues with the OOM killer in production across our fleet of servers, then a bit about how we make use of chaos engineering in production to reduce outages and build resilience in our systems.

The Kernel's OOM Killer

At a simple level (as the kernel docs explain it), if a specified number of memory allocation calls fail in quick succession and there's no swap space left, then oom_kill() is invoked and will kill a process, based on a formula which prioritises high memory but short-lived processes.

As well-intentioned as this process is, I find it frequently kills the SSH daemon instead of the process with the slow memory leak 🤦‍♂️, which means that the only way to recover access to the machine is either through a KVM interface, or to reboot the machine through the web UI, which takes time. The obvious fix to this is to find the memory leak and submit an issue or fix to the application with the memory leak, but this may not always be a good use of time.

Fortunately, the kernel maintainers have thought this through and established that not everybody wants their systems left in a nondeterministic state. Whilst the default behaviour of the OOM killer may be good for a shared computer or a personal computer, servers (which want to be treated like cattle, not pets) should never reach this state as manual intervention from a system administrator will likely be required to bring the machine back into a good state.

The sysctl Program and Daemon

In Linux, kernel configuration is mostly done by modifying the virtual /proc/sys filesystem, either at boot time through sysctl.d or on the fly through sysctl. It can also be done with some kernel variables that can't change once the system is booted, via /boot/loader.conf. If you've ever used Wireguard through wg-quick, you might've noticed this in use, as the script executes the following when bringing up an interface:

sysctl -q net.ipv4.conf.all.src_valid_mark=1

Additionally, if you have ever used your computer or a server as a router, then you will have also configured the net.ipv4.ip_forward variable to allow the kernel to handle packet forwarding.

We can therefore use sysctl to modify the kernel parameters and the behaviour of the OOM killer. Of the many kernel variables there are, those of interest to us are vm.panic_on_oom and kernel.panic. If we set these with sysctl, then spawn a process that constantly eats memory, we should see the system panic, then wait for a predefined amount of time before restarting.

Testing Different Parameters

To play with this a bit more, I spun up a virtual machine using Vagrant, with the hashicorp/bionic64 box (I find that it's a bit easier to test with a VM, rather than rebooting the entire system every time):

# Create directory for the Vagrantfile
mkdir vagrant-testing && cd vagrant-testing

# Use the Ubuntu 18.04 image
vagrant init hashicorp/bionic64

# Initialise the VM and login
vagrant up
vagrant ssh

# Download GCC to compile a memory hog file
apt-get update && apt-get install -y gcc

cat <<EOF > ~/memalloc.c
#include <stdio.h>
#include <stdlib.h>
int main() {
  printf("Allocating 10MB of memory with each dot.");
  for(;;) {
    printf(".");
    malloc(10*1000*1000*sizeof(char));
  }
}
EOF

# Compile and run
gcc memalloc.c -o memalloc
./memalloc

Without changing any kernel parameters, I was lucky to have the program terminate. Several times, I was kicked off the VM as oom_kill chose to kill a different program:

[  130.282896] Out of memory: Kill process 13578 (memalloc) score 891 or sacrifice child
[  173.003505] Out of memory: Kill process 13608 (memalloc) score 891 or sacrifice child
[  173.874717] Out of memory: Kill process 624 (networkd-dispat) score 3 or sacrifice child
[  174.224429] Out of memory: Kill process 1538 ((sd-pam)) score 1 or sacrifice child
[  552.821667] Out of memory: Kill process 13643 (memalloc) score 893 or sacrifice child
[  553.734135] Out of memory: Kill process 1612 (sshd) score 0 or sacrifice child
[  576.156102] Out of memory: Kill process 13737 (memalloc) score 893 or sacrifice child
[  576.727195] Out of memory: Kill process 13648 ((sd-pam)) score 1 or sacrifice child
[  576.736310] Out of memory: Kill process 621 (rsyslogd) score 0 or sacrifice child
[  576.751442] Out of memory: Kill process 13726 (sshd) score 0 or sacrifice child
[  576.769174] Out of memory: Kill process 13726 (sshd) score 0 or sacrifice child
[  576.776256] Out of memory: Kill process 13647 (systemd) score 0 or sacrifice child
[  577.050456] Out of memory: Kill process 13645 (sshd) score 0 or sacrifice child
[  597.310468] Out of memory: Kill process 13841 (memalloc) score 893 or sacrifice child
[  673.662908] Out of memory: Kill process 13842 (memalloc) score 892 or sacrifice child
[  674.478830] Out of memory: Kill process 13756 ((sd-pam)) score 1 or sacrifice child
[  674.505089] Out of memory: Kill process 13749 (rsyslogd) score 0 or sacrifice child
[  674.517725] Out of memory: Kill process 13830 (sshd) score 0 or sacrifice child
[  674.526321] Out of memory: Kill process 13830 (sshd) score 0 or sacrifice child
[  674.537569] Out of memory: Kill process 13755 (systemd) score 0 or sacrifice child
[  674.573690] Out of memory: Kill process 13753 (sshd) score 0 or sacrifice child
[  674.645638] Out of memory: Kill process 602 (accounts-daemon) score 0 or sacrifice child

Whilst this kicked me out of the machine, it was still in the same state and hadn't been rebooted, but had the side effect of killing various other programs. Let's make it panic:

sudo sysctl -w vm.panic_on_oom=1
vm.panic_on_oom = 1

Obviously, running the program now just makes the system unresponsive, with the kernel logging only to the main tty allocated, hence the following screenshot:

Screenshot from Virtualbox with the kernel log on it
A kernel panic, from the Virtualbox terminal.

Now, we can modify the second parameter, which automatically reboots on a panic. At the same time, let's persist both of these in /etc/sysctl.conf to last between reboots, then make sure we load them:

cat <<EOF | sudo tee -a /etc/sysctl.conf
vm.panic_on_oom=1
kernel.panic=10
EOF
sudo sysctl --system

Re-running the memory allocation code, we get the same out-of-memory panic, then as if by magic, the system waits 10 seconds and reboots!

Docker

I also wanted to check this would work with Docker, as we have a few machines running various servers as docker containers. I installed Docker, and made a minimal configuration for an image that simply runs the file:

# Install docker
sudo apt-get install -y docker.io

# Set docker permissions, relogin after this
sudo usermod -aG vagrant docker
exit

# Create the Dockerfile, assuming memalloc compiled in last step
cat <<EOF > Dockerfile
FROM ubuntu:latest
COPY memalloc /usr/local/bin
ENTRYPOINT memalloc
EOF

# Build the image
docker build -t memalloc .

# Run the image and wait for the kernel panic
docker run memalloc

I was expecting Docker to handle this a bit more gracefully, and cause the internal container to panic, but not cause the host to go down. As of writing, this is an open issue with Docker. In our instance, Docker killing the host when OOM is acceptable, as we wouldn't use the kernel's panic_on_oom option on hosts where file corruption could be an issue, but isn't what we'd expect from a container. Just bear this in mind if trying yourself.

Updating Hosts with Ansible

Knowing the system works, I can then build a simple Ansible file, which will automatically update selected hosts to make them panic on out of memory and reboot:

---
- hosts: auto-reboot-group
  become: true
  tasks:
  # Set panic when out of memory
  - ansible.posix.sysctl:
    name: vm.panic_on_oom
    value: '1'
    state: present
  # Set auto reboot on panic
  - ansible.posix.sysctl:
    name: kernel.panic
    value: '10'
    state: present

With these changes to your infrastructure, there should be fewer instances of dead machines that need manually rebooting as theoretically the kernel will now sort this for us.

File Corruption

It is important to note that this doesn't perform a filesystem sync, so potentially corruption could occur in a write-heavy system like a database server. The standard configuration shouldn't kill database servers or their backing stores if they have their oom_adj variable set properly, so again this is only intended for use where data corruption won't be an issue.

Chaos Engineering and Apes

I remember reading a series of posts by the Netflix engineering team, which shows how they automatically and randomly terminate instances in their cloud to test resiliency of their infrastructure. They have created a project called chaos monkey, which has evolved over the years and can randomly reboot aspects of their system or whole AWS availability zones–as happened back in 2017. This is known as chaos engineering and is expected to increase resiliency of infrastructure.

In a slightly more primitive way than Netflix, many of our machines have scheduled reboots in place running via a crontab once a week or so. This is a much easier system to implement, and doesn't take much configuration. Unfortunately, this is susceptible to a couple of issues: if the machine is out of memory, then it can't run the crontab and save itself; and machines consistently reboot themselves at the same time.

Why not just use chaos monkey you ask? Simply because Spinnaker is the wrong tool for the job for our fleet of servers, and it doesn't support things such as a simple VPS, only complicated public cloud services. Therefore, we decided to create Havoc Ape (repo TBA), which aims to be conceptually similar to the Chaos Monkey project deployed by Netflix, but designed to run against a series of OVH VPSes.

This introduces an uncontrolled failure to each machine and as we call the API method on the VPS as opposed to directly on the machine, any machine in a stuck state unable to run the crontab will still be able to be rebooted. The project also includes randomness, meaning that each machine will be rebooted at some random point in the week, as opposed to a fixed schedule. In next week's article, I'll go through the full feature specification for the application, and the implementation in Go.

Conclusion

Implementing automatic reboots when the kernel runs out of memory is easily done with sysctl, meaning administration overhead is lower and services are back online quicker. In this article, we tested various kernel variable settings to see system behaviour, learnt how to persist them, and how OOM conditions were handled with Docker. We looked at how this configuration could be applied to a set of servers with an Ansible playbook and finally dived into a bit about chaos engineering.

Thanks for taking the time to read this article! Please leave a comment with your thoughts below, and subscribe to my RSS feed to have next week's post automatically hit your feed reader.

Comments