Skip to content

Blog

Havoc Ape: Using Chaos Engineering to Build Resilience and Automate Reboots

AI-generated image of a monkey pulling apart power cables in a server room, with sparks flying
Keep your data centre under lock and key, or risk the wrath of the ape... (AI-generated image, Havoc Ape project logo)

In a perfect world, servers would have infinite memory and would never have memory leaks. Unfortunately, this isn't a perfect world, and we've not yet figured out a way around the fundamental laws of physics, so it's quite likely you will have seen something like this if you've been administering Linux servers for a while:

[11686.043641] Out of memory: Kill process 2603 (applicationWithSlowMemoryLeak) score 761 or sacrifice child
[11686.043647] Killed process 2603 (applicationWithSlowMemoryLeak) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB

If not, then you're either really lucky, or you haven't dealt with applications that aren't perfect. Linux tries its best to sort this using something called the OOM killer. This is a relatively simple part of the kernel that deals with memory saturation issues to prevent a complete system breakdown. It's well-intentioned, and tries to kill the process that it thinks might have a memory leak, but leaves servers in an unknown state, often killing the wrong daemon and requiring manual intervention to fix.

For the remainder of this blog post, I'll first dive into how we fix issues with the OOM killer in production across our fleet of servers, then a bit about how we make use of chaos engineering in production to reduce outages and build resilience in our systems.

Serving Plex Clients from a Raspberry Pi

Plex logo, a plus, and Raspberry Pi logo on a blurry background which is the plex home dashboard.
Running a home media server has never been easier!

Plex is the gold standard for a home-hosted media server, allowing you to stream your home movies or (legally) ripped media from any television, computer, or phone either at home or when out and about.

Plex is widely run across many homelabs, as it is easy to setup, and unlike some other projects actually provides good value to the end user. In the remainder of this post, I'll break down how I have my Plex server set up for 24/7 access on a Raspberry Pi.

4th Year Module Round Up

Icons of a group of people, a padlock, a CCTV camera, and a calendar, with the CCTV camera being a part of the background photo on a brutalist building.
Developing skills in teamwork, security, and project management. (image by author, background photo by Reuben Hustler)

This was the last year of my degree at the University of Southampton. I wish I'd had this blog active from the start of my studies, so that I could both publish my notes freely, and comment on my thoughts on modules for many future cohorts.

It's a shame that it's taken me this long to figure out that these sorts of things might be handy and provide some good content for the blog for those that want to know about what I've been studying without necessarily looking at my notes for each of the modules, but here we are.

During the year, I took the following modules:

The Case for a Local Smart Home

Home assistant logo and a selection of local network addresses
Use Home Assistant and keep your smart devices local!

I like to think that I'm quite tech conscious, and that IoT can be a force for good, if installed properly and made easy enough to use. The most important thing, in my opinion, is that the 'smart' devices are just that, and will work reliably not just now, but a few years down the line. They shouldn't be a pain to maintain, or cause problems if the internet dies, or there's a power cut.

Building an Offsite Backup NAS

Disclosure: Some of the links below are affiliate links. This means that, at zero cost to you, I will earn an affiliate commission if you click through the link and finalise a purchase. Learn more.

Hard drive on fire spurting images
Making sure that failures like these don't cause total data loss (AI generated image)

You've likely heard of the 3-2-1 rule for backups. If not, it's really simple:

  • You want 3 copies of your data,
  • of these copies, 2 on different mediums (e.g., HDD/tape),
  • and finally there should be 1 offsite copy.

When forming your backup plan, you should consider each of these requirements, and formulate a plan for how you'll fulfill them. You also want to consider the type of data you'll be storing, how frequently it'll be accessed, and whether the data is truly irreplaceable, or something that can be re-downloaded or imported from another medium.

In this article, I'll be explaining my 3-2-1 backup solution, including the architecture, costs, and overall performance.

Raspberry Pi Boot Modes and Security

Set of army boots next to a hard drive and a cloud
Different *boot* modes... (AI generated image)

This is just a quick post to highlight the possible boot modes on the Raspberry Pis, how these can be used in conjunction with network booting, and the key differences between the Pi 3 and 4B. I found this to all be quite confusing when I first looked at it, so hopefully this post will help it make a bit more sense.

Some of these principles can be applied to other computers, especially PXE boot. This post is, however, more geared towards the specifics encountered with the Raspberry Pi family of single-board computers.

Testing SD Card Failure and Storage Reliability

Disclosure: Some of the links below are affiliate links. This means that, at zero cost to you, I will earn an affiliate commission if you click through the link and finalise a purchase. Learn more.

AI-generated grim reaper trying to take SD card
How long until your storage dies? (AI generated image)

I've been auditing the SD cards I have in use on various devices, including the data on them, and whether they are still reliable. I've recently freed up some of these SD cards for use in other projects, as I move to network booting my Raspberry Pis. Some of these SD cards are completely dead, I think because they have had lots of small writes, for things such as databases, which cause lots of stress on the underlying flash, and eventually cause it to fail. One example I can think of is a previous Pi running my Home Assistant, which randomly died one day and had lots of database activity.

Wales 2022: Lakes, Climbing Snowdon, and Caving

This is the second and final instalment for the Wales 2022 trip. In this one, we go to Llyn Padarn, one of the largest lakes for swimming in Snowdon, climb Mt. Snowdon (Yr Wyddfa as the Welsh like to call it) on the Ranger Path, then go caving underground at Zip World in LLechwedd, whilst Jack goes off and looks at a castle somewhere.

South West Coast Path: Pentireglaze Mines to Watergate Bay

This video series comes from my January trip with Dad to Cornwall to smash out yet another bit of the coast path. I didn't post the earlier ones here, mainly because I didn't yet have a functional blog, but they are available on the YouTube playlist.

Here's the latest one, where we walk from Watergate Bay to Porthcothan:

Why Here?

We picked this bit of the coast path to do based on Dad's research as to the more flat bits, as at the time, he was having issues with his foot and wanted to take it easy.

Wales 2022 Day 1&2: Travelling up and Climbing Moel Hebog

Almost 2 years after the trip, I've finally finished editing the first episode! In this one, we drive up from Salisbury to Wales, stopping off in the Beacons on the way. On the second day, we climb Moel Hebog, which is the mountain overlooking the campsite.