Skip to content

Testing SD Card Failure and Storage Reliability

Disclosure: Some of the links below are affiliate links. This means that, at zero cost to you, I will earn an affiliate commission if you click through the link and finalise a purchase. Learn more.

AI-generated grim reaper trying to take SD card
How long until your storage dies? (AI generated image)

I've been auditing the SD cards I have in use on various devices, including the data on them, and whether they are still reliable. I've recently freed up some of these SD cards for use in other projects, as I move to network booting my Raspberry Pis. Some of these SD cards are completely dead, I think because they have had lots of small writes, for things such as databases, which cause lots of stress on the underlying flash, and eventually cause it to fail. One example I can think of is a previous Pi running my Home Assistant, which randomly died one day and had lots of database activity.

Wiping the Filesystem

Initially, I check what files are on there (typically, this is a Linux system, and I can see from the etc/hostname file on /dev/sdX2, which is the main root filesystem). I then run wipefs to clear the filesystem, before either creating a GPT partition table and a primary partition, or more often than not, just write a filesystem straight to the device.

Warning

This is a destructive operation, I take no responsibility if you wipe the wrong device!

sudo wipefs --all --force /dev/sdX # Or /dev/disk/by-id/xxx (probably safer)
sudo mkfs.exfat /dev/sdX

SD card markings One such SD card used to run running-lion, which I was wanting to repurpose for use with my DJI OSMO camera from years ago. The OSMO was complaining about a slow SD card, even after wiping and putting ExFAT on the card. I checked the specifications for the Micro-SD card I was using to ensure compliance with the camera. I needed UHS class 1, which I confirmed visually on the front of the SD card. This should mean a theoretical minimum write throughput of 10MB/s. Kingston have published a guide on what the markings should mean.

Testing

Sequential vs Random

Video files are written sequentially, meaning that each byte comes after the next in terms of the location written to on the SD card. For flash-based devices, the difference in speed between sequential and random writes is not as big as on magnetic storage, but even so, it is important we test the sequential speeds.

I mounted the filesystem, with mount:

mount /dev/sdc /mnt/testsd && cd /mnt/testsd

Using fio

From there, I invoked fio, passing the following options:

fio \
  --name=seqwrite \ # File name on device
  --rw=write \      # Only interested in writing to card
  --bs=4k \         # Standard for most filesystems
  --size=256m \     # Generate a file this big
  --runtime=60 \    # Run for a minute, to ensure caches are filled
  --time_based

On a healthy SD card, we can expect to see performance at or in excess of the specifications. Truncated fio outputs are given below. This was the healthy SD card:

Run status group 0 (all jobs):
  WRITE: bw=21.1MiB/s (22.1MB/s), 21.1MiB/s-21.1MiB/s (22.1MB/s-22.1MB/s), io=1280MiB (1342MB), run=60785-60785msec

Disk stats (read/write):
  sdc: ios=0/1751, sectors=0/2621442, merge=0/416, ticks=0/58019, in_queue=58019, util=95.49%

And this was the broken one:

Run status group 0 (all jobs):
  WRITE: bw=3936KiB/s (4030kB/s), 3936KiB/s-3936KiB/s (4030kB/s-4030kB/s), io=256MiB (268MB), run=66603-66603msec

Disk stats (read/write):
  sdc: ios=65/470, sectors=65/523153, merge=0/212, ticks=84/65807, in_queue=65892, util=99.24%

Results

We can see that across the board, the healthy SD card is performing much better. We are able to write at a steady 22.1MB/s, well above the minimums for the UHS class 1 spec, compared to 3936KiB/s, which is about 4MB/s, or 5.5x slower than the healthy SD card.

Whilst I can still write to the card, it's very slow, and I wouldn't expect it to last much longer. You can get a 2-pack of 32GB Micro-SD cards for less than a tenner at the time of writing, and for the costs of the hassle of replacing a failed card, I'd say it's well worth not using one which may fail imminently.

Reliability

It's also worth noting that storage is quite unreliable as we're constantly pushing the bounds of what's possible. Hard drives have moving parts, NAND flash wears out, and other forms of storage can be unreliable if not stored right. Where possible, use something like an Industrial Micro-SD card, which can survive many more writes and are designed to be maintenance free. If possible, network boot computers or use traditional RAID or mirrored vdevs (if using ZFS), and make sure any faults are reported before both drives die.

Personal Computer

Since having an SSD fail on me a few years ago, causing me lots of lost productivity, my computer can now boot and run from either one of two SSDs, and all filesystem writes are synchronised to both. Whilst this isn't a backup, I now know that the reliability of my computer which I use for day to day work is much higher than it was, and drive failure is more of a mild annoyance than a downright issue for my work. ZFS reports drive status with zpool status, and I've also got my computer configured to run a weekly scrub on all data to recover any silent errors:

  pool: rpool_xxxxxx
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:39:15 with 0 errors on Wed Jun 26 20:55:43 2024
config:

        NAME                                                                STATE     READ WRITE CKSUM
        rpool_tyj3nl                                                        ONLINE       0     0     0
          mirror-0                                                          ONLINE       0     0     0
            /dev/disk/by-id/ata-CT1000BX500SSD1_xxxxxxxxxxxx-part3          ONLINE       0     0     0
            /dev/disk/by-id/nvme-WDC_xxxxxxxxxxx-xxxxxx_xxxxxxxxxxxx-part3  ONLINE       0     0     0

errors: No known data errors

Network-Booted Devices

Devices such as running-lion are now booted and run across the network, using NFS for access to the filesystem. The 'disk' for each drive is now on an automatically backed-up SSD which can be mirrored on a RAID array for maximum uptime. This is again orders of magnitude more reliable than just a plain single SD card, and makes me much happier with my data security and lack of headaches.

Comments