Saturday, February 24, 2024

Drive Failures - Data Recovery with Open-Source Tools (part 2)

This is part 2 of a multi-part series.  See part 1 for the beginning of the series.

Note that this is material from 2010 and earlier that pre-dates the common availability of solid state drives.

Detecting failures

Mechanical failures

Mechanical drive failure is nearly always accompanied by some sort of audible noise.  One common sound heard from failing hard drives is the so-called "Click of Death", a sound similar to a watch ticking (but much louder).  This can have various causes, but it is commonly caused by the read/write head inside a drive being stuck or possibly trying to repeatedly read a failing block.

Another common noise is a very high-pitched whine.  This is caused by bearings in a drive failing (most likely rubbing metal-on-metal), usually as a result of old age.  Anything that moves inside a computer (fans, for example) can make a noise like this, so always check a suspect drive away from other sources of noise to verify that the sound is indeed coming from the drive.

Drive motors failing and head crashes can cause other distinctive noises.  As a rule, any noise coming from a hard drive that does not seem normal is probably an indicator of imminent failure.

Electronic failures

Failing electronics can cause a drive to act flaky, not detect, and occasionally catch fire.

Hard drives have electronics on the inside of the drive which are inaccessible without destroying the drive (unless you happen to have a clean room).  Unfortunately, if those fail, there isn't much you can do.

The external electronics on a hard drive are usually a small circuit board that contains the interface connector and is held onto the drive with a few screws.  In many cases, multiple versions of a drive (IDE, SATA, SCSI, SAS, etc.) exist with different controller interface boards.  Generally speaking, it is possible to transplant the external electronics from a good drive onto a drive with failing electronics in order to get data off the failing drive.  Usually the controller board will need to be off an identical drive with similar manufacturing dates.

Dealing with physical failures

In addition to drive electronics transplanting, just about any trick you've heard of (freezing, spinning, smacking, etc.) has probably worked for someone, sometime.  Whether any of these tricks work for you is a matter of trial and error.  Just be careful.

Freezing drives seem to be especially effective.  Unfortunately, as soon as a drive is operating, it will tend to heat up quickly, so some care needs to be taken to keep drives cool without letting them get wet from condensation.

Swapping electronics often works when faced with electronic failure, but only when the donor drive exactly matches the failed drive.

Freezing drives often helps in cases of crashed heads and electronic problems. Sometimes they will need help to stay cold (ice packs, freeze spray, etc.), but often once they start spinning, they'll stay spinning. Turning a drive on its side sometimes helps with physical problems as well.

Unfortunately, we do have to get a drive to spin for any software data recovery techniques to work.

To be continued in part 3.

Sunday, February 18, 2024

Data Recovery with Open-Source Tools (part 1)

This is material from a class I taught a long time ago.  Some of it may still be useful.  🙂

The original copyright notice:

Copyright © 2009-2010 Steven Pritchard / K&S Pritchard Enterprises, Inc.

This work is licensed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. To view a copy of this license, visit or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

This is part 1 of a multi-part series.

Identifying drives

An easy way to get a list of drives attached to a system is to run fdisk -l.  The output will look something like this:

# fdisk -l

Disk /dev/sda: 80.0 GB, 80026361856 bytes

255 heads, 63 sectors/track, 9729 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk identifier: 0xcab10bee

   Device Boot      Start         End      Blocks   Id  System

/dev/sda1   *           1        8673    69665841    7  HPFS/NTFS

/dev/sda2            8675        9729     8474287+   c  W95 FAT32 (LBA)

In many cases, you'll see a lot of (generally) uninteresting devices that are named /dev/dm-n.  These are devices created by device mapper for everything from software RAID to LVM logical volumes.  If you are primarily interested in the physical drives attached to a system, you can suppress the extra output of fdisk -l with a little bit of sed.  Try the following:

fdisk -l 2>&1 | sed '/\/dev\/dm-/,/^$/d' | uniq

Whole devices generally show up as /dev/sdx (/dev/sda, /dev/sdb, etc.) or /dev/hdx (/dev/hda, /dev/hdb, etc.).  Partitions on the individual devices show up as /dev/sdxn (/dev/sda1, /dev/sda2, etc.), or, in the case of longer device names, the name of the device with pn appended (an example might be /dev/mapper/loop0p1).



The vast majority of hard drives currently in use connect to a computer using either an IDE (or Parallel ATA) interface or a SATA (Serial ATA) interface.  For the most part, SATA is just IDE with a different connector, but when SATA came out, the old Linux IDE driver had accumulated enough cruft that a new SATA driver (libata) was developed to support SATA controller chipsets.  Later, the libata driver had support for most IDE controllers added, obsoleting the old IDE driver.

There are some differences in the two drivers, and often those differences directly impact data recovery.  One difference is device naming.  The old IDE driver named devices /dev/hdx, where x is determined by the position of the drive.

/dev/hda    Master device, primary controller

/dev/hdb    Slave device, primary controller

/dev/hdc    Master device, secondary controller

/dev/hdd    Slave device, secondary controller

And so on.

Unlike the IDE driver, the libata driver uses what was historically SCSI device naming, /dev/sdx, where x starts at "a" and increments upwards as devices are detected, which means that device names are more-or-less random, and won't be consistent across reboots.

The other major difference between the old IDE driver and the libata driver that affects data recovery is how the drivers handle DMA (direct memory access).  The ATA specification allows for various PIO (Programmed I/O) and DMA modes.  Both the old IDE driver and the libata driver will determine the best mode, in most cases choosing a DMA mode initially, and falling back to a PIO mode in error conditions.  The old IDE driver would also let you manually toggle DMA off and on for any device using the command hdparm.

hdparm -d /dev/hd    Query DMA on/off state for /dev/hdx

hdparm -d0 /dev/hdx    Disable DMA on /dev/hdx

hdparm -d1 /dev/hdx    Enable DMA on /dev/hdx

The libata driver currently lacks the ability to toggle DMA on a running system, but it can be turned off for all hard drives with the kernel command line option libata.dma=6, or for all devices (including optical drives) with libata.dma=0.  On a running system, the value of libata.dma can be found in /sys/module/libata/parameters/dma.  (The full list of numeric values for this option can be found in  There does not appear to be a way to way to toggle DMA per device with the libata driver.

There are several reasons why you might want to toggle DMA on or off for a drive.  In some cases, failing drives simply won't work unless DMA is disabled, or even in some rare cases might not work unless DMA is enabled. In some cases the computer might have issues when reading from a failing drive with DMA enabled.  (The libata driver usually handles these situations fairly well.  The old IDE driver only began to handle these situations well in recent years.)

In addition to those reasons, PIO mode forces a drive to a maximum speed of 25MB/s (PIO Mode 6, others are even slower), while DMA modes can go up to 133MB/s.  Some drives appear to work better at these lower speeds.


While SCSI drives and controllers are less common than they once were, all current hard drive controller interfaces now use the kernel SCSI device layers for device management and such.  For example, all devices that use the SCSI layer will show up in /proc/scsi/scsi.

# cat /proc/scsi/scsi

Attached devices:

Host: scsi0 Channel: 00 Id: 00 Lun: 00

  Vendor: TSSTcorp Model: CD/DVDW TS-L632D Rev: AS05

  Type:   CD-ROM                           ANSI  SCSI revision: 05

Host: scsi1 Channel: 00 Id: 00 Lun: 00

  Vendor: ATA      Model: ST9160821A       Rev: 3.AL

  Type:   Direct-Access                    ANSI  SCSI revision: 05

Host: scsi3 Channel: 00 Id: 00 Lun: 00

  Vendor: ATA      Model: WDC WD10EACS-00Z Rev: 01.0

  Type:   Direct-Access                    ANSI  SCSI revision: 05

In most cases, it is safe to remove a device that isn't currently mounted, but to be absolutely sure it is safe, you can also explicitly tell the kernel to disable a device by writing to /proc/scsi/scsi.  For example, to remove the third device (the Western Digital drive in this example), you could do the following:

echo scsi remove-single-device 3 0 0 0 > /proc/scsi/scsi

Note that the four numbers correspond to the controller, channel, ID, and LUN in the example.

In cases where hot-added devices don't automatically show up, there is also a corresponding add-single-device command.

When recovering data from SCSI (and SCSI-like drives such as SAS), there are no special tricks like DMA.

USB, etc.

The Linux USB drivers are rather resilient in the face of errors, so no special consideration needs to be given when recovering data from thumb drives and other flash memory (except that these devices tend to work or not, and, of course, dead shorts across USB ports are a Bad Thing).  USB-to-ATA bridge devices are a different matter entirely though.  They tend to lock up hard or otherwise behave badly when they hit errors on a failing drive.  Generally speaking, they should be avoided for failing drives, but drives that are OK other than a trashed filesystem or partition table should be completely fine on a USB-to-ATA bridge device.

To be continued in part 2.

Wednesday, May 17, 2023

libvirt surprise

I just noticed that some of my libvirt VMs had on_crash set to destroy instead of restart. It looks like there is an easy fix:

for vm in $( virsh list --name ) ; do virt-xml "$vm" --edit --events on_crash=restart ; done

I don't know if something changed in virt-manager/virt-install over the years, or if I ran into this a long time ago and forgot about it.

Now I just need to remember to add that --events option to virt-install in the future... 🙂

Tuesday, May 9, 2023

Harvester HCI

I have been using libvirt on CentOS + ZFS for my home lab for somewhere around a decade now.  For the last several years, I have been trying off and on to switch to some kind of hyperconverged infrastructure, usually oVirt + a clustered storage solution (Ceph, Gluster).  For various reasons, I've never quite managed to get all the pieces to work together correctly.

So, imagine how happy I was to hear about Harvester a while back.

Harvester is a modern Hyperconverged infrastructure (HCI) solution built for bare metal servers using enterprise-grade open source technologies including Kubernetes, Kubevirt and Longhorn.

I love everything about this!  It's a more modern take on hyperconverged infrastructure than what I was trying to assemble.  Plus, all my problems magically disappear when all the pieces work together out of the box, right?

Well...  Not quite.  I installed version 0.3.0.  It made for a cool demo, but, thanks to stability problems and a whole lot of missing features, it wasn't quite ready for anything resembling production use.  (Granted, this is my home lab, but I still run virtualized firewalls and stuff like that on it, so I need it to work, and work reliably.)

I'll note here that I wrote all of the above over a year ago.  I then closed with a list of reasons why Harvester wasn't good enough for me to actually use it at the time.  It (unintentionally!) sounded negative enough that I decided not to publish the post.

So here we are a year or so later, and after a few more failed attempts I recently tried Harvester again, this time with version 1.1.1.  Everything I need to work seems to, and I'm ready to start migrating some real workloads!

That's not to say that everything is perfect.  There are a few useful features on the roadmap that I could benefit from (like anti-affinity rules, zero-downtime upgrades, ...), and I still have some challenges.  Some examples:

  • Automating node installation is ... let's say difficult?
  • Networking is almost as functional as I want it, but I still haven't been able to figure out how to move storage replication to a network with jumbo frames.
  • I want real certs.  I see how to manually manage the certs, but it's not immediately obvious how I could manage them automatically (for Let's Encrypt).
Thankfully none of those things are keeping me from using Harvester.  They're just things to look forward to in future upgrades. 😀

Saturday, September 10, 2022

Migrating libvirt VMs

I recently moved a bunch of libvirt VMs from a CentOS 7 host to a CentOS Stream 9 host. Normally moving virtual machines from one libvirt host to another is pretty easy. All you need to do is stop the VM on the original host, copy the disk image from host to host (with rsync or whatever is convenient), dump the VM config (with virsh dumpxml guest), and import that config on the new host (with virsh define). It turns out a few things have changed that make that not quite work though...

The first thing (and the easiest to fix) was that a lot of old machine types that worked in CentOS 7's libvirt no longer work. The easy answer is to switch to the generic pc machine type.

The harder one to deal with was that Spice support was dropped. This meant switching the graphics to VNC, using virtio for the virtual video hardware, and removing all the other Spice-related devices.

The libvirt VM configuration is all XML, so I wrote a script that uses xmlstarlet to make all the necessary changes.


set -e

remote="${REMOTE_HOST:?'Set REMOTE_HOST environment variable'}"

for guest in "$@" ; do
    xml="$( mktemp XXXXXXXX.xml )"
    trap "rm -fv '$xml'" EXIT
    ssh root@"$remote" virsh dumpxml "$guest" | \
        xmlstarlet ed \
            -u '/domain/os/type[starts-with(@machine, "pc-i440fx")]/@machine' -v pc \
            -u '/domain/os/type[starts-with(@machine, "rhel")]/@machine' -v pc \
            -u '/domain/devices/video/model[@type="qxl"]/@type' -v virtio \
            -d '/domain/devices/video/model[@type="virtio"]/@ram' \
            -d '/domain/devices/video/model[@type="virtio"]/@vram' \
            -d '/domain/devices/video/model[@type="virtio"]/@vgamem' \
            -d '/domain/devices/graphics[@type="spice"]/@port' \
            -i '/domain/devices/graphics[@type="spice"]' -t attr -n port -v -1 \
            -u '/domain/devices/graphics[@type="spice"]/@type' -v vnc \
            -d '/domain/devices/channel[@type="spicevmc"]' \
            -d '/domain/devices/redirdev[@type="spicevmc"]' \
        > "$xml"
    virsh define "$xml"
    rm -fv "$xml"
    trap - EXIT
    virsh autostart "$guest"

(The above script is also available at

Monday, February 7, 2022

Recovering a ZFS array

In August of 2016, I lost a (relatively) large (for me at the time) ZFS array. Rather than tell you how it happened to me, watch how it happened to Linus Tech Tips.

My story is almost identical, except the array in question was much smaller, but to make matters worse it was mostly cobbled together with old hardware, including drives, so when the array died, there were a lot of bad drives.

My array started life as 15 750GB drives in a RAID-Z2. As the 750GB drives failed, they were replaced with 1TB drives. Unfortunately, I continued to use a mix of previously-used drives and some Seagate drives that apparently weren't Seagate's best work. The end result was that drives were failing rather often, and due to like of time, attention, and a ready supply of spare drives, I wasn't great at replacing them when they failed.

The biggest problem with RAID-5/6 and the RAID-Z equivalents are that rebuilds from a drive failure involve a lot of I/O to all of the drives in the array. RAID-Z2 allows you to lose two drives, but if you lose a third from the stress mid-rebuild, your whole array is toast. In my case, I didn't realize that I had a major problem until the third drive started to fail and ZFS took the array offline. A couple of the remaining drives had SMART errors and likely weren't going to survive a rebuild. I was going to have to clone all of the drives with errors before trying to rebuild. If I wanted a non-destructive fall-back plan, I needed to clone every drive, so if all else failed I could go back to the original array members to try again.

So... I didn't want to buy another 15 1TB drives. Where was I going to find enough disks (or raw space for disk images) to make a copy of the array?

My ultimate answer came almost 5.5 years later as I was working on rebuilding my Harvester cluster (more on that some other time). I had several larger drives in the cluster, so while everything was down and disassembled, I put 5 of them in one system and built a ZFS pool. With around 18TB usable, I had more than enough space to store images of all of the drives in the array!

Enough time had passed that I wasn't sure which drives were which, so I wrote a script to examine the metadata on each drive and then clone the drive to a name based on the pool, GUID, and last modified timestamp:


set -e

    "-b" "1M"
    "-B" "4k"

warn() {
    echo "$@" >&2

die() {
    warn "$@"
    exit 1

usage() {
    warn "$( basename "$0" ) device [device [...]]"

get_first() {

    value=$( echo "$text" | awk "(\$1 == \"$key:\") { print \$2; exit 0; }" )

    echo "${value//\'/}"

get_pool() {
    get_first "name" "$@"

get_guid() {
    get_first "guid" "$@"

get_timestamp() {

    timestamps=( $( echo "$text" | awk '($1 == "timestamp") { print $3 }' | sort -n ) )

    echo "${timestamps[-1]}"

get_output_filename() {

    while [ -f "${base}-${n}.img" ] ; do
        warn "${base}-${n}.img exists." 
        (( n++ ))

    echo "${base}-${n}.img"

if [ "$#" -lt 1 ] ; then
    exit 1

cd /volumes/recovery/disks

for device in "$@" ; do
    if [ ! -e "${device}1" ] ; then
        die "Can't find first partition for device $device"

    zdb=$( zdb -l -u "${device}1" )

    pool=$( get_pool "$zdb" )
    guid=$( get_guid "$zdb" )
    timestamp=$( get_timestamp "$zdb" )

    echo "Recovering $guid from pool $pool last updated $( date --date="@$timestamp" )..."

    mkdir -pv "${pool}/${guid}"

    filename=$( get_output_filename "${pool}/${guid}/${timestamp}" )

    echo "Cloning $device to $filename (logging to $logfile)..."

    dd_rescue "${dd_args[@]}" -l "$logfile" -o "$badfile" "$device" "$filename"

The script uses zdb to get metadata from the drive, then uses dd_rescue to clone the drive to a file.

Once that finished, I made a snapshot of the entire filesystem (with zfs snapshot), mapped the files to block devices (with losetup), and activated partitions on the loopback devices (with kpartx). Then I was able to import the pool (with zpool import) and find and fix all the errors (with zpool scrub).

Very roughly, the commands I used went something like this:

  • zpool create -f -m /volumes/recovery -o ashift=12 recovery raidz /dev/disk/by-id/ata-TOSHIBA_!(*-part[0-9])
  • zfs create recovery/disks -o compress=zstd-fast
  • Insert the drives and run the script above against each one.
  • zfs snapshot recovery/disks@$( date +%Y%m%d%H%M%S )
  • for file in /volumes/recovery/disks/*/*/*.img ; do losetup -f -v $file; done
  • for loop in $( losetup -a | awk -F: '{print $1}' ) ; do kpartx -a $loop ; done
  • zpool import -d /dev/disk/by-id -f pool_name
  • zpool scrub pool_name

Now I just need to find enough space to rsync or zfs send | zfs receive all that data. 😀

Tuesday, February 1, 2022

Video from OLF 2021

I had two talks at OLF in December. I just noticed that videos are up on YouTube for both of them.

I Like GitLab... and So Should You

Infrastructure Prototyping with Bolt and Vagrant