Monday, February 7, 2022

Recovering a ZFS array

In August of 2016, I lost a (relatively) large (for me at the time) ZFS array. Rather than tell you how it happened to me, watch how it happened to Linus Tech Tips.

My story is almost identical, except the array in question was much smaller, but to make matters worse it was mostly cobbled together with old hardware, including drives, so when the array died, there were a lot of bad drives.

My array started life as 15 750GB drives in a RAID-Z2. As the 750GB drives failed, they were replaced with 1TB drives. Unfortunately, I continued to use a mix of previously-used drives and some Seagate drives that apparently weren't Seagate's best work. The end result was that drives were failing rather often, and due to like of time, attention, and a ready supply of spare drives, I wasn't great at replacing them when they failed.

The biggest problem with RAID-5/6 and the RAID-Z equivalents are that rebuilds from a drive failure involve a lot of I/O to all of the drives in the array. RAID-Z2 allows you to lose two drives, but if you lose a third from the stress mid-rebuild, your whole array is toast. In my case, I didn't realize that I had a major problem until the third drive started to fail and ZFS took the array offline. A couple of the remaining drives had SMART errors and likely weren't going to survive a rebuild. I was going to have to clone all of the drives with errors before trying to rebuild. If I wanted a non-destructive fall-back plan, I needed to clone every drive, so if all else failed I could go back to the original array members to try again.

So... I didn't want to buy another 15 1TB drives. Where was I going to find enough disks (or raw space for disk images) to make a copy of the array?

My ultimate answer came almost 5.5 years later as I was working on rebuilding my Harvester cluster (more on that some other time). I had several larger drives in the cluster, so while everything was down and disassembled, I put 5 of them in one system and built a ZFS pool. With around 18TB usable, I had more than enough space to store images of all of the drives in the array!

Enough time had passed that I wasn't sure which drives were which, so I wrote a script to examine the metadata on each drive and then clone the drive to a name based on the pool, GUID, and last modified timestamp:

#!/bin/bash

set -e

dd_args=(
    "-d"
    "-b" "1M"
    "-B" "4k"
    "-w"
    "-A"
)

warn() {
    echo "$@" >&2
}

die() {
    warn "$@"
    exit 1
}

usage() {
    warn "$( basename "$0" ) device [device [...]]"
}

get_first() {
    key="$1"
    text="$2"

    value=$( echo "$text" | awk "(\$1 == \"$key:\") { print \$2; exit 0; }" )

    echo "${value//\'/}"
}

get_pool() {
    get_first "name" "$@"
}

get_guid() {
    get_first "guid" "$@"
}

get_timestamp() {
    text="$1"

    timestamps=( $( echo "$text" | awk '($1 == "timestamp") { print $3 }' | sort -n ) )

    echo "${timestamps[-1]}"
}

get_output_filename() {
    base="$1"

    n=1
    while [ -f "${base}-${n}.img" ] ; do
        warn "${base}-${n}.img exists." 
        (( n++ ))
    done

    echo "${base}-${n}.img"
}

if [ "$#" -lt 1 ] ; then
    usage
    exit 1
fi

cd /volumes/recovery/disks

for device in "$@" ; do
    if [ ! -e "${device}1" ] ; then
        die "Can't find first partition for device $device"
    fi

    zdb=$( zdb -l -u "${device}1" )

    pool=$( get_pool "$zdb" )
    guid=$( get_guid "$zdb" )
    timestamp=$( get_timestamp "$zdb" )

    echo "Recovering $guid from pool $pool last updated $( date --date="@$timestamp" )..."

    mkdir -pv "${pool}/${guid}"

    filename=$( get_output_filename "${pool}/${guid}/${timestamp}" )
    logfile="${filename%.img}.log"
    badfile="${filename%.img}.bad"

    echo "Cloning $device to $filename (logging to $logfile)..."

    dd_rescue "${dd_args[@]}" -l "$logfile" -o "$badfile" "$device" "$filename"
done

The script uses zdb to get metadata from the drive, then uses dd_rescue to clone the drive to a file.

Once that finished, I made a snapshot of the entire filesystem (with zfs snapshot), mapped the files to block devices (with losetup), and activated partitions on the loopback devices (with kpartx). Then I was able to import the pool (with zpool import) and find and fix all the errors (with zpool scrub).

Very roughly, the commands I used went something like this:

  • zpool create -f -m /volumes/recovery -o ashift=12 recovery raidz /dev/disk/by-id/ata-TOSHIBA_!(*-part[0-9])
  • zfs create recovery/disks -o compress=zstd-fast
  • Insert the drives and run the script above against each one.
  • zfs snapshot recovery/disks@$( date +%Y%m%d%H%M%S )
  • for file in /volumes/recovery/disks/*/*/*.img ; do losetup -f -v $file; done
  • for loop in $( losetup -a | awk -F: '{print $1}' ) ; do kpartx -a $loop ; done
  • zpool import -d /dev/disk/by-id -f pool_name
  • zpool scrub pool_name

Now I just need to find enough space to rsync or zfs send | zfs receive all that data. 😀

No comments: