Monday, February 7, 2022

Recovering a ZFS array

In August of 2016, I lost a (relatively) large (for me at the time) ZFS array. Rather than tell you how it happened to me, watch how it happened to Linus Tech Tips.

My story is almost identical, except the array in question was much smaller, but to make matters worse it was mostly cobbled together with old hardware, including drives, so when the array died, there were a lot of bad drives.

My array started life as 15 750GB drives in a RAID-Z2. As the 750GB drives failed, they were replaced with 1TB drives. Unfortunately, I continued to use a mix of previously-used drives and some Seagate drives that apparently weren't Seagate's best work. The end result was that drives were failing rather often, and due to like of time, attention, and a ready supply of spare drives, I wasn't great at replacing them when they failed.

The biggest problem with RAID-5/6 and the RAID-Z equivalents are that rebuilds from a drive failure involve a lot of I/O to all of the drives in the array. RAID-Z2 allows you to lose two drives, but if you lose a third from the stress mid-rebuild, your whole array is toast. In my case, I didn't realize that I had a major problem until the third drive started to fail and ZFS took the array offline. A couple of the remaining drives had SMART errors and likely weren't going to survive a rebuild. I was going to have to clone all of the drives with errors before trying to rebuild. If I wanted a non-destructive fall-back plan, I needed to clone every drive, so if all else failed I could go back to the original array members to try again.

So... I didn't want to buy another 15 1TB drives. Where was I going to find enough disks (or raw space for disk images) to make a copy of the array?

My ultimate answer came almost 5.5 years later as I was working on rebuilding my Harvester cluster (more on that some other time). I had several larger drives in the cluster, so while everything was down and disassembled, I put 5 of them in one system and built a ZFS pool. With around 18TB usable, I had more than enough space to store images of all of the drives in the array!

Enough time had passed that I wasn't sure which drives were which, so I wrote a script to examine the metadata on each drive and then clone the drive to a name based on the pool, GUID, and last modified timestamp:


set -e

    "-b" "1M"
    "-B" "4k"

warn() {
    echo "$@" >&2

die() {
    warn "$@"
    exit 1

usage() {
    warn "$( basename "$0" ) device [device [...]]"

get_first() {

    value=$( echo "$text" | awk "(\$1 == \"$key:\") { print \$2; exit 0; }" )

    echo "${value//\'/}"

get_pool() {
    get_first "name" "$@"

get_guid() {
    get_first "guid" "$@"

get_timestamp() {

    timestamps=( $( echo "$text" | awk '($1 == "timestamp") { print $3 }' | sort -n ) )

    echo "${timestamps[-1]}"

get_output_filename() {

    while [ -f "${base}-${n}.img" ] ; do
        warn "${base}-${n}.img exists." 
        (( n++ ))

    echo "${base}-${n}.img"

if [ "$#" -lt 1 ] ; then
    exit 1

cd /volumes/recovery/disks

for device in "$@" ; do
    if [ ! -e "${device}1" ] ; then
        die "Can't find first partition for device $device"

    zdb=$( zdb -l -u "${device}1" )

    pool=$( get_pool "$zdb" )
    guid=$( get_guid "$zdb" )
    timestamp=$( get_timestamp "$zdb" )

    echo "Recovering $guid from pool $pool last updated $( date --date="@$timestamp" )..."

    mkdir -pv "${pool}/${guid}"

    filename=$( get_output_filename "${pool}/${guid}/${timestamp}" )

    echo "Cloning $device to $filename (logging to $logfile)..."

    dd_rescue "${dd_args[@]}" -l "$logfile" -o "$badfile" "$device" "$filename"

The script uses zdb to get metadata from the drive, then uses dd_rescue to clone the drive to a file.

Once that finished, I made a snapshot of the entire filesystem (with zfs snapshot), mapped the files to block devices (with losetup), and activated partitions on the loopback devices (with kpartx). Then I was able to import the pool (with zpool import) and find and fix all the errors (with zpool scrub).

Very roughly, the commands I used went something like this:

  • zpool create -f -m /volumes/recovery -o ashift=12 recovery raidz /dev/disk/by-id/ata-TOSHIBA_!(*-part[0-9])
  • zfs create recovery/disks -o compress=zstd-fast
  • Insert the drives and run the script above against each one.
  • zfs snapshot recovery/disks@$( date +%Y%m%d%H%M%S )
  • for file in /volumes/recovery/disks/*/*/*.img ; do losetup -f -v $file; done
  • for loop in $( losetup -a | awk -F: '{print $1}' ) ; do kpartx -a $loop ; done
  • zpool import -d /dev/disk/by-id -f pool_name
  • zpool scrub pool_name

Now I just need to find enough space to rsync or zfs send | zfs receive all that data. 😀

Tuesday, February 1, 2022

Video from OLF 2021

I had two talks at OLF in December. I just noticed that videos are up on YouTube for both of them.

I Like GitLab... and So Should You

Infrastructure Prototyping with Bolt and Vagrant

Tuesday, April 20, 2021

Dealing with old ssh implementations

Over the last several releases, Fedora has removed support for old, broken crypto algorithms.  Unfortunately, this makes it harder to deal with old devices or servers that can't easily be upgraded.  For example, I have a switch that I can't connect to with the ssh on Fedora.

I can connect to it fine with the ssh on CentOS 7 though...  podman/docker to the rescue!


get_container_runtime() {
    if [ -n "$CONTAINER_RUNTIME" ] ; then

    podman=$( type -p podman )
    if [ -n "$podman" ] ; then

    docker=$( type -p docker )
    if [ -n "$docker" ] ; then

    echo 'No container runtime found.' >&2
    exit 1


set -e


ssh_cmd=$( mktemp /tmp/ssh.XXXXXX )
chmod 700 "$ssh_cmd"

trap "rm -fv $ssh_cmd" EXIT

cat > "$ssh_cmd" <<END
set -e
yum -y install /usr/bin/ssh
ssh $@

    -v "$HOME/.ssh:/root/.ssh"
    -v "$ssh_cmd:$ssh_cmd"

if [ -n "$SSH_AUTH_SOCK" ] ; then

$container_runtime run ${run_args[@]} \
    "$container" \

The script accepts all of the arguments that the container's ssh accepts (because it blindly passes them along). It automatically maps your .ssh directory and your ssh-agent socket. YMMV, but I've tested it on Fedora with podman and a Mac with docker.

Monday, August 24, 2020

Vagrant + libvirt on CentOS 7

 I recently needed to set up vagrant-libvirt on a CentOS 7 VM.  After finding a lot of outdated guides, I decided to write my own and post it on my work blog.

Saturday, April 8, 2017

Delegating domain join privileges in Samba 4 from the command line (or not)

I'm trying to solve a bit of a mystery. I'd like to set up Samba 4 without using Windows. Most things seem to be possible, but I can't figure out how to delegate domain join privileges. Unfortunately, even the official documentation specifically references ADUC.

So I did some digging into what it would take to delegate domain join privileges without a Windows system. After several dead ends, I ran across this page:

The important bit of that page is this script that uses the Windows command-line tool dsacls:

$user = 'gps\SCCM Client Computer Joiners'
$ou = 'OU=SCCM Test Clients,OU=SCCM,OU=Service,OU=Company,DC=gopas,DC=virtual'

DSACLS $ou /R $user

DSACLS $ou /I:S /G "$($user):GR;;computer"
DSACLS $ou /I:S /G "$($user):CA;Reset Password;computer"
DSACLS $ou /I:S /G "$($user):WP;pwdLastSet;computer"
DSACLS $ou /I:S /G "$($user):WP;Logon Information;computer"
DSACLS $ou /I:S /G "$($user):WP;description;computer"
DSACLS $ou /I:S /G "$($user):WP;displayName;computer"
DSACLS $ou /I:S /G "$($user):WP;sAMAccountName;computer"
DSACLS $ou /I:S /G "$($user):WP;DNS Host Name Attributes;computer"
DSACLS $ou /I:S /G "$($user):WP;Account Restrictions;computer"
DSACLS $ou /I:S /G "$($user):WP;servicePrincipalName;computer"
DSACLS $ou /I:S /G "$($user):CC;computer;organizationalUnit"

samba-tool has a subcommand dsacl set that I thought might be able to accomplish the same task. After a lot of work trying to get the arguments correct, I got to this point:
[root@dc1 ~]# samba-tool dsacl set --action=allow --objectdn='cn=Computers,dc=samba4,dc=local' --trusteedn='cn=Domain Join,cn=Users,dc=samba4,dc=local' --sddl='GR;;computer' --realm=SAMBA4.LOCAL -U administrator --password="$( cat /root/.password )"
new descriptor for cn=Computers,dc=samba4,dc=local:
ERROR(<type 'exceptions.TypeError'>): uncaught exception - Unable to parse SDDL
  File "/usr/lib64/python2.7/site-packages/samba/netcmd/", line 176, in _run
    return*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/samba/netcmd/", line 174, in run
    self.add_ace(samdb, objectdn, new_ace)
  File "/usr/lib64/python2.7/site-packages/samba/netcmd/", line 129, in add_ace
    desc = security.descriptor.from_sddl(desc_sddl, self.get_domain_sid(samdb))
So... I think the arguments to dsacls are some kind of "friendly" names that resolve to UUIDs or SIDs or something on the back end, but I can't figure out how to do the mapping.

Suggestions welcome.

Saturday, October 27, 2012

Health update

I don't often post anything personal (or really anything at all, for that matter), but I'm going to make an exception today.

Today marks 9 months since I started working on losing weight and generally improving my health.  At the time, I weighed somewhere around twice what I should (maybe more), and I hadn't been at a healthy weight for nearly 20 years.  The scary thing is that I really didn't see myself as that heavy, but obviously my self-image didn't match reality in the slightest...  As a co-worker put it, you don't get to that size without a healthy dose of denial.

2 years ago
It took a weight loss competition organized by another co-worker to get me motivated, but I decided in January that I was going to lose weight and get back to some level of physical fitness.  I've had a lot of success, as anyone who knows me can tell, so I get asked a lot how I lost all the weight.  The short answer that I give is that I lost it the old-fashioned way - diet and exercise.  There is a much longer answer though, so bear with me...

Diet was definitely the biggest change for me.  I've always eaten way too much food.  It wasn't necessarily all bad food, although it often was, but the sheer quantity was what got me in trouble.  I decided to do three things to address that.  First of all (and most importantly), I finally started logging what I was eating (as my wife had been trying to get me to do for years).  I found the Lose It! app, which made this painless.  It was absolutely invaluable since it let me see what foods were OK to eat and which weren't (and to see just how bad those things were).

Second, I tried to eat more of the right foods, like lean proteins and vegetables.  I started to avoid sugar, starches (no pasta, bread, or rice), and high-fat foods (with a few exceptions like almonds, which became one of my favorite snacks).  Conveniently, since I was watching my calorie intake, the things I was trying to eat are low-calorie, which meant I didn't have to starve myself at all.

The third diet change that I made was to start snacking through the day, usually eating something every couple of hours.  This was the weirdest part, focusing on eating regularly and often in order to lose weight, and it was odd never really being full, but at the same time I never really got hungry enough to have impulse control issues.

Note that when I say "diet", I'm trying to avoid the connotation that the world normally holds.  I never meant for any of this to be a short-term change in my eating, but rather I considered this to be a lifestyle change.  I have no intention of going back to anything resembling my old diet, no matter what shape I'm in or how active I am.

Speaking of activity, I struggled a bit to find exercise that I was physically capable of doing for any length of time, without hurting myself.  My friend Artie (who had recently lost a large amount of weight himself, and who was my biggest inspiration for putting in all this effort) worked for a while to convince me to go out for short walks with him.  With my bad knees, walking was extremely uncomfortable.  Eventually I gave in though, and we started walking as often as possible.  At first, a 15-minute walk would nearly kill me.  I kept walking as often as I could though, either at lunch, in the afternoons just to clear my head, with my family in the evenings, you name it.  By May, I walked a 5K with Artie (in just over 50 minutes).  It was looking like I would be able to run a 5K this past month, but unfortunately an injury slowed me down just enough that I wasn't able to.

Somewhere early on, I started riding our stationary recumbent bike (which had sat in our house, collecting dust for around 5 years).  At first, I was lucky to do 5-10 minutes.  After a few weeks, I recall doing an hour, non-stop, and feeling like I wasn't going to be able to walk afterwards.  At some point around then, I started riding my real bike and found that I couldn't climb a hill.  I kept working on it though, and eventually I was able to ride 10 miles, 15 miles, 25 miles, 33 miles, and ultimately 50 miles.  (At some point in the near future, I'd like to try to ride 100 miles, but that's a pretty massive time commitment.)

The most rewarding part of this entire experience has been the lifestyle change that my entire family has gone through.  It's one thing for everyone to diet together, but that's not what we've done.  We're all eating differently, cooking together, and finding ways to be active together.  My wife Kara has been incredibly patient and understanding, even when I've been overly single-minded about trying to hit whatever goal I had on any given day.  She has been on-board since the beginning, and has also managed to lose a significant amount of weight.  (I'll leave it to her to give details.)  All of the changes have been great for our daughter Emma too, who is in better shape now than she has been at any other point in her life.  I know a lot of people who try to lose weight on their own, and I'm sure it can be done, but I certainly wouldn't recommend it.

I'm fortunate to have a great support system.  I mentioned Artie before (thanks, Artie!), but I also have to thank Mike for pushing me to do more, go a little faster, or go a little farther.  There are many others (yes, I'm looking at you, Emma) who have helped, and I apologize for not naming every one of you, but I do appreciate all of the support.

As of this morning, I have lost over 36% of the weight I was carrying at the end of January.  I need to get to 50%, give or take, so I still have quite a bit to lose, but I have complete confidence that it will come off over the next few months.  I have had to replace my wardrobe multiple times now (I'm already wearing shirts 4 sizes smaller than I was wearing when I started), so I'm perfectly OK with the loss leveling off for a while.  :-)