Wednesday, May 17, 2023

libvirt surprise

I just noticed that some of my libvirt VMs had on_crash set to destroy instead of restart. It looks like there is an easy fix:

for vm in $( virsh list --name ) ; do virt-xml "$vm" --edit --events on_crash=restart ; done

I don't know if something changed in virt-manager/virt-install over the years, or if I ran into this a long time ago and forgot about it.

Now I just need to remember to add that --events option to virt-install in the future... 🙂

Tuesday, May 9, 2023

Harvester HCI

I have been using libvirt on CentOS + ZFS for my home lab for somewhere around a decade now.  For the last several years, I have been trying off and on to switch to some kind of hyperconverged infrastructure, usually oVirt + a clustered storage solution (Ceph, Gluster).  For various reasons, I've never quite managed to get all the pieces to work together correctly.

So, imagine how happy I was to hear about Harvester a while back.

Harvester is a modern Hyperconverged infrastructure (HCI) solution built for bare metal servers using enterprise-grade open source technologies including Kubernetes, Kubevirt and Longhorn.

I love everything about this!  It's a more modern take on hyperconverged infrastructure than what I was trying to assemble.  Plus, all my problems magically disappear when all the pieces work together out of the box, right?

Well...  Not quite.  I installed version 0.3.0.  It made for a cool demo, but, thanks to stability problems and a whole lot of missing features, it wasn't quite ready for anything resembling production use.  (Granted, this is my home lab, but I still run virtualized firewalls and stuff like that on it, so I need it to work, and work reliably.)

I'll note here that I wrote all of the above over a year ago.  I then closed with a list of reasons why Harvester wasn't good enough for me to actually use it at the time.  It (unintentionally!) sounded negative enough that I decided not to publish the post.

So here we are a year or so later, and after a few more failed attempts I recently tried Harvester again, this time with version 1.1.1.  Everything I need to work seems to, and I'm ready to start migrating some real workloads!

That's not to say that everything is perfect.  There are a few useful features on the roadmap that I could benefit from (like anti-affinity rules, zero-downtime upgrades, ...), and I still have some challenges.  Some examples:

  • Automating node installation is ... let's say difficult?
  • Networking is almost as functional as I want it, but I still haven't been able to figure out how to move storage replication to a network with jumbo frames.
  • I want real certs.  I see how to manually manage the certs, but it's not immediately obvious how I could manage them automatically (for Let's Encrypt).
Thankfully none of those things are keeping me from using Harvester.  They're just things to look forward to in future upgrades. 😀

Saturday, September 10, 2022

Migrating libvirt VMs

I recently moved a bunch of libvirt VMs from a CentOS 7 host to a CentOS Stream 9 host. Normally moving virtual machines from one libvirt host to another is pretty easy. All you need to do is stop the VM on the original host, copy the disk image from host to host (with rsync or whatever is convenient), dump the VM config (with virsh dumpxml guest), and import that config on the new host (with virsh define). It turns out a few things have changed that make that not quite work though...

The first thing (and the easiest to fix) was that a lot of old machine types that worked in CentOS 7's libvirt no longer work. The easy answer is to switch to the generic pc machine type.

The harder one to deal with was that Spice support was dropped. This meant switching the graphics to VNC, using virtio for the virtual video hardware, and removing all the other Spice-related devices.

The libvirt VM configuration is all XML, so I wrote a script that uses xmlstarlet to make all the necessary changes.


#!/bin/bash

set -e

remote="${REMOTE_HOST:?'Set REMOTE_HOST environment variable'}"

for guest in "$@" ; do
    xml="$( mktemp XXXXXXXX.xml )"
    trap "rm -fv '$xml'" EXIT
    ssh root@"$remote" virsh dumpxml "$guest" | \
        xmlstarlet ed \
            -u '/domain/os/type[starts-with(@machine, "pc-i440fx")]/@machine' -v pc \
            -u '/domain/os/type[starts-with(@machine, "rhel")]/@machine' -v pc \
            -u '/domain/devices/video/model[@type="qxl"]/@type' -v virtio \
            -d '/domain/devices/video/model[@type="virtio"]/@ram' \
            -d '/domain/devices/video/model[@type="virtio"]/@vram' \
            -d '/domain/devices/video/model[@type="virtio"]/@vgamem' \
            -d '/domain/devices/graphics[@type="spice"]/@port' \
            -i '/domain/devices/graphics[@type="spice"]' -t attr -n port -v -1 \
            -u '/domain/devices/graphics[@type="spice"]/@type' -v vnc \
            -d '/domain/devices/channel[@type="spicevmc"]' \
            -d '/domain/devices/redirdev[@type="spicevmc"]' \
        > "$xml"
    virsh define "$xml"
    rm -fv "$xml"
    trap - EXIT
    virsh autostart "$guest"
done

(The above script is also available at https://gist.github.com/silug/8c13cdfa0e50ca6e237333a79594be66.)

Monday, February 7, 2022

Recovering a ZFS array

In August of 2016, I lost a (relatively) large (for me at the time) ZFS array. Rather than tell you how it happened to me, watch how it happened to Linus Tech Tips.

My story is almost identical, except the array in question was much smaller, but to make matters worse it was mostly cobbled together with old hardware, including drives, so when the array died, there were a lot of bad drives.

My array started life as 15 750GB drives in a RAID-Z2. As the 750GB drives failed, they were replaced with 1TB drives. Unfortunately, I continued to use a mix of previously-used drives and some Seagate drives that apparently weren't Seagate's best work. The end result was that drives were failing rather often, and due to like of time, attention, and a ready supply of spare drives, I wasn't great at replacing them when they failed.

The biggest problem with RAID-5/6 and the RAID-Z equivalents are that rebuilds from a drive failure involve a lot of I/O to all of the drives in the array. RAID-Z2 allows you to lose two drives, but if you lose a third from the stress mid-rebuild, your whole array is toast. In my case, I didn't realize that I had a major problem until the third drive started to fail and ZFS took the array offline. A couple of the remaining drives had SMART errors and likely weren't going to survive a rebuild. I was going to have to clone all of the drives with errors before trying to rebuild. If I wanted a non-destructive fall-back plan, I needed to clone every drive, so if all else failed I could go back to the original array members to try again.

So... I didn't want to buy another 15 1TB drives. Where was I going to find enough disks (or raw space for disk images) to make a copy of the array?

My ultimate answer came almost 5.5 years later as I was working on rebuilding my Harvester cluster (more on that some other time). I had several larger drives in the cluster, so while everything was down and disassembled, I put 5 of them in one system and built a ZFS pool. With around 18TB usable, I had more than enough space to store images of all of the drives in the array!

Enough time had passed that I wasn't sure which drives were which, so I wrote a script to examine the metadata on each drive and then clone the drive to a name based on the pool, GUID, and last modified timestamp:

#!/bin/bash

set -e

dd_args=(
    "-d"
    "-b" "1M"
    "-B" "4k"
    "-w"
    "-A"
)

warn() {
    echo "$@" >&2
}

die() {
    warn "$@"
    exit 1
}

usage() {
    warn "$( basename "$0" ) device [device [...]]"
}

get_first() {
    key="$1"
    text="$2"

    value=$( echo "$text" | awk "(\$1 == \"$key:\") { print \$2; exit 0; }" )

    echo "${value//\'/}"
}

get_pool() {
    get_first "name" "$@"
}

get_guid() {
    get_first "guid" "$@"
}

get_timestamp() {
    text="$1"

    timestamps=( $( echo "$text" | awk '($1 == "timestamp") { print $3 }' | sort -n ) )

    echo "${timestamps[-1]}"
}

get_output_filename() {
    base="$1"

    n=1
    while [ -f "${base}-${n}.img" ] ; do
        warn "${base}-${n}.img exists." 
        (( n++ ))
    done

    echo "${base}-${n}.img"
}

if [ "$#" -lt 1 ] ; then
    usage
    exit 1
fi

cd /volumes/recovery/disks

for device in "$@" ; do
    if [ ! -e "${device}1" ] ; then
        die "Can't find first partition for device $device"
    fi

    zdb=$( zdb -l -u "${device}1" )

    pool=$( get_pool "$zdb" )
    guid=$( get_guid "$zdb" )
    timestamp=$( get_timestamp "$zdb" )

    echo "Recovering $guid from pool $pool last updated $( date --date="@$timestamp" )..."

    mkdir -pv "${pool}/${guid}"

    filename=$( get_output_filename "${pool}/${guid}/${timestamp}" )
    logfile="${filename%.img}.log"
    badfile="${filename%.img}.bad"

    echo "Cloning $device to $filename (logging to $logfile)..."

    dd_rescue "${dd_args[@]}" -l "$logfile" -o "$badfile" "$device" "$filename"
done

The script uses zdb to get metadata from the drive, then uses dd_rescue to clone the drive to a file.

Once that finished, I made a snapshot of the entire filesystem (with zfs snapshot), mapped the files to block devices (with losetup), and activated partitions on the loopback devices (with kpartx). Then I was able to import the pool (with zpool import) and find and fix all the errors (with zpool scrub).

Very roughly, the commands I used went something like this:

  • zpool create -f -m /volumes/recovery -o ashift=12 recovery raidz /dev/disk/by-id/ata-TOSHIBA_!(*-part[0-9])
  • zfs create recovery/disks -o compress=zstd-fast
  • Insert the drives and run the script above against each one.
  • zfs snapshot recovery/disks@$( date +%Y%m%d%H%M%S )
  • for file in /volumes/recovery/disks/*/*/*.img ; do losetup -f -v $file; done
  • for loop in $( losetup -a | awk -F: '{print $1}' ) ; do kpartx -a $loop ; done
  • zpool import -d /dev/disk/by-id -f pool_name
  • zpool scrub pool_name

Now I just need to find enough space to rsync or zfs send | zfs receive all that data. 😀

Tuesday, February 1, 2022

Video from OLF 2021

I had two talks at OLF in December. I just noticed that videos are up on YouTube for both of them.

I Like GitLab... and So Should You

Infrastructure Prototyping with Bolt and Vagrant

Tuesday, April 20, 2021

Dealing with old ssh implementations

Over the last several releases, Fedora has removed support for old, broken crypto algorithms.  Unfortunately, this makes it harder to deal with old devices or servers that can't easily be upgraded.  For example, I have a switch that I can't connect to with the ssh on Fedora.

I can connect to it fine with the ssh on CentOS 7 though...  podman/docker to the rescue!

#!/bin/bash

get_container_runtime() {
    if [ -n "$CONTAINER_RUNTIME" ] ; then
        container_runtime=$CONTAINER_RUNTIME
        return
    fi

    podman=$( type -p podman )
    if [ -n "$podman" ] ; then
        container_runtime=$podman
        return
    fi

    docker=$( type -p docker )
    if [ -n "$docker" ] ; then
        container_runtime=$docker
        return
    fi

    echo 'No container runtime found.' >&2
    exit 1
}

get_container_runtime

set -e

container=${CONTAINER:-"centos:7"}

ssh_cmd=$( mktemp /tmp/ssh.XXXXXX )
chmod 700 "$ssh_cmd"

trap "rm -fv $ssh_cmd" EXIT

cat > "$ssh_cmd" <<END
#!/bin/sh
set -e
yum -y install /usr/bin/ssh
ssh $@
END

run_args=(
    -it
    --rm
    -v "$HOME/.ssh:/root/.ssh"
    -v "$ssh_cmd:$ssh_cmd"
)

if [ -n "$SSH_AUTH_SOCK" ] ; then
    run_args+=(
        -e=SSH_AUTH_SOCK
        -v "$SSH_AUTH_SOCK:$SSH_AUTH_SOCK"
    )
fi

$container_runtime run ${run_args[@]} \
    "$container" \
    "$ssh_cmd"

The script accepts all of the arguments that the container's ssh accepts (because it blindly passes them along). It automatically maps your .ssh directory and your ssh-agent socket. YMMV, but I've tested it on Fedora with podman and a Mac with docker.

Monday, August 24, 2020

Vagrant + libvirt on CentOS 7

 I recently needed to set up vagrant-libvirt on a CentOS 7 VM.  After finding a lot of outdated guides, I decided to write my own and post it on my work blog.