SteveCo: February 2024

This is part 2 of a multi-part series. See part 1 for the beginning of the series.

Note that this is material from 2010 and earlier that pre-dates the common availability of solid state drives.

Detecting failures

Mechanical failures

Mechanical drive failure is nearly always accompanied by some sort of audible noise. One common sound heard from failing hard drives is the so-called "Click of Death", a sound similar to a watch ticking (but much louder). This can have various causes, but it is commonly caused by the read/write head inside a drive being stuck or possibly trying to repeatedly read a failing block.

Another common noise is a very high-pitched whine. This is caused by bearings in a drive failing (most likely rubbing metal-on-metal), usually as a result of old age. Anything that moves inside a computer (fans, for example) can make a noise like this, so always check a suspect drive away from other sources of noise to verify that the sound is indeed coming from the drive.

Drive motors failing and head crashes can cause other distinctive noises. As a rule, any noise coming from a hard drive that does not seem normal is probably an indicator of imminent failure.

Electronic failures

Failing electronics can cause a drive to act flaky, not detect, and occasionally catch fire.

Hard drives have electronics on the inside of the drive which are inaccessible without destroying the drive (unless you happen to have a clean room). Unfortunately, if those fail, there isn't much you can do.

The external electronics on a hard drive are usually a small circuit board that contains the interface connector and is held onto the drive with a few screws. In many cases, multiple versions of a drive (IDE, SATA, SCSI, SAS, etc.) exist with different controller interface boards. Generally speaking, it is possible to transplant the external electronics from a good drive onto a drive with failing electronics in order to get data off the failing drive. Usually the controller board will need to be off an identical drive with similar manufacturing dates.

Dealing with physical failures

In addition to drive electronics transplanting, just about any trick you've heard of (freezing, spinning, smacking, etc.) has probably worked for someone, sometime. Whether any of these tricks work for you is a matter of trial and error. Just be careful.

Freezing drives seem to be especially effective. Unfortunately, as soon as a drive is operating, it will tend to heat up quickly, so some care needs to be taken to keep drives cool without letting them get wet from condensation.

Swapping electronics often works when faced with electronic failure, but only when the donor drive exactly matches the failed drive.

Freezing drives often helps in cases of crashed heads and electronic problems. Sometimes they will need help to stay cold (ice packs, freeze spray, etc.), but often once they start spinning, they'll stay spinning. Turning a drive on its side sometimes helps with physical problems as well.

Unfortunately, we do have to get a drive to spin for any software data recovery techniques to work.

To be continued in part 3.

This is material from a class I taught a long time ago. Some of it may still be useful. 🙂

The original copyright notice:

This work is licensed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

This is part 1 of a multi-part series.

Identifying drives

An easy way to get a list of drives attached to a system is to run fdisk -l. The output will look something like this:

# fdisk -l

Disk /dev/sda: 80.0 GB, 80026361856 bytes

255 heads, 63 sectors/track, 9729 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk identifier: 0xcab10bee

Device Boot Start End Blocks Id System

/dev/sda1 * 1 8673 69665841 7 HPFS/NTFS

/dev/sda2 8675 9729 8474287+ c W95 FAT32 (LBA)

In many cases, you'll see a lot of (generally) uninteresting devices that are named /dev/dm-n. These are devices created by device mapper for everything from software RAID to LVM logical volumes. If you are primarily interested in the physical drives attached to a system, you can suppress the extra output of fdisk -l with a little bit of sed. Try the following:

fdisk -l 2>&1 | sed '/\/dev\/dm-/,/^$/d' | uniq

Whole devices generally show up as /dev/sdx (/dev/sda, /dev/sdb, etc.) or /dev/hdx (/dev/hda, /dev/hdb, etc.). Partitions on the individual devices show up as /dev/sdxn (/dev/sda1, /dev/sda2, etc.), or, in the case of longer device names, the name of the device with pn appended (an example might be /dev/mapper/loop0p1).

Hardware

PATA/SATA

The vast majority of hard drives currently in use connect to a computer using either an IDE (or Parallel ATA) interface or a SATA (Serial ATA) interface. For the most part, SATA is just IDE with a different connector, but when SATA came out, the old Linux IDE driver had accumulated enough cruft that a new SATA driver (libata) was developed to support SATA controller chipsets. Later, the libata driver had support for most IDE controllers added, obsoleting the old IDE driver.

There are some differences in the two drivers, and often those differences directly impact data recovery. One difference is device naming. The old IDE driver named devices /dev/hdx, where x is determined by the position of the drive.

/dev/hda Master device, primary controller

/dev/hdb Slave device, primary controller

/dev/hdc Master device, secondary controller

/dev/hdd Slave device, secondary controller

And so on.

Unlike the IDE driver, the libata driver uses what was historically SCSI device naming, /dev/sdx, where x starts at "a" and increments upwards as devices are detected, which means that device names are more-or-less random, and won't be consistent across reboots.

The other major difference between the old IDE driver and the libata driver that affects data recovery is how the drivers handle DMA (direct memory access). The ATA specification allows for various PIO (Programmed I/O) and DMA modes. Both the old IDE driver and the libata driver will determine the best mode, in most cases choosing a DMA mode initially, and falling back to a PIO mode in error conditions. The old IDE driver would also let you manually toggle DMA off and on for any device using the command hdparm.

hdparm -d /dev/hdx Query DMA on/off state for /dev/hdx

hdparm -d0 /dev/hdx Disable DMA on /dev/hdx

hdparm -d1 /dev/hdx Enable DMA on /dev/hdx

The libata driver currently lacks the ability to toggle DMA on a running system, but it can be turned off for all hard drives with the kernel command line option libata.dma=6, or for all devices (including optical drives) with libata.dma=0. On a running system, the value of libata.dma can be found in /sys/module/libata/parameters/dma. (The full list of numeric values for this option can be found in http://www.kernel.org/doc/Documentation/kernel-parameters.txt.) There does not appear to be a way to way to toggle DMA per device with the libata driver.

There are several reasons why you might want to toggle DMA on or off for a drive. In some cases, failing drives simply won't work unless DMA is disabled, or even in some rare cases might not work unless DMA is enabled. In some cases the computer might have issues when reading from a failing drive with DMA enabled. (The libata driver usually handles these situations fairly well. The old IDE driver only began to handle these situations well in recent years.)

In addition to those reasons, PIO mode forces a drive to a maximum speed of 25MB/s (PIO Mode 6, others are even slower), while DMA modes can go up to 133MB/s. Some drives appear to work better at these lower speeds.

SCSI

While SCSI drives and controllers are less common than they once were, all current hard drive controller interfaces now use the kernel SCSI device layers for device management and such. For example, all devices that use the SCSI layer will show up in /proc/scsi/scsi.

# cat /proc/scsi/scsi

Attached devices:

Host: scsi0 Channel: 00 Id: 00 Lun: 00

Vendor: TSSTcorp Model: CD/DVDW TS-L632D Rev: AS05

Type: CD-ROM ANSI SCSI revision: 05

Host: scsi1 Channel: 00 Id: 00 Lun: 00

Vendor: ATA Model: ST9160821A Rev: 3.AL

Type: Direct-Access ANSI SCSI revision: 05

Host: scsi3 Channel: 00 Id: 00 Lun: 00

Vendor: ATA Model: WDC WD10EACS-00Z Rev: 01.0

Type: Direct-Access ANSI SCSI revision: 05

In most cases, it is safe to remove a device that isn't currently mounted, but to be absolutely sure it is safe, you can also explicitly tell the kernel to disable a device by writing to /proc/scsi/scsi. For example, to remove the third device (the Western Digital drive in this example), you could do the following:

echo scsi remove-single-device 3 0 0 0 > /proc/scsi/scsi

Note that the four numbers correspond to the controller, channel, ID, and LUN in the example.

In cases where hot-added devices don't automatically show up, there is also a corresponding add-single-device command.

When recovering data from SCSI (and SCSI-like drives such as SAS), there are no special tricks like DMA.

USB, etc.

The Linux USB drivers are rather resilient in the face of errors, so no special consideration needs to be given when recovering data from thumb drives and other flash memory (except that these devices tend to work or not, and, of course, dead shorts across USB ports are a Bad Thing). USB-to-ATA bridge devices are a different matter entirely though. They tend to lock up hard or otherwise behave badly when they hit errors on a failing drive. Generally speaking, they should be avoided for failing drives, but drives that are OK other than a trashed filesystem or partition table should be completely fine on a USB-to-ATA bridge device.

To be continued in part 2.

SteveCo

Saturday, February 24, 2024

Drive Failures - Data Recovery with Open-Source Tools (part 2)