Friday, March 29, 2024

RAID - Data Recovery with Open-Source Tools (part 7)

This is part 7 of a multi-part series. See part 1 for the beginning of the series.

Software RAID

It's becoming increasingly common (in 2009) on desktop PCs to use some form of BIOS-based software RAID. In most cases dealing with a single-drive failure in a software RAID isn't terribly difficult. For example, with NVIDIA's software RAID, even when one drive out of a stripe (RAID 0) set fails, if the drive is recoverable, you can simply clone it to a new identically-sized drive and the RAID will just work. Unfortunately, this isn't so simple with Intel's software RAID, which appears to store the serial numbers of the drives in the RAID metadata, meaning an exact clone won't work. While it would most likely be possible to simply edit the RAID metadata using hexedit to update the drive information, a somewhat simpler solution is to make a backup clone of the drives in the array, then re-create the RAID exactly as it was before in the RAID BIOS, then boot into Linux and run testdisk on the RAID device. More on that in part 8.

Most often the RAID metadata for drives in a software RAID volume is stored toward the end of the drive. In some cases, if you are forced to clone a failing RAID drive to a larger drive, you can make Linux (and maybe the BIOS and Windows) see the drive as a RAID device by copying the last few blocks from the failing drive to the last few blocks of the replacement drive.

old_end=$(( $( blockdev --getsz /dev/sda ) / 2 ))
end=$(( $( blockdev --getsz /dev/sdb ) / 2 ))
dd_rescue -d -D -b 4k -B 4k -s $(( $old_end - 1024 ))k -S $(( $end - 1024 ))k /dev/sdb /dev/sdb

Hardware RAID

Unfortunately, the ways that hardware RAID controllers store metadata don't tend to be quite as predictable as software RAID. If you attach a hardware RAID member drive to a non-RAID controller, some of the tricks mentioned above might work, but there are by no means any guarantees.

Also be aware that hardware RAID controllers are very likely to take a drive offline at the first sign of an error rather than report back the error and continue as most non-RAID controllers would. While this makes hardware RAID controllers largely unusable for data recovery, it does mean that a failing RAID member drive is quite likely to be recoverable.

To be continued in part 8.

Friday, March 22, 2024

Wiping Drives - Data Recovery with Open-Source Tools (part 6)

This is part 6 of a multi-part series.  See part 1 for the beginning of the series.

Wiping drives

To properly wipe a drive so it is effectively unrecoverable, the best solution is to use DBAN. It can be downloaded from https://sourceforge.net/projects/dban/.

Note from 2024: The DBAN project is mostly dead. Currently I would recommend nwipe, which is available in the standard package repositories for a number of Linux distributions, from source at https://github.com/martijnvanbrummelen/nwipe, or on bootable media like SystemRescue.  In fact, SystemRescue has a page in their documentation on this very topic.

In many cases, it is sufficient to simply zero out the entire drive. This can be done using dd_rescue.

To zero out /dev/sda, you can use the following command:

dd_rescue -D -b 1M -B 4k -m $(( $( blockdev --getsz /dev/sda ) / 2 ))k /dev/zero /dev/sda

This uses a bit of a shell scripting trick to avoid multiple commands and copy & paste, but it is still fairly simple. The output of blockdev --getsz gives us the size of the device in 512-byte blocks, so we divide that number by 2 to get the size in 1kB blocks, which we pass to the -m option (with a trailing k) to denote kB) to specify the maximum amount of data to transfer. Using a default block size of 1MB (-b) with a fallback of 4kB (-B, to match the host page size, which is required for direct I/O) should give us decent throughput.

Note that we're using -D to turn on direct I/O to the destination drive (/dev/sda), but we're not using direct I/O (-d) to read /dev/zero since /dev/zero is a character device that does not support direct I/O.

To just clear the MS-DOS partition table (and boot sector) on /dev/sda, you could do the following:

dd if=/dev/zero of=/dev/sda count=1

To be continued in part 7.

Friday, March 15, 2024

Cloning Drives - Data Recovery with Open-Source Tools (part 5)

This is part 5 of a multi-part series. See part 1 for the beginning of the series.

Cloning hard drives with dd_rescue

In cases where a hard drive is failing, often simply cloning the drive is all that is required to recover data. There are many other situations where cloning a drive is important though, such as when attempting to recover from a broken partition table or major filesystem corruption.

The primary tool for cloning drives is called dd_rescue. Running dd_rescue -h or simply dd_rescue with no options will give you a summary of the various command-line options:

dd_rescue Version 1.14, garloff@suse.de, GNU GPL
 ($Id: dd_rescue.c,v 1.59 2007/08/26 13:42:44 garloff Exp $)
dd_rescue copies data from one file (or block device) to another.
USAGE: dd_rescue [options] infile outfile
Options: -s ipos start position in input file (default=0),
	     -S opos start position in output file (def=ipos),
	     -b softbs block size for copy operation (def=65536),
	     -B hardbs fallback block size in case of errs (def=512),
	     -e maxerr exit after maxerr errors (def=0=infinite),
	     -m maxxfer maximum amount of data to be transfered (def=0=inf),
	     -y syncfrq frequency of fsync calls on outfile (def=512*softbs),
	     -l logfile name of a file to log errors and summary to (def=""),
	     -o bbfile name of a file to log bad blocks numbers (def=""),
	     -r reverse direction copy (def=forward),
	     -t truncate output file (def=no),
	     -d/D use O_DIRECT for input/output (def=no),
	     -w abort on Write errors (def=no),
	     -a spArse file writing (def=no),
	     -A Always write blocks, zeroed if err (def=no),
	     -i interactive: ask before overwriting data (def=no),
	     -f force: skip some sanity checks (def=no),
	     -p preserve: preserve ownership / perms (def=no),
	     -q quiet operation,
	     -v verbose operation,
	     -V display version and exit,
	     -h display this help and exit.
Note: Sizes may be given in units b(=512), k(=1024), M(=1024^2) or G(1024^3) bytes
This program is useful to rescue data in case of I/O errors, because
 it does not necessarily abort or truncate the output.

Note that there is also a GNU ddrescue with a similar feature set, but with entirely incompatible command-line arguments.

In the simplest of cases, dd_rescue can be used to copy infile (let's say, for example, /dev/sda) to outfile (again, for example, /dev/sdb).

dd_rescue /dev/sda /dev/sdb

In most cases, you'll want a little more control over how dd_rescue behaves though. For example, to clone failing /dev/sda to /dev/sdb:

dd_rescue -d -D -B 4k /dev/sda /dev/sdb

(to use the default 64k block size) or, for really bad drives, to force only one read attempt:

dd_rescue -d -D -B 4k -b 4k /dev/sda /dev/sdb

Adding the -r option to read backwards also helps sometimes.

Changing block sizes

By default, dd_rescue uses a block size of 64k (overridden with -b). In the event of a read error, it tries to read again in 512-byte chunks (overridden with -B). If a drive is good (or only beginning to fail), a larger block size (usually in the 512kB-1MB range) will give you significantly better performance.

If a drive is failing, forcing the default block size to the same value as the fall-back size will keep dd_rescue from re-reading (and therefore possibly damaging) failed blocks.

Direct I/O

The -d and -D options turn on direct I/O for the input and output files respectively. Direct I/O turns off all OS caching, both read-ahead and write-behind. This is much more efficient (and safer) when reading from and writing to hard drives, but should generally be avoided when using regular files.

Other useful options

-r        Read backwards. Sometimes works more reliably. (Very handy trick...)

-s num    Start position in input file.

-S num    Start position in output file. (Defaults to the same as -s.)

-e num    Stop after num errors.

-m num    Maximum amount of data to read.

-l file   Write a log to file.

Copying partitions

Let's say you have a drive with a MS-DOS partition table.  The drive has two partitions.  The first is a NTFS partition that seems to be intact.  The second partition is an unknown type.  Rather than copying every block using dd_rescue, you want to copy only the blocks that are in use to a drive that is the same size.

To do this, first copy the boot sector and partition table from /dev/sda to /dev/sdb using dd:

dd if=/dev/sda of=/dev/sdb count=1

The default block size of dd is 512 bytes, which, conveniently, is the size of boot sector + partition table at the beginning of the drive.

Note: This trick doesn't quite work on MS-DOS partition tables with extended partitions! In that case, use sfdisk to copy the partition table (after running the above command to pick up the boot sector):

sfdisk -d /dev/sda | sfdisk /dev/sdb

Next, re-read the partition table on /dev/sdb using hdparm:

hdparm -z /dev/sdb

Next we can clone the NTFS filesystem on /dev/sda1 to /dev/sdb1 using the ntfsclone command from ntfsprogs:

ntfsclone --rescue -O /dev/sdb1 /dev/sda1

Finally we clone /dev/sda2 to /dev/sdb2 using dd_rescue using a 1MB block size (for speed):

dd_rescue -d -D -B 4k -b 1M /dev/sda2 /dev/sdb2

To be continued in part 6.

Friday, March 8, 2024

Burn-in Testing for Spinning Disks - Data Recovery with Open-Source Tools (part 4)

This is part 4 of a multi-part series.  See part 1 for the beginning of the series.

Note that this was written long before solid state drives were common (or possibly before they existed), so when I say "drive", I mean traditional spinning hard drives.  Burn-in testing like this on SSDs makes a lot less sense and will likely only reduce their useful lifespan.

Burn-in testing

A good way to do a burn-in test on a new drive is to use a combination of SMART self-tests and the badblocks utility.  An example of how to do this can be found at https://github.com/silug/drivetest.

This script does the following:

  1. Enables SMART on the drive
  2. Checks for existing SMART health problems
  3. Runs a SMART conveyance or short test if the drive advertises that capability
  4. Uses badblocks to do a non-destructive read/write test of the whole drive
  5. Checks for resulting SMART errors
  6. Runs an extended SMART test
Depending on the size of the drive, this can take many hours, but the result will be a drive that should be past any early failures.

To be continued in part 5.

Friday, March 1, 2024

SMART - Data Recovery with Open-Source Tools (part 3)

This is part 3 of a multi-part series.  See part 1 for the beginning of the series.

SMART

SMART (Self-Monitoring, Analysis, and Reporting Technology) can, in many cases, be used to detect drive failures. The utility smartctl (from the smartmontools package, see https://www.smartmontools.org/) can be used to view SMART data, initiate self-tests, etc.

Specifying device types

Historically, smartctl has guessed that devices named /dev/hdn are ATA (IDE) drives, and devices named /dev/sdn are SCSI drives. Since SATA drives and IDE drives using the libata driver show up as /dev/sdn, recent versions of smartctl have been modified to generally detect ATA drives named /dev/sdn, but to be sure, or in cases where smartctl needs to be told what type of device you're accessing, use the -t option. To test how you are accessing the drive, use the -i (AKA --info) option.
  • ATA (SATA and IDE drives)
smartctl -d ata -i /dev/sdn
  • SCSI
smartctl -d scsi -i /dev/sdn
  • 3ware controller, port n
smartctl -d 3ware,n -i /dev/twe0 (8000-series and earlier controllers)
smartctl -d 3ware,n -i /dev/twa0 (9000-series controllers)

smartctl supports various other device types (other RAID controllers, some USB-to-ATA bridges, etc.). See the man page or the smartmontools web site for more information.

Enabling SMART

If SMART is not enabled on the device (like when it is disabled in the BIOS), it can be enabled with smartctl -s on device. There is also a -S option that turns on autosave of vendor-specific attributes. In most cases, it shouldn't be necessary to turn this on, but it can't hurt to turn it on.

Displaying SMART data

If you only remember one option for smartctl, make sure it is -a. That will show you everything smartctl knows about a drive. It is equivalent to -H -i -c -A -l error -l selftest -l selective for ATA drives and -H -i -A -l error -l selftest for SCSI drives.

Health

Drives use a combination of factors to determine their overall health. The drive's determination can be displayed with smartctl -H. For a failing drive, the output might look like this:

# smartctl -d ata -H /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 033 033 036 Pre-fail Always FAILING_NOW 2747


For a drive that isn't failing (or, more accurately, that SMART on the drive doesn't think is failing), the output will look like this:

# smartctl -d ata -H /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


Please note that a failing health self-assessment should always be taken as a clear indication of a failure, but passing this test should not be used as an indication that a drive is fine. Most actively failing drives do not trip this test.

Information

As previously mentioned, the -i option for smartctl will report drive information, such as model number, serial number, capacity, etc. The output of smartctl -i will look something like this:

# smartctl -d ata -i /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST31000528AS
Serial Number: X4JZDJRF
Firmware Version: CC38
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jul 7 21:01:41 2010 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

In some cases, drives that are known to have firmware bugs will also give output like this:

==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957

Capabilities

The -c option for smartctl displays drive capabilities. The most interesting bit of information displayed with this option is the suggested amount of time required for various self-tests. The full output will look like this:

# smartctl -d ata -c /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:     (   0) The previous self-test routine completed
                                       without error or no self-test has ever
                                       been run.
Total time to complete Offline
data collection:                ( 600) seconds.
Offline data collection
capabilities:                   (0x7b) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:           (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:       (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine
recommended polling time:       (   1) minutes.
Extended self-test routine
recommended polling time:       ( 175) minutes.
Conveyance self-test routine
recommended polling time:       (   2) minutes.
SCT capabilities:             (0x103f) SCT Status supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART attributes

The -A option for smartctl displays vendor-specific device attributes that are stored by the device.

# smartctl -d ata -A /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG   VALUE WORST THRESH TYPE     UPDATED WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f 099   087   006    Pre-fail Always  -           134820080
  3 Spin_Up_Time            0x0003 095   095   000    Pre-fail Always  -           0
  4 Start_Stop_Count        0x0032 100   100   020    Old_age  Always  -           16
  5 Reallocated_Sector_Ct   0x0033 033   033   036    Pre-fail Always  FAILING_NOW 2748
  7 Seek_Error_Rate         0x000f 072   062   030    Pre-fail Always  -           16103679
  9 Power_On_Hours          0x0032 097   097   000    Old_age  Always  -           3165
 10 Spin_Retry_Count        0x0013 100   100   097    Pre-fail Always  -           0
 12 Power_Cycle_Count       0x0032 100   100   020    Old_age  Always  -           8
183 Runtime_Bad_Block       0x0032 100   100   000    Old_age  Always  -           0
184 End-to-End_Error        0x0032 100   100   099    Old_age  Always  -           0
187 Reported_Uncorrect      0x0032 100   100   000    Old_age  Always  -           0
188 Command_Timeout         0x0032 100   099   000    Old_age  Always  -           8590065676
189 High_Fly_Writes         0x003a 100   100   000    Old_age  Always  -           0
190 Airflow_Temperature_Cel 0x0022 071   065   045    Old_age  Always  -           29 (Lifetime Min/Max 27/30)
194 Temperature_Celsius     0x0022 029   040   000    Old_age  Always  -           29 (0 9 0 0)
195 Hardware_ECC_Recovered  0x001a 044   020   000    Old_age  Always  -           134820080
197 Current_Pending_Sector  0x0012 100   100   000    Old_age  Always  -           0
198 Offline_Uncorrectable   0x0010 100   100   000    Old_age  Offline -           0
199 UDMA_CRC_Error_Count    0x003e 200   200   000    Old_age  Always  -           0
240 Head_Flying_Hours       0x0000 100   253   000    Old_age  Offline -           257186936654939
241 Total_LBAs_Written      0x0000 100   253   000    Old_age  Offline -           2601921204
242 Total_LBAs_Read         0x0000 100   253   000    Old_age  Offline -           551656776

Generally speaking, these attributes should be mostly self-explanatory. For example, attribute #9, Power_On_Hours, stores the number of hours that the drive has been powered on. In this example, the drive has been on 3165 hours (seen in the RAW_VALUE column), which is a bit over 4 months.

Drives store thresholds for what value indicates a failure. In this example, note that attribute 5, Reallocated_Sector_Ct, which has a value of 2748, is considered FAILING_NOW.

SMART logs

The -l name option for smartctl displays the SMART log name stored on the device. There are several such logs that any given device might support, but the most interesting are the error and selftest logs.

The error log is, like the name suggests, a log of events that are seen as errors by the drive. A device that supports (and stores) a SMART error log, but currently has nothing logged, will look like this:

# smartctl -d ata -l error /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged


And here's an example of a device with one error logged:

# smartctl -d ata -l error /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 4775 hours (198 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 aa b9 2f 04

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  60 00 08 a7 b9 2f 44 00 5d+18:49:37.312 READ FPDMA QUEUED
  61 00 10 87 2e de 44 00 5d+18:49:37.296 WRITE FPDMA QUEUED
  61 00 01 9a 7b 56 40 00 5d+18:49:37.272 WRITE FPDMA QUEUED
  61 00 20 ff ff ff 4f 00 5d+18:49:37.235 WRITE FPDMA QUEUED
  60 00 10 f7 98 59 40 00 5d+18:49:37.212 READ FPDMA QUEUED

The error log will only show the five most recent entries, but that is usually enough context to get an idea what is wrong.

SMART self-tests

The -t type option tells smartctl to run a self-test of type type on the drive. type can be one of several options, although the most common are short, long, and conveyance. smartctl -t short runs a SMART Short Self Test, which usually finishes in just a couple of minutes. smartctl -t long runs a SMART Extended Self Test, which often will take an hour or more to run. smartctl -t conveyance runs a SMART Conveyance Self Test, which checks for damage sustained during transport (drops and such).

The output will look like this:

# smartctl -t short /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Sep 6 20:22:49 2010

Use smartctl -X to abort test.


After waiting the appropriate amount of time (2 minutes, in the previous case, as seen in the smartctl -t short output, but which can also be found with smartctl -c), you can use smartctl -l selftest to view the self-test results.

# smartctl -l selftest /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description     Status                  Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline        Completed: read failure       90%    8835         17135
# 2 Short offline        Completed without error       00%    0            -

In the example above, a short test completed successfully at a lifetime of 0 hours, but another short test failed with a read failure with 90% remaining at a lifetime of 8835 hours. (Test results are listed in order of most recent to oldest.)

More information

Google has done some excellent work in determining how SMART and various other data relates to drive failure. See https://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf.

To be continued in part 4.

Saturday, February 24, 2024

Drive Failures - Data Recovery with Open-Source Tools (part 2)

This is part 2 of a multi-part series.  See part 1 for the beginning of the series.

Note that this is material from 2010 and earlier that pre-dates the common availability of solid state drives.

Detecting failures

Mechanical failures

Mechanical drive failure is nearly always accompanied by some sort of audible noise.  One common sound heard from failing hard drives is the so-called "Click of Death", a sound similar to a watch ticking (but much louder).  This can have various causes, but it is commonly caused by the read/write head inside a drive being stuck or possibly trying to repeatedly read a failing block.

Another common noise is a very high-pitched whine.  This is caused by bearings in a drive failing (most likely rubbing metal-on-metal), usually as a result of old age.  Anything that moves inside a computer (fans, for example) can make a noise like this, so always check a suspect drive away from other sources of noise to verify that the sound is indeed coming from the drive.

Drive motors failing and head crashes can cause other distinctive noises.  As a rule, any noise coming from a hard drive that does not seem normal is probably an indicator of imminent failure.

Electronic failures

Failing electronics can cause a drive to act flaky, not detect, and occasionally catch fire.

Hard drives have electronics on the inside of the drive which are inaccessible without destroying the drive (unless you happen to have a clean room).  Unfortunately, if those fail, there isn't much you can do.

The external electronics on a hard drive are usually a small circuit board that contains the interface connector and is held onto the drive with a few screws.  In many cases, multiple versions of a drive (IDE, SATA, SCSI, SAS, etc.) exist with different controller interface boards.  Generally speaking, it is possible to transplant the external electronics from a good drive onto a drive with failing electronics in order to get data off the failing drive.  Usually the controller board will need to be off an identical drive with similar manufacturing dates.

Dealing with physical failures

In addition to drive electronics transplanting, just about any trick you've heard of (freezing, spinning, smacking, etc.) has probably worked for someone, sometime.  Whether any of these tricks work for you is a matter of trial and error.  Just be careful.

Freezing drives seem to be especially effective.  Unfortunately, as soon as a drive is operating, it will tend to heat up quickly, so some care needs to be taken to keep drives cool without letting them get wet from condensation.

Swapping electronics often works when faced with electronic failure, but only when the donor drive exactly matches the failed drive.

Freezing drives often helps in cases of crashed heads and electronic problems. Sometimes they will need help to stay cold (ice packs, freeze spray, etc.), but often once they start spinning, they'll stay spinning. Turning a drive on its side sometimes helps with physical problems as well.


Unfortunately, we do have to get a drive to spin for any software data recovery techniques to work.

To be continued in part 3.

Sunday, February 18, 2024

Data Recovery with Open-Source Tools (part 1)

This is material from a class I taught a long time ago.  Some of it may still be useful.  🙂

The original copyright notice:

Copyright © 2009-2010 Steven Pritchard / K&S Pritchard Enterprises, Inc.

This work is licensed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.


This is part 1 of a multi-part series.

Identifying drives

An easy way to get a list of drives attached to a system is to run fdisk -l.  The output will look something like this:


# fdisk -l


Disk /dev/sda: 80.0 GB, 80026361856 bytes

255 heads, 63 sectors/track, 9729 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk identifier: 0xcab10bee


   Device Boot      Start         End      Blocks   Id  System

/dev/sda1   *           1        8673    69665841    7  HPFS/NTFS

/dev/sda2            8675        9729     8474287+   c  W95 FAT32 (LBA)


In many cases, you'll see a lot of (generally) uninteresting devices that are named /dev/dm-n.  These are devices created by device mapper for everything from software RAID to LVM logical volumes.  If you are primarily interested in the physical drives attached to a system, you can suppress the extra output of fdisk -l with a little bit of sed.  Try the following:


fdisk -l 2>&1 | sed '/\/dev\/dm-/,/^$/d' | uniq


Whole devices generally show up as /dev/sdx (/dev/sda, /dev/sdb, etc.) or /dev/hdx (/dev/hda, /dev/hdb, etc.).  Partitions on the individual devices show up as /dev/sdxn (/dev/sda1, /dev/sda2, etc.), or, in the case of longer device names, the name of the device with pn appended (an example might be /dev/mapper/loop0p1).

Hardware

PATA/SATA

The vast majority of hard drives currently in use connect to a computer using either an IDE (or Parallel ATA) interface or a SATA (Serial ATA) interface.  For the most part, SATA is just IDE with a different connector, but when SATA came out, the old Linux IDE driver had accumulated enough cruft that a new SATA driver (libata) was developed to support SATA controller chipsets.  Later, the libata driver had support for most IDE controllers added, obsoleting the old IDE driver.


There are some differences in the two drivers, and often those differences directly impact data recovery.  One difference is device naming.  The old IDE driver named devices /dev/hdx, where x is determined by the position of the drive.


/dev/hda    Master device, primary controller

/dev/hdb    Slave device, primary controller

/dev/hdc    Master device, secondary controller

/dev/hdd    Slave device, secondary controller


And so on.


Unlike the IDE driver, the libata driver uses what was historically SCSI device naming, /dev/sdx, where x starts at "a" and increments upwards as devices are detected, which means that device names are more-or-less random, and won't be consistent across reboots.


The other major difference between the old IDE driver and the libata driver that affects data recovery is how the drivers handle DMA (direct memory access).  The ATA specification allows for various PIO (Programmed I/O) and DMA modes.  Both the old IDE driver and the libata driver will determine the best mode, in most cases choosing a DMA mode initially, and falling back to a PIO mode in error conditions.  The old IDE driver would also let you manually toggle DMA off and on for any device using the command hdparm.


hdparm -d /dev/hd    Query DMA on/off state for /dev/hdx

hdparm -d0 /dev/hdx    Disable DMA on /dev/hdx

hdparm -d1 /dev/hdx    Enable DMA on /dev/hdx


The libata driver currently lacks the ability to toggle DMA on a running system, but it can be turned off for all hard drives with the kernel command line option libata.dma=6, or for all devices (including optical drives) with libata.dma=0.  On a running system, the value of libata.dma can be found in /sys/module/libata/parameters/dma.  (The full list of numeric values for this option can be found in http://www.kernel.org/doc/Documentation/kernel-parameters.txt.)  There does not appear to be a way to way to toggle DMA per device with the libata driver.


There are several reasons why you might want to toggle DMA on or off for a drive.  In some cases, failing drives simply won't work unless DMA is disabled, or even in some rare cases might not work unless DMA is enabled. In some cases the computer might have issues when reading from a failing drive with DMA enabled.  (The libata driver usually handles these situations fairly well.  The old IDE driver only began to handle these situations well in recent years.)


In addition to those reasons, PIO mode forces a drive to a maximum speed of 25MB/s (PIO Mode 6, others are even slower), while DMA modes can go up to 133MB/s.  Some drives appear to work better at these lower speeds.

SCSI

While SCSI drives and controllers are less common than they once were, all current hard drive controller interfaces now use the kernel SCSI device layers for device management and such.  For example, all devices that use the SCSI layer will show up in /proc/scsi/scsi.


# cat /proc/scsi/scsi

Attached devices:

Host: scsi0 Channel: 00 Id: 00 Lun: 00

  Vendor: TSSTcorp Model: CD/DVDW TS-L632D Rev: AS05

  Type:   CD-ROM                           ANSI  SCSI revision: 05

Host: scsi1 Channel: 00 Id: 00 Lun: 00

  Vendor: ATA      Model: ST9160821A       Rev: 3.AL

  Type:   Direct-Access                    ANSI  SCSI revision: 05

Host: scsi3 Channel: 00 Id: 00 Lun: 00

  Vendor: ATA      Model: WDC WD10EACS-00Z Rev: 01.0

  Type:   Direct-Access                    ANSI  SCSI revision: 05


In most cases, it is safe to remove a device that isn't currently mounted, but to be absolutely sure it is safe, you can also explicitly tell the kernel to disable a device by writing to /proc/scsi/scsi.  For example, to remove the third device (the Western Digital drive in this example), you could do the following:


echo scsi remove-single-device 3 0 0 0 > /proc/scsi/scsi

Note that the four numbers correspond to the controller, channel, ID, and LUN in the example.


In cases where hot-added devices don't automatically show up, there is also a corresponding add-single-device command.

When recovering data from SCSI (and SCSI-like drives such as SAS), there are no special tricks like DMA.

USB, etc.

The Linux USB drivers are rather resilient in the face of errors, so no special consideration needs to be given when recovering data from thumb drives and other flash memory (except that these devices tend to work or not, and, of course, dead shorts across USB ports are a Bad Thing).  USB-to-ATA bridge devices are a different matter entirely though.  They tend to lock up hard or otherwise behave badly when they hit errors on a failing drive.  Generally speaking, they should be avoided for failing drives, but drives that are OK other than a trashed filesystem or partition table should be completely fine on a USB-to-ATA bridge device.

To be continued in part 2.