Friday, March 1, 2024

SMART - Data Recovery with Open-Source Tools (part 3)

This is part 3 of a multi-part series.  See part 1 for the beginning of the series.

SMART

SMART (Self-Monitoring, Analysis, and Reporting Technology) can, in many cases, be used to detect drive failures. The utility smartctl (from the smartmontools package, see https://www.smartmontools.org/) can be used to view SMART data, initiate self-tests, etc.

Specifying device types

Historically, smartctl has guessed that devices named /dev/hdn are ATA (IDE) drives, and devices named /dev/sdn are SCSI drives. Since SATA drives and IDE drives using the libata driver show up as /dev/sdn, recent versions of smartctl have been modified to generally detect ATA drives named /dev/sdn, but to be sure, or in cases where smartctl needs to be told what type of device you're accessing, use the -t option. To test how you are accessing the drive, use the -i (AKA --info) option.
  • ATA (SATA and IDE drives)
smartctl -d ata -i /dev/sdn
  • SCSI
smartctl -d scsi -i /dev/sdn
  • 3ware controller, port n
smartctl -d 3ware,n -i /dev/twe0 (8000-series and earlier controllers)
smartctl -d 3ware,n -i /dev/twa0 (9000-series controllers)

smartctl supports various other device types (other RAID controllers, some USB-to-ATA bridges, etc.). See the man page or the smartmontools web site for more information.

Enabling SMART

If SMART is not enabled on the device (like when it is disabled in the BIOS), it can be enabled with smartctl -s on device. There is also a -S option that turns on autosave of vendor-specific attributes. In most cases, it shouldn't be necessary to turn this on, but it can't hurt to turn it on.

Displaying SMART data

If you only remember one option for smartctl, make sure it is -a. That will show you everything smartctl knows about a drive. It is equivalent to -H -i -c -A -l error -l selftest -l selective for ATA drives and -H -i -A -l error -l selftest for SCSI drives.

Health

Drives use a combination of factors to determine their overall health. The drive's determination can be displayed with smartctl -H. For a failing drive, the output might look like this:

# smartctl -d ata -H /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 033 033 036 Pre-fail Always FAILING_NOW 2747


For a drive that isn't failing (or, more accurately, that SMART on the drive doesn't think is failing), the output will look like this:

# smartctl -d ata -H /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


Please note that a failing health self-assessment should always be taken as a clear indication of a failure, but passing this test should not be used as an indication that a drive is fine. Most actively failing drives do not trip this test.

Information

As previously mentioned, the -i option for smartctl will report drive information, such as model number, serial number, capacity, etc. The output of smartctl -i will look something like this:

# smartctl -d ata -i /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST31000528AS
Serial Number: X4JZDJRF
Firmware Version: CC38
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jul 7 21:01:41 2010 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

In some cases, drives that are known to have firmware bugs will also give output like this:

==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957

Capabilities

The -c option for smartctl displays drive capabilities. The most interesting bit of information displayed with this option is the suggested amount of time required for various self-tests. The full output will look like this:

# smartctl -d ata -c /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:     (   0) The previous self-test routine completed
                                       without error or no self-test has ever
                                       been run.
Total time to complete Offline
data collection:                ( 600) seconds.
Offline data collection
capabilities:                   (0x7b) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:           (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:       (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine
recommended polling time:       (   1) minutes.
Extended self-test routine
recommended polling time:       ( 175) minutes.
Conveyance self-test routine
recommended polling time:       (   2) minutes.
SCT capabilities:             (0x103f) SCT Status supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART attributes

The -A option for smartctl displays vendor-specific device attributes that are stored by the device.

# smartctl -d ata -A /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG   VALUE WORST THRESH TYPE     UPDATED WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f 099   087   006    Pre-fail Always  -           134820080
  3 Spin_Up_Time            0x0003 095   095   000    Pre-fail Always  -           0
  4 Start_Stop_Count        0x0032 100   100   020    Old_age  Always  -           16
  5 Reallocated_Sector_Ct   0x0033 033   033   036    Pre-fail Always  FAILING_NOW 2748
  7 Seek_Error_Rate         0x000f 072   062   030    Pre-fail Always  -           16103679
  9 Power_On_Hours          0x0032 097   097   000    Old_age  Always  -           3165
 10 Spin_Retry_Count        0x0013 100   100   097    Pre-fail Always  -           0
 12 Power_Cycle_Count       0x0032 100   100   020    Old_age  Always  -           8
183 Runtime_Bad_Block       0x0032 100   100   000    Old_age  Always  -           0
184 End-to-End_Error        0x0032 100   100   099    Old_age  Always  -           0
187 Reported_Uncorrect      0x0032 100   100   000    Old_age  Always  -           0
188 Command_Timeout         0x0032 100   099   000    Old_age  Always  -           8590065676
189 High_Fly_Writes         0x003a 100   100   000    Old_age  Always  -           0
190 Airflow_Temperature_Cel 0x0022 071   065   045    Old_age  Always  -           29 (Lifetime Min/Max 27/30)
194 Temperature_Celsius     0x0022 029   040   000    Old_age  Always  -           29 (0 9 0 0)
195 Hardware_ECC_Recovered  0x001a 044   020   000    Old_age  Always  -           134820080
197 Current_Pending_Sector  0x0012 100   100   000    Old_age  Always  -           0
198 Offline_Uncorrectable   0x0010 100   100   000    Old_age  Offline -           0
199 UDMA_CRC_Error_Count    0x003e 200   200   000    Old_age  Always  -           0
240 Head_Flying_Hours       0x0000 100   253   000    Old_age  Offline -           257186936654939
241 Total_LBAs_Written      0x0000 100   253   000    Old_age  Offline -           2601921204
242 Total_LBAs_Read         0x0000 100   253   000    Old_age  Offline -           551656776

Generally speaking, these attributes should be mostly self-explanatory. For example, attribute #9, Power_On_Hours, stores the number of hours that the drive has been powered on. In this example, the drive has been on 3165 hours (seen in the RAW_VALUE column), which is a bit over 4 months.

Drives store thresholds for what value indicates a failure. In this example, note that attribute 5, Reallocated_Sector_Ct, which has a value of 2748, is considered FAILING_NOW.

SMART logs

The -l name option for smartctl displays the SMART log name stored on the device. There are several such logs that any given device might support, but the most interesting are the error and selftest logs.

The error log is, like the name suggests, a log of events that are seen as errors by the drive. A device that supports (and stores) a SMART error log, but currently has nothing logged, will look like this:

# smartctl -d ata -l error /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged


And here's an example of a device with one error logged:

# smartctl -d ata -l error /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 4775 hours (198 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 aa b9 2f 04

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  60 00 08 a7 b9 2f 44 00 5d+18:49:37.312 READ FPDMA QUEUED
  61 00 10 87 2e de 44 00 5d+18:49:37.296 WRITE FPDMA QUEUED
  61 00 01 9a 7b 56 40 00 5d+18:49:37.272 WRITE FPDMA QUEUED
  61 00 20 ff ff ff 4f 00 5d+18:49:37.235 WRITE FPDMA QUEUED
  60 00 10 f7 98 59 40 00 5d+18:49:37.212 READ FPDMA QUEUED

The error log will only show the five most recent entries, but that is usually enough context to get an idea what is wrong.

SMART self-tests

The -t type option tells smartctl to run a self-test of type type on the drive. type can be one of several options, although the most common are short, long, and conveyance. smartctl -t short runs a SMART Short Self Test, which usually finishes in just a couple of minutes. smartctl -t long runs a SMART Extended Self Test, which often will take an hour or more to run. smartctl -t conveyance runs a SMART Conveyance Self Test, which checks for damage sustained during transport (drops and such).

The output will look like this:

# smartctl -t short /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Sep 6 20:22:49 2010

Use smartctl -X to abort test.


After waiting the appropriate amount of time (2 minutes, in the previous case, as seen in the smartctl -t short output, but which can also be found with smartctl -c), you can use smartctl -l selftest to view the self-test results.

# smartctl -l selftest /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description     Status                  Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline        Completed: read failure       90%    8835         17135
# 2 Short offline        Completed without error       00%    0            -

In the example above, a short test completed successfully at a lifetime of 0 hours, but another short test failed with a read failure with 90% remaining at a lifetime of 8835 hours. (Test results are listed in order of most recent to oldest.)

More information

Google has done some excellent work in determining how SMART and various other data relates to drive failure. See https://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf.

To be continued in part 4.

No comments: