This is part 3 of a multi-part series. See part 1 for the beginning of the series.
SMART
SMART (Self-Monitoring, Analysis, and Reporting Technology) can, in many cases, be used to detect drive failures. The utility smartctl (from the smartmontools package, see https://www.smartmontools.org/) can be used to view SMART data, initiate self-tests, etc.Specifying device types
Historically, smartctl has guessed that devices named /dev/hdn are ATA (IDE) drives, and devices named /dev/sdn are SCSI drives. Since SATA drives and IDE drives using the libata driver show up as /dev/sdn, recent versions of smartctl have been modified to generally detect ATA drives named /dev/sdn, but to be sure, or in cases where smartctl needs to be told what type of device you're accessing, use the -t option. To test how you are accessing the drive, use the -i (AKA --info) option.- ATA (SATA and IDE drives)
smartctl -d ata -i /dev/sdn
- SCSI
smartctl -d scsi -i /dev/sdn
- 3ware controller, port n
smartctl -d 3ware,n -i /dev/twe0 (8000-series and earlier controllers)
smartctl -d 3ware,n -i /dev/twa0 (9000-series controllers)
smartctl supports various other device types (other RAID controllers, some USB-to-ATA bridges, etc.). See the man page or the smartmontools web site for more information.
Enabling SMART
If SMART is not enabled on the device (like when it is disabled in the BIOS), it can be enabled with smartctl -s on device. There is also a -S option that turns on autosave of vendor-specific attributes. In most cases, it shouldn't be necessary to turn this on, but it can't hurt to turn it on.Displaying SMART data
If you only remember one option for smartctl, make sure it is -a. That will show you everything smartctl knows about a drive. It is equivalent to -H -i -c -A -l error -l selftest -l selective for ATA drives and -H -i -A -l error -l selftest for SCSI drives.Health
Drives use a combination of factors to determine their overall health. The drive's determination can be displayed with smartctl -H. For a failing drive, the output might look like this:# smartctl -d ata -H /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 033 033 036 Pre-fail Always FAILING_NOW 2747
For a drive that isn't failing (or, more accurately, that SMART on the drive doesn't think is failing), the output will look like this:
# smartctl -d ata -H /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note that a failing health self-assessment should always be taken as a clear indication of a failure, but passing this test should not be used as an indication that a drive is fine. Most actively failing drives do not trip this test.
Information
As previously mentioned, the -i option for smartctl will report drive information, such as model number, serial number, capacity, etc. The output of smartctl -i will look something like this:# smartctl -d ata -i /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12 family
Device Model: ST31000528AS
Serial Number: X4JZDJRF
Firmware Version: CC38
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jul 7 21:01:41 2010 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
In some cases, drives that are known to have firmware bugs will also give output like this:
==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207957
Capabilities
The -c option for smartctl displays drive capabilities. The most interesting bit of information displayed with this option is the suggested amount of time required for various self-tests. The full output will look like this:# smartctl -d ata -c /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 175) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART attributes
The -A option for smartctl displays vendor-specific device attributes that are stored by the device.# smartctl -d ata -A /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 099 087 006 Pre-fail Always - 134820080
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 16
5 Reallocated_Sector_Ct 0x0033 033 033 036 Pre-fail Always FAILING_NOW 2748
7 Seek_Error_Rate 0x000f 072 062 030 Pre-fail Always - 16103679
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3165
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 8
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 8590065676
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 071 065 045 Old_age Always - 29 (Lifetime Min/Max 27/30)
194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 9 0 0)
195 Hardware_ECC_Recovered 0x001a 044 020 000 Old_age Always - 134820080
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 257186936654939
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2601921204
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 551656776
Generally speaking, these attributes should be mostly self-explanatory. For example, attribute #9, Power_On_Hours, stores the number of hours that the drive has been powered on. In this example, the drive has been on 3165 hours (seen in the RAW_VALUE column), which is a bit over 4 months.
Drives store thresholds for what value indicates a failure. In this example, note that attribute 5, Reallocated_Sector_Ct, which has a value of 2748, is considered FAILING_NOW.
SMART logs
The -l name option for smartctl displays the SMART log name stored on the device. There are several such logs that any given device might support, but the most interesting are the error and selftest logs.The error log is, like the name suggests, a log of events that are seen as errors by the drive. A device that supports (and stores) a SMART error log, but currently has nothing logged, will look like this:
# smartctl -d ata -l error /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
And here's an example of a device with one error logged:
# smartctl -d ata -l error /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 4775 hours (198 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 aa b9 2f 04
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 a7 b9 2f 44 00 5d+18:49:37.312 READ FPDMA QUEUED
61 00 10 87 2e de 44 00 5d+18:49:37.296 WRITE FPDMA QUEUED
61 00 01 9a 7b 56 40 00 5d+18:49:37.272 WRITE FPDMA QUEUED
61 00 20 ff ff ff 4f 00 5d+18:49:37.235 WRITE FPDMA QUEUED
60 00 10 f7 98 59 40 00 5d+18:49:37.212 READ FPDMA QUEUED
SMART self-tests
The -t type option tells smartctl to run a self-test of type type on the drive. type can be one of several options, although the most common are short, long, and conveyance. smartctl -t short runs a SMART Short Self Test, which usually finishes in just a couple of minutes. smartctl -t long runs a SMART Extended Self Test, which often will take an hour or more to run. smartctl -t conveyance runs a SMART Conveyance Self Test, which checks for damage sustained during transport (drops and such).The output will look like this:
# smartctl -t short /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Sep 6 20:22:49 2010
Use smartctl -X to abort test.
After waiting the appropriate amount of time (2 minutes, in the previous case, as seen in the smartctl -t short output, but which can also be found with smartctl -c), you can use smartctl -l selftest to view the self-test results.
# smartctl -l selftest /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 8835 17135
# 2 Short offline Completed without error 00% 0 -
In the example above, a short test completed successfully at a lifetime of 0 hours, but another short test failed with a read failure with 90% remaining at a lifetime of 8835 hours. (Test results are listed in order of most recent to oldest.)
More information
Google has done some excellent work in determining how SMART and various other data relates to drive failure. See https://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf.To be continued in part 4.
No comments:
Post a Comment