Mechanical failures account for about 60 percent of all drive failures. Most mechanical failures result from gradual wear, although an eventual failure may be catastrophic. However, before complete failure occurs, there are usually certain indications that failure is imminent. These may include increased heat output, increased noise level, problems with reading and writing of data, a marked increase in the number of damaged disk sectors, and so on.
The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure while there is still time to take preventative action, such as copying the data to a replacement device. Approximately 30% of failures can be predicted by S.M.A.R.T. Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole, but suggests that certain sub-categories of information which some S.M.A.R.T. implementations track do correlate with actual failure rates – specifically, in the 60 days following the first scan error on a drive, the drive is, on average, 39 times more likely to fail than it would have been had no such error occurred. Also, first errors in reallocations, offline reallocations and probational counts are strongly correlated to higher probabilities of failure.
PCTechGuide's page on S.M.A.R.T. (2003) comments that the technology has gone through three phases:
Later, another variant, which was named IntelliSafe, was created by computer manufacturer Compaq and disk drive manufacturers Seagate, Quantum, and Conner . The disk drives would measure the disk’s "health parameters", and the values would be transferred to the operating system and user-space monitoring software. Each disk drive vendor was free to decide which parameters were to be included for monitoring, and what their thresholds should be. The unification was at the protocol level with the host.
Compaq submitted their implementation to Small Form Committee for standardization in early 1995. It was supported by IBM, by Compaq's development partners Seagate, Quantum, and Conner, and by Western Digital, who did not have a failure prediction system at the time. The Committee chose IntelliSafe's approach, as it provided more flexibility. The resulting jointly developed standard was named S.M.A.R.T.
The most basic information that SMART provides is the SMART status. It provides only two values: "threshold not exceeded" and "threshold exceeded". Often these are represented as "drive OK" or "drive fail" respectively. A "threshold exceeded" value is intended to indicate that there is a relatively high probability that the drive will not be able to honour its specification in the future – that is, the drive is "about to fail". The predicted failure may be catastrophic or may be something as subtle as the inability to write to certain sectors, or perhaps slower performance than the manufacturer's declared minimum.
The SMART status does not necessarily indicate the drive's past or present reliability. If a drive has already failed catastrophically, the SMART status may be inaccessible. Alternatively, if a drive has experienced problems in the past, but the sensors no longer detect such problems, the SMART status may, depending on the manufacturer's programming, suggest that the drive is now sound.
The inability to read some sectors is not always an indication that a drive is about to fail. One way that unreadable sectors may be created, even when the drive is functioning within specification, is through a sudden power failure while the drive is writing. In order to prevent this problem, modern hard drives will always finish writing at least the current sector immediately after the power fails (typically using rotational energy from the disk). Also, even if the physical disk is damaged at one location, such that a certain sector is unreadable, the disk may be able to use spare space to replace the bad area, so that the sector can be overwritten.
More detail on the health of the drive may be obtained by examining the SMART Attributes. SMART Attributes were included in some drafts of the ATA standard, but were removed before the standard became final. The meaning and interpretation of the attributes varies between manufacturers, and are sometimes considered a trade secret for one manufacturer or another. Attributes are further discussed below.
Drives with SMART may optionally support a number of 'logs'. The error log records information about the most recent errors that the drive has reported back to the host computer. Examining this log may help one to determine whether computer problems are disk-related or caused by something else.
A drive supporting SMART may optionally support a number of self-test or maintenance routines, and the results of the tests are kept in the self-test log. The self-test routines may be used to detect any unreadable sectors on the disk, so that they may be restored from back-up sources (for example, from other disks in a RAID). This helps to reduce the risk of incurring permanent loss of data.
From a legal perspective, the term "S.M.A.R.T." refers only to a signalling method between internal disk drive electromechanical sensors and the host computer. Hence, a drive may be claimed by its manufacturers to include S.M.A.R.T. support even if it does not include, say, a temperature sensor, which the customer might reasonably expect to be present. Moreover, in the most extreme case, a disk manufacturer could, in theory, produce a drive which includes a sensor for just one physical attribute, and then legally advertise the product as "S.M.A.R.T. compatible".
Depending on the type of interface being used, some S.M.A.R.T.-enabled motherboards and related software may not communicate with certain S.M.A.R.T.-capable drives. For example, few external drives connected via USB and Firewire correctly send S.M.A.R.T. data over those interfaces. With so many ways to connect a hard drive (SCSI, Fibre Channel, ATA, SATA, SAS, SSA, and so on), it is difficult to predict whether S.M.A.R.T. reports will function correctly in a given system.
Even on hard drives and interfaces that support it, S.M.A.R.T. information may not be reported correctly to the computer's operating system. Some disk controllers can duplicate all write operations on a secondary "back-up" drive in real time. This feature is known as "RAID mirroring". However, many programs which are designed to analyze changes in drive behaviour and relay S.M.A.R.T. alerts to the operator do not function properly when a computer system is configured for RAID support. Generally this is because, under normal RAID operational conditions, the computer is not permitted by the RAID subsystem to 'see' (or directly access) individual physical drives, but may access only logical volumes instead.
On the Windows platform, many programs designed to monitor and report S.M.A.R.T. information will function only under an administrator account. At present, S.M.A.R.T. is implemented individually by manufacturers, and while some aspects are standardized for compatibility, others are not.
Each drive manufacturer defines a set of attributes, and sets threshold values beyond which attributes should not pass under normal operation. Each attribute has a raw value, whose meaning is entirely up to the drive manufacturer (but often corresponds to counts or a physical unit, such degrees Celsius or seconds), and a normalized value, which ranges from 1 to 253 (with 1 representing the worst case and 253 representing the best). Depending on the manufacturer, a value of 100 or 200 will often be chosen as the "normal" value.
The following chart lists some S.M.A.R.T. attributes and the typical meaning of their raw values. Normalized values are always mapped so that higher values are better (with only very rare exceptions such as the "Temperature" attribute on certain Seagate drives), but higher raw attribute values may be better or worse depending on the attribute and manufacturer. For example, the "Reallocated Sectors Count" attribute's normalized value decreases as the number of reallocated sectors increases. In this case, the attribute's raw value will often indicate the actual number of sectors that were reallocated, although vendors are in no way required to adhere to this convention. As manufacturers do not necessarily agree on precise attribute definitions and measurement units, the following list of attributes should be regarded as a general guide only.
| || || || |
| ||Potential indicators of imminent electromechanical failure|
|01||01||Read Error Rate|| ||Indicates the rate of hardware read errors that occurred when reading data from a disk surface. A non-zero value indicates a problem with either the disk surface or read/write heads. Note that Seagate drives often report a raw value that is very high even on new drives, and does not thereby indicate a failure.|
|02||02||Throughput Performance|| ||Overall (general) throughput performance of a hard disk drive. If the value of this attribute is decreasing there is a high probability that there is a problem with the disk.|
|03||03||Spin-Up Time|| ||Average time of spindle spin up (from zero RPM to fully operational [millisecs]).|
|04||04||Start/Stop Count||A tally of spindle start/stop cycles.|
|05||05||Reallocated Sectors Count|| ||Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area). This process is also known as remapping, and "reallocated" sectors are called remaps. This is why, on modern hard disks, "bad blocks" cannot be found while testing the surface – all bad blocks are hidden in reallocated sectors. However, as the number of reallocated sectors increases, the read/write speed tends to decrease. The raw value normally represents a count of the number of bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate.|
|06||06||Read Channel Margin||Margin of a channel while reading data. The function of this attribute is not specified.|
|07||07||Seek Error Rate|| ||Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. More seek errors indicates a worsening condition of a disk’s surface or the mechanical subsystem, or both. Note that Seagate drives often report a raw value that is very high, even on new drives, and this does not normally indicate a failure.|
|08||08||Seek Time Performance|| ||Average performance of seek operations of the magnetic heads. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.|
|09||09||Power-On Hours (POH)|| ||Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state.|
|10||0A||Spin Retry Count|| ||Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.|
|11||0B||Recalibration Retries|| ||This attribute indicates the number of times recalibration was requested (under the condition that the first attempt was unsuccessful). A decrease of this attribute value is a sign of problems in the hard disk mechanical subsystem.|
|12||0C||Device Power Cycle Count||This attribute indicates the count of full hard disk power on/off cycles.|
|13||0D||Soft Read Error Rate|| ||Uncorrected read errors reported to the operating system. If the value is non-zero, you should back up your data.|
|189||BD||High Fly Writes (WDC)|| ||Fly Height Monitor Improves Hard Drive Reliability. Western Digital's Fly Height Monitor protects write operations by detecting when a recording head is flying outside its normal operating range. If an unsafe fly height condition is encountered, the write process is stopped, and the information is rewritten or reallocated to a safe region of the hard drive. This constant monitoring process increases the reliability of write operations and reduces the probability of read errors. The new Fly Height Monitor is being implemented in Western Digital’s drives, beginning with the WD Enterprise WDE18300 and WDE9180 Ultra2 SCSI hard drives, and will be included on all future WD Enterprise products.(http://www.wdc.com/en/library/2579-850123.pdf)|
|190||BE||Airflow Temperature (WDC)|| ||Airflow temperature on Western Digital HDs (Same as temp. [C2], but current value is 50 less for some models. Marked as obsolete.)|
|190||BE||Temperature Difference from 100|| || Value is equal to (100 – temp. °C), allowing manufacturer to set a minimum threshold which corresponds to a maximum temperature. (Seagate only?)|
Seagate ST910021AS: Verified Present
Seagate ST9120823ASG: Verified Present under name "Airflow Temperature Cel" 2008-10-06
Seagate ST3802110A: Verified Present 2007-02-13
Seagate ST980825AS: Verified Present 2007-04-05
Seagate ST3320620AS: Verified Present 2007-04-23
Seagate ST3500641AS: Verified Present 2007-06-12
Seagate ST3250824AS: Verified Present 2007-08-07
Seagate ST31000340AS: Verified Present 2008-02-05
Seagate ST3160211AS: Verified Present 2008-06-12
Seagate ST3320620AS: Verified Present 2008-06-12
Seagate ST3400620AS: Verified Present 2008-06-12
Samsung HD501LJ: Verified Present under name "Airflow Temperature" 2008-03-02
Samsung HD753LJ: Verified Present under name "Airflow Temperature" 2008-07-15
|191||BF||G-sense error rate|| ||Frequency of mistakes as a result of impact loads|
|192||C0||Power-off Retract Count|| ||Number of times the heads are loaded off the media. Heads can be unloaded without actually powering off. (or Emergency Retract Cycle count – Fujitsu)|
|193||C1||Load/Unload Cycle|| ||Count of load/unload cycles into head landing zone position.|
|194||C2||Temperature|| ||Current internal temperature.|
|195||C3||Hardware ECC Recovered|| ||Time between ECC-corrected errors.|
|196||C4||Reallocation Event Count|| ||Count of remap operations. The raw value of this attribute shows the total number of attempts to transfer data from reallocated sectors to a spare area. Both successful & unsuccessful attempts are counted.|
|197||C5||Current Pending Sector Count|| ||Number of "unstable" sectors (waiting to be remapped). If the unstable sector is subsequently written or read successfully, this value is decreased and the sector is not remapped. Read errors on the sector will not remap the sector, it will only be remapped on a failed write attempt. This can be problematic to test because cached writes will not remap the sector, only direct I/O writes to the disk.|
|198||C6||Uncorrectable Sector Count|| ||The total number of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.|
|199||C7||UltraDMA CRC Error Count|| ||The number of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).|
|200||C8|| Write Error Rate / |
Multi-Zone Error Rate
| ||The total number of errors when writing a sector.|
|201||C9||Soft Read Error Rate|| ||Number of off-track errors. If non-zero, make a backup.|
|202||CA||Data Address Mark errors|| ||Number of Data Address Mark errors (or vendor-specific).|
|203||CB||Run Out Cancel|| ||Number of ECC errors|
|204||CC||Soft ECC Correction|| ||Number of errors corrected by software ECC|
|205||CD||Thermal Asperity Rate (TAR)|| ||Number of thermal asperity errors.|
|206||CE||Flying Height||?||Height of heads above the disk surface.|
|207||CF||Spin High Current||?||Amount of high current used to spin up the drive.|
|208||D0||Spin Buzz||?||Number of buzz routines to spin up the drive|
|209||D1||Offline Seek Performance||?||Drive’s seek performance during offline operations|
|211||D3||Vibration During Write||?||Vibration During Write|
|212||D4||Shock During Write||?||Shock During Write|
|220||DC||Disk Shift|| ||Distance the disk has shifted relative to the spindle (usually due to shock). Unit of measure is unknown.|
|221||DD||G-Sense Error Rate|| ||The number of errors resulting from externally-induced shock & vibration.|
|222||DE||Loaded Hours||?||Time spent operating under data load (movement of magnetic head armature)|
|223||DF||Load/Unload Retry Count||?||Number of times head changes position.|
|224||E0||Load Friction|| ||Resistance caused by friction in mechanical parts while operating.|
|225||E1||Load/Unload Cycle Count|| ||Total number of load cycles|
|226||E2||Load 'In'-time||?||Total time of loading on the magnetic heads actuator (time not spent in parking area).|
|227||E3||Torque Amplification Count|| ||Number of attempts to compensate for platter speed variations|
|228||E4||Power-Off Retract Cycle|| ||The number of times the magnetic armature was retracted automatically as a result of cutting power.|
|230||E6||GMR Head Amplitude||?||Amplitude of "thrashing" (distance of repetitive forward/reverse head motion)|
|231||E7||Temperature|| ||Drive Temperature|
|240||F0||Head Flying Hours||?||Time while head is positioning|
|250||FA||Read Error Retry Rate|| ||Number of errors while reading from a disk|
|254||FE||Free Fall Protection|| ||Number of "Free Fall Events" detected|
Prognosis of this date is based on the factor "Speed of attribute change"; how many points each month the value is decreasing/increasing. This factor is calculated automatically at any change of S.M.A.R.T. attributes for each attribute individually. Note that TEC dates are not guarantees; hard drives can and will either last much longer or fail much sooner than the date given by a TEC.
|smartmontools||libatasmart||HDAT2||DriveSitter||HDD Health||Active Smart||SpeedFan||SMARTReporter||HDTune||Norton System Doctor||SMART Utility||DiskCheckup||Hard Disk Sentinel||DiskChecker|
|Operating System|| Windows (native or Cygwin)|
Darwin (Mac OS X)
|Linux||DOS||Windows||Windows||Windows||Windows||Mac OS X||Windows||Windows||Mac OS X||Windows||Windows, DOS, Linux||Windows, .NET Framework 3.5|
|Price||Open Source||Open Source||Freeware||29,69 $||Freeware||18,46 €||Freeware||Open Source||Freeware||proprietary||20.00 $||Freeware for personal use, 15.00 $ otherwise||Windows: from 18,00 €, DOS/Linux: Freeware||19,00 $|
|Trial version can be used||-||-||-||30 days||-||21 days||-||-||-||-||30 days/5 launches||-||-||30 days|
|professionals||professionals, programmers||professionals||advanced||beginners to advanced||beginners to advanced||beginners to advanced||beginners||beginners to advanced||beginners||beginners to advanced||beginners to advanced||beginners to advanced||beginners|
optional daemon or service
| Command line,|
|text menu||graphical||graphical||graphical||graphical||graphical||graphical||graphical||graphical||graphical||Windows: graphical, DOS/Linux: commandline||graphical|
|(S)ATA, SCSI, SAT (incl. some USB)||(S)ATA, SAT (incl. some USB)||(S)ATA||(S)ATA||(S)ATA||(S)ATA, SCSI, SAT (incl. some USB)||(S)ATA, SCSI||(S)ATA||(S)ATA||(S)ATA, SCSI, SAT (incl. some USB)||(S)ATA, SCSI, SAT (incl. some USB)||(S)ATA, SCSI||(S)ATA, SCSI, SAT (incl. some USB)||(S)ATA|
Reads hard discs on RAID controllers:
| 3ware (Linux, FreeBSD, Windows),|
Compaq/HP (Linux, FreeBSD),
and HighPoint and Areca (only Linux)
|-||yes|| -|| -|| -||Areca and Software RAID||-|| -||?|| 3ware,|
OS X Software RAID
|Software RAID only||Software RAID only|
Shows error log
|yes (also scheduled)||yes||yes||yes||yes||no||no||no||no||no||announced||yes||yes||yes|
Prediction of failure
|choosable parameter changes , threshold, temperature||-||-||choosable parameter changes , threshold, temperature||every parameter change, temperature||threshold, temperature||choosable parameter changes , threshold, temperature||threshold,||-||threshold, (for every single medium)||announced||temperature||temperature, parameter changes, threshold, new problem found, low disk space||at startup, low health|
|Notification by||window (only Windows), e-mail, system log, run a certain command||-||-||window, sound, e-mail, network message, system log, run a certain command||window, sound, e-mail, network message, system log||window, sound, e-mail, network message||window, sound, e-mail, run a certain command||window, e-mail, run a certain command||-||taskbar symbol, sound, administrative message||announced||window, e-mail||taskbar, window, e-mail, network message, sound alert (optionally repeating), run a certain command, run automatic backup projects, shutdown/hibernate||window, task bar notification|
|smartmontools||Lennart Poettering||Lubomir Cabla||Oliver Marr||PANTERASoft||Ariolic ATA / SCSI / USB||Alfredo Milani Comparetti||Julian Mayer||EFD Software||Symantec weblink||Volitans Software||Passmark Software||H.D.S. Hungary||Bapuli Online|
|Notes||Possibility of AAM and further parameter, surface testing||API for other programs||highly scalable, can be set to activate hibernation mode on critical temperature||can be set to activate hibernation mode on critical temperature||offers online drive analysis , monitors PC temperatures||benchmarks and surface testing||individual configuration for every medium, Interface for Disc Doktor/chkdsk: surface testing, complete testing at restart||based on smartmontools||acoustic management, highly scalable scheduled and automatic projects upon failure, offers scheduled and manual disk testing, performance and logical disk information|