"The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
* Expensive 'enterprise' drives don't have notably better reliability than their 'consumer' counterparts (consider this conclusion in the context of my past recommendation of Western Digital 10,000 RPM Raptor SATA HDDs as a credible alternative to other manufacturers' much more costly SAS drives)
* S.M.A.R.T. error reporting only encompasses a fraction of all experience HDD failure mechanisms, and, specifically to this writeup's theme,
* RAID 1 and 5 are less robust than might appear to be the case at first glance...particularly when (as in my case...ahem) all of the drives in the RAID array come from the same manufacturer, and especially when they come from the same manufacturing lot. If one drive fails, the likelihood that a second drive will fail shortly thereafter is uncomfortably...likely.
* Costly FC and SCSI drives are more reliable than cheap SATA drives.
* RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.
* After infant mortality, drives are highly reliable until they reach the end of their useful life.
* Vendor MTBF are a useful yardstick for comparing drives.