Burn-in HDDs & Servers
Burn-in HDD and Servers. When customer satisfaction is the key objective you need to go the extra mail and make sure there is no breaks early due to bad HW or bad SW. Most bad HDD failures happens early if they are to fail. The failing due to old age is easy to handle by monitoring.
Setting up large storage arrays is not complicated but takes some time to do it the right way. To avoid the hassle with dead drives early on, it's a good practice to burn them in. Why? Most bad HDD failures happens early if they are to fail (luckily not that many do it these days). According to IDC, stored data will increase 17.8% by 2024 with HDD as the main storage technology.
The benefits
A known fact is that the failure rate for HDDs is higher in the beginning and trailing off rapidly. If you buy refurbished or used disks, you want to know the condition of them.
This way, buying cheap SATA drive instead of SAS drives is a little safer and make sense for storage arrays and archiving server running ZFS.
Failing due to wear and tear or due to old age is another thing and is easy to handle by monitoring.
Burning in Servers adds value to the end customer and is professional.
Backup vs. Burning in HDDs
You probably have you heard “Don't bother with burn-in, have good backups”. And boy is that the BS of the year.
I did spend 16 hours restoring a broken system at my friend's company from backups (you should always have backups).
There were 12 drives in a RAIDZ2 config (like a RAID-6 on steroids) and 3 drives died over the course of 24 hours on a box that had been running for approximately 35 days.
Extra work because the system was running with drives that hadn't been burned in. Why — I asked and the supplier's answers — “it's cheaper that way”, “and I gave you 2 extra drives just in case”, “you could file a warranty claim with the manufacturer and get them fixed”.
This supplier lost a customer and that is really expensive.
Basically ZFS do not trust the disk reporting good or bad as Raid system do, but if you get a bad batch neither will help you determine the bad ones — only burning them in has a chance of finding them. Most SW bugs are on SSDs today, but HD's have some issues too.
Other tools
To check the specifications of disks are what the label says, use hdparm. For SCSI info use lsscsi and with smartmontools you get a lot of info. Here I use nala, if you do not have it
- install with
apt update && apt install nala -y
or - replace
nala install
, in the examples, withapt update && apt install
.
Customer satisfaction
Customer satisfaction is the key objective, you need to go the extra mail and make sure there are no brakes early due to bad HW or bad SW. If you have systems on a global scale, and you set them up in regional centers and ship them a thousand miles or tens of thousands of miles — you want them to work.
Many decades ago one of the major suppliers of hard disks had some bugs in their disk firmware and the disks died like flies in the field, the situation became quickly catastrophic. We quickly started to test all disks before sending them out to the field. Fixing it at the shop cost about 10 bucks but fixing it in the field anything from 100 to 1,000 depending on location and system involved.
Dealing with sensitive customers in finance and government
- quality is mandatory.
The investment
Basically the investment is time, the server will stay in the shop a week or two extra and uses some electricity. Not many man-hours are needed, and a junior member or a trainee can do the job.
A tested and wiped disk is always worth more than just a random disk.
A Server with documented testing performed has more value.
How to do the burn-in?
The traditional way is to test a single drive, one after another. That was not bad 30 years ago with disk of 20-40G. But not today, the problem is that it's very, very slow with the huge disk sizes we use today.
The single drive methods
You could use the dd command to wright patters to the disk, but it takes a long time to do all disks. You run one disk after another.
Use the SMART test is not really that good, the short self-test does not give you security at all and the extended self-test has its issues too, it's basically a last century tool from 1999. It's quick, not good. The criteria for the short self-test are that it has one or more segments and completes in two minutes or less. The criteria for the extended self-test are that it has one or more segments and that the completion time is vendor-specific. Any tests performed in the segments are vendor-specific.
A better solution is to run bad blocks. Executing badblocks -c 2048 -sw /dev/sdg
will write to every block 4 times, each time with a different pattern (0xaa, 0x55, 0xff, 0x00). This test should prove that every block can be written too and read. By writhing 10101010, 01010101, 11111111 and 00000000 you test the disk with a good pattern, and you end up with a totally wiped disk. By setting up the email feature, you will receive an email after the test is finished.
The bulk testing
I recommend using the bht script it's fast (running disks in parallel), it creates a report and uses email to inform you that the test is done.
If you run the test on a new server, you can test that hardware at the same time. With modern large disks, the test will run for a week. At the end of the test, all disks are written full of zeros — an extra benefit.
Benefit of bulk testing
By testing drive sequentially you may spend 72 days but with bulk testing 1 week.
Setting up the Customer's Hardware
Install all drives and document the placement of each one. Install the UPS, and test it too. Instal the network and all other cards too.
Set up a boot disk with your favorite Linux distro, after the test you can set up the real system on the now totally empty disks.
Run the test and document everything and give the full documentation to the customer after or during the installation.
The in-shop solution — a dedicated rig
Use a dedicated rig that is known to work and can take 12, 24 or 48 disks at one time and test them all reading and writing the disks. With the documented burn-in, it's safer to sell the disks or use them in a build. And, disk do dye, it's good to have tested spares.
A fully tested system is better than a system with tested drives.
Setting up bulk hard drive burn-in testing
The bht is a script to help with bulk HDD testing using badblocks. When you need to fully test 24 or 48 or whatever number of hard drives at the same time with badblocks, bht makes this easy by launching multiple instances of badblocks in the background.
You can periodically check on the status of all running instances of badblocks. You can give an email address so that when an instance of badblocks is done, you can receive an email notice with the results.
Please note that this relies on the mailx command to work, so your local mail relaying capability must be working.
Set up a Live Linux system on your server and add these
sudo apt-get install git smartmontools lsscsi mailutils ksh lvm2
Download the bht script
The script can be found at https://github.com/ezonakiusagi/bht.git. There are also examples and a how to use sections. Link to READ.me
Setup bht
Set up a directory for your reports on the boot drive and download the script.
mkdir /burnin/
cd /burnin/
Set up a symlink in your bin directory bht -> /burnin/bht
otherwise you need to use the full path.
You must run bht
as root or with root privileges, sudo/doas
Performing the burn-in test
Make sure bht is in your path, or you need to call it with its full path. The block size and stride are high number, but they should be like that (32768 blocks in bytes and 512 blocks per test).
Don't use the badblocks default 1024/64
The test will check all drives if they are part of a ZFS pool or not and warn you and ask for permission to wipe the drives. Below, we run a Live Linux on a USB drive and test 12 drives.
The test will take like a week for 8–12 T SATA drives.
You should use a UPS for the server you test.
If you provide an email address with --email option, when an instance of bad blocks is done, you will receive an email notice for each drive with the results. Please note that this relies on the mailx command to work, so your local mail relaying capability must be working.
Because the testing is going to take some time, you can walk away and do other things. You can check the status of the test by the -s or –status
flags.
When the test is done, you can see the results by bht -s. Each disk is reported separately. You will notice that all drives did not finish at the same time, this is typical that some SATA drives are slower than others (even 12 h slower). If there were any errors, they are tallied and the test will also take longer for that drive.
References
bht [1] badblocks [2] dd [3] SMART [4] Failure Trends - Google Research [5] using AutoML [6]
See the GitHub page for information how to setup and run the utility. ↩︎
The T10 PDF on HDD self-testing link to pdf ↩︎
Read this stydy from Google Research on Failure Trends in Large Disk Drive Populations PDF ↩︎
Google Cloud and Seagate: Transforming hard-disk drive maintenance with predictive ML see this page ↩︎