My ZFS pool has a failing disk

This happens some times especially on older consumer grade stuff like SATA disks but also enterprise grade stuff will break. Newer ever think you don't need backup's, you do! ZFS makes all this easier.

My ZFS pool has a failing disk
Photo by Denny Müller / Unsplash

This happens some times especially on older consumer grade stuff like SATA disks but also enterprise grade stuff will break. Newer ever think you don't need backup's, you do! Proxmox is easy to rebuild but not the VM/CT's so much and your data is very unique.

RAID is not a Backup!

Are the pools OK

Run zpool status to see if any of your pools and their disks if any is DEGRADED with a disk UNAVAIL then you need to replase it NOW. If its just the error count going up you can change it soon.

Replace a bad disk in a degraded pool

Its very easy to do. 1 find the bad disk, 2 swap it physically, add the new disk to the pool and then just wait for ZFS to do it's magic.

1 Find the bad disk

Run zpool status -x and find the failed drive and copy the number.
Example: 4701626368083175923

2 Swap in a new disk

Physically remove the old disk, and replace it with a new one

ls -1 /dev/disk/by-id/ | grep ata and find your new disk and copy the ID.
Example: ata-ST3500820AS_9QM0BA51

Wipe the disk if it has been used, let's assume it's: /dev/sdc. Below 2 ways tp wipe.

wipefs -a /dev/sdc
  1. fdisk /dev/sdc
  2. then create anew GPT partition table by hitting g
  3. then write the table to the disk and exit by hittink w.

3 Add the new disk to the pool

Physically add the new disk to your server (replacing the old) and issue the command below.

sudo zpool replace -f <pool> <old failed> /dev/disk/by-id/<new replacement>
sudo zpool replace -f lake 4701626368083175923 /dev/disk/by-id/ata-ST3500820AS_9QM0BA51

Now you have a long and often a very long wait for the process of re-silvering the pool. You can check the progress by zpool status -x.

Clearing Storage Pool Device Errors

If a device is taken offline due to a failure that causes errors to be listed in the zpool status output, you can clear the error counts with the zpool clear command. If a device within a pool is loses connectivity and then connectivity is restored, you will need to clear these errors as well.

zpool clear tank

If one or more devices are specified, this command only clears errors associated with the specified devices. For example:

zpool clear tank <Disk ID>

Reading S.M.A.R.T. data of the disk

Smartmontools is a tool for most SATA drives, some SSDs and some newer SAS controllers and disks. What the disk reports is what the manufacturer did programed it to report - oh yeah. ZFS is not relaying on this and that is why it needs to talk directly to the disks and not a RAID controller that believes what the disk is telling it. One reason ZFS is better than RAID via-a-vi bit rot.

To read the data smartctl -a /dev/sdX. If you see Disable and you want it enabled smartctl -s on /dev/sdX.


References

OpenZFS [1] ZFS [2] S.M.A.R.T [3] smartmontools [4]


  1. OpenZFS home page ↩︎

  2. See wikipedia on ZFS ↩︎

  3. Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) wikipedia ↩︎

  4. See the smartmontools home page and the wiki ↩︎