My ZFS pool has a failing disk
This happens some times especially on older consumer grade stuff like SATA disks but also enterprise grade stuff will break. Newer ever think you don't need backup's, you do! ZFS makes all this easier.
This happens some times especially on older consumer grade stuff like SATA disks but also enterprise grade stuff will break. Newer ever think you don't need backup's, you do! Proxmox is easy to rebuild but not the VM/CT's so much and your data is very unique.
RAID is not a Backup!
Are the pools OK
Run zpool status to see if any of your pools and their disks if any is DEGRADED with a disk UNAVAIL then you need to replase it NOW. If its just the error count going up you can change it soon.
Replace a bad disk in a degraded pool
Its very easy to do. 1 find the bad disk, 2 swap it physically, add the new disk to the pool and then just wait for ZFS to do it's magic.
1 Find the bad disk
Run zpool status -x and find the failed drive and copy the number.
Example: 4701626368083175923
2 Swap in a new disk
Physically remove the old disk, and replace it with a new one
ls -1 /dev/disk/by-id/ | grep ata and find your new disk and copy the ID.
Example: ata-ST3500820AS_9QM0BA51
Wipe the disk if it has been used, let's assume it's: /dev/sdc. Below 2 ways tp wipe.
wipefs -a /dev/sdc
fdisk /dev/sdc
- then create anew GPT partition table by hitting
g
- then write the table to the disk and exit by hittink
w
.
3 Add the new disk to the pool
Physically add the new disk to your server (replacing the old) and issue the command below.
sudo zpool replace -f <pool> <old failed> /dev/disk/by-id/<new replacement>
sudo zpool replace -f lake 4701626368083175923 /dev/disk/by-id/ata-ST3500820AS_9QM0BA51
Now you have a long and often a very long wait for the process of re-silvering the pool. You can check the progress by zpool status -x.
Clearing Storage Pool Device Errors
If a device is taken offline due to a failure that causes errors to be listed in the zpool status output, you can clear the error counts with the zpool clear command. If a device within a pool is loses connectivity and then connectivity is restored, you will need to clear these errors as well.
zpool clear tank
If one or more devices are specified, this command only clears errors associated with the specified devices. For example:
zpool clear tank <Disk ID>
Reading S.M.A.R.T. data of the disk
Smartmontools is a tool for most SATA drives, some SSDs and some newer SAS controllers and disks. What the disk reports is what the manufacturer did programed it to report - oh yeah. ZFS is not relaying on this and that is why it needs to talk directly to the disks and not a RAID controller that believes what the disk is telling it. One reason ZFS is better than RAID via-a-vi bit rot.
To read the data smartctl -a /dev/sdX
. If you see Disable
and you want it enabled smartctl -s on /dev/sdX
.
References
OpenZFS [1] ZFS [2] S.M.A.R.T [3] smartmontools [4]