Today, I accidentally dd’ed a disk which was part of an active ZFS pool on my test server. I dd’ed the first and the last 10 sectors of the disk. Technically I didn’t lose any data because my ZFS configuration was RAIDZ. However once I rebooted my computer, my ZFS complained:
#This is what I did: sudo dd if=/dev/zero of=/dev/sda bs=512 count=10 sudo dd if=/dev/zero of=/dev/sda bs=512 seek=$(( $(blockdev --getsz /dev/sda) - 4096 )) count=1M
sudo zpool status pool: storage state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: resilvered 2.40T in 1 days 00:16:34 with 0 errors on Fri Nov 13 20:05:53 2020 config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST4000DM000-1F2168_S30076XX ONLINE 0 0 0 ata-ST4000DX001-1CE168_Z3019CXX ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0S9YY ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0SXZZ ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0SXDD ONLINE 0 0 0 412403026512446213 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST4000NM0033-9ZM170 _Z1Z3RR74-part1
So I checked the problematic device, and I see the problem:
ls /dev/disk/by-id/ #This is normal disk: lrwxrwxrwx 1 root root 10 Nov 13 20:58 ata-ST4000DX001-1CE168_Z3019CXX-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 10 Nov 13 20:58 ata-ST4000DX001-1CE168_Z3019CXX-part9 -> ../../sdd9 #This is the problematic disk, part1 and part9 are missing. lrwxrwxrwx 1 root root 9 Nov 13 20:58 ata-ST4000NM0033-9ZM170_Z1Z3RR74 -> ../../sdf
It is pretty easy to fix this problem. All you need is to bring the device offline and bring it back.
#First, offline the problematic device: sudo zpool offline storage 412403026512446213
sudo zpool status pool: storage state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: resilvered 2.40T in 1 days 00:16:34 with 0 errors on Fri Nov 13 20:05:53 2020 config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST4000DM000-1F2168_S30076XX ONLINE 0 0 0 ata-ST4000DX001-1CE168_Z3019CXX ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0S9YY ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0SXZZ ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0SXDD ONLINE 0 0 0 412403026512446213 OFFLINE 0 0 0
#Then bring back the device: sudo zpool online ata-ST4000NM0033-9ZM170_Z1Z3RR74 #Resilver it sudo zpool scrub storage sudo zpool status pool: storage state: ONLINE scan: resilvered 36K in 0 days 00:00:01 with 0 errors on Fri Nov 13 21:03:01 2020 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST4000DM000-1F2168_S30076XX ONLINE 0 0 0 ata-ST4000DX001-1CE168_Z3019CXX ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0S9YY ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0SXZZ ONLINE 0 0 0 ata-ST4000DM000-2AE166_WDH0SXDD ONLINE 0 0 0 ata-ST4000NM0033-9ZM170_Z1Z3RR74 ONLINE 0 0 0 errors: No known data errors
That’s it.
Our sponsors:
Hi, Derrick! I’m impressed by your analysis in ZFS Performace and HERE, the repairment of damaged devices. Because of your successful experience in using ZFS for customers for a long time, I’m very intersted in when and why you decide to replace an old disk in ZFS. It’s also important for family usage of ZFS. Any suggestions?
There are many situations, but most of the time it is related to hardware. For example, when a disk reaches the end of its life, it becomes less reliable and it may be about time to replace. You can monitor this by either running the “zpool status” (look into the checksum column). If the number is high, you can either fix this by re-silvering the pool first. Also check for the output from “dmesg” and “smartctl -a /dev/sdX”. If you notice some hardware related error, may be it is about time to swap it out.
@Derrick Thanks for replying!
Here is some report of my disk:
First, using ‘zpool status’, there is 0 cksum error.
Second, using ‘dmesg|grep pool|grep error’, I got :
“`
[ 1213.664678] zio pool=nas vdev=/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0XASCEL-part1 error=5 type=1 offset=791195897856 size=1011712 flags=40080ca8
[ 1213.664712] zio pool=nas vdev=/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0XASCEL-part1 error=5 type=1 offset=2463362502656 size=4096 flags=180880
“`
When I use `smartctl -a /dev/sdX`(like), I got informations that I uploaded to github: https://github.com/huangwb8/test_file/blob/master/zfs/smart_result.txt
Acturely, the OpenMediaVault reminds me that the device has a few bad sectors for a long time.
Any suggestions? Shall I replace the disk?
Looks like your hard drive has a firmware issue. Personally, I won’t even try to update the firmware if your data is important. I will replace it. Please see here for details: https://github.com/openzfs/zfs/issues/10214