If you have set up a ZFS system, you may want to stress test the system before putting it in a production environment. There are many different ways to stress test the system. The most common way is to fill the entire pool using dd. However, I think scrubbing the entire pool is the best.
In case you are not familiar with scrubbing, basically it is a ZFS tool to test the data integrity. The system will go through every single file and perform checksum calculation, parity check etc. During scrubbing the entire pool, the system will generate a lot of I/O traffic.
First, please make sure that your ZFS is filled with some data. Then we will scrub the system:
sudo zpool scrub mypool
Afterward, simply run the following command to check the status:
#sudo zpool status -v pool: storage state: ONLINE scan: scrub in progress since Sun Jan 26 19:51:03 2014 36.6G scanned out of 14.4T at 128M/s, 32h38m to go 0 repaired, 0.25% done
Depending on the size of your pool, it may take few hours to few days to finish the entire process.
So how does the scrubbing related to the stability? Let’s take a look to the following example. Recently I set up a ZFS system which was based on 6 hard drives. During the initial setup, everything was fine. It gives no error or anything when loading the data. However, after I scrubbed the system, something bad happened. During the process, the system disconnected two hard drives, which made the entire pool unreadable (That’s because RAID1 can afford up to one disk fails). I was feeling so lucky because it didn’t happen in a production environment. Here is the result:
sudo zpool status pool: storage state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-3C scan: none requested config: NAME STATE READ WRITE CKSUM storage UNAVAIL 0 0 0 raidz1-0 UNAVAIL 0 0 0 ada1 ONLINE 0 0 0 ada4 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 9977105323546742323 UNAVAIL 0 0 0 was /dev/ada1 12612291712221009835 UNAVAIL 0 0 0 was /dev/ada0
After some investigations, I found that the error had nothing to do with the hard drives (such as bad sectors, bad cables etc). Turn out that it was related to bad memory. See? You never know what component in your ZFS is bad until you stress test it.
Happy stress-testing your ZFS system.