How to Stress Test ZFS System

If you have set up a ZFS system, you may want to stress test the system before putting it in a production environment. There are many different ways to stress test the system. The most common way is to fill the entire pool using dd. However, I think scrubbing the entire pool is the best.

In case you are not familiar with scrubbing, basically it is a ZFS tool to test the data integrity. The system will go through every single file and perform checksum calculation, parity check etc. During scrubbing the entire pool, the system will generate a lot of I/O traffic.

First, please make sure that your ZFS is filled with some data. Then we will scrub the system:

sudo zpool scrub mypool

Afterward, simply run the following command to check the status:

#sudo zpool status -v
  pool: storage
 state: ONLINE
  scan: scrub in progress since Sun Jan 26 19:51:03 2014
        36.6G scanned out of 14.4T at 128M/s, 32h38m to go
        0 repaired, 0.25% done

Depending on the size of your pool, it may take few hours to few days to finish the entire process.

So how does the scrubbing related to the stability? Let’s take a look to the following example. Recently I set up a ZFS system which was based on 6 hard drives. During the initial setup, everything was fine. It gives no error or anything when loading the data. However, after I scrubbed the system, something bad happened. During the process, the system disconnected two hard drives, which made the entire pool unreadable (That’s because RAID1 can afford up to one disk fails). I was feeling so lucky because it didn’t happen in a production environment. Here is the result:

sudo zpool status
  pool: storage
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
  scan: none requested

        NAME                      STATE     READ WRITE CKSUM
        storage                   UNAVAIL      0     0     0
          raidz1-0                UNAVAIL      0     0     0
            ada1                  ONLINE       0     0     0
            ada4                  ONLINE       0     0     0
            ada2                  ONLINE       0     0     0
            ada3                  ONLINE       0     0     0
            9977105323546742323   UNAVAIL      0     0     0  was /dev/ada1
            12612291712221009835  UNAVAIL      0     0     0  was /dev/ada0

After some investigations, I found that the error had nothing to do with the hard drives (such as bad sectors, bad cables etc). Turn out that it was related to bad memory. See? You never know what component in your ZFS is bad until you stress test it.

Happy stress-testing your ZFS system.


Our sponsors:

Leave a Reply

Your email address will not be published. Required fields are marked *