ZFS Performance: Mirror VS RAIDZ VS RAIDZ2 vs RAIDZ3 vs Striped

I always wanted to find out the performance difference among different ZFS types, such as mirror, RAIDZ, RAIDZ2, RAIDZ3, Striped, two RAIDZ vdevs vs one RAIDZ2 vdev etc. So I decide to create an experiment to test these ZFS types. Before we talk about the test result, let’s go over some background information, such as the details of each design and the hardware information.

Background

Here is a machine I used for experiment. It is a consumer grade desktop computer manufactured back in 2014:

CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz / quard cores / 8 threads
OS: CentOS Linux release 7.3.1611 (Core)
Kernel: Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Memory: 20 GB (2GB x 4)
Hard drives: 5 TB x 8 
(Every hard drive is 4k sectors, non-SSD, consumer grade, connected via a PCI-e x 16 raid card with SAS interface)
System Settings: Everything is system default. Nothing has been done to the kernel configuration.

Also, I tried to keep each test simple. Therefore I didn’t do anything special:

zpool create -f myzpool (different settings go here...)
zfs create myzpool/data

To optimize the I/O performance, the block size of the zpool is based on the physical sector of the hard drive. In my case, all of the hard drives have 4k (4096 bytes) sectors, which is translated to 2^12, therefore, the ashift value of the zpool is 12.

zdb | grep ashift
ashift: 12

To measure the write performance, I first generate a zero based file with the size of 41GB and output to the zpool directly. To measure the read performance, I read the file and output to /dev/null. Notice that the file size is very large (41GB) such that it does not fit in the arc cache memory (50% of the system memory, i.e., 10GB). Notice that the block size is the physical sector of the hard drive.

One of the readers asked me why I use a large file instead of many small files. There are few reasons:

  • It is very easy to stress test / saturate the bandwidth (connection in between the hard drives, network etc) when working with large file.
  • The results of testing large files is more consistent.
#To test the write performance:
dd if=/dev/zero of=/myzpool/data/file.out bs=4096 count=10000000

#To test the read performance:
dd if=/myzpool/data/file.out of=/dev/null bs=4096

FYI, if the block size is not specified, the result can be very different:

#Using default block size:
dd if=/myzpool/data/file.out of=/dev/null
40960000000 bytes (41 GB) copied, 163.046 s, 251 MB/s

#Using native block size:
dd if=/myzpool/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 58.111 s, 705 MB/s

After each test, I destroyed the zpool and created a different one. This ensures that the environment factors (such as hardware and OS) stay the same. Here is the test result. If you want to learn more about each design, such as the exact command I used for each test, the corresponding material will be available in the later section.

Notice that I used eight 5TiB hard drives (Total: 40TiB) in this test. Typically hard drive of 5TiB of can hold about 4.5 TB of data, that’s around 86%-90% of the advertised number, depending on which OS you are using. For example, if we use the striped design, which is the maximum possible storage capacity in ZFS, the usable space will be 8 x 5TiB x 90% = 36TB. Therefore, the following percentages will be based on 36TB rather than 40TiB.

You may notice that I use 10 disks in each diagram, while I use only 8 disks in the article here. That’s because the diagram was from my first edit. At that time I used a relative old machine, which may not reflect the modern ZFS design. The hardware and the test methods I used in the second edit is better, although both edits draw the same conclusion.

Test Result

(Sorted by speed)

No.
ZFS Type
(Click to see details)
Write Speed (MB/s)
Time Spent on Writing a 41GB File
Read Speed (MB/s)
Time Spent on Reading a 41GB File
Storage Capacity (Max: 36TB)
# of Disks Used On Data Parity
Disk Arrangement

705
58.111s
687
59.6386s
36TB (100%)
0
Striped (8)

670
61.1404s
680
60.2457s
26TB (72%)
2
RAIDZ (4) x 2

608
67.3897s
673
60.8205s
25TB (69%)
2
RAIDZ2 (8)

604
67.8107s
631
64.8782s
30TB (83%)
1
RAIDZ (8)

528
77.549s
577
70.9604s
21TB (58%)
3
RAIDZ3 (8)

473
86.6451s
598
68.4477s
18TB (50%)
4
Mirror (2) x 4

414
98.9698s
441
92.963s
18TB (50%)
4
RAIDZ(2) x 2

Striped

In this design, we use all disks to store data (i.e., zero data protection), which max out our total usable spaces to 36 TB.

#Command
zpool create -f myzpool hd1 hd2 \
                        hd3 hd4 \
                        hd5 hd6 \
                        hd7 hd8

#df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
myzpool         36T      0K     36T       0%  /myzpool 

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        myzpool     ONLINE       0     0     0
          hd1       ONLINE       0     0     0
          hd2       ONLINE       0     0     0
          hd3       ONLINE       0     0     0
          hd4       ONLINE       0     0     0
          hd5       ONLINE       0     0     0
          hd6       ONLINE       0     0     0
          hd7       ONLINE       0     0     0
          hd8       ONLINE       0     0     0

And here is the test result:

#Write Test
dd if=/myzpool/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 58.111 s, 705 MB/s

#Read Test
dd if=/myzpool/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 59.6386 s, 687 MB/s

RAIDZ x 2

In this design, we split the data into two groups. In each group, we store the data in a RAIDZ1 structure. This is similar to RAIDZ2 in terms of data protection, except that this design supports up to one failure disk in each group (local scale), while RAIDZ2 allows ANY two failure disks overall (global scale). Since we use two disks for parity purpose, the usable space drops from 36TB to 26TB.

#Command
zpool create -f myzpool raidz hd1 hd2 hd3 hd4 \
                        raidz hd5 hd6 hd7 hd8

#df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
myzpool         26T      0K     26T       0%  /myzpool 

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        myzpool     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            hd1     ONLINE       0     0     0
            hd2     ONLINE       0     0     0
            hd3     ONLINE       0     0     0
            hd4     ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            hd5     ONLINE       0     0     0
            hd6     ONLINE       0     0     0
            hd7     ONLINE       0     0     0
            hd8     ONLINE       0     0     0


And here is the test result:

#Write Test
dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000
40960000000 bytes (41 GB) copied, 61.1401 s, 670 MB/s

#Read Test
dd if=/storage/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 60.2457 s, 680 MB/s


RAIDZ2

In this design, we use two disks for data protection. This allow up to two disks fail without losing any data. The usable space will drop from 36TB to 25TB.

#Command
zpool create -f myzpool raidz2 hd1 hd2 hd3 hd4 \
                               hd5 hd6 hd7 hd8

#df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
myzpool         25T     31K     25T       0%  /myzpool 

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        myzpool     ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            hd1     ONLINE       0     0     0
            hd2     ONLINE       0     0     0
            hd3     ONLINE       0     0     0
            hd4     ONLINE       0     0     0
            hd5     ONLINE       0     0     0
            hd6     ONLINE       0     0     0
            hd7     ONLINE       0     0     0
            hd8     ONLINE       0     0     0

And here is the test result:

#Write Test
dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000
40960000000 bytes (41 GB) copied, 67.3897 s, 608 MB/s

#Read Test
dd if=/storage/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 60.8205 s, 673 MB/s

RAIDZ1

In this design, we use one disk for data protection. This allow up to one disk fails without losing any data. The usable space will drop from 36TB to 30TB.

#Command
zpool create -f myzpool raidz hd1 hd2 hd3 hd4 \
                              hd5 hd6 hd7 hd8

#df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
myzpool         30T      0K     30T       0%  /myzpool 

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        myzpool     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            hd1     ONLINE       0     0     0
            hd2     ONLINE       0     0     0
            hd3     ONLINE       0     0     0
            hd4     ONLINE       0     0     0
            hd5     ONLINE       0     0     0
            hd6     ONLINE       0     0     0
            hd7     ONLINE       0     0     0
            hd8     ONLINE       0     0     0

And here is the test result:

#Write Test
dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000
40960000000 bytes (41 GB) copied, 67.8107 s, 604 MB/s

#Read Test
dd if=/storage/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 64.8782 s, 631 MB/s


RAIDZ3

In this design, we use three disks for data protection. This allow up to three disks fail without losing any data. The usable space will drop from 36TB to 21TB.

#Command
zpool create -f myzpool raidz3 hd1 hd2 hd3 hd4 \
                               hd5 hd6 hd7 hd8

#df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
myzpool         21T     31K     21T     0%    /myzpool 

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        myzpool     ONLINE       0     0     0
          raidz3-0  ONLINE       0     0     0
            hd1     ONLINE       0     0     0
            hd2     ONLINE       0     0     0
            hd3     ONLINE       0     0     0
            hd4     ONLINE       0     0     0
            hd5     ONLINE       0     0     0
            hd6     ONLINE       0     0     0
            hd7     ONLINE       0     0     0
            hd8     ONLINE       0     0     0


And here is the test result:

#Write Test
dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000
40960000000 bytes (41 GB) copied, 77.549 s, 528 MB/s

#Read Test
dd if=/storage/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 70.9604 s, 577 MB/s


Mirror

In this design, we use half of our disks for data protection, which makes our total usable spaces drop from 36 TB to 18 TB.

#Command
zpool create -f myzpool mirror hd1 hd2 \
                        mirror hd3 hd4 \
                        mirror hd5 hd6 \
                        mirror hd7 hd8

#df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
myzpool         18T     31K     18T       0%  /myzpool 

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        myzpool     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            hd1     ONLINE       0     0     0
            hd2     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            hd3     ONLINE       0     0     0
            hd4     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            hd5     ONLINE       0     0     0
            hd6     ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            hd7     ONLINE       0     0     0
            hd8     ONLINE       0     0     0
          

And here is the test result:

#Write Test
dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000
40960000000 bytes (41 GB) copied, 86.6451 s, 473 MB/s

#Read Test
dd if=/storage/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 68.4477 s, 598 MB/s


RAIDZ2 x 2

In this design, we split the data into two groups. In each group, we store the data in a RAIDZ2 structure. Since we use two disks for parity purpose, the usable space drops from 36TB to 18TB.

#Command
zpool create -f myzpool raidz2 hd1 hd2 hd3 hd4 \
                        raidz2 hd5 hd6 hd7 hd8

#df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
myzpool         18T      0K     18T       0%  /myzpool 

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        myzpool     ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            hd1     ONLINE       0     0     0
            hd2     ONLINE       0     0     0
            hd3     ONLINE       0     0     0
            hd4     ONLINE       0     0     0
          raidz2-1  ONLINE       0     0     0
            hd5     ONLINE       0     0     0
            hd6     ONLINE       0     0     0
            hd7     ONLINE       0     0     0
            hd8     ONLINE       0     0     0


And here is the test result:

#Write Test
dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000
40960000000 bytes (41 GB) copied, 98.9698 s, 414 MB/s

#Read Test
dd if=/storage/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 92.963 s, 441 MB/s


Summary

I am not surprised that the striped layout offers the fastest writing speed and maximum storage space. The only drawback is zero data protection. Unless you mirror the data at the server level (e.g., Hadoop), or the data is not important, otherwise I won’t recommend you to use this design.

Personally I recommend to go with Striped RAIDZ, i.e., we try to make multiple RAIDZ vdev, and each vdev has no more than 5 disks. In theory, ZFS recommends the number of disks in each vdev is no more than 8 to 9 disks. Based on my experience, ZFS will slow down when it has about 30% free space left if we have too many disks in one single vdev.

So which design you should use? Here is my recommendation:

#Do you care your data?
No: Go with striped.
Yes: See below:

#How many disks do you have?
1:     ZFS is not for you.
2:     Mirror
3-5:   RAIDZ1
6-10   RAIDZ1 x 2
10-15: RAIDZ1 x 3
16-20: RAIDZ1 x 4

And yes, you can pretty much forget about RAIDZ2, RAIDZ3 and mirror if you need speed and data protection together.

So, you may ask a question, what should I do if there are more than one hard drive fail? The answer is: You need to keep an eye on the health of your ZFS pool every day. I have been managing over 60 servers since 2009, and I’ve used only RAIDZ1 with my consumer level harddrives (most of them actually was taken from the external harddrives). So far I don’t have any data lost.

sudo zpool status -v

or

sudo zpool status -v | grep 'state: ONLINE'

Simply write a program to get the result from this command, and send yourself an email if there is anything go wrong. You can include the program in your cron job and have it run daily or hourly. This is my version:

#!/bin/bash

result=`sudo zpool status -x`

if [[ $result != 'all pools are healthy' ]]; then
        echo "Something is wrong."
        #Do something here such as send an email, such as sending an email via HTTP...
        /usr/bin/wget "http://example.com/send_email.php?subject=Alert&body=File%20System%20Has%20Problem" -O /dev/null > /dev/null
        exit 1;
fi

Enjoy ZFS.

–Derrick

Our sponsors:

28 Replies to “ZFS Performance: Mirror VS RAIDZ VS RAIDZ2 vs RAIDZ3 vs Striped”

    • Derrick Post author

      The design of data redundancy hasn’t been changed for years. The hardware improvement will make the speed higher, but the relative speed / ratio will probably still hold for years, e.g., RAIDZ3 is still slower comparing to simple data stripping.

      Thanks!

      –Derrick

      Reply
  1. Kenny

    You used /dev/random as your data source which may block when the entropy runs low. Have you taken this problem into account? Why didn’t you use the non-blocking /dev/urandom equivalent instead?

    Reply
    • Derrick Post author

      I used FreeBSD to perform the tests. In FreeBSD, /dev/random and /dev/urandom are the same.

      ls -al /dev | grep random
      crw-rw-rw- 1 root wheel 0x20 Oct 14 04:49 random
      lrwxr-xr-x 1 root wheel 6B Oct 14 09:49 urandom -> random

      Reply
  2. David Barton

    The raidz2 and 2 x raidz performance look very similar, with the latter being slightly faster and what you recommend. However you don’t get any data protection during a disk rebuild, and if any of the remaining drives has a failure on any block during the rebuild then that data is unrecoverable.

    raidz2 seems to offer a very slight performance degradation of raidz2 for a significant reliability improvement. Did I misread this?

    Reply
    • Derrick Post author

      Yes or no. It really depends on your hardware configuration and how you set up the ZFS. For example, if you are talking about small RAIDZ2 (e.g., around 5 disks), then the difference will be small. However if you have a big RAIDZ2 (e.g., 14 disks), then I am pretty sure RAIDZ x 2 (i.e., 7 disks x 2) will perform much faster than RAIDZ2.

      Reply
  3. Lee Burch

    No need for creating monitoring scripts, zed is packaged with ZFS and monitors for ZFS events such as a disk being taken offline.

    Reply
  4. Anonymous

    Have only one disk, ZFS is for you.

    I am afraid you do not know about ZFS copies parameter of a pool.

    Having only one disk, ZFS can save you from “silent data corruption” if you activate to have multiple copies on the same pool… it can work with just only one disk.

    Ah, and do not forget about compression… and maybe “de-dup” if have plenty of ram.

    Etc.

    Yes, yes… one disk and one disk fail, all lost.

    But ZFS never ever can be used to not need BackUPs…. you still need them.

    No system can prevent from human typical error: I order to delete the folder X but i was trying to delete the folder Y,… i lost all important data excet garbage… X was Important data, Y was garbage on the sample.

    I use ZFS on every USB HDD that i use for BackUPs with at least two copies (yes i reduce the size to half as with a mirror)… that way i can ensure (having five or more ZPOOLs, each one on its own USB HDD) my BackUps will not get “silent data corruption”… i manually sync them having only two of them at the same time powered.

    Reply
    • Derrick Post author

      That’s interesting. I never thought about using single disk ZFS mainly because I don’t like the idea of putting all eggs in one basket. If I have some important data to store, I always go with raidz if possible. I personally don’t use ZFS copies because I don’t feel safe to use single ZFS dataset for backup. I always have two or more servers, which one act as a master and the others act as slaves/mirrors. That way if one fails, the other one will still be available. Don’t forget that there are many other factors such as motherboard, power interruption, bad memory etc.

      Reply
  5. Greg S

    I studied your article and enjoyed it and learned much. I did my own benchmarks on my new system and after following most “best practices” I’ve read, still managed to only get about 147 MB/sec on writes and only a bit better on reads. I assume the problem is in the tuning. The hardware is ASRock x299 i9 Fatal1ty with Intel Core I7-7820X Extreme 3.6 GHz, memory is 128GB of DDR4-2666 Quad-Channel, disks are 6 Seagate BarraCuda Pro 6TB each. My question is: why do I only get 147GB/sec and you got between 600 and 700 MB/sec. Please advise. Thank you very much for your time.

    Reply
    • Derrick Post author

      Intel has introduced one new feature in its 7th generation of i7 CPU: Optane Cache. Personally I don’t have any experience with it. I only know that it’s like ZFS log and cache device for regular hard drive. Some other people mentioned that the Optane Cache doesn’t work as expect when used with ZFS. So I will investigate the Optane Cache options on your motherboard/BIOS.

      Reply
      • Greg S

        I do not know if I have this Optane Cache or not, I vaguely remember reading that it only works on Windows 10. All I know is, that with the $8000+ I just spent on this “fast” hardware, I should be getting better performance on ZFS that what I am seeing. I run Fedora Linux on my other machines, and am very new to FreeBSD. I would like this new hardware (and FreeBSD/ZFS) to eventually become my “main” file server, but the learning curve is a bit steeper than I anticipated. What would you suggest I do to resolve the speed (lack of) issue? Thanks. –gs–

        Reply
        • Derrick Post author

          Optane Cache is at the BIOS level, i.e., when the OS needs to access the hard drive, the BIOS / motherboard place the data in the optane cache device for performance purposes. It’s more like running ZFS on a hardware RAID, which is redundant. I would definitely check the BIOS and disable the Optane Cache.

          My main tests are mainly based on FreeBSD and CentOS (Linux kernel v3). As far as I know, ZFS on Linux doesn’t like Kernel v4 (which is what Fedora mainly uses). The performance based on multiple factors:

          – How the hard drives are connected together. I always group the disks in the same vdev in the same raid card. Also I never enable the RAID option in the card BIOS, and I use the SAS cables if possible.
          – I usually turn off the unnecessary features in the BIOS, e.g., Audio, boot from LAN, minimize the graphic memory etc.
          – I will test the IO speed of each hard drive. In FreeBSD, you can do something like:

          dd if=/dev/random of=/dev/adX1 bs=16M count=1000

          In Linux, you should not use /dev/random because their algorithm is very slow. You can use zero:

          dd if=/dev/zero of=/dev/sdX bs=16M count=1000

          If your hard drives fail to reach the full speed, then I don’t see any reason why your RAID setup will reach the high speed.

          I documented my entire journal here too: How to improve ZFS performance

          –Derrick

          Reply
          • Greg S

            This FreeBSD box with ZFS is NOT Fedora, as I may have implied. This new hardware has been tested with a bash script I wrote that tries each permutation of: raidz1, raidz2, mirror, for several combinations of drives: ada3, ada3 and 4, ada3 and 4 and 5 … up to ada3 thru ada9. I used dev/random because I found that the compression was making unfair easy work of the zeros when I originally did use /dev/zero. Each combination of raidzNNN with each combination of numbers of whole disks ada3 thru ada8 gave similiar results +-10% about 150GB/sec. Can I email you my sysctl.conf and loader.conf and my test script, along with a few other relevant details, maybe you have some insight. What is your email addr? It is quite a difference between my 150MB/s and your 700+MB/s. Like WOW.

  6. Goran

    It would be interesting to see the same test but instead of one gigantic 40GB file use lots of small files. I’ve read that RaidZ1/Z2 will have better performance with large files over Mirrored pools, but Mirrors will beat RaidZ1/2 in performance if you got lots of small files instead. Thanks for sharing your test results.

    Source: https://forums.freenas.org/index.php?threads/some-differences-between-raidz-and-mirrors-and-why-we-use-mirrors-for-block-storage.44068/

    Reply
  7. Niek

    I know this is very old and I don’t know if you still monitor this.
    I’m confused about Raidz2 vs Raidz2 x2. Why isn’t the first slower than the latter? I would expect the mirroring between the two raidz2’s give a speed boost?

    Reply
    • Derrick Post author

      RaidZ2 x 2 is not mirror. It is a two set of raidz2 vdevs. Raidz2 is always faster than Raiz2 x 2, because the CPU needs to calculate two parities for Raidz2, versus four parities for raidz2 x 2 per write.

      Reply
  8. Dwight Turner

    Hello from the future!

    Just found this as I am reconfiguring my home NAS. I am reconfiguring my (10) 3TB WD Reds.
    Mostly used for media etc and have always thought RAID Z2 (10) x 1 using the 10 drives.
    Hoever after looking at this I may go with RAID Z1 (5) x 2

    good info thanks and be well!

    Reply
  9. Adrian

    Well, since nobody noticed, I would like to point out a little mistake in this blog.
    In Striped part, two dd commands in test result are same, the write test is the incorrect one.
    “`
    And here is the test result:

    #Write Test
    dd if=/myzpool/data/file.out of=/dev/null bs=4096
    40960000000 bytes (41 GB) copied, 58.111 s, 705 MB/s

    #Read Test
    dd if=/myzpool/data/file.out of=/dev/null bs=4096
    40960000000 bytes (41 GB) copied, 59.6386 s, 687 MB/s
    “`

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *