I always wanted to find out the performance difference among different ZFS types, such as mirror, RAIDZ, RAIDZ2, RAIDZ3, Striped, two RAIDZ vdevs vs one RAIDZ2 vdev etc. So I decide to create an experiment to test these ZFS types. Before we talk about the test result, let’s go over some background information, such as the details of each design and the hardware information.
Background
Here is a machine I used for experiment. It is a consumer grade desktop computer manufactured back in 2014:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz / quard cores / 8 threads OS: CentOS Linux release 7.3.1611 (Core) Kernel: Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Memory: 20 GB (2GB x 4) Hard drives: 5 TB x 8 (Every hard drive is 4k sectors, non-SSD, consumer grade, connected via a PCI-e x 16 raid card with SAS interface) System Settings: Everything is system default. Nothing has been done to the kernel configuration.
Also, I tried to keep each test simple. Therefore I didn’t do anything special:
zpool create -f myzpool (different settings go here...) zfs create myzpool/data
To optimize the I/O performance, the block size of the zpool is based on the physical sector of the hard drive. In my case, all of the hard drives have 4k (4096 bytes) sectors, which is translated to 2^12, therefore, the ashift value of the zpool is 12.
zdb | grep ashift ashift: 12
To measure the write performance, I first generate a zero based file with the size of 41GB and output to the zpool directly. To measure the read performance, I read the file and output to /dev/null. Notice that the file size is very large (41GB) such that it does not fit in the arc cache memory (50% of the system memory, i.e., 10GB). Notice that the block size is the physical sector of the hard drive.
One of the readers asked me why I use a large file instead of many small files. There are few reasons:
- It is very easy to stress test / saturate the bandwidth (connection in between the hard drives, network etc) when working with large file.
- The results of testing large files is more consistent.
#To test the write performance: dd if=/dev/zero of=/myzpool/data/file.out bs=4096 count=10000000 #To test the read performance: dd if=/myzpool/data/file.out of=/dev/null bs=4096
FYI, if the block size is not specified, the result can be very different:
#Using default block size: dd if=/myzpool/data/file.out of=/dev/null 40960000000 bytes (41 GB) copied, 163.046 s, 251 MB/s #Using native block size: dd if=/myzpool/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 58.111 s, 705 MB/s
After each test, I destroyed the zpool and created a different one. This ensures that the environment factors (such as hardware and OS) stay the same. Here is the test result. If you want to learn more about each design, such as the exact command I used for each test, the corresponding material will be available in the later section.
Notice that I used eight 5TiB hard drives (Total: 40TiB) in this test. Typically hard drive of 5TiB of can hold about 4.5 TB of data, that’s around 86%-90% of the advertised number, depending on which OS you are using. For example, if we use the striped design, which is the maximum possible storage capacity in ZFS, the usable space will be 8 x 5TiB x 90% = 36TB. Therefore, the following percentages will be based on 36TB rather than 40TiB.
You may notice that I use 10 disks in each diagram, while I use only 8 disks in the article here. That’s because the diagram was from my first edit. At that time I used a relative old machine, which may not reflect the modern ZFS design. The hardware and the test methods I used in the second edit is better, although both edits draw the same conclusion.
Test Result
(Sorted by speed)
(Click to see details)
Striped
In this design, we use all disks to store data (i.e., zero data protection), which max out our total usable spaces to 36 TB.
#Command zpool create -f myzpool hd1 hd2 \ hd3 hd4 \ hd5 hd6 \ hd7 hd8 #df -h Filesystem Size Used Avail Capacity Mounted on myzpool 36T 0K 36T 0% /myzpool #zpool status -v NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 hd1 ONLINE 0 0 0 hd2 ONLINE 0 0 0 hd3 ONLINE 0 0 0 hd4 ONLINE 0 0 0 hd5 ONLINE 0 0 0 hd6 ONLINE 0 0 0 hd7 ONLINE 0 0 0 hd8 ONLINE 0 0 0
And here is the test result:
#Write Test dd if=/myzpool/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 58.111 s, 705 MB/s #Read Test dd if=/myzpool/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 59.6386 s, 687 MB/s
RAIDZ x 2
In this design, we split the data into two groups. In each group, we store the data in a RAIDZ1 structure. This is similar to RAIDZ2 in terms of data protection, except that this design supports up to one failure disk in each group (local scale), while RAIDZ2 allows ANY two failure disks overall (global scale). Since we use two disks for parity purpose, the usable space drops from 36TB to 26TB.
#Command zpool create -f myzpool raidz hd1 hd2 hd3 hd4 \ raidz hd5 hd6 hd7 hd8 #df -h Filesystem Size Used Avail Capacity Mounted on myzpool 26T 0K 26T 0% /myzpool #zpool status -v NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 hd1 ONLINE 0 0 0 hd2 ONLINE 0 0 0 hd3 ONLINE 0 0 0 hd4 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 hd5 ONLINE 0 0 0 hd6 ONLINE 0 0 0 hd7 ONLINE 0 0 0 hd8 ONLINE 0 0 0
And here is the test result:
#Write Test dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000 40960000000 bytes (41 GB) copied, 61.1401 s, 670 MB/s #Read Test dd if=/storage/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 60.2457 s, 680 MB/s
RAIDZ2
In this design, we use two disks for data protection. This allow up to two disks fail without losing any data. The usable space will drop from 36TB to 25TB.
#Command zpool create -f myzpool raidz2 hd1 hd2 hd3 hd4 \ hd5 hd6 hd7 hd8 #df -h Filesystem Size Used Avail Capacity Mounted on myzpool 25T 31K 25T 0% /myzpool #zpool status -v NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 hd1 ONLINE 0 0 0 hd2 ONLINE 0 0 0 hd3 ONLINE 0 0 0 hd4 ONLINE 0 0 0 hd5 ONLINE 0 0 0 hd6 ONLINE 0 0 0 hd7 ONLINE 0 0 0 hd8 ONLINE 0 0 0
And here is the test result:
#Write Test dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000 40960000000 bytes (41 GB) copied, 67.3897 s, 608 MB/s #Read Test dd if=/storage/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 60.8205 s, 673 MB/s
RAIDZ1
In this design, we use one disk for data protection. This allow up to one disk fails without losing any data. The usable space will drop from 36TB to 30TB.
#Command zpool create -f myzpool raidz hd1 hd2 hd3 hd4 \ hd5 hd6 hd7 hd8 #df -h Filesystem Size Used Avail Capacity Mounted on myzpool 30T 0K 30T 0% /myzpool #zpool status -v NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 hd1 ONLINE 0 0 0 hd2 ONLINE 0 0 0 hd3 ONLINE 0 0 0 hd4 ONLINE 0 0 0 hd5 ONLINE 0 0 0 hd6 ONLINE 0 0 0 hd7 ONLINE 0 0 0 hd8 ONLINE 0 0 0
And here is the test result:
#Write Test dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000 40960000000 bytes (41 GB) copied, 67.8107 s, 604 MB/s #Read Test dd if=/storage/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 64.8782 s, 631 MB/s
RAIDZ3
In this design, we use three disks for data protection. This allow up to three disks fail without losing any data. The usable space will drop from 36TB to 21TB.
#Command zpool create -f myzpool raidz3 hd1 hd2 hd3 hd4 \ hd5 hd6 hd7 hd8 #df -h Filesystem Size Used Avail Capacity Mounted on myzpool 21T 31K 21T 0% /myzpool #zpool status -v NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 hd1 ONLINE 0 0 0 hd2 ONLINE 0 0 0 hd3 ONLINE 0 0 0 hd4 ONLINE 0 0 0 hd5 ONLINE 0 0 0 hd6 ONLINE 0 0 0 hd7 ONLINE 0 0 0 hd8 ONLINE 0 0 0
And here is the test result:
#Write Test dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000 40960000000 bytes (41 GB) copied, 77.549 s, 528 MB/s #Read Test dd if=/storage/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 70.9604 s, 577 MB/s
Mirror
In this design, we use half of our disks for data protection, which makes our total usable spaces drop from 36 TB to 18 TB.
#Command zpool create -f myzpool mirror hd1 hd2 \ mirror hd3 hd4 \ mirror hd5 hd6 \ mirror hd7 hd8 #df -h Filesystem Size Used Avail Capacity Mounted on myzpool 18T 31K 18T 0% /myzpool #zpool status -v NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 hd1 ONLINE 0 0 0 hd2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 hd3 ONLINE 0 0 0 hd4 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 hd5 ONLINE 0 0 0 hd6 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 hd7 ONLINE 0 0 0 hd8 ONLINE 0 0 0
And here is the test result:
#Write Test dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000 40960000000 bytes (41 GB) copied, 86.6451 s, 473 MB/s #Read Test dd if=/storage/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 68.4477 s, 598 MB/s
RAIDZ2 x 2
In this design, we split the data into two groups. In each group, we store the data in a RAIDZ2 structure. Since we use two disks for parity purpose, the usable space drops from 36TB to 18TB.
#Command zpool create -f myzpool raidz2 hd1 hd2 hd3 hd4 \ raidz2 hd5 hd6 hd7 hd8 #df -h Filesystem Size Used Avail Capacity Mounted on myzpool 18T 0K 18T 0% /myzpool #zpool status -v NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 hd1 ONLINE 0 0 0 hd2 ONLINE 0 0 0 hd3 ONLINE 0 0 0 hd4 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 hd5 ONLINE 0 0 0 hd6 ONLINE 0 0 0 hd7 ONLINE 0 0 0 hd8 ONLINE 0 0 0
And here is the test result:
#Write Test dd if=/dev/zero of=/storage/data/file.out bs=4096 count=10000000 40960000000 bytes (41 GB) copied, 98.9698 s, 414 MB/s #Read Test dd if=/storage/data/file.out of=/dev/null bs=4096 40960000000 bytes (41 GB) copied, 92.963 s, 441 MB/s
Summary
I am not surprised that the striped layout offers the fastest writing speed and maximum storage space. The only drawback is zero data protection. Unless you mirror the data at the server level (e.g., Hadoop), or the data is not important, otherwise I won’t recommend you to use this design.
Personally I recommend to go with Striped RAIDZ, i.e., we try to make multiple RAIDZ vdev, and each vdev has no more than 5 disks. In theory, ZFS recommends the number of disks in each vdev is no more than 8 to 9 disks. Based on my experience, ZFS will slow down when it has about 30% free space left if we have too many disks in one single vdev.
So which design you should use? Here is my recommendation:
#Do you care your data? No: Go with striped. Yes: See below: #How many disks do you have? 1: ZFS is not for you. 2: Mirror 3-5: RAIDZ1 6-10 RAIDZ1 x 2 10-15: RAIDZ1 x 3 16-20: RAIDZ1 x 4
And yes, you can pretty much forget about RAIDZ2, RAIDZ3 and mirror if you need speed and data protection together.
So, you may ask a question, what should I do if there are more than one hard drive fail? The answer is: You need to keep an eye on the health of your ZFS pool every day. I have been managing over 60 servers since 2009, and I’ve used only RAIDZ1 with my consumer level harddrives (most of them actually was taken from the external harddrives). So far I don’t have any data lost.
sudo zpool status -v or sudo zpool status -v | grep 'state: ONLINE'
Simply write a program to get the result from this command, and send yourself an email if there is anything go wrong. You can include the program in your cron job and have it run daily or hourly. This is my version:
#!/bin/bash result=`sudo zpool status -x` if [[ $result != 'all pools are healthy' ]]; then echo "Something is wrong." #Do something here such as send an email, such as sending an email via HTTP... /usr/bin/wget "http://example.com/send_email.php?subject=Alert&body=File%20System%20Has%20Problem" -O /dev/null > /dev/null exit 1; fi
Enjoy ZFS.
–Derrick
Our sponsors:
Looks like this was posted over a year ago (Sep 25, 2014) but still very helpful. Thanks Derrick
The design of data redundancy hasn’t been changed for years. The hardware improvement will make the speed higher, but the relative speed / ratio will probably still hold for years, e.g., RAIDZ3 is still slower comparing to simple data stripping.
Thanks!
–Derrick
How about the read speeds in comparison? For example, mirror vdevs should have much faster read speeds than raidz(n), because it is able to read in parallel?
And how come it seems like most documentation say mirrored is always faster than raidz(n), but benchmarks always seem to show the opposite? (https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/) (https://calomel.org/zfs_raid_speed_capacity.html)
You used /dev/random as your data source which may block when the entropy runs low. Have you taken this problem into account? Why didn’t you use the non-blocking /dev/urandom equivalent instead?
I used FreeBSD to perform the tests. In FreeBSD, /dev/random and /dev/urandom are the same.
ls -al /dev | grep random
crw-rw-rw- 1 root wheel 0x20 Oct 14 04:49 random
lrwxr-xr-x 1 root wheel 6B Oct 14 09:49 urandom -> random
The raidz2 and 2 x raidz performance look very similar, with the latter being slightly faster and what you recommend. However you don’t get any data protection during a disk rebuild, and if any of the remaining drives has a failure on any block during the rebuild then that data is unrecoverable.
raidz2 seems to offer a very slight performance degradation of raidz2 for a significant reliability improvement. Did I misread this?
Yes or no. It really depends on your hardware configuration and how you set up the ZFS. For example, if you are talking about small RAIDZ2 (e.g., around 5 disks), then the difference will be small. However if you have a big RAIDZ2 (e.g., 14 disks), then I am pretty sure RAIDZ x 2 (i.e., 7 disks x 2) will perform much faster than RAIDZ2.
No need for creating monitoring scripts, zed is packaged with ZFS and monitors for ZFS events such as a disk being taken offline.
Have only one disk, ZFS is for you.
I am afraid you do not know about ZFS copies parameter of a pool.
Having only one disk, ZFS can save you from “silent data corruption” if you activate to have multiple copies on the same pool… it can work with just only one disk.
Ah, and do not forget about compression… and maybe “de-dup” if have plenty of ram.
Etc.
Yes, yes… one disk and one disk fail, all lost.
But ZFS never ever can be used to not need BackUPs…. you still need them.
No system can prevent from human typical error: I order to delete the folder X but i was trying to delete the folder Y,… i lost all important data excet garbage… X was Important data, Y was garbage on the sample.
I use ZFS on every USB HDD that i use for BackUPs with at least two copies (yes i reduce the size to half as with a mirror)… that way i can ensure (having five or more ZPOOLs, each one on its own USB HDD) my BackUps will not get “silent data corruption”… i manually sync them having only two of them at the same time powered.
That’s interesting. I never thought about using single disk ZFS mainly because I don’t like the idea of putting all eggs in one basket. If I have some important data to store, I always go with raidz if possible. I personally don’t use ZFS copies because I don’t feel safe to use single ZFS dataset for backup. I always have two or more servers, which one act as a master and the others act as slaves/mirrors. That way if one fails, the other one will still be available. Don’t forget that there are many other factors such as motherboard, power interruption, bad memory etc.
It would have been nice to see IOPS benchmarks since you went through all of this.
I studied your article and enjoyed it and learned much. I did my own benchmarks on my new system and after following most “best practices” I’ve read, still managed to only get about 147 MB/sec on writes and only a bit better on reads. I assume the problem is in the tuning. The hardware is ASRock x299 i9 Fatal1ty with Intel Core I7-7820X Extreme 3.6 GHz, memory is 128GB of DDR4-2666 Quad-Channel, disks are 6 Seagate BarraCuda Pro 6TB each. My question is: why do I only get 147GB/sec and you got between 600 and 700 MB/sec. Please advise. Thank you very much for your time.
Intel has introduced one new feature in its 7th generation of i7 CPU: Optane Cache. Personally I don’t have any experience with it. I only know that it’s like ZFS log and cache device for regular hard drive. Some other people mentioned that the Optane Cache doesn’t work as expect when used with ZFS. So I will investigate the Optane Cache options on your motherboard/BIOS.
I do not know if I have this Optane Cache or not, I vaguely remember reading that it only works on Windows 10. All I know is, that with the $8000+ I just spent on this “fast” hardware, I should be getting better performance on ZFS that what I am seeing. I run Fedora Linux on my other machines, and am very new to FreeBSD. I would like this new hardware (and FreeBSD/ZFS) to eventually become my “main” file server, but the learning curve is a bit steeper than I anticipated. What would you suggest I do to resolve the speed (lack of) issue? Thanks. –gs–
Optane Cache is at the BIOS level, i.e., when the OS needs to access the hard drive, the BIOS / motherboard place the data in the optane cache device for performance purposes. It’s more like running ZFS on a hardware RAID, which is redundant. I would definitely check the BIOS and disable the Optane Cache.
My main tests are mainly based on FreeBSD and CentOS (Linux kernel v3). As far as I know, ZFS on Linux doesn’t like Kernel v4 (which is what Fedora mainly uses). The performance based on multiple factors:
– How the hard drives are connected together. I always group the disks in the same vdev in the same raid card. Also I never enable the RAID option in the card BIOS, and I use the SAS cables if possible.
– I usually turn off the unnecessary features in the BIOS, e.g., Audio, boot from LAN, minimize the graphic memory etc.
– I will test the IO speed of each hard drive. In FreeBSD, you can do something like:
dd if=/dev/random of=/dev/adX1 bs=16M count=1000
In Linux, you should not use /dev/random because their algorithm is very slow. You can use zero:
dd if=/dev/zero of=/dev/sdX bs=16M count=1000
If your hard drives fail to reach the full speed, then I don’t see any reason why your RAID setup will reach the high speed.
I documented my entire journal here too: How to improve ZFS performance
–Derrick
This FreeBSD box with ZFS is NOT Fedora, as I may have implied. This new hardware has been tested with a bash script I wrote that tries each permutation of: raidz1, raidz2, mirror, for several combinations of drives: ada3, ada3 and 4, ada3 and 4 and 5 … up to ada3 thru ada9. I used dev/random because I found that the compression was making unfair easy work of the zeros when I originally did use /dev/zero. Each combination of raidzNNN with each combination of numbers of whole disks ada3 thru ada8 gave similiar results +-10% about 150GB/sec. Can I email you my sysctl.conf and loader.conf and my test script, along with a few other relevant details, maybe you have some insight. What is your email addr? It is quite a difference between my 150MB/s and your 700+MB/s. Like WOW.
It would be interesting to see the same test but instead of one gigantic 40GB file use lots of small files. I’ve read that RaidZ1/Z2 will have better performance with large files over Mirrored pools, but Mirrors will beat RaidZ1/2 in performance if you got lots of small files instead. Thanks for sharing your test results.
Source: https://forums.freenas.org/index.php?threads/some-differences-between-raidz-and-mirrors-and-why-we-use-mirrors-for-block-storage.44068/
NICE SUMMARY DERRICK.. good work.
Great job thank you.
Thank you for this comparison. This gives a good idea on what to expect.
very useful, thanks a lot !
I know this is very old and I don’t know if you still monitor this.
I’m confused about Raidz2 vs Raidz2 x2. Why isn’t the first slower than the latter? I would expect the mirroring between the two raidz2’s give a speed boost?
RaidZ2 x 2 is not mirror. It is a two set of raidz2 vdevs. Raidz2 is always faster than Raiz2 x 2, because the CPU needs to calculate two parities for Raidz2, versus four parities for raidz2 x 2 per write.
I see, but then where does the speed gain comes from with raidz1 x2 vs raidz1?
Exactly! Though the results are solid, could please someone explain why is this?
Hello from the future!
Just found this as I am reconfiguring my home NAS. I am reconfiguring my (10) 3TB WD Reds.
Mostly used for media etc and have always thought RAID Z2 (10) x 1 using the 10 drives.
Hoever after looking at this I may go with RAID Z1 (5) x 2
good info thanks and be well!
Well, since nobody noticed, I would like to point out a little mistake in this blog.
In Striped part, two dd commands in test result are same, the write test is the incorrect one.
“`
And here is the test result:
#Write Test
dd if=/myzpool/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 58.111 s, 705 MB/s
#Read Test
dd if=/myzpool/data/file.out of=/dev/null bs=4096
40960000000 bytes (41 GB) copied, 59.6386 s, 687 MB/s
“`
Thanks for the test!
> Typically hard drive of 5TiB of can hold about 4.5 TB of data
You messed up between binary and SI units: https://superuser.com/questions/504/why-are-hard-drives-never-as-large-as-advertised/530#530
The drive is advertised as 5TB and holds 4.5 TiB. If the software shows you 4.5TB it is a bug in that software.