How to improve ZFS performance

First Edition: March 14, 2010
Last Revised: December 15, 2016

In this tutorial, I will show you how to improve the performance of your ZFS using the consumer-grade components (e.g., Gigabit network card, standard SATA non-SSD hard drives etc).

Many people found a problem on their ZFS system. The speed is slow! It is slow to read or write files to the system. In this article, I am going to show you some tips on improving the speed of your ZFS file system.

Linux users: I have wrote an article about the things should be aware of when running ZFS on Linux. Please click here for details.

Table of Content
  1. A 64-bit decent CPU + Lots of memory
  2. Tweaking the Boot Loader
  3. Use disks with the same specifications
  4. Use a powerful power supply
  5. Enable Compression
  6. Identify the bottleneck
  7. Keep your ZFS up to date
  8. Use Cache / Log Devices
  9. Two drives is better than one single drive
  10. Use a combination of Strip and RAIDZ if speed is your first concern.
  11. Distribute your free space evenly
  12. Make your pool expandable
  13. Backup the data on a different machine, not on the same pool
  14. Rsync or ZFS send?
  15. Did you enable dedup? Disable it!
  16. Reinstall Your Old System
  17. Connect your disks via high speed interface
  18. Do not use up all spaces
  19. Use AHCI, not IDE
  20. Refresh Your Pool
  21. My Settings

Improve ZFS Performance: Step 1

A 64-bit decent CPU + Lots of memory

Traditionally, we are told to use a less powerful computer for a file/data server. That’s not true for ZFS. ZFS is more than a file system. It uses a lot of resources to improve the performance of the input/output, such as compressing data on the fly. For example, suppose you need to write a 1GB file. Without enabling the compression, the system will write the entire 1GB file on the disk. With the compression being enabled, the CPU will compress the data first, and write the data on the disk after that. Since the compressed data is smaller, it takes shorter time to write to the disk, which results a higher writing speed. The same thing can be applied for reading. ZFS can cache the file for you in the memory, it will result a higher reading speed.

That’s why a 64-bit CPU and higher amount of memory is recommended. I recommended at least a Quad Core CPU with 4GB of memory (Personally I have: i7 920/24GB & Q6700/8GB).

Please make sure that the memory modules should have the same frequencies/speed. If you mix them with different speed, try to group the memories with same speed together., e.g., Channel 1 and Channel 2: 1333 MHz, Channel 3 and Channel 4: 1600 MHz.

Let’s do a test. Suppose I am going to create a 10GB file with all zero. Let’s see how long does it take to write on the disk:

#CPU: i7 920 (8 cores) + 24GB Memory + FreeBSD 9.3 64-bit
#time dd if=/dev/zero of=./file.out bs=1M count=10k
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 6.138918 secs (1749073364 bytes/sec)

real    0m6.140s
user    0m0.023s
sys     0m3.658s

That’s 1.6GB/s! Why is it so fast? That’s because it is a zero based file. After the compression, a compressed 10GB file may result in several bytes only. Since the performance of the compression is highly depended on the CPU, that’s why a fast CPU matters.

Now, let’s do the same thing on a not-so-fast CPU:

#CPU: AMD 4600 (2 cores) + 5GB Memory + FreeBSD 9.3 64-bit
#time dd if=/dev/zero of=./file.out bs=1M count=10k
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 23.672373 secs (453584362 bytes/sec)

real    0m23.675s
user    0m0.091s
sys     0m22.409s

That’s 434MB/s only. See the difference?

Improve ZFS Performance: Step 2

Tweaking the Boot Loader Parameters

Update: This section was written based on FreeBSD 8. As of today (February 26, 2016), the latest version is FreeBSD 9.3 / FreeBSD 10.2. I noticed that my FreeBSD/ZFS pair work very stable even without any tweaking! In the other words, you may skip this section if you are not using FreeBSD 8.2 or older.

Many people complain about ZFS for its stability issues, such as kernel panic, reboot randomly, crash when copying large files (> 2GB) at full speed etc. It may have something to do with the boot loader settings. By default, ZFS will not work smoothly without tweaking the system parameters system. Even FreeBSD (9.1 or earlier) claims that no tweaking is necessary for 64-bit system, my FreeBSD server crashes very often when writing large files to the pool. After trial and error for many times, I figure out few equations. You can tweak your boot loader (/boot/loader.conf) using the following parameters. Notice that I only tested the following on FreeBSD. Please let me know whether the following tweaks work on other operating systems.

If you experiences kernel panic, crash or something similar, it could be the hardware problem, such as memory. I encourage to test all memory modules by using Memtest86+ first. I wish someone told me about this few years ago. That would make my life a lot easier.

Warning: Make sure that you save a copy before doing anything to the boot loader. Also, if you experience anything unusual, please remove your changes and go back to the original settings.

#Assuming 8GB of memory

#If Ram = 4GB, set the value to 512M
#If Ram = 8GB, set the value to 1024M

#Ram x 0.5 - 512 MB

#Ram x 2

#Ram x 1.5

#The following were copied from FreeBSD ZFS Tunning Guide

# Disable ZFS prefetching
# Increases overall speed of ZFS, but when disk flushing/writes occur,
# system is less responsive (due to extreme disk I/O).
# NOTE: Systems with 4 GB of RAM or more have prefetch enabled by default.

# Decrease ZFS txg timeout value from 30 (default) to 5 seconds.  This
# should increase throughput and decrease the "bursty" stalls that
# happen during immense I/O with ZFS.
# default in FreeBSD since ZFS v28

# Increase number of vnodes; we've seen vfs.numvnodes reach 115,000
# at times.  Default max is a little over 200,000.  Playing it safe...
# If numvnodes reaches maxvnode performance substantially decreases.

# Set TXG write limit to a lower threshold.  This helps "level out"
# the throughput rate (see "zpool iostat").  A value of 256MB works well
# for systems with 4 GB of RAM, while 1 GB works well for us w/ 8 GB on
# disks which have 64 MB cache.

# NOTE: in v27 or below , this tunable is called 'vfs.zfs.txg.write_limit_override'.

Don’t forget to reboot your system after making any changes. After changing to the new settings, the writing speed improves from 60MB/s to 80MB/s, sometimes it even goes above 110MB/s! That’s a 33% improvement!

By the way, if you found that the system still crashes often, the problem could be an uncleaned file system.

After a system crashes, it may cause the file links to be broken (e.g., the system see the file tag, but unable to locate the files). Usually FreeBSD will automatically run fsck after the crash. However it will not fix the problem for you. In fact, there is no way to clean up the file system when the system is running (because the partition is mounted). The only way to clean up the file system is by entering the Single User Mode (a reboot is required).

After you enter the single user mode, make sure that each partition is cleaned. For example, here is my df result:

Filesystem    Size    Used   Avail Capacity  Mounted on
/dev/ad8s1a   989M    418M    491M    46%    /          
devfs         1.0k    1.0k      0B   100%    /dev       
/dev/ad8s1e   989M     23M    887M     3%    /tmp       
/dev/ad8s1f   159G     11G    134G     8%    /usr       
/dev/ad8s1d    15G    1.9G     12G    13%    /var        

Try running the following commands:

fsck -y -f /dev/ad8s1a
fsck -y -f /dev/ad8s1d
fsck -y -f /dev/ad8s1e
fsck -y -f /dev/ad8s1f

These command will clean up the affected file systems. The parameters f means force, and y means yes.

After the clean up is done, type reboot and let the system to boot to the normal mode.

Improve ZFS Performance: Step 3

Use disks with the same specifications

A lot of people may not realize the importance of using exact the same hardware. Mixing different disks of different models/manufacturers can bring performance penalty. For example, if you are mixing a slower disk (e.g., 5900 rpm) and a faster disk(7200 rpm) in the same virtual device (vdev), the overall speed will depend on the slowest disk. Also, different harddrives may have different sector size. For example, Western Digital releases a harddrive with 4k sector, while the general harddrives use 512 byte. Mixing harddrives with different sectors can bring performance penalty too. Here is a quick way to check the model of your harddrive:

sudo dmesg | grep ad

In my cases, I have the following:

ad10: 1907729MB Hitachi HDS722020ALA330 JKAOA20N at ata5-master UDMA100 SATA 3Gb/s
ad11: 1907729MB Seagate ST32000542AS CC34 at ata5-slave UDMA100 SATA 3Gb/s
ad12: 1907729MB WDC WD20EARS-00MVWB0 51.0AB51 at ata6-master UDMA100 SATA 3Gb/s
ad13: 1907729MB Hitachi HDS5C3020ALA632 ML6OA580 at ata6-slave UDMA100 SATA 3Gb/s
ad14: 1907729MB WDC WD20EARS-00MVWB0 50.0AB50 at ata7-master UDMA100 SATA 3Gb/s
ad16: 1907729MB WDC WD20EARS-00MVWB0 51.0AB51 at ata8-master UDMA100 SATA 3Gb/s

Notice that my Western Digital harddrives with 4k sectors all ends with EARS in the model number.

If you don’t have enough budget to replace all disks with the same specifications, try to group the disks with similar specifications in the same vdev.

Improve ZFS Performance: Step 4

Use a Powerful Power Supply

I recently built a Linux-based ZFS file system with 12 hard disks. For some reasons, it was pretty unstable. When I tried filling the pool with 12TB of data, the ZFS system crashed randomly. When I rebooted the machine, the error was gone. However, when I resumed the copy process again, the error happened on a different disk. In short, the disks failed randomly. When I checked the dmesg, I found something like the following. Since I didn’t have the exact copy, I grabbed something similar from the web:

[  412.575724] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[  412.576452] ata10.00: BMDMA stat 0x64
[  412.577201] ata10.00: failed command: WRITE DMA EXT
[  412.577897] ata10.00: cmd 35/00:08:97:19:e4/00:00:18:00:00/e0 tag 0 dma 4096 out
[  412.577901]          res 51/84:01:9e:19:e4/84:00:18:00:00/e0 Emask 0x10 (ATA bus error)
[  412.579294] ata10.00: status: { DRDY ERR }
[  412.579996] ata10.00: error: { ICRC ABRT }
[  412.580724] ata10: soft resetting link
[  412.844876] ata10.00: configured for UDMA/133
[  412.844899] ata10: EH complete
[  980.304788] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[  980.305311] ata10.00: BMDMA stat 0x64
[  980.305817] ata10.00: failed command: WRITE DMA EXT
[  980.306351] ata10.00: cmd 35/00:08:c7:00:ce/00:00:18:00:00/e0 tag 0 dma 4096 out
[  980.306354]          res 51/84:01:ce:00:ce/84:00:18:00:00/e0 Emask 0x10 (ATA bus error)
[  980.307425] ata10.00: status: { DRDY ERR }
[  980.307948] ata10.00: error: { ICRC ABRT }
[  980.308529] ata10: soft resetting link
[  980.572523] ata10.00: configured for UDMA/133

Basically, this message means the disk was failed during writing the data. Initially I thought it could be the SMART/bad sectors. However since I observed the problem happend randomly on random disks, I think the problem could be something else. I have tried to replacing the SATA cable, power cable etc. None of them worked. Finally I upgraded my power supply (450W to 600W), and the error was gone.

FYI, here is the specs of my affected system. Notice that I didn’t use any component that required high-power such as graphic card etc.

  • CPU: Intel Q6600
  • Standard motherboard
  • PCI RAID Controller Card with 4 SATA ports
  • WD Green Drive x 12
  • CPU fan x 1
  • 12″ case fan x 3

And yes, you will need a 600W power supply for such a simple system. Also, another thing worth to check is the power cable. Sometimes, using 15 pin power cable is better than 4pin to 15 pin converter.

Improve ZFS Performance: Step 5


ZFS supports compressing the data on the fly. This is a nice feature that improves the I/O speed – only if you have a high speed CPU (such as Quad core or higher). If your CPU is not fast enough, I don’t recommend you to turn on the compression feature, because the benefit from reducing the file size is smaller than the time spent on the CPU computation. Also, the compression algorithm plays an important role here. ZFS supports two compression algorithms, LZJB and GZIP. I personally use lz4 (See LZ4 vs LZJB for more information) because it gives a better balance between the compression radio and the performance. You can also use GZIP and specify your own compression ratio (i.e., GZIP-N). FYI, I tried GZIP-9 (The maximum compression ratio available) and I found that the overall performance gets worse even on my i7 with 12GB of memory.

There is no solid answer here because it all depends on what kind of files you store. Different files such as large file, small files, already compressed files (such as Xvid movie) need different compression settings.

If you cannot decide, just go with lz4. It can’t be wrong:

#Try to use lz4 first.
sudo zfs set compression=lz4 mypool

#If you system does not support lz4, try to use lzjb
sudo zfs set compression=lzjb mypool

Improve ZFS Performance: Step 6

Identify the bottle neck

Sometimes, the limit of the ZFS I/O speed is closely related to the hardware. For example, I set up a network file system that is based on ZFS. Most of the time, I mainly transfer the data in between the servers, rather than within the same server. Therefore, the maximum I/O speed I could get is the capacity of my network card, which is 125MB/s. On average, I can reach to 100-110 MB/s, which is pretty good for consumer grade network adapter.

One day, I decide to explore the options on network bonding, which combines multiple network adapters together on the same server. In theory, it multiplies the bandwidth, and it will increase the ZFS I/O speed.

I set up network bonding on a machine with two network cards. It is a CentOS 7 box. The bonding mode is 0 (load balancing/round-robin). I am able to double the I/O speed in both network and ZFS.

sudo cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: em1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Slave queue ID: 0

Slave Interface: em2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Slave queue ID: 0

Improve ZFS Performance: Step 7

Keep your ZFS up to date

By default, ZFS will not update the file system itself even if a newer version is available on the system. For example, I created a ZFS file system on FreeBSD 8.1 with ZFS version 14. After upgrading to FreeBSD 8.2 (which supports ZFS version 15), my ZFS file system was still on version 14. I needed to upgrade it manually using the following commands:

sudo zfs upgrade my_pool
sudo zpool upgrade my_pool

Improve ZFS Performance: Step 8

Use Cache / Log Devices

Suppose you have a very fast SSD harddrive. You can use it for logging/caching the data for your ZFS pool.

To improve the reading performance:

sudo zpool add myzpool cache 'ssd device name'

To improve the writing performance:

sudo zpool add myzpool log 'ssd device name'

It was impossible to remove the log devices without losing the data until ZFS v.19 (FreeBSD 8.3+/9.0+). I highly recommend to add the log drives as a mirror, i.e.,

sudo zpool add myzpool log mirror /dev/log_drive1 /dev/log_drive2

Now you may ask a question. How about using a ram disk as log / cache devices? First, ZFS already uses your system memory for I/O, so you don’t need to set up a dedicated ram disk by yourself. Also, using a ram disk for log (writing) devices is not a good idea. When somethings go wrong, such as power failure, you will end up losing your data during the writing.

Improve ZFS Performance: Step 9

Two drives is better than one single drive

Do you know ZFS works faster on multiple devices pool than single device pool, even they have the same storage size?

Improve ZFS Performance: Step 10

Use a combination of Striped and RAIDZ if speed is your first concern.

Striped design also gives the best performance. Since it offers no data protection at all, you may want to use RAIDZ (RAIDZ1, RAIDZ2, RAIDZ3) or mirror to handle the data protection. However, there are too many choices and each of them offer different degree of performance and protection level. If you want a quick answer, try to use a combination of striped and RAIDZ. I posted a very detail of comparison among Mirror, RAIDZ, RAIDZ2, RAIDZ3 and Striped here.

Here is an example of striped with RAIDZ:

zpool create -f myzpool raidz hd1 hd2 hd3 hd4 hd5 \
                        raidz hd6 hd7 hd8 hd9 hd10

#zpool status -v
        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            hd1     ONLINE       0     0     0
            hd2     ONLINE       0     0     0
            hd3     ONLINE       0     0     0
            hd4     ONLINE       0     0     0
            hd5     ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            hd6     ONLINE       0     0     0
            hd7     ONLINE       0     0     0
            hd8     ONLINE       0     0     0
            hd9     ONLINE       0     0     0
            hd10    ONLINE       0     0     0

Improve ZFS Performance: Step 11

Distribute your free space evenly

One of the important tricks to improve ZFS performance is to keep the free space evenly distributed across all devices.

You can check it using the following command:

zpool iostat -v

The free space is show on the second column (available capacity)

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage     3.23T  1.41T      0      3  49.1K   439K
  ad4        647G   281G      0      0  5.79K  49.2K
  ad8        647G   281G      0      0  5.79K  49.6K
  ad10       647G   281G      0      0  5.82K  49.6K
  ad16       647G   281G      0      0  5.82K  49.6K
  ad18       647G   281G      0      0  5.77K  49.5K

When ZFS writes a new file to replace the old file in the system, it will first write the file in the free space first, then move the file pointer from the old one to the new one. In this case, even there is a power failure during writing the data, no data will be lost because the file pointer is still pointing to the old file. That’s why ZFS does not need fsck (file system check).

In order to keep the performance at a good level, we need to make sure that the free space is available in every device in the pool. Otherwise ZFS can only write the data to some of the devices only (instead of all). In the other words, the higher number of devices ZFS write, the better the performance.

Technically, if the structure of a zpool has not been modified or alternated, you should not need to worry about the free space distribution because ZFS will take care of that for you automatically. However, when you add a new device to an existing pool, that will be a different story, e.g.,

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage     3.88T  2.33T      0      3  49.1K   439K
  ad4        647G   281G      0      0  5.79K  49.2K
  ad8        647G   281G      0      0  5.79K  49.6K
  ad10       647G   281G      0      0  5.82K  49.6K
  ad16       647G   281G      0      0  5.82K  49.6K
  ad18       647G   281G      0      0  5.77K  49.5K
  ad20          0   928G      0      0  5.77K  49.5K

In this example, I add a 1TB hard drive (ad20) to my existing pool, which gives about 928GB of free space. Let say I add a 6GB file, the free space will look something like this:

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage     4.48T  1.73T      0      3  49.1K   439K
  ad4        648G   280G      0      0  5.79K  49.2K
  ad8        648G   280G      0      0  5.79K  49.6K
  ad10       648G   280G      0      0  5.82K  49.6K
  ad16       648G   280G      0      0  5.82K  49.6K
  ad18       648G   280G      0      0  5.77K  49.5K
  ad20         1G   927G      0      0  5.77K  49.5K

In the other words, ZFS will still divide my 6GB file into six equal pieces and write each piece to each device. Eventually, ZFS will use up the free space in the older devices, and it can write the data to the new devices only (ad20), which will decrease the performance. Unfortunately, there is no way to redistribute the data / free space evenly without destroying the pool, i.e.,

1. Back up your data
2. Destroy the pool
3. Rebuild the pool
4. Put your data back

Depending on how much data do you have, it can take 2 to 3 days to copy 10TB of data from one server to another server over a gigabit network. You don’t want to use scp to do it because you will need to re-do everything again if the process is dropped. In my case, I use rsync:

(One single line)

#Run this command on the production server:
rsync -avzr --delete-before backup_server:/path_to_zpool_in_backup_server /path_to_zpool_in_production_server

Of course, netcat is a faster way if you don’t care about the security. (scp / rsync will encrypt the data during transfer).

See here for further information

Improve ZFS Performance: Step 12

Make your pool expandable

Setting up a ZFS system is more than a one-time job. Unless you take a very good care of your storage like how supermodels monitor their body weights, otherwise you will end up using all of the available space one day. Therefore it is a good idea to come up a good design that it can grow in the future.

Suppose we want to build a server with maximum storage capacity, how will we start? Typically we try to put as many hard drives on a single machine as possible, i.e., it will be be around 12 to 14 hard drives, which is what a typical consumer grade full tower computer case can hold. Let’s say we have 12 disks, here are couple setups which maximize storage capacity with a decent level of data safety:

In this design, we create a giant pool and let the ZFS to take care of the rest. This pool will offer n-2 storage capacity which will allow up to 2 hard drives fail without losing any data.

In the second design, it offers the same level of storage capacity and a similar level of data protection. It allows up to one failure disk in each vdev. Keep in mind that the first design offers a great data protection. However, the second design will offer a better performance and greater flexibility in terms of future upgrade. Check out this article if you want to learn more about the difference in ZFS design.

First, let’s talk about the good and bad of the first design. It offers a great data security because it allows ANY two disks in the zpool to fail. However, it has couple disadvantages. ZFS works great when the number of disk of vdev is small. Ideally, the number should be smaller than 8 (Personally, I will stick with 5). In the first design, we put 12 disks in one single vdev, which will be problematic when the storage is getting full (>90%). Also, when we talks about upgrading the entire zpool, we will need to upgrade each disk one by one first. We won’t be able to use the extra space until we replace all 12 disks. This may be an issue for those who do not have budget to get 12 new disks at a time.

For the second design, it does not have the problem mentioned in the first design. The number of disk in each vdev is small (6 disks in each vdev). For those who don’t have plenty of budgets, it is okay to get six disks at a time to expand the pool.

Here is how to create the second design:

sudo zpool create myzpoolname raidz /dev/ada1 /dev/ada2 ... /dev/ada6 raidz /dev/ada7 /dev/ada8 /dev/ada9 ... /dev/ada12

Here is how to expand the pool by replacing the hard drive one by one without losing any data:

1. Shutdown the computer, replace the hard drive and turn on the computer.

2. Tell ZFS to replace the hard drive. This will force it to fill in the new hard drive with the existing data based on the check sum.
zpool replace mypool /dev/ada1

3. Resilver the pool
zpool scrub mypool 

4 Shutdown the server and replace the second ard drive again. Repeat the steps until everything is done.

5. zpool set autoexpand=on mypool 

6. Resilve the pool if needed.
zpool scrub mypool 

Improve ZFS Performance: Step 13

Backup your data on a different machine, not on the same pool

ZFS comes with a very cool feature. It allows you to save multiple copies of the same data in the same pool. This adds an additional layer on data security. However, I don’t recommend using this feature for backup purpose because it adds more work when writing the data to the disks. Also, I don’t think this is a good way to secure the data. I prefer to set up a mirror on a different server (Master-Slave). Since the chance of two machines fail at the same time is much smaller than one machine fails. Therefore the data is safer in this settings.

Here is how I synchronize two machines together:

(Check out this guide on how to use rsyncd)

Create a script in the slave machine:
(One single line)

rsync -avzr -e ssh --delete-before master:/path/to/zpool/in/master/machine /path/to/zpool/in/slave/machine

And put this file in a cronjob, i.e.,

@daily root /path/to/

Now, you may ask a question. Should I go with strip-only ZFS (i.e., stripping only. No mirror, RAIDZ, RAIDZ2) when I set up my pool? Yes or no. ZFS allows you to mix any size of har ddrive in one single pool. Unlike RAID[0,1,5,10] and concatenation, it can be any size and there is no lost in the disk space, i.e., you can connect 1TB, 2TB, 3TB into one single pool while enjoying the data-stripping (Total usable space = 6TB). It is fast (because there is no overhead such as parity etc) and simple. The only down side is that the entire pool will stop working if at least one device fails.

Let’s come back to the question, should we employ simple stripping in production environment? I prefer not. Strip-only ZFS divides all data into all vdev. If each vdev is simply a hard drive, and if one fails, there is NO WAY to get the original data back. If something screws up in the master machine, the only way is to destroy and rebuild the pool, and restore the data from the backup. (This process can takes hours to days if you have large amount of data, say 6TB.) Therefore, I strongly recommend to use at least RAIDZ in the production environment. If one device fails, the pool will keep working and no data is lost. Simply replace the bad hard drive with a good one and everything is good to go.

To minimize the downtime when something goes wrong, go with at least RAIDZ in a production environment (ideally, RAIDZ or strip-mirror).

For the backup machine, I think using simple stripping is completely fine.

Here is how to build a pool with simple stripping, i.e., no parity, mirror or anything

zpool create mypool /dev/dev1 /dev/dev2 /dev/dev3

And here is how to monitor the health

zpool status

Some websites suggest to use the following command instead:

zpool status -x

Don’t believe it! This command will return “all pools are healthy” even if one device is failed in a RAIDZ pool. In the other words, your data is healthy doesn’t mean all devices in your pool are healthy. So go with “zpool status” at any time.

FYI, it can easily takes few days to copy 10TB of data from one machine to another through a gigabit network. In case you need to restore large amount of data through the network, use rsync, not scp. I found that scp sometimes fail in the middle of transfer. Using rsync allows me to resume it at any time.

Improve ZFS Performance: Step 14

rsync or ZFS send?

So what’s the main difference between rsync and ZFS send? What’s the advantage of one over the other?

Rsync is a file level synchronization tool. It simply goes through the source, find out which files have been changed, and copy the corresponding files to the destination. Also rsync is portable/cross-platform. Unlike ZFS, rsync is available in most Unix platforms. If your backup platform does not support ZFS, you may want to go with rsync.

ZFS send is doing something similar. First, it takes a snapshot on the ZFS pool first:

zfs snapshot mypool/vdev@20120417

After that, you can generate a file that contains the pool and data information, copy to the new server to restore it:

#Method 1: Generate a file first
zfs send mypool/vdev@20120417 > myZFSfile
scp myZFSfile backupServer:~/
zfs receive mypool/vdev@20120417 < ~/myZFSfile

Or you can do everything in one single command line:

#Method 2: Do everything over the pipe (One command)
zfs send pool/vdev@20120417 | ssh backupServer zfs receive pool/vdev@20120417

In general, the preparation time of ZFS send is much shorter than rsync, because ZFS already knows which files have been modified. Unlike rsync, a file-level tool, ZFS send does not need to go though the entire pool and find out such information. In terms of the transfer speed, both of them are similar.

So why do I prefer rsync over ZFS send (both methods)? It’s because the latter one is not practical! In method #1, the obvious issue is the storage space. Since it requires generating a file that contains your entire pool information. For example, suppose your pool is 10TB, and you have 8TB of data (i.e., 2TB of free space), if you go with method #1, you will need another 8TB of free space to store the file. In the other words, you will need to make sure that at least 50% of free space is available all the time. This is a quite expensive way to run ZFS.

What about method #2? Yes, it does not have the storage problem because it copies everything over the pipe line. However, what if the process is interrupted? It is a common thing due to high traffic in the network, high I/O to the disk etc. Worst worst case, you will need to re-do everything again, say, copying 8TB over the network, again.

rsync does not have these two problems. In rsync, it uses relatively small space for temporary storage, and in case the rsync process is interrupted, you can easily resume the process without copying everything again.

Improve ZFS Performance: Step 15

Disable dedup if you don’t have enough memory (5GB memory per 1TB storage)

Deduplication (dedup) is a space-saving technology. It works at the block level (a file can have many blocks). To explain it in simple English, if you have multiple copies of the same file in different places, it will store only one copy instead of multiple copies. Notice that dedup is not the same as compression. Check out this article: ZFS: Compression VS Deduplication(Dedup) in Simple English if you wan to learn more.

The idea of dedup is very simple. ZFS maintains an index of your files. Before writing any incoming files to the pool, it checks whether the storage has a copy of this file or not. If the file already exists, it will skip the file. With dedup enabled, instead of store 10 identical files, it stores one only copy. Unfortunately, the drawback is that it needs to check every incoming file before making any decision.

After upgrading my ZFS pool to version 28, I enabled dedup for testing. I found that it really caused huge performance hit. The writing speed over the network dropped from 80MB/s to 5MB/s!!! After disabling this feature, the speed goes up again.

sudo zfs set dedup=off your-zpool

In general, dedup is an expensive feature that requires a lot of hardware resources. You will need 5GB memory per 1TB of storage (Source). For example, if zpool is 10TB, I will need 50GB of memory! (Which I only have 12GB). Therefore, think twice before enabling dedup!

Notice that it won’t solve all the performance problem by disabling the dedup. For example, if you enable dedup before and disable it afterward, all files stored during this period are dedup dependent, even dedup is disabled. When you need to update these files (e.g., delete), the system still needs to check again the dedup index before any processing your file. Therefore, the performance issue still exists when working with these affected files. For the new files, it should be okay. Unfortunately, there is no way to find out the affected dedup files. The only way is to destroy and re-build the ZFS pool, which will clear the list of dedup files.

Improve ZFS Performance: Step 16

Reinstall Your Old System

Sometimes, reinstalling your old system from scratch may help to improve the performance. Recently, I decided to reinstall my FreeBSD box. It was an old FreeBSD box that was started with FreeBSD 6 (released in 2005, about 8 years ago from today). Although I upgraded the system every release, it already accumulated many junk and unused files. So I decide to reinstall the system from scratch. After the installation, I can tell that the system is more responsive and stable.

Before you wipe out the system, you can export the ZFS tank using the following command:

sudo zpool export mypool

After the work is done, you can import the data back:

sudo zpool import mypool

Improve ZFS Performance: Step 17

Connect your disks via high speed interface

Recently, I found that my overall ZFS system is slow no matter what I have done. After some investigations, I noticed that the bottle neck was my RAID card. Here are my suggestions:

1. Connect your disks to the ports with highest speed. For example, my PCI-e RAID card deliveries higher speed than my PCI RAID card. One way to verify the speed is by using dmesg, e.g.,

dmesg | grep MB

#Connected via PCI card. Speed is 1.5Gb/s
ad4: 953869MB  at ata2-master UDMA100 SATA 1.5Gb/s

#Connected via PCI-e card. Speed is 3.0 Gb/s
ad12: 953869MB  at ata6-master UDMA100 SATA 3Gb/s

In this case, the overall speed limit is based on the slowest one (1.5Gb/s), even the rest of my disks are 3Gb/s.

2. Some RAID cards come with some advanced features such as RAID, linear RAID, compression etc. Make sure that you disable these features first. You want to minimize the workload of the card and maximize the I/O speed. It will only slow down the overall process if you enable these additional features. You can disable the settings in the BIOS of the card. FYI, most of the RAID cards in $100 ranges are “software RAID”, i.e., they are using the system CPU to do the work. Personally, I think these fancy features are designed for Windows users. You really don’t need any of these features in Unix world.

3. Personally, I recommend any brand except Highpoint Rocketraid because of the driver issues. Some of the Highpoint Rocketraid products are not supported by FreeBSD natively. You will need to download the driver from their website first. Their driver is version-specified, e.g., they have two different set of drivers for FreeBSD 7 and 8, and both of them are not compatible with each other. One day if they decide to stop supporting the device, then you either need to stick with the old FreeBSD, or buy a new card. My conclusion: Stay away from Highpoint Rocketraid.

Improve ZFS Performance: Step 18

Do not use up all spaces

Depending on the settings / history of your zpool, you may want to maintain the free space at a certain level to avoid speed-drop issues.

Recently, I found that my ZFS system is very slow in terms of reading and writing. The speed dropped from 60MB/s to 5MB/s over the network. After some investigations, I found that the available space was around 300GB (out of 10TB), which is 3% left. Someone suggest that the safe threshold is about 10%, i.e., the performance won’t be impacted if you have at least 10% of the free space. I would say 5% is the bottom line, because I haven’t noticed any performance issues until it hits 3%.

After I free up some spaces, the speed comes back again.

I think it doesn’t make any sense not to use all of my space. So I decide to find out what caused this problem. The answer is the zpool structure.

In my old setup, I put set up a single RAIDZ vdev with 8 disks. This gives me basic data security (up to one disk fails), and maximum disk spaces (Usable space is 7 disks). However, I notice that the speed drops a lot when the available free space was 5%.

In my experiment setup, I decide to do the same thing with RAIZ2, i.e., it allows up to two disks fail, and the usable space is down to 6 disks. After filling up the pool, I found that it does not have the speed-drop problem. The I/O speed is still fast even the free space is 10GB (That’s 0.09%).

My conclusion: RAIDZ is okay up to 6 devices. If you want to add more devices, either use RAIDZ2 or split them into multiple vdevs:

#Suppose I have 8 disks (/dev/hd1 ... /dev/hd8).

#One vdev
zpool create myzpool raidz2 /dev/hd1 /dev/hd2 ... /dev/hd8

#Two vdevs
zpool create myzpool raidz /dev/hd1 ... /dev/hd4 raidz /dev/hd5 ... /dev/hd8

Improve ZFS Performance: Step 19


Typically there is a setting to control how the motherboard interacts with the hard drives: IDE or AHCI. If your motherboard has IDE ports (or manufactured before 2009), it is likely that the default value is set to IDE. Try to change to AHCI. Believe me, this litter tweak can save you countless of hours on debugging.

FYI, here is my whole story.

Improve ZFS Performance: Step 20

Refresh your pool

I had set up my zpool for five years. Over the past five years, I had performed lots of upgrade and changed a lot of settings. For example, during the initial set up, I didn’t enable the compression. Later, I set the compression to lzjb and changed it to lz4. I also enabled and disabled the dedup. So you can imagine some part of the data is compressed using lzjb, some data has dedup enabled. In short, the data in my zpool has all kind of different settings. That’s dirty.

The only thing I can clean up is to destroy the entire zpool and rebuild the whole thing. Depending on the size of your data, it can take 2-3 days to transfer 10TB of data from one to another server, i.e., 4-6 days round trip. However, you will see the performance gain in long run.

I recommend cleaning the pool every 3 years.

Improve ZFS Performance: Step 21

My Settings – Simple and Clean

Here is the what I have done to my 12-disk ZFS pool (Usable space: 18TB). Notice that I haven’t done much to it. I tried to keep it simple and clean. The result? The top speed is about 80-100MB/s (max speed is 126MB/s) over a consumer-grade gigabit network.

#What I have done
sudo zpool history
History for 'storage':
2014-01-22.20:28:59 zpool create -f storage raidz /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada4 raidz /dev/ada5 /dev/ada6 /dev/ada7 /dev/ada8 raidz /dev/da0 /dev/da1 /dev/da2 /dev/da3
2014-01-22.20:29:07 zfs create storage/data
2014-01-22.20:29:15 zfs set compression=lz4 storage
2014-01-22.20:29:19 zfs set compression=lz4 storage/data
2014-01-22.20:30:19 zfs set atime=off storage

#sudo zfs upgrade
This system is currently running ZFS filesystem version 5.

#sudo zpool upgrade
#Based on version 28
This system supports ZFS pool feature flags.

All pools are formatted using feature flags.

Every feature flags pool has all supported features enabled.

A simple nload when running rsync over a gigabit network:

#nload -u M

Curr: 99.81 MByte/s
Avg: 95.25 MByte/s
Min: 76.89 MByte/s
Max: 102.22 MByte/s
Ttl: 894.14 GByte

Here are the hardware I used:

9.3-RELEASE-p5 FreeBSD 9.3-RELEASE-p5 #0: Mon Nov  3 22:38:58 UTC 2014  amd64

#System Drive
#A PATA 200MB drive bought in 2005 (Some experts suggest to install the OS on a USB drive to save a port)
ada0: maxtor STM3200820A 3.AAE ATA-7 device
ada0: 100.000MB/s transfers (UDMA5, PIO 8192bytes)
ada0: 190782MB (390721968 512 byte sectors: 16H 63S/T 16383C)

#ZFS Drives
#8 x 2TB standard SATA hard drives connected to the motherboard
#4 x 3TB standard SATA hard drives connected to a PCI-e RAID card

da0: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
da1: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
da2: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
da3: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)

ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 512bytes)
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada3: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada4: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
ada4: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada5: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
ada5: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada6: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
ada6: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada7: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
ada7: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada8: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes)
ada8: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

hw.machine: amd64
hw.model: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
hw.ncpu: 8
hw.machine_arch: amd64

#Memory - 6 x 4GB
real memory  = 25769803776 (24576 MB)
avail memory = 24823390208 (23673 MB)

#Network - Gigabit network on the motherboard
re0: realtek 8168/8111 B/C/CP/D/DP/E/F PCIe Gigabit Ethernet

Enjoy ZFS.


Our sponsors:


  1. “Use disks with the same specifications”.
    “For example, if you are mixing a slower disk (e.g., 5900 rpm) and a faster disk(7200 rpm) in the same ZFS pool, the overall speed will depend on the slowest disk. ”

    It is not true. You can mix different sizes and speeds of devices as top-level devices. ZFS will dynamically redistributed write bandwidth according to estimated speed and latency.

    You hower should not mix different sizes and speeds in single vdev, like mirror or raidz2. For example in mirror, zfs will wait for both devices to complete write. In theory it could wait just for one of them (at least when not using O_DIRECT or O_SYNC or fsync, etc.), but this can be pretty nasty when doing recovery after one of the devices fails.

    1. Hi Witek,

      Thanks for your comment. What I was talking about was the 512 / 4k sector issue that yields performance penalty. For example, if I mixed harddrive with 512 sector and 4k sectors together, ZFS will try to run all disks with 512 sector, which will gives performance penalty.


  2. You said : “Use Mirror, not RAIDZ if speed is your first concern.”

    But RaidZ in vdev as :
    vdev1: RAID-Z2 of 6 disks
    vdev2: RAID-Z2 of 6 disks

    could provide speed and security ?

    1. Hi ST3F,

      No, I don’t think so. RAIDZ or RAIDZ2 is definitely slower than pure striping, i.e.,

      zpool create mypool disk1 disk2 disk3 disk4 disk5 disk6

      And I think this will be faster:
      zpool create mypool mirror disk1 disk2 mirror disk3 disk4 mirror disk5 disk6

      Of course, the trade off for the speed is losing disk space.

  3. So instead of buying/building expensive server (ECC memory, Xeon and so on) the better, cheaper and more reliable alternative is to buy two (or even more) customers-grade computers and do syncronization between it.

    1. Yes. All members in my data server farm are consumer level computers (e.g., i7, Quad-Core, dual-core with non-ECC memory and consumer level hard drives). So far the reliability is pretty good. Actually, the gaming PC has a very high reliability, because they need to operate at high temperature. Using a gaming quality computer for data server is a good choice.

  4. Hi Derrick,

    You have written a very good article for ZFS beginners, but I would like to comment on some of the claims.

    “Since it is impossible to remove the log devices without losing the data.”, that is incorrect. All writes are first written to the ARC (which unlike the L2ARC is not only a read cache), and if ZIL/log is enabled the write is then written to that device, and when confirmed written it’ll acknowledge back to the client. In case the ZIL dies, no data is lost, unless the machine looses it’s power. What will happen however is that the writes will become as slow as the spindles are just as if you had no ZIL/log device, but you will not loose any data.

    That is how I understand it after lurking around and reading about ZFS for a few days.

    I would also recommend to use zfs send instead of rsync to achieve asynchronous replication – this is the preferred method.

    1. Hi Mattias,

      Thanks for catching that. While I wrote this article, ZFS log device removal was not available yet. (I was referencing v. 15, and this feature was introduced in v. 19).

      I just put a new section to talk about rsync vs zfs send.

      Thank again for your comment!


  5. Hi,

    I tested dd+zero on my FreeBSD 9.0-RELEASE , sadly the speed is very slow, only 35MB/s and the disk is keep writing while dd. maybe you disabled ZIL?

    [root@H2 /z]# zfs get compression z/compress
    z/compress compression on local

    [root@H2 /z]# dd if=/dev/zero of=/z/compress/0000 bs=1m count=1k
    1024+0 records in
    1024+0 records out
    1073741824 bytes transferred in 30.203237 secs (35550554 bytes/sec)

    [root@H2 /z]# du -s /z/compress/
    13 /z/compress/

    [root@H2 /z]# zpool status
    pool: z
    state: ONLINE
    scan: scrub canceled on Fri Apr 13 17:06:34 2012

    z ONLINE 0 0 0
    raidz1-0 ONLINE 0 0 0
    ada1 ONLINE 0 0 0
    ada2 ONLINE 0 0 0
    ada3 ONLINE 0 0 0
    ada4 ONLINE 0 0 0
    ada0p4 ONLINE 0 0 0
    da0 ONLINE 0 0 0
    da1 ONLINE 0 0 0

    ada1-ada4 are 2TB SATA disks, ada0p4 is ZIL on Intel SSD, da0 & da1 are USB sticks.

    1. Hi Bob,

      I will try to remove the log and cache devices first. If the speed is good, then the problem is either on the log or cache, otherwise something is not right to your main storage.


  6. Hello,
    You write in #12, rsync or zfs send, about the problem of interruption of the pipeline when doing zfs send/receive to a remote site.
    However, that presumes you are doing a FULL sent every time.
    If instead, you do frequent incrementals (after a first time initial looong sync, this is true), then you almost eliminate this problem.
    I wrote “zrep” to solve this problem.

    It is primarily developed on Solaris: However,it should be functional on freebsd as well, I think.

  7. I really enjoy reading your website. There is one thing I don’t understand. I think that you have a lot of visitors daily but why your website is loading so fast? My website is loading very slowly! What kind of hosting you are using? I heard hostgator is good but its price is quite steep for me. Thanks in advance.

    1. I run my own server because I had a very bad experience with my former web hosting company (, and I am using FreeBSD. It offers much better performance than other operating systems like Windows and Linux.


  8. Disabling checksums is a terribly bad idea. If you do that, then you cannot reliably recover data using the RAID reconstruction techniques… you can only hope that it works, but will never actually know. In other words, the data protection becomes no better than that provided by LVM. For the ZFS community, or anyone who values their data, disabling checksums is bad advice. The imperceptible increase in performance is of no use when your data is lost.

  9. $ dd if=/dev/zero of=./file.out bs=1M count=10k
    10240+0 records in
    10240+0 records out
    10737418240 bytes (11 GB) copied, 130.8 s, 82.1 MB/s

    that’s even much slower… it is on a single core AMD with 8GB of RAM. but I doubt the CPU is the only cause?

    the system runs on a SSD drive (Solaris 11 Express)
    and my main pool is a raidz on 4 Hitachi harddrives.

    your tips don’t give me that much options. #1 will remain the same for now. #2 is FreeBSD related. #3 I’m compliant. #4 not enabled, so compliant again. #5 disabling checksum is not recommended from the ZFS Evil Tunning guide. I disabled atime though, hopefully this will help.

    #6, I am at version 31, much higher than 14,15…. I could try upgrading to 33….

    #7 I will consider it, I thought adding a log/caching device was adding permanent dependency on my pool. but seems like they can be removed from version 19.

    #8 OK, #9 too late and anyway, I have 4 disks, #10 never added a disk, so irrelevant. #11, #12, #13 irrelevant.

    anything else I can try ?

  10. As we know, ZFS works faster on multiple devices pool than single device pool(Step 8)

    However, the above statement is it only apply to local disk? how abt SAN?

    SAN as we know if we allocate (for example) 100GB, this 100GB may makes up by 3 to 4 hard disk on SAN level and present it to Solaris as 1 devices (LUN).

    So according to the “ZFS works faster on multiple devices pool than single device pool”, is it better to create 5 Luns (each for 20GB), then we create a zpool based on 5 Luns to have 100GB?

  11. Jesus… never disable checksums on a zfs pool. That single suggestion alone should cause anyone who reads this post to question everything suggested here.

  12. Hi There,

    Can you help with NFS performance?

    I had a VM Solaris 9 serving NFS share to ESXi (using VTd for access to the RAID controller) This had 4 x 15 K disk via an adaptec 2805.
    Performance was great!

    When I moved to VM Solaris 11 express, the write latency dropped to 800ms.

    I rolled back to 9, all was OK again. I set up ISCSI target in Sol 11 exp, and performance was almost back to normal.

    Any idea why the NFS from VMWARE to Solaris 11 EXP would be so poor?



    1. Hi Shaun,

      Personally I don’t have any experience with Solaris. However, I think the problem might be related to the system settings, which can be one of the following: system settings, kernel version, or your application settings. To identify the problem, I suggest to measure the I/O speed using some basic tools, such as iostat, scp etc. If the speed of two OSes are the same, then you can narrow down the problem to the applications. Hope it helps.


Leave a Reply

Your email address will not be published. Required fields are marked *