ZFS: Compression VS Deduplication (Dedup) in Simple English

Last Edited: Jan 17, 2021

Many people are confused between the ZFS compression and ZFS deduplication because they are so similar. Both of them are designed to reduce the size of the data being stored in the storage. Let me explain the difference between them in simple English.

1. This is how your data looks like originally (Assuming only one unique file):

2. This is how your data look like after being stored in a ZFS pool with compression enabled.

3. This is how your data look like after being stored in a ZFS pool with deduplication enabled.

4. Let say we are storing three identical files, i.e.,

5. ZFS: Compression Only

6. ZFS: Deduplication Only

7. ZFS: Compression + Deduplication

The biggest difference between deduplication and compression is the scope. File compression works at the file level. For example, if you have three identical files, ZFS will store the compressed files three times. Deduplication works at the block level. A block is simply the basic unit of the ZFS storage (e.g., 512 bytes, 4k etc). Imagine ZFS needs to store a big file. What it will do is to divide the file into multiple chunks. Each chunk will be stored in a block. What deduplication does is to remember the content of each block (checksum), and avoid storing the same content again. In other words, deduplication works at a narrower level (think of it as a molecule).

One of the reasons why the drug are usually tested on mice because some of mouse genes are 99% identical to human genes. Imagine we need to store the mouse genes into the database. All we need is to store the gene of mouse once. Later on if we need to store the human genes into the database, we can reference the mouse one rather than storing the same copy again.

Of course, enabling both compression and deduplication will save lots of free space. However, it comes with a very high price tag. If you like to enable deduplication, you need to make sure that you have at least 2GB of memory per 1TB of storage. For example, if your ZFS pool is 10TB, you need to have 20GB of memory installed in your system. Otherwise, you will experience a huge performance hit.

Hope this article helps you to understand the difference between compression and deduplication.

–Derrick

Our sponsors:

8 Replies to “ZFS: Compression VS Deduplication (Dedup) in Simple English”

  1. Jan

    ZFS uses block level dedupping, not file level dedupping.
    As such, it’ll even deduplicate pieces of files that aren’t the same.

    Say for example that you have 1000 documents, all with the same header, footer and some more standardized text.

    The pieces of the documents that are the same will be deduplicated, while the other parts of the documents are not.

    Even if you make a 1MB file with random data, then copy it and change 1 single bit inside the copied file, all of the 2 files will be deduplicated, except the block in which the single bit was changed.

    Reply
  2. Rob

    Note that those 1000 documents will only dedup headers, footers, and standardized text blocks *if they are block aligned with each other* – something that rarely happens in the real world. Unfortunately, ZFS dedup support doesn’t match up well with what other offerings (such a DataDomain and its variable-block dedup support) can provide.

    Reply
  3. mccs

    point 7 is not completely correct.

    the strange thing with compression is that even if data is roughly the same and its compressed. then the complete file and thus its blockes will be changed because of the compression ratio and algorithm.

    example1: 20GB mail files uncompressed this in a gzip = XGB
    example2: 20GB mail files uncompressed with a few extra mails and changes results in about the same size XGB but the complete compressed file / blocks are changed. thus with dedupe you will have nothing gain.

    Reply
  4. Pingback: ~dhe

  5. Shane

    “point 7 is not completely correct.”

    Which is why the underlying order of operations should be and then compress.

    Similar to the issue with encrypted and compressed data, the order of operation should be compress then encrypt, since an encrypted file should not be compressible if the cipher text is strong.

    Reply
  6. Shane

    Sorry, “dedupe” was deduped it seems!

    That should have read, “Which is why the underlying order of operations should be dedupe and then compress.”

    Reply

Leave a Reply to Frank Cancel reply

Your email address will not be published. Required fields are marked *