ZFS: Compression VS Deduplication (Dedup) in Simple English

Many people are confused between the compression and deduplication because they are so similar. Both of them are designed to reduce the size of the data being stored in the storage. Let me explain the difference between them in simple English.

1. This is how your data looks like originally (Assuming only one unique file):

2. This is how your data look like after being stored in a ZFS pool with compression enabled.

3. This is how your data look like after being stored in a ZFS pool with deduplication enabled.

4. Let say we are storing three identical files, i.e.,

5. ZFS: Compression Only

6. ZFS: Deduplication Only

7. ZFS: Compression + Deduplication

Of course, enabling both compression and deduplication will save lots of free space. However, it comes with a very high price tag. If you like to enable deduplication, you need to make sure that you have at least 2GB of memory per 1TB of storage. For example, if your ZFS pool is 10TB, you need to have 20GB of memory installed in your system. Otherwise, you will experience a huge performance hit.

Hope this article helps you to understand the difference between compression and deduplication.

–Derrick

Our sponsors:

6 comments

  1. ZFS uses block level dedupping, not file level dedupping.
    As such, it’ll even deduplicate pieces of files that aren’t the same.

    Say for example that you have 1000 documents, all with the same header, footer and some more standardized text.

    The pieces of the documents that are the same will be deduplicated, while the other parts of the documents are not.

    Even if you make a 1MB file with random data, then copy it and change 1 single bit inside the copied file, all of the 2 files will be deduplicated, except the block in which the single bit was changed.

  2. Note that those 1000 documents will only dedup headers, footers, and standardized text blocks *if they are block aligned with each other* – something that rarely happens in the real world. Unfortunately, ZFS dedup support doesn’t match up well with what other offerings (such a DataDomain and its variable-block dedup support) can provide.

  3. point 7 is not completely correct.

    the strange thing with compression is that even if data is roughly the same and its compressed. then the complete file and thus its blockes will be changed because of the compression ratio and algorithm.

    example1: 20GB mail files uncompressed this in a gzip = XGB
    example2: 20GB mail files uncompressed with a few extra mails and changes results in about the same size XGB but the complete compressed file / blocks are changed. thus with dedupe you will have nothing gain.

  4. Pingback: ~dhe
  5. “point 7 is not completely correct.”

    Which is why the underlying order of operations should be and then compress.

    Similar to the issue with encrypted and compressed data, the order of operation should be compress then encrypt, since an encrypted file should not be compressible if the cipher text is strong.

  6. Sorry, “dedupe” was deduped it seems!

    That should have read, “Which is why the underlying order of operations should be dedupe and then compress.”

Leave a Reply to Shane Cancel reply

Your email address will not be published. Required fields are marked *