I need to transfer 10TB of data from one machine to another machine. Those 10TB of files are living in a large RAID which span across 7 different disks. The target machine has another large RAID which span across 12 different disks. It is not easy to copying those files locally. Therefore, I decide to copy the files over the LAN.
There are four options popping up in my head: scp, rsync, rsyncd (rsync as daemon) and netcat.
scp is handy, easy to use but comes with two disadvantages: slow and not fault-tolerant. Since scp comes with the highest security, all data are encrypted before the transfer. It will slow down the overall performance because of the extra encryption stuffs (which makes the data larger), and extra computational resource (which uses more CPU). If the transfer is interrupted, there is no easy way to resume the process other than transferring everything again. Here are some example commands:
#Source machine #Typical speed is about 20 to 30MB/s scp -r /data target_machine:/data #Or you can enable the compression on the fly #Depending on the type of your data, if your data is already compressed, you may see no or negative speed improvement scp -rC /data target_machine:/data
rsync is similar to scp. It comes with the encryption (via SSH) such that the data is safe. It also allows you to transfer the newer files only. This will reduce the amount of data being transferred. However, it comes with few disadvantages: long decision time, encryption (which increase the size of overhead) and extra computational resource(e.g., data comparison, encryption and decryption etc). For example, if I use rsync to transfer 10TB of files from one machine to another machine (where the directory on the target machine is blank), it can easily take 5 hours to determine which files will need to be transferred before the actual data transfer is initialized.
#Run on the target machine rsync -avzr -e ssh --delete-after source_machine:/data/ /data/ #Use a less secure encryption algorithm to speed up the process rsync -avzr --rsh="ssh -c blowfish" --delete-after source_machine:/data/ /data/ #Use an even less secure algorithm to get the top speed rsync -avzr --rsh="ssh -c arcfour" --delete-after source_machine:/data/ /data/ #By default, rsync compares the files using checksum, file size and modification date. #Reduce the decision process by skipping the hash check rsync -avzr --rsh="ssh -c arcfour" --delete-after --whole-file source_machine:/data/ /data/
Anyway, no matter what you do, the top speed of rsync in a consumer-grade gigabit network is around 45MB/s. On average, the speed is around 25-35MB/s. Keep in mind that this number does not include the decision time, which can be few hours.
rsyncd (rsync as a daemon)
Thanks for the comment of our reader. I got a chance to investigate the rsync as a daemon. Basically, the idea of running rsync as a daemon is similar to rsync. On the server, we run rsync as a service/daemon. We specify which directory we want to “export” to the clients (e.g., /usr/ports). When the files get changed on the server, it records the changes so that the when the clients talk to the server, the decision time will be faster. Here is how to set up rsync server on FreeBSD
sudo nano /usr/local/etc/rsyncd.conf
And this is my configuration file:
pid file = /var/run/rsyncd.pid #Notice that I use derrick here instead of other systems users, such as nobody #That's because nobody does not have permission to access the path, i.e., /data/ #Either you make the source directory available to "nobody", or you change the daemon user. uid = derrick gid = derrick use chroot = no max connections = 4 syslog facility = local5 pid file = /var/run/rsyncd.pid [mydata] path = /data/ comment = data
Don't forget to include the following in /etc/rc.conf, so that the service will be started automatically. rsyncd_enable="YES"
#Let's start the rsync service: sudo /usr/local/etc/rc.d/rsyncd start
To pull the files from the server to the clients, run the following:
rsync -av myserver::mydata /data/ #Or you can enable compression rsync -avz myserver::mydata /data/
To my surprise, it works much better than running rsync alone. Here are some data I collected during transferring 10TB files from ZFS to ZFS:
Bandwidth measured on the client machine: 70MB/s
zpool IO speed on the client side: 75MB/s
P.S. Initially, the speed was about 45-60MB/s, after I tweak my Zpool, I can get the top speed to 75-80MB/s. Please check out here for references.
I notice that the decision time is much faster than running rsync alone. Also the process is much more stable, with zero interruption, i.e.,
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at io.c(521) [receiver=3.1.0] rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(632) [generator=3.1.0] rsync: [receiver] write error: Broken pipe (32)
NetCat is similar to cat, except that it works at the network level. I decide to use netcat for the initial transfer. If it is interrupted, I will let rsync to kick in the process. Netcat does not encrypt the data, so the overhead is very small. If you transfer the file within a local network and you don’t care about the security, netcat is a perfect choice.
There is only one disadvantage of using netcat. It can only handle one file at a time. It doesn’t mean you need to run netcat for every single file. Instead, we can tar the file before feeding to netcat, and untar the file at the receiving end. As long as we do not compress the files, we can keep the CPU usage small.
#Open two terminals, one for the source and another one for the target machine. #On the target machine: #Go to the directory, e.g., cd /data #Run the following: nc -l 9999| tar xvfp - #On the source machine: #Go to the directory, e.g., cd /data #Pick a port number that is not being used, e.g., 9999 tar -cf - . | nc target_machine 9999
Unlike rsync, the process will start right the way, and the maximum speed is around 45 to 60MB/s in a gigabit network.
|Candidates||Top Speed (w/o compression)||Top Speed (w/ compression)||Resume||Stability||Instant Start?|
|netcat||60MB/s (tar w/o -z)||40MB/s (tar w/ -z)||No||Very High||Instant|
nice comparison. Have you ever tried rsyncd to transfer data between machines? It’s not encrypted as well, has no additional overhead and i got transfer speeds between two FreeBSD-based machines (both RAID-Z1) of about 60-70 MB/s. And btw: the /boot/loader.conf on the target machine (only 4 GiB RAM) is not tweaked so far.
Maybe for some transfers rsyncd could be an alternative to NetCat for you.
Thanks for your comment. Would you mind to share your settings in rsyncd.conf or which rsync parameter you used? I just tried to transfer 10TB files between two machines with ZFS on a consumer level gigabit network. Here is the speed I got:
ZFS iostat (RAIDZ2): 40MB/s
Bandwidth on the network adapter: 50MB/s
ZFS iostat (RAIDZ): 45MB/s
Bandwidth on the network adapter: 55MB/s
I am very curious to know how you can achieve the speed to 60-70MB/s on ZFS.
I came to site because I knew, but didn’t remember the syntax of ‘–rsh=”ssh -c arcfour”‘ and I googled it.
For the compressed data (mine are videos), netcat is probably the fastest. However, you should also consider the stability while using netcat. What happens if some other computer is sending packet to receiving end port 9999 ? Is it possible that your data will corrupt ?
My “vote” goes for rsync with ssh (with fast encryption).
If two computers are sending the data to a single port via NetCat, I think the data on the receiving end will get corrupted. Personally, I think it is safer to use NetCat in a safe environment, i.e., a network you have control and you know what you are doing. I definitely won’t run NetCat in a public environment. Like you said, someone can scan your opened port (such as NetCat) and unload the junk to your port.