Ubuntu: Sync two largely identical huge Files



Question:

I have two 300 GB files on different volumes:

  • encrypted local backup
  • encrypted ‘remote’ backup on NAS that is).

By design, these two files are identical in size and also mostly (>90%) identical in content...

Is there an efficient tool to „rsync“ these files, and only copy over the differing sections, so the target file becomes identical with the source?

Perhaps somethings that builds checksums of blocks to figure that out, I don't know... (anything more efficient than cp -f... rsync would afaik also grab the entire source file to overwrite)


Solution:1

rsync can be used to do this.

--no-whole-file or --no-W parameters use the block-level sync instead of the file level syncing.


Test case

Generated a random text files using /dev/random and large chunks of text file from websites as following. These 4 files are different in all contents. tf_2.dat is our target file.

~/logs/rs$ ls -tlh      -rw-rw-r-- 1 vayu vayu 2.1G  二  25 23:11 tf_2.dat  -rw-rw-r-- 1 vayu vayu 978M  二  25 23:11 a.txt  -rw-rw-r-- 1 vayu vayu 556K  二  25 23:10 file2.txt  -rw-rw-r-- 1 vayu vayu 561K  二  25 23:09 nt.txt  

Then copied them to different hard disk using rsync (the destination is empty).

rsync -r --stats rs/ /mnt/raid0/scratch/t2  

The following stat was received.

Number of files: 5 (reg: 4, dir: 1)  Number of created files: 0  Number of deleted files: 0  Number of regular files transferred: 4  Total file size: 3,260,939,140 bytes  Total transferred file size: 3,260,939,140 bytes  Literal data: 3,260,939,140 bytes  Matched data: 0 bytes  File list size: 0  File list generation time: 0.001 seconds  File list transfer time: 0.000 seconds  Total bytes sent: 3,261,735,553  Total bytes received: 92    sent 3,261,735,553 bytes  received 92 bytes  501,805,483.85 bytes/sec  total size is 3,260,939,140  speedup is 1.00  

Now I merge, the files to make a new file which has approx 60% old data.

cat file2.txt a.txt >> tf_2.dat  

Now, I sync the two folders , this time using the --no-W option.

rsync -r --no-W --stats rs/ /mnt/raid0/scratch/t2    Number of files: 5 (reg: 4, dir: 1)  Number of created files: 0  Number of deleted files: 0  Number of regular files transferred: 4  Total file size: 4,289,593,685 bytes  Total transferred file size: 4,289,593,685 bytes  Literal data: 1,025,553,047 bytes  Matched data: 3,264,040,638 bytes  File list size: 0  File list generation time: 0.001 seconds  File list transfer time: 0.000 seconds  Total bytes sent: 1,026,127,265  Total bytes received: 611,604    sent 1,026,127,265 bytes  received 611,604 bytes  21,169,873.59 bytes/sec  total size is 4,289,593,685  speedup is 4.18  

You can see a large data is matched and speedup.

Next, I try again, this time I merge several shell files to the target (tf_2.dat) such that change is ~2%,

cat *.sh >> rs/tf_2.dat  

And, again sync using rsync.

rsync -r --no-whole-file --stats rs/ /mnt/raid0/scratch/t2      Number of files: 5 (reg: 4, dir: 1)  Number of created files: 0  Number of deleted files: 0  Number of regular files transferred: 4  Total file size: 4,289,727,173 bytes  Total transferred file size: 4,289,727,173 bytes  Literal data: 178,839 bytes  Matched data: 4,289,548,334 bytes  File list size: 0  File list generation time: 0.001 seconds  File list transfer time: 0.000 seconds  Total bytes sent: 541,845  Total bytes received: 690,392    sent 541,845 bytes  received 690,392 bytes  43,236.39 bytes/sec  total size is 4,289,727,173  speedup is 3,481.25  

We see a large match and speedup giving fast syncing.


Solution:2

You can also try to use https://bitbucket.org/ppershing/blocksync (disclaimer: I am the author of this particular fork). An advantage over rsync is that it reads the file only once (as far as I know rsync can't be convinced to assume two files are different without computing the checksum before it starts the delta transfer. Needless to say, reading 160GB hard drives twice isn't a good strategy). A note of caution -- the current version of blocksync works well over short-RTT connections (e.g., localhost, LAN and local WiFi) but isn't particularly useful for syncing over long distances.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »