Optimize the process of backing up VMs

Please note that this blog has been moved.

Now it has its own domain: mynixworld.info ūüôā

If you want to read the latest version of this article (recommended) please click here and I open the page for you.

I have few KVM virtual machines that, for some reasons, I use to backup from time to time.

Some of their virtual disks (let’s call them VM-hdd) are stored just as a plain raw-uncompressed file (see qcow2) while others are stored inside a physical disk partition (no file system on partition but the¬†VM-hdd content).

To backup your VM-hdd just use the dd tool to dump the partition content to a raw file (i.e. dd if=/dev/sdX of=/tmp/sdX.raw) then create a compressed archive/copy of it that you can store wherever you like. Likewise, to backup your VM-hdd that is stored as a plain/raw file just compress that file then store it wherever you like.

This post is not about that. This post is about how to optimize the whole process, meaning:

  1. dump the VM-hdd from physical partition to a raw file (if not stored as a file already)
  2. compress the VM-hdd image as much as possible
  3. process all the steps using as little as possible resources (CPU, disk, time) while getting the optimal compressed backup copy
  1. dump the VM-hdd from physical partition to a raw file
dd if=/dev/<partition> of=<dir>/<imagefile>.raw
  1. compress the VM-hdd as much as possible

By just running the bzip2 (or another compression) tool you against the physical raw file you will, of course, shrink the file to some extent. But what if you have a 100GB VM-hdd that has 99% free space, how much can your compression tool shrink the file? Our common sense will demand to get an archive of maximum 1GB (which is in fact the disk space used). But can any compression algorithm do that? I have honestly my doubts about that.

The problem is that when you delete a file from the disk the space occupied by the file is not automatically zeroed. In fact, the file will remain there until, byte by byte, will be overwritten by other file contents. So if you have used your disk for a while there is a big change that your disk, even if is empty, contains at a very intimate low level some random information. And random information cannot be compressed efficiently (like, for example, few billions of consecutive zeros). In fact, if you used to have some JPEGs, MPEGs or MP3 files on your disk before it got empty (only 1% disk usage) then the chance to compress the disk up to maximum 1 GB (because 99% is free space) is very-very tiny. Because, you know, free space does not mean free of data, it only means that it’s free for your OS/application to use it. But, as said before, at low-level it contains non-zeroed data.

So the trick is to zero the disk unused space and only then to try to compress it. All you have to do is to write a file that takes all the free space of your disk till your disk get full (100%). The file should contain as many bytes as necessary (to fill the whole disk) but they must have the same value. Like billions of spaces. Or billions of A-s. Or like billions of zeros. Then you remove that file from disk so that you free back the space that you toke it. And because the space that once was used (by your file) doesn’t contain some random bytes (like in JPEG example) but only zeros, then a compression tool can shrink to only few bytes (like 8 or 10 or something). It can do that because it sees that you have, for example, 99GB (~ 10^11 bytes) of zeros so all it have to do to compress that space is to write (into the archived version of your file) “…everything that I wrote before + 10^11 zeros + everything that comes after they”.

After you have zeroed the free space of your disk (and release it back to your OS) it is a good idea to defragment your disk, even to consolidate your free space if you have software for that, so that the compression gets even better.

If you have a *nix VM guest then in order to zero the free space is easy:

dd if=/dev/zero of=/tmp/zero.file bs=512 count=<free-space>/512

If you have a Windows guest then in order to zero the free space you may use a free tool written by Mark Russinovich (Winternal co-founder) that is called SDelete. The tool is can be downloaded and use freely like many other tools written by the same author.

It worths mentioning that if you want to zero the free space on volume/partition x: then you should run the SDelete command like below:

x:                 # first change the active partition to X
sdelete.exe -z     # execute the tool from that partitionn

If you are still on Windows then, in order to consolidate/defragment your disk free space, you may use MyDefrag or UltraDefrag (both freeware).

Now, finally, let’s compress that hyper-optimized VM-hdd disk image that you’ve gotten.

If you are on Linux then I would recommend bzip2 compression filter. Or even better, the parallel version of it, pbzip2. Do you have a cluster? The even better, you may use the MPI bzip2 version of it.

Usually I backup both, the virtual disk(s) and the VM xml definition.

tar cf - /<dir1>/vm-hdd-image.raw /<dir2>/vm-definition.xml|pbzip2 -9 -m2000 -v -c > /<backup media>/vm-hdd-image.tar.bz2

If yor VM-hdd is stored on a physical partition (so it’s not just a simple raw file) then you can even dump and compress the whole thing in only one line of code:

tar cf - | dd if=/dev/<partition>|pbzip2 -9 -m2000 -v -c > /<backup media>/vm-hdd-image.tar.bz2
  1. process all the steps using as little as possible resources (CPU, disk, time) while getting the optimal compressed backup copy

Overall, it was easy and straight forward process, just zero the free space, defragment the disk, dump&tar the disk image to your backup media.

To give you a picture of why is worthing all this effort, I will present shortly two experiments I’ve done:

  • an XP Home + SP3 system with 3GB free space on a 10G VM-hdd
    • defrag.+zeroing free space+bzip2 => ~4GB bz2 archive (~40% of VM-hdd size)
  • an XP Home + SP3 system with 8GB free space on a 10G VM-hdd
    • defrag.+zeroing free space+bzip2 => ~0.9GB bz2 archive (~9% of VM-hdd size)
    • defrag.+zeroing free space+qcow2 => ~1.1GB qcow2 disk (~11% of VM-hdd size)

Without the disk defragmentation/free space consolidation this compression rate would be hard to achieve because the data on the physical disk is so randomly spread, have a such random information on it, that is hard to find a repetitive pattern to shrink it effectively.

BTW: I am using the last archive of XP-Home (0.9G .bz2 file) as a backup for a pre-installed image that I can use it directly in case I of “emergency” and/or when I want/have to start from the scratch with a XP VM.


About Eugen Mihailescu

Always looking to learn more about *nix world, about the fundamental concepts of arithmetic, algebra and geometry. I am also passionate about programming, database and systems administration.
This entry was posted in kvm and tagged , , , , . Bookmark the permalink.

One Response to Optimize the process of backing up VMs

  1. PaulL says:

    Nice suggestion – I did something similar for my vm backups when configuring drbdadm as a backing store. I’d noticed the poor compression and figured it was through not zeroing the file system before starting the build. I hadn’t gotten quite as far as working out how to zero it afterwards.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s