ChopZip: a parallel implementation of arbitrary compression algorithms

Remember plzma.py? I made a wrapper script for running [[LZMA]] in parallel. The script could be readily generalized to use any compression algorithm, following the principle of breaking the file in parts (one per CPU), compressing the parts, then [[tar (file format)|tarring]] them together. In other words, chop the file, zip the parts. Hence the name of the program that evolved from plzma.py: ChopZip.

Introduction

Currently ChopZip supports [[LZMA|lzma]], [[XZ Utils|xz]], [[gzip]] and lzip. Of them, lzip deserves a brief comment. It was brought to my attention by the a reader of this blog. It is based on the LZMA algorithm, as are lzma and xz. Apparently unlike them, multiple files compressed with lzip can be concatenated to form a single valid lzip-compressed file. Uncompressing the latter generates a concatenation of the formers.

To illustrate the point, check the following shell action:

% echo hello > head
% echo bye > tail
% lzip head
% lzip tail
% cat head.lz tail.lz > all.lz
% lzip -d all.lz
% cat all
hello
bye

However, I just discovered that all gzip, bzip2 and xz do that already! It seems that lzma is advertised as capable of doing it, but it doesn’t work for me. Sometimes it will uncompress the concatenated file to the original file just fine, others it will decompress it to just the first chunk of the set, yet other times it will complain that the “data is corrupt” and refuse to uncompress. For that reason, chopzip will accept two working modes: simple concatenation (gzip, lzip, xz) and tarring (lzma). The relevant mode will be used transparently for the user.

Also, if you use Ubuntu, this bug will apply to you, making it impossible to have xz-utils, lzma and lzip installed at the same time.

The really nice thing about concatenability is that it allows for trivial parallelization of the compression, while maintaining compatibility with the serial compression tool, which can still uncompress the product of a parallel compression. Unfortunatelly, for non-concatenatable compression formats, the output of chopzip will be a tar file of the compressed chunks, making it imposible to uncompress with the original compressor alone (first an untar would be needed, then uncompressing, then concatenation of chunks. Or just use chopzip to decompress).

The rationale behind plzma/chopzip is simple: multi-core computers are commonplace nowadays, but still the most common compression programs do not take advantage of this fact. At least the ones that I know and use don’t. There are at least two initiatives that tackle the issue, but I still think ChopZip has a niche to exploit. The most consolidated one is pbzip2 (which I mention in my plzma post). pbzip2 is great, if you want to use bzip2. It scales really nicely (almost linearly), and pbzipped files are valid bzip2 files. The main drawback is that it uses bzip2 as compression method. bzip2 has always been the “extreme” bother of gzip: compresses more, but it’s so slow that you would only resort to it if compression size is vital. LZMA-based programs (lzma, xz, lzip) are both faster, and even compress more, so for me bzip2 is out of the equation.

A second contender in parallel compression is pxz. As its name suggests, it compresses in using xz. Drawbacks? it’s not in the official repositories yet, and I couldn’t manage to compile it, even if it comprises a single C file, and a Makefile. It also lacks ability to use different encoders (which is not necessarily bad), and it’s a compiled program, versus chopzip, which is a much more portable script.

Scalability benchmark

Anyway, let’s get into chopzip. I have run a simple test with a moderately large file (a 374MB tar file of the whole /usr/bin dir). A table follows with the speedup results for running chopzip on that file, using various numbers of chunks (and consequently, threads). The tests were conducted in a 4GB RAM Intel Core 2 Quad Q8200 computer. Speedups are calculated as how many times faster did #chunks perform with respect to just 1 chunk. It is noteworthy that in every case running chopzip with a single chunk is virtually identical in performance to running the orginal compressor directly. Also decompression times (not show) were identical, irrespective of number of chunks. ChopZip version vas r18.

#chunks xz gzip lzma lzip
1 1.000 1.000 1.000 1.000
2 1.862 1.771 1.907 1.906
4 3.265 1.910 3.262 3.430
8 3.321 1.680 3.247 3.373
16 3.248 1.764 3.312 3.451

Note how increasing the number of chunks beyond the amount of actual cores (4 in this case) can have a small benefit. This happens because N equal chunks of a file will not be compressed with equal speed, so the more chunks, the smaller overall effect of the slowest-compressing chunks.

Conclusion

ChopZip speeds up quite noticeably the compression of arbitrary files, and with arbitrary compressors. In the case of concatenatable compressors (see above), the resulting compressed file is an ordinary compressed file, apt to be decompressed with the regular compressor (xz, lzip, gzip), as well as with ChopZip. This makes ChopZip a valid alternative to them, with the parallelization advantage.

6 Comments »

  1. Inigo said,

    December 24, 2009 @ 4:36 am

    For trivial paralelization using python, with some examples that may ring to you, take a look at Parallel Python http://www.parallelpython.com/

    Iñigo

  2. isilanes said,

    January 5, 2010 @ 13:10 pm

    Thanks, Iñigo! I also experimented with [[pexec]]. I even made a Python function to mimic pexec (and I guess ParallelPython), in the sense that it accepts a list of commands, then runs N of them simultaneously, making sure that there are always N processes running (for N cores). When one finishes, a new one is added to the list of running processes, until the list of commands to run is exhausted. In the case of ChopZip, I ended up going back to my original formulation, to avoid unnecessary code complexity (for little or no efficiency gain). pexec-like functionality allows (in my case) for generating more file pieces than cores are available, and thus a better load balance (compensate for chunks that get compressed faster, and their core would be idle). However, as I said, I saw no actual performance gain.

  3. Ole Tange said,

    August 18, 2011 @ 14:25 pm

    Could ChopZip be replaced by GNU Parallel:

    ZIPPRG=lzip
    cat bigfile | parallel –pipe -k $ZIPPRG > bigfile.lz

  4. isilanes said,

    August 19, 2011 @ 11:03 am

    Thanks Ole,

    I seems that it could certainly do.

    However I am getting somewhat disturbing results with it. With either xz, xz + parallel, or ChopZip, I compress a given file, then decompress it back and check that the md5sum hasn’t changed, so the compression and decompression where correct. Now the file sizes are:

    Uncompressed: 94 MB (97922080 bytes)
    xz -3: 14399948 bytes
    xz -3 + parallel -j 4: 14767356 bytes
    chopzip.py -n 4 -m xz -l 3: 14413616 bytes

    So far, so good. Parallel compression is slightly less efficient than serial, which was expected. I still don’t know why chopzip beats xz + parallel in size, if they both split the input into 4 equal pieces, but oh well…

    However, the surprise comes when benchmarking the speed. The compression times are:

    xz: 32.8 s
    chopzip.py: 14.6 s
    xz + parallel: 3.77 s

    I am disapointed at how chopzip barely gets a 2x speedup, when 4x is expected. However, I am much much more surprised at how xz+parallel got almost a 9x speedup!! I just don’t get it…

    I also tried gzip, instead of xz, and with it gzip+parallel was not much faster than serial gzip (1.33 vs 1.83 s). I’m a bit puzzled.

  5. isilanes said,

    August 22, 2011 @ 14:44 pm

    Responding to my own comment above, I have found the reason for the dramatic speedup of parallel. When compressing data strings, compressors look for redundancy of data (if no redundancy is present, no compression is possible, because by definition our data can not be conveyed by a smaller set of data. If it could, it would mean some of it is redundant). Modern compressors do not search for such redundancies in the whole input data stream, because that would skyrocket the usage of memory and CPU. Instead of that, they compress the input data in chunks: they get the first MB of data and compress it, then the second MB, etc (or whatever chunk size they use).

    The catch here is that parallel splits the input stream into 1 MB chunks by default, and each chunk is send to a single process (xz, in our example). When making a serial compression, apparently longer chunks are used (xz knows that the bigger the chunks, the more CPU and memory will be used, but also the better the compression ratio, so it uses as much as it can), so it’s slower (but compresses more) than with parallel.

    We can force parallel to split the input data into bigger chunks before sending each chunk to a xz process, as follows:

    $ cat inputfile | parallel –pipe -k –blocksize 2000k xz -3 > inputfile.xz

    Increasing the blocksize to around 3200 kB, we get a 8 second compression, which is a quarter of the serial compression. Bigger chunks result in slower compression, and smaller ones in a faster one (up to a limit). Also, increasing the block size to 3200 kB does reduce the size of the compressed file, actually to the level obtained by ChopZip or serial xz.

  6. Nagilum said,

    January 14, 2012 @ 14:37 pm

    For gzip there is pigz..
    To build pxz you need to install liblzma-dev.

RSS feed for comments on this post · TrackBack URI

Leave a Comment