July, 2009 - handyfloss

Archive for July, 2009

plzma.py: a wrapper for parallel implementation of LZMA compression

July 23, 2009 at 14:39 pm · Filed under Free software and related beasts

Update: this script has been superseded by ChopZip

Introduction

I discovered the [[Lempel-Ziv-Markov chain algorithm|LZMA]] compression algorithm some time ago, and have been thrilled by its capacity since. It has higher compression ratios than even [[bzip2]], with a faster decompression time. However, although decompressing is fast, compressing is not: LZMA is even slower than bzip2. On the other hand, [[gzip]] remains blazing fast in comparison, while providing a decent level of compression.

More recently I have discovered the interesting pbzip2, which is a parallel implementation of bzip2. With the increasing popularity of multi-core processors (I have a quad-core at home myself), parallelizing the compression tools is a very good idea. pbzip2 performs really well, producing bzip2-compatible files with near-linear scaling with the number of CPUs.

LZMA being such a high performance compressor, I wondered if its speed could be boosted by using it in parallel. Although the [[Lempel-Ziv-Markov chain algorithm|Wikipedia article]] states that the algorithm can be parallelized, I found no such implementation in Ubuntu 9.04, where the utility provided by the lzma package is exclusively serial. Not finding one, I set myself to produce it.

About plzma.py

Any compression can be parallelized as follows:

Split the original file into as many pieces as CPU cores available
Compress (simultaneously) all the pieces
Create a single file by joining all the compressed pieces, and call the result “the compressed file”

In a Linux environment, these three tasks can be carried out easily by split, lzma itself, and tar, respectively. I just made a [[Python (programming language)|Python]] script to automate these tasks, called it plzma.py, and put it in my web site for anyone to download (it’s GPLed). Please notice that plzma.py has been superseded by chopzip, starting with revision 12, whereas latest plzma is revision 6.

I must remark that, while pbzip2 generates bzip2-compatible compressed files, that is not the case with plzma. The products of plzma compression must be decompressed with plzma as well. The actual format of a plzma file is just a TAR file containing as many LZMA-compressed chunks as CPUs used for compression. These chunks, once decompressed individually, can be concatenated (with the cat command) to form the original file.

Benchmarks

What review of compression tools lacks benchmarks? No matter how inaccurate or silly, none of them do. And neither does mine :^)

I used three (single) files as reference:

molekel.tar – a 108 MB tar file of the (GPL) [[Molekel]] 5.0 source code
usr.bin.tar – 309 MB tar file of the contens of my /usr/bin/ dir
hackable.tar – a 782 MB tar file of the hackable:1 [[Debian]]-based distro for the [[Neo FreeRunner]]

The second case is intended as an example of binary file compression, whereas the other two are more of a “real-life” example. I didn’t test text-only files… I might in the future, but don’t expect the conclusions to change much. The testbed was my Frink desktop PC (Intel Q8200 quad-core).

The options for each tool were:

gzip/bzip/pbzip2: compression level 6
lzma/plzma: compression level 3
pbzip2/plzma: 4 CPUs

Compressed size

The most important feature of a compressor is the size of the resulting file. After all, we used it in first place to save space. No matter how fast an algorithm is, if the resulting file is bigger than the original file I wouldn’t use it. Would you?

The graph below shows the compressed size ratio for compression of the three test files with each of the five tools considered. The compressed size ratio is defined as the compressed size divided by the original size for each file.

This test doesn’t surprise much: gzip is the least effective and LZMA the most one. The point to make here is that the parallel implementations perform as well or badly as their serial counterparts.

If you are unimpressed by the supposedly higher performance of bzip2 and LZMA over gzip, when in the picture all final sizes do not look very different, recall that gzip compressed molekel.tar ~ 3 times (to a 0.329 ratio), whereas LZMA compressed it ~ 4.3 times (to a 0.233 ratio). You could stuff 13 LZMAed files where only 9 gzipped ones fit (and just 3 uncompressed ones).

Compression time

However important the compressed size is, compression time is also an important subject. Actually, that’s the very issue I try to address parallelizing LZMA: to make it faster while keeping its high compression ratio.

The graph below shows the normalized times for compression of the three test files with each of the five tools considered. The normalized time is taken as the total time divided by the time it took gzip to finish (an arbitrary scale with t(gzip)=1.0).

Roughly speaking, we could say that in my setting pbzip2 makes bzip2 as fast as gzip, and plzma makes LZMA as fast as serial bzip2.

The speedups for bzip2/pbzip2 and LZMA/plzma are given in the following table:

File	pbzip2	plzma
molekel.tar	4.00	2.72
usr.bin.tar	3.61	3.38
hackable.tar	3.80	3.04

The performance of plzma is nowere near pbzip2, but I’d call it acceptable (wouldn’t I?, I’m the author!). There are two reasons I can think of to explain lower-than-linear scalability. The first one is the overhead imposed when cutting the file into pieces then assembling them back. The second one, maybe more important, is the disk performance. Maybe each core can compress each file independently, but the disk I/O for reading the chunks and writing them back compressed is done simultaneously on the same disk, which the four processes share.

Update: I think that a good deal of under-linearity comes from the fact that files of equal size will not be compressed in an equal time. Each chunk compression will take a slightly different time to complete, because some will be easier than others to compress. The program waits for the last compression to finish, so it’s as slow as the slowest one. It is also true that pieces of 1/N size might take more than 1/N time to complete, so the more chunks, the slower the compression in total (the opposite could also be true, though).

Decompression times

Usually we pay less attention to it, because it is much faster (and because we often compress things never to open them again, in which case we had better deleted them in first place… but I digress).

The following graph shows the decompression data equivalent to the compression times graph above.

The most noteworthy point is that pbzip2 decompresses pbzip2-compressed files faster than bzip2 does with bzip2-compressed files. That is, both compression and decompression benefit from the parallelization. However, for plzma that is not the case: decompression is slower than with the serial LZMA. This is due to two effects: first, the decompression part is still not parallelized in my script (it will soon be). This would lead to decompression speeds near to the serial LZMA. However, it is slower due to the second effect: the overhead caused by splitting and then joining.

Another result worth noting is that, although LZMA is much slower than even bzip2 to compress, the decompression is actually faster. This is not random. LZMA was designed with fast uncompression time in mind, so that it could be used in, e.g. software distribution, where a single person compresses the original data (however painstakingly), then the users can download the result (the smaller, the faster), and uncompress it to use it.

Conclusions

While there is room for improvement, plzma seems like a viable option to speed up general compression tasks where a high compression ratio (LZMA level) is desired.

I would like to stress the point that plzma files are not uncompressable with just LZMA. If you don’t use plzma to decompress, you can follow the these steps:

% tar -xf file.plz
% lzma -d file.0[1-4].lz
% cat file.0[1-4] > file
% rm file.0[1-4] file.plz

Permalink Comments (4)

Sobre GonzÃ¡lez-Sinde sobre las descargas de Internet

July 16, 2009 at 14:22 pm · Filed under Free software and related beasts

Como siempre, estoy al lÃmite de la novedad, comentando noticias que tienen casi un mes de antigÃ¼edad. En fin.

El caso es que querÃa comentar algunas perlas de la ~~ignorante~~ ministra esta, GonzÃ¡lez-Sinde. De las muchas ~~estupideces~~ cosas que ha dicho, me referirÃ© concretamente a las recogidas en esta noticia en El PaÃs.

Vayamos por partes:

La ministra de Cultura, Ãngeles GonzÃ¡lez Sinde, ha seÃ±alado que que no hay que generalizar y acusar a todos los internautas de hacer descargas ilegales […]

Â¡Y dale con “descagas ilegales”! A ver cuando nos enteramos de que bajar de Internet material con copyright NO ES ILEGAL en EspaÃ±a, cuando se hace sin Ã¡nimo de lucro. Y en caso de que ese material se redistribuya comercialmente, el delito estÃ¡ en el lucro con dicha redistribuciÃ³n, no en la descarga en sÃ. La legislaciÃ³n espaÃ±ola defiende los derechos de los ciudadanos (como debe), y no permite que unos pocos controlen lo que podemos acceder con interÃ©s no comercial.

GonzÃ¡lez-Sinde, en declaraciones a RNE, […] s eha[sic] mostrado partidaria de la Ãºltima propuesta de la CoaliciÃ³n de Creadores, que representa la industria cultural, de perseguir las pÃ¡ginas webs de enlaces en lugar de a los usuarios.

Â¿”CoaliciÃ³n de Creadores”? Â¿”industria cultural”? Â¿A nadie se le revuelve el estÃ³mago con tales conceptos?

En cuanto a lo de perseguir pÃ¡ginas de enlaces, en vez de a usuarios, es de traca. Todos sabemos que los periÃ³dicos, por poner un ejemplo, sacan una parte substancial de sus ingresos de los anuncios de servicios de prostituciÃ³n tÃ©nuemente encubierta, y sin embargo el Gobierno no se pronuncia sobre ello. No oigo a nadie decir que si la prostituciÃ³n es ilegal, tambiÃ©n lo deben ser los anuncios de ella. Sin embargo con las descargas el caso es al revÃ©s: son legales (lÃ©ase la Ley, ministra), pero sÃ se quiere perseguir no su ejecuciÃ³n, sino su facilitaciÃ³n mediante anuncios e informaciÃ³n. Â¿CuÃ¡l puede ser la diferencia? La de siempre: el dinero. Mientras los anuncios de negocios que explotan la libertad sexual de mujeres engrosan las arcas de ciertos empresarios, las descargas que hacen accesible recursos culturales y de ocio a millones de ciudadanos merman las arcas de ciertos otros empresarios. Ante esto me pregunto, Â¿por quÃ© el beneficio o perjuicio econÃ³mico de ciertos empresarios puede afectar las decisiones de un Gobierno, que como tal se debe a los ciudadanos y a la aplicaciÃ³n de la Ley y la Justicia? TambiÃ©n uno se pregunta por quÃ© ganar dinero anula la injusticia de la prostituciÃ³n; y a la inversa, por quÃ© perderlo anula los beneficios sociales de una cultura, un conocimiento y un ocio mÃ¡s accesibles. Es decir, el dinero es la medida moral de si algo es bueno o malo, Â¿no?

“[…] es importante aplicar las leyes que ya tenemos y cerrar esas 200 pÃ¡ginas que se lucran poniendo a disposiciÃ³n material audiovisual que han conseguido ilÃcitamente”

No sÃ© a quÃ© pÃ¡ginas se refiere. Â¿QuizÃ¡ se refiere a pÃ¡ginas que extraen mÃºsica de CDs comerciales y las venden on-line como si fuera suya? Si es asÃ, aplaudo la decisiÃ³n. Es inaceptable que haya gente lucrÃ¡ndose del esfuerzo y el arte de los artistas.

Ahora bien, aparte de las discogrÃ¡ficas, no conozco de sitios que hagan eso. SÃ que hay sitios que hacen accesible material con copyright mediante tecnologÃas p2p, pero todos los casos que conozco son gratuitos. Los usuarios suben el material que desean compartir, y otros lo bajan, sin mÃ¡s beneficio que el quid pro quo.

Ha matizado que el problema de la mÃºsica en Internet es el poco peso de las canciones y su rapidez para copiarlas.

Esta es la perla que ha desatado mi indignaciÃ³n, el detonante de este post.

Primero, es falso, ya que cuando la velocidad de las redes era inferior la gente tambiÃ©n compartÃa ficheros. No existe un tamaÃ±o de canciones tan grande, o una lÃnea tan lenta (dentro de lÃmites razonables), que la gente elija no bajar mÃºsica o pelÃculas.

Pero, en segundo lugar, es un razonamiento increÃblemente perverso, y mÃ¡s aÃºn viniendo de una ministra de Cultura. El que archivos de contenido audiovisual sea susceptible de compresiÃ³n manteniendo la calidad es un avance tecnolÃ³gico de tremendo valor. Nos permite almacenar mÃ¡s en menor espacio, permite hacer mÃ¡s copias de seguridad en empresas que trabajen con ello, permite su transmisiÃ³n mÃ¡s rÃ¡pida y eficiente, permite streaming de vÃdeo en tiempo real sobre canales que por su lentitud no lo permitirÃan de otra manera… En cuanto a la velocidad de las redes de comunicaciÃ³n, es otro avance mÃ¡s importante todavÃa. Permite la comunicaciÃ³n en tiempo real entre dos puntos cualquiera del globo, permite la colaboraciÃ³n internacional (por ejemplo en ciencia), permite la transmisiÃ³n y rÃ©plica de informaciÃ³n vital en tiempo razonable, permite las copias de seguridad remota en tiempo razonable, permite que grabe un vÃdeo de mi hijo jugando con un sonajero, y se lo haga llegar a sus abuelos antes de que el niÃ±o ~~vaya a la universidad~~ se haga futbolista.

Lo que esta tiparraca insinÃºa es que la tecnologÃa nos permite hacer cosas maravillosas, y por ello es mala. EstÃ¡ predicando un oscurantismo encubierto.

Para la ministra, las crÃticas que le hacen por esa regulaciÃ³n demuestra “lo virulento o apasionado de esas reacciones demuestra que es un tema importante en la vida de la gente. La red ha cambiado la manera de participar en sociedad”.

No seÃ±ora. Las criticas indican lo que toda crÃtica indica: que la gente no estÃ¡ de acuerdo con usted. La gente no “reacciona apasionadamente” simplemente. La gente se indigna con usted y con sus declaraciones. AsÃ de simple.

En el caso concreto de la piraterÃa musical, ha subrayado que “me preocupan mucho los efectos colaterales [de] que no se recupere la inversiÃ³n cuando inviertes en cultura que se puede copiar”.

El argumento de siempre: la cultura se muere, porque al ser gratis acceder a ella, nadie querrÃ¡ producirla.

Los defensores de tal despropÃ³sito cometen la falacia de dar por sentado que vender trozos de plÃ¡stico con canciones dentro es la Ãºnica manera de obtener beneficios de la producciÃ³n musical. Al igual que las radios nos dan el coÃ±azo con lo Ãºltimo del Loco del Canto, Bisbal y demÃ¡s para “promocionarlos” y que luego la gente compre mÃ¡s discos y vaya mÃ¡s a conciertos, sigue siendo vÃ¡lido decir que el distribuir la mÃºsica por Internet hace mÃ¡s visibles a muchos artistas (claro que no necesariamente a los que las ~~mafias~~ discogrÃ¡ficas quieren) y les permite obtener ganancias de conciertos a los que no irÃa nadie si no se hubieran bajado su mÃºsica de internet. No veo a nadie quejarse de que emitir mÃºsica gratis por la radio puede daÃ±ar la venta de discos. Al fin y al cabo, si puedo oir la canciÃ³n por la radio (y hasta puedo grabarla de la misma, si quiero), Â¿para quÃ© iba a comprarme el CD? Al contrario, las discogrÃ¡ficas se dejan una pasta gansa en untar a las radios para que emitan lo que ellas (las discogrÃ¡ficas) quieren que la gente oiga.

Pero incluso aunque las descargas bajen ventas de discos y los artistas reciban menos beneficios. Aunque los artistas en ciernes desistan de dedicarse a ese mundo por no tener aliciente econÃ³mico (otra falacia, suponer que la Ãºnica motivaciÃ³n para producir cultura es la econÃ³mica). Aunque la producciÃ³n de Cultura se resintiese por las descargas… Eso no justifica el daÃ±o causado a la ciudadanÃa por medidas injustamente restrictivas.

Denostando e intentando impedir las descargas de material con copyright se estÃ¡ haciendo un daÃ±o enorme a la sociedad. Para empezar, se estÃ¡ intentando mantener un modelo de negocio obsoleto, lo cual en una sociedad capitalista es inaceptable. La venta de soportes fÃsicos para material audiovisual no es un fin en sÃ mismo, sino un medio para poder hacer llegar el producto a los consumidores de la manera mÃ¡s eficiente posible. No puede hacerse que un grupo toque para un cliente cada vez que el cliente desee, pero sÃ puede grabarse en un medio fÃsico, y que luego el cliente use ese medio para reproducir la mÃºsica. Como la producciÃ³n y distribuciÃ³n de estos medios fÃsicos cuesta dinero, es lÃ³gico cobrar por ello, como por cualquier bien (el medio fÃsico) o servicio (la distribuciÃ³n). Pero observemos que se cobra por la producciÃ³n y transporte del medio fÃsico. En cuanto haya otros medios de eliminar la brecha entre mÃºsico y su audiencia, los medios fÃsicos (CDs, etc) quedarÃ¡n obsoletos, y el pago por ellos serÃ¡ insostenible. Ese punto ya ha llegado.

El segundo daÃ±o a la sociedad es de un Ã¡mbito moral. Se nos dice que “no se puede tener todo gratis” (yo me sigo preguntando Â¿por quÃ© no? Â¿No es eso el objetivo de toda sociedad, que sus ciudadanos estÃ©n satisfechos sin tener que “pagar” por ello? Â¿Es que la vida tiene que ser un “valle de lÃ¡grimas” por narices?), pero ademÃ¡s se nos dice que “compartir estÃ¡ mal”. Este es un mesaje nefasto. Compartir es lo que hace, por ejemplo, que la Wikipedia sea lo que es. Compartir es lo que hace posible que haya gente que pueda ver series extranjeras subtituladas en el idioma propio por terceros desinteresados. Lo bueno del p2p y la Web 2.0 es que el material que consumimos mejora (y muchas veces se crea) con la aportaciÃ³n desinteresada de otros. A cambio, yo soy ese “otro desinteresado” para ellos, aportando mi ancho de banda y espacio en disco duro para que puedan ver una peli que yo ya he visto. O perdiendo mi tiempo para corregir un artÃculo en la Wikipedia, o comentar algo en un blog y aportar algo a su autor, o contestar a alguna pregunta en un foro sobre un tema que domino. Desde mi punto de vista, es una pena que en el mundo no funcione todo asÃ. Que no podamos aportar desinteresadamente aquello que sabemos y podemos hacer, y beneficiarnos de la misma aportaciÃ³n de otros. Y para un reducto en que sÃ se puede, Â¿nos lo quieren quitar? Â¿Quieren criminalizar el ser buen vecino?

Permalink Comments

Accessing Linux ext2/ext3 partitions from MS Windows

July 2, 2009 at 15:57 pm · Filed under Free software and related beasts

Accessing both Windows [[File Allocation Table|FAT]] and [[NTFS]] file systems from Linux is quite easy, with tools like [[NTFS-3G]]. However (following with the [[shit|MS]] tradition of making itself incompatible with everything else, to thwart competition), doing the opposite (accessing Linux file systems from Windows) is more complicated. One would have to guess why (and how!) [[closed source software|closed]] and [[proprietary software|proprietary]] and technically inferior file systems can be read by free software tools, whereas proprietary software with such a big corporation behind is incapable (or unwilling) to interact with superior and [[free software]] file systems. Why should Windows users be deprived of the choice over [[JFS (file system)|JFS]], [[XFS]] or [[ReiserFS]], when they are free? MS techs are too dumb to implement them? Or too evil to give their users the choice? Or, maybe, too scared that if choice is possible, their users will dump NTFS? Neither explanation makes one feel much love for MS, does it?

This stupid inability of Windows to read any of the many formats Linux can use gives rise to problems for not only Windows users, but also Linux users. For example, when I format my external hard disks or pendrives, I end up wondering if I should reserve some space for a FAT partition, so I could put there data to share with hypothetical Windows users I could lend the disk to. And, seriously, I abhor wasting my hardware with such lousy file systems, when I could use Linux ones.

Anyway, there are some third-party tools to help us which such a task. I found at least two:

I have used the first one, but as some blogs point out (e.g. BloggUccio), ext2fsd is required if the [[inode]] size is bigger than 128 B (256 B in some modern Linux distros).

Getting Ext2IFS

It is a simple exe file you can download from fs-driver.org. Installing it consists on the typical windows next-next-finish click-dance. In principle the defaults are OK. It will ask you about activating “read-only” (which I declined. It’s less safe, but I would like to be able to write too), and something about large file support (which I accepted, because it’s only an issue with Linux kernels older than 2.2… Middle Age stuff).

Formatting the hard drive

In principle, Ext2IFS can read ext2/ext3 partitions with no problem. In practice, if the partition was created with an [[inode]] size of more than 128 bytes, Ext2IFS won’t read it. To create a “compatible” partition, you can mkfs it with the -I flag, as follows:

# mkfs.ext3 -I 128 /dev/whatever

I found out about the 128 B inode thing from this forum thread [es].

Practical use

What I have done, and tested, is what follows: I format my external drives with almost all of it as ext3, as described, leaving a couple of gigabytes (you could cut down to a couple of megabytes if you really want to) for a FAT partition. Then copy the Ext2IFS_1_11a.exe executable to that partition.

Whenever you want to use that drive, Linux will see two partitions (the ext3 and the FAT one), the second one of which you can ignore. From Windows, you will see only a 2GB FAT partition. However, you will be able to open it, find the exe, double-click, and install Ext2IFS. After that, you can unplug the drive and plug it again…et voilÃ , you will see the ext3 partition just fine.

Permalink Comments (2)

handyfloss

Archive for July, 2009

plzma.py: a wrapper for parallel implementation of LZMA compression

Sobre GonzÃ¡lez-Sinde sobre las descargas de Internet

Accessing Linux ext2/ext3 partitions from MS Windows

Meta