Please, choose the right format to send me that text. Thanks.

I just received an e-mail with a very interesting text (recipies for [[Pincho|pintxos]]), and it prompted some experiment. The issue is that the text was inside of a [[DOC (computing)|DOC]] file (of course!), which rises some questions and concerns on my side. The size of the file was 471 kB.

I thought that one could make the document more portable by exporting it to [[PDF]] (using [[OpenOffice.org]]). Doing so, the resulting file has a size of 364 kB (1.29 times smaller than the original DOC).

Furthermore, text formatting could be waived, by using a [[plain text]] format. A copy/paste of the contents of the DOC into a TXT file yielded a 186 kB file (2.53x smaller).

Once in the mood, we can go one step further, and compress the TXT file: with [[gzip]] we get a 51 kb file (9.24x), and with [[xz]] a 42 kB one (11.2x)

So far, so good. No surprise. The surprise came when, just for fun, I exported the DOC to [[OpenDocument|ODT]]. I obtained a document equivalent to the original one, but with a 75 kB size! (6.28x smaller than the DOC).

So, for summarizing:

DOC

Pros

  • Editable.
  • Allows for text formatting.

Cons

  • Proprietary. In principle only MS Office can open it. OpenOffice.org can, but because of reverse engineering.
  • If opened with OpenOffice.org, or just a different version of MS Office, the reader can not be sure of seeing the same formatting the writer intended.
  • Size. 6 times bigger than ODT. Even bigger than PDF.
  • MS invented and owns it. You need more reasons?

PDF

Pros

  • Portability. You can open it in any OS (Windows, Linux, Mac, BSD…), on account of there being so many free PDF readers.
  • Smaller than the DOC.
  • Allows for text formatting, and the format the reader sees will be exactly the one the writer intended.

Cons

  • Not editable (I really don’t see the point in editing PDFs. For me the PDF is a product of an underlying format (e.g. LaTeX), as what you see on your browser is the product of some HTML/PHP, or an exe is the product of some source code. But I digress.)
  • Could be smaller

TXT

Pros

  • Portability. You can’t get much more portable than a plain text file. You can edit it anywhere, with your favorite text editor.
  • Size. You can’t get much smaller than a plain text file (as it contains the mere text content), and you can compress it further with ease.

Cons

  • Formatting. If you need text formatting, or including pictures or content other than text, then plain text is not for you.

ODT

Pros

  • Portability. It can be edited with OpenOffice.org (and probably others), which is [[free software]], and has versions for Windows, Linux, and Mac.
  • Editability. Every bit as editable as DOC.
  • Size. 6 times smaller files than DOC.
  • It’s a free standard, not some proprietary rubbish.

Cons

  • None I can think of.

So please, if you send me some text, first consider if plain text will suffice. If not, and no edition is intended on my side, PDF is fine. If edition is important (or size, because it’s smaller than PDF), the ODT is the way to go.

Comments (7)

Save HD space by using compressed files directly

Maybe the constant increases in hard disk capacity provide us with more space we can waste with our files, but there is always a situation in which we would like to squeeze as much data in as little space as possible. Besides, it is always a good practice to keep disk usage as low as possible, just for tidiness.

The first and most important advice for saving space: for $GOD’s sake, delete the stuff you don’t need!

Now, assuming you want to keep all you presently have, the second tool is [[data compression]]. Linux users have long time friends in the [[gzip]] and [[bzip2]] commands. One would use the former for fast (and reasonably good) compression, and the latter for when saving space is really vital (although bzip2 is really slow). A more recent entry in the “perfect compression tool” contest would be [[Lempel-Ziv-Markov chain algorithm]] (LZMA). This one can compress even more than bzip2, being usually faster (although never as fast as gzip).

One problem with compression is that it is a good way of storing files, but they usually have to be uncompressed to modify, and then re-compressed, and this is very slow. However, we have some tools to interact with the compressed files directly (internally decompressing “on the fly” only the part that we need to edit). I would like to just mention them here:

Shell commands

We can use zcat, zgrep and zdiff as replacements for cat, grep and diff, but for gzipped files. These account for a huge fraction of all the interaction I do with text files from the command line. If you are like me, they can save you tons of time.

Vim

[[Vim (text editor)|Vim]] can be instructed to open some files making use of some decompression tool, to show the contents of the file and work on them transparently. Once we :wq out of the file, we will get the original compressed file. The speed to do this cycle is incredibly fast: almost as fast as opening the uncompressed file, and nowhere near as slow as gunzipping, viming and gzipping sequentially.

You can add the following to your .vimrc config file for the above:

" Only do this part when compiled with support for autocommands.
if has("autocmd")

 augroup gzip
  " Remove all gzip autocommands
  au!

  " Enable editing of gzipped files
  " set binary mode before reading the file
  autocmd BufReadPre,FileReadPre	*.gz,*.bz2,*.lz set bin

  autocmd BufReadPost,FileReadPost	*.gz call GZIP_read("gunzip")
  autocmd BufReadPost,FileReadPost	*.bz2 call GZIP_read("bunzip2")
  autocmd BufReadPost,FileReadPost	*.lz call GZIP_read("unlzma -S .lz")

  autocmd BufWritePost,FileWritePost	*.gz call GZIP_write("gzip")
  autocmd BufWritePost,FileWritePost	*.bz2 call GZIP_write("bzip2")
  autocmd BufWritePost,FileWritePost	*.lz call GZIP_write("lzma -S .lz")

  autocmd FileAppendPre			*.gz call GZIP_appre("gunzip")
  autocmd FileAppendPre			*.bz2 call GZIP_appre("bunzip2")
  autocmd FileAppendPre			*.lz call GZIP_appre("unlzma -S .lz")

  autocmd FileAppendPost		*.gz call GZIP_write("gzip")
  autocmd FileAppendPost		*.bz2 call GZIP_write("bzip2")
  autocmd FileAppendPost		*.lz call GZIP_write("lzma -S .lz")

  " After reading compressed file: Uncompress text in buffer with "cmd"
  fun! GZIP_read(cmd)
    let ch_save = &ch
    set ch=2
    execute "'[,']!" . a:cmd
    set nobin
    let &ch = ch_save
    execute ":doautocmd BufReadPost " . expand("%:r")
  endfun

  " After writing compressed file: Compress written file with "cmd"
  fun! GZIP_write(cmd)
    if rename(expand(""), expand(":r")) == 0
      execute "!" . a:cmd . " :r"
    endif
  endfun

  " Before appending to compressed file: Uncompress file with "cmd"
  fun! GZIP_appre(cmd)
    execute "!" . a:cmd . " "
    call rename(expand(":r"), expand(""))
  endfun

 augroup END
endif " has("autocmd")

I first found the above in my (default) .vimrc file, allowing gzipped and bzipped files to be edited. I added the “support” for LZMAed files quite trivially, as can be seen in the lines containign “lz” in the code above (I use .lz as termination for LZMAed files, instead of the default .lzma. See man lzma for more info).

Non-plaintext files

Other files that I have been able to successfully use in compressed form are [[PostScript]] and [[Portable Document Format|PDF]]. Granted, PDFs are already quite compact, but sometimes gzipping them saves space. In general, PS and EPS files save a lot of space by gzipping.

As far as I have tried, the [[Evince]] document viewer can read gzipped PS, EPS and PDF files with no problem (probably [[Device_independent_file_format|DVI]] files as well).

Comments (3)

Making a PDF grayscale with ghostscript

A request from a friend made me face the problem of converting a color [[Portable Document Format|PDF]] into a [[grayscale]] one. Searching the web provided some ways of doing so with [[Adobe Acrobat]], via some obscure menu item somewhere.

However, the very same operation could be undertaken with free tools, such as [[ghostscript]]. I found a way to do it in the YANUB blog, and I will copy-paste it here, with a small modification.

Assuming we have a file called color.pdf, and we want to convert it into grayscale.pdf, we could run the following command (all in a single line, and omitting the “\” line continuation marks):

% gs -sOutputFile=grayscale.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH color.pdf

I prefer the above to YANUB’s version below (in red what he lacks, in blue what I lack), because a shell operation is substituted by some option(s) of the command we are running:

% gs -sOutputFile=grayscale.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH color.pdf < /dev/null

A sample [[Perl]] script to alleviate the tedious writing above:

#!/usr/bin/perl -w
use strict;
my $infile = $ARGV[0];
my $outfile = $infile;
$outfile =~ s/\.pdf$//;
$outfile = $outfile.”_gray.pdf”;
system “gs -sOutputFile=$outfile -sDEVICE=pdfwrite -sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH $infile”

Assuming we call the Perl script “togray.pl”, and that we have a color file “input.pdf”, we could just issue the command:

% togray.pl input.pdf

and we would get a grayscale version of it, named “input_gray.pdf”.

Comments (27)

LaTeX programming: how to implement conditionals

I have recently come across a problem while creating a LaTeX style (for making A0-size posters). Maybe it could be avoided or solved more elegantly, but I wanted to solve it with conditionals.

Basically, what I wanted to do was define a command (actually, an environment) that accepted one argument, and make it return different output, depending on the argument:

if (argument equals something) then
  do something
else
  do somethingelse
end if

It gave me some headaches to get it, but I also learned some interesting things on the way. There are at least two ways of playing with conditionals: defining boolean variables or directly using logical comparisons.

Defining logical valiables

We can define a logical variable logvar as follows:

\newif\iflogvar

By default, it is set to false. We can set it to true by:

\logvartrue

and back to false by:

\logvarfalse

The variable can be used in a conditional as follows:

\iflogvar
  aaaa
\else
  bbbb
\fi

You can think of the above code as a single object, the output value of which will be “aaaa” if logvar is true, and “bbbb” if false. Basically, the following code will, thus, output “Today is great“:

Today

\newif\ifismonday

\ismondayfalse

\ifismonday
  sucks!
\else
  is \textbf{great}
\fi

Direct logic comparison

The example I provide works for numbers, but check this page for more info. Recall that LaTeX works with integers (counters) and text strings. As far as I know, floating point operations are impossible in LaTeX (nothing is actually impossible in LaTeX, just veeery difficult).

For example, defining the following command in the preamble:

\newcommand{\isitthree}[1]
{
  \ifnum#1=3
    number #1 is 3
  \else
    number #1 is not 3
  \fi
}

allows us to call it in the document, so the following outputs “We know that number 33 is not 3”:

We know that \isitthree{33}

Nesting

Obviously the conditionals can be nested (put one inside another), when more than one condition needs to be tested. For example:

Today

\newif\ifismonday
\newif\ifistuesday

\ismondayfalse
\istuesdaytrue

\ifismonday
  sucks!
\else
  \ifistuesday
    almost sucks.
  \else
    is \textbf{great}
  \fi
\fi

Comments (7)

PowerDot screen size error in Debian Lenny

I use the PowerDot class to make presentations (such as the one in a previous post), and I have come across a nasty problem in the current testing branch of Debian (Lenny). Obviously it is bound to affect any other distro relying on Debian, such as Ubuntu.

The problem is discussed in this thread in freelists.org, and a solution is given by Hendri Adriaens in the tug.org bug page.

In short, when selecting paper=screen paper size in a PowerDot .tex file, the current dvips (version 5.96.1, provided by the package texlive-bin version 2007-12) generates a PostScript file with a wrong paper size. To fix it, you can get the following file:

% wget http://tug.org/svn/texlive/trunk/Master/texmf/dvips/config/config.ps

and put it where the following command tells you:

% kpsewhich --format='dvips config' config.ps

after backing up the old (buggy) one, just in case. For example:

% mv /etc/texmf/dvips/config/config.ps /etc/texmf/dvips/config/config.ps.backup
% mv config.ps /etc/texmf/dvips/config/

This fixes the problem for me.

Comments

App of the week: PDF Cube

I just found this little app browsing for PDF software in my Debian aptitude repository contents.

In short, PDF Cube displays PDFs in full screen, adding Compiz-like cube transitions from slide to slide if we want. The following YouTube video shows how it works:

[youtube=http://youtube.com/watch?v=AscU72HOwgM]

You can notice the mixed regular/cube transitions, as well as the five zooming options used in slide 4.

By the way, I have started the Wikipedia article for PDF Cube. I think this little program deserves to be in the Wikipedia.

Incidentally, the above is the first video I upload to YouTube! :^)

Comments (2)

PDF exploits for all readers and platforms?

I have read in Kriptopolis some posts about new PDF exploits (in Spanish). The articles say that web broser PDF plugins are vulnerable, dedicated PDF readers are also vulnerable, and new exploits may be created. The Kriptopolis site keeps on talking about new vulnerabilities in PDF documents, and how they affect all platforms. Do they?

If you go to the SecurityFocus site, where they cover the new, you can download an example PDF, that exploits this vulnerability. If you open it with any (vulnerable) PDF reader, the program will freeze, and the CPU usage will go over the roof.

Well, bold as I am, I did the test. I opened it with Acroread 7.0 for GNU/Linux and… it froze, and… the CPU usage hit the roof. I could not Ctrl-C the beast, and a kill would not kill it. Fortunately, a kill -9 did the job :^(

Now, I tried Evince:


Heracles[~/Downloads]: evince MOAB-06-01-2007.pdf
Error (3659): Illegal character ')'
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Segmentation fault

and Xpdf:


Heracles[~/Downloads]: xpdf MOAB-06-01-2007.pdf
Error (3659): Illegal character ')'
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Segmentation fault

Ta-chan!! Yes, they crash, but refusing to open the damned thing! They both complain, and don’t fall for it.

Perhaps it’s worth reminding the reader that Evince and Xpdf are free software, whereas Acroread is not. Acroread is merely free of charge, but not free as in freedom.

Comments

My public and open University

As the readers of this blog may know, I recently became a Doctor in the University of the Basque Country.

As a follow-up to the Thesis Defense act, there is still some paperwork to be done, as e.g. filling a datasheet called “Teseo”. Anyway, what I will comment applies to all the paperwork I did before, during and after the Thesis Defense.

The matter is that this freaking “Teseo” sheet is available online as a RTF or PDF. An original handwritten copy must be sent to the University, so I used a printed down PDF for that. No problem.

The problem came when the University requested that an “electronic” form be sent by e-mail (for which a scanned copy of my manuscript would not do). These bright minds surely wanted me to fill in the RTF, and send it. However, not everyone who wants to get a Ph.D. has adhered (or wants to adhere) to any expensive and abusive license for a proprietary product like MS Windows or MS Office. Certainly I haven’t, so I had to make do with GNU/Linux and OpenOffice to fill the RTF. The result was crappy, due to incompatibilities of the friggin’ RTF proprietary format… but that I sent.

Now the point is: does the public University of the Basque Country (or the public Spanish Ministry of Education) have any reason to discriminate in favor of the private and foreign company Microsoft? Do we, the tax-payers who put the money for their salaries, have to put up with being forced to use specific proprietary formats to communicate with the public institutions? It disgusts me to no end.

Picture the following example: I want to attend the University, and they tell me that I have to wear shoes for that (e.g., use a computer). OK, this might be more or less arbitrary, but I can accept it. Now, imagine that they ask me to wear Gucci shoes (e.g., a proprietary file format, such as RTF). That would be inacceptable, because a public institution can not favor that way a private company, at least not if there is any conceivable substitute (e.g. acceptable shoes of any other brand). And it doesn’t matter if instead of Gucci they require that one uses any cheaper shoe brand. The problem is not if it is expensive, but rather that they are discriminating against other options. And they have no right to do it. They are there to serve us, not the other way around.

Someone could say that they have to use some electronic format, and any would be equally arbitrary. No, not at all. There is something called “open standards”, to which “things” (e.g. electronic document formats) can adhere. One such standard is the ISO, and one document format adhering to a standard (the ISO/IEC 26300) is the Open Document Format (ODF), so they can use that.

The basics are simple: readers and editors for open formats can be made by anyone freely. No-one can force me to pay them royalties so that they allow me to make a program that reads these documents. With proprietary formats (such as DOC, RTF and others), the owner of the license (e.g. Microsoft) can ban anyone from making a program that writes documents in that format, or charge royalties as they please. Put bluntly: since the exchange of documents in my University depends on proprietary formats (RTF and DOC), Microsoft could decide tomorrow to disrupt its operations by denegating further licenses for e.g. MS Office. Of course, this will not happen, because the University will pay as requested. I call this extorsion, because the University can not afford not to pay, so where do the “free competition” and “open market” ideas fit in here? Moreover, I call the University bunch of fools, because they put themselves in a position that can be extorted. The aforementioned is not possible if one uses open formats, because free (not free of cost, but free as in freedom) document editors are, and will always be, available.

Comments

Default Ghostscript paper size

The three times god-forsaken Ghostscript (I use the Debian package gs-afpl) suite is shipped worldwide with the US letter default paper size. So, when you use it (e.g. to convert PS to PDF), and if the source file does not specify a paper size, the output file will have a letter size, instead of the more sane A4.

You can specify A4 size at runtime, with the -sPAPERSIZE=a4 flag:

ps2pdf -sPAPERSIZE=a4 input.ps

However, if you want to always use A4 as default, you can change the gs_init.ps file (locate gs_init.ps), and uncomment the following line (remove the leading ‘%‘):

% /DEFAULTPAPERSIZE (a4) def

Beware that in Debian you will have to change it to (because the name of the variable is different):

/DEFPAPERSIZE (a4) def

You will only need to edit the gs_init.ps file (as root), make the changes and save the file. Subsequent gs uses (e.g. ps2pdf), will default to A4 page size.

Comments (2)

Convert PS to PDF

I make extensive use of ps2pdf to convert PostScript files to PDF. As most GNU/Linux tools, this is a simple and incredibly useful one.

However, sometimes it might give problems. For example, I have sometime converted a PS to a PDF that Evince would open fine, but Acrobat Reader would not. I fixed this problem making use of the superb alternatives system present in Debian.

The first thing to know is that most of PS and PDF manipulation (including PS-to-PDF conversion) is done by calling a backend application called Ghostscript (GS). A quick search within the Debian packages shows that most (if not all) of the GS versions mentioned in the wikipedia page are available:

Bart[~/]: aptitude search ^gs-
i   gs-afpl                     - The AFPL Ghostscript PostScript interpreter
p   gs-aladdin                  - Transitional package
p   gs-cjk-resource             - Resource files for gs-cjk, ghostscript CJK-TrueType extension
i A gs-common                   - Common files for different Ghostscript releases
i A gs-esp                      - The Ghostscript PostScript interpreter - ESP version
p   gs-gpl                      - The GPL Ghostscript PostScript interpreter
v   gs-pdfencrypt               -

It turns out I was using gs-esp:

Bart[~/]: which gs
/usr/bin/gs
Bart[~/]: ls -l /usr/bin/gs
lrwxrwxrwx 1 root root 20 Jul  4 09:00 /usr/bin/gs -> /etc/alternatives/gs*
Bart[~/]: ls -l /etc/alternatives/gs
lrwxrwxrwx 1 root root 16 Jul 27 11:26 /etc/alternatives/gs -> /usr/bin/gs-esp

I remember having used different GS versions, and AFPL being the “best”, so I installed it and made the default gs point to it, with the Debian alternatives system (as root):

Bart:~# aptitude install gs-afpl
[...]
Bart:~# update-alternatives --config gs

There are 2 alternatives which provide `gs'.

  Selection    Alternative
  -----------------------------------------------
  * +   1        /usr/bin/gs-esp
        2        /usr/bin/gs-afpl

Press enter to keep the default[*], or type selection number:

There, I just pressed “2”, et voilà! Now my default GS is gs-afpl, and ps2pdf makes use of it. Any other GS version one could want to use, the procedure to change it would be the same.

Comments