Speeding up file processing with Unix commands
February 17th 2008

In my last post I commented some changes I made to a Python script to process a file reducing the memory overhead related to reading the file directly to RAM.

I realized that the script needed much optimizing, and resorted to reading the link a reader (Paddy3118) was kind enough to point me to, I realized I could save time by compiling my search expressions. Basically my script opens a gzipped file, searches for lines containing some keywords, and uses the info read from those lines. The original script would take 44 seconds to process a 6.9 MB file (49 MB uncompressed). Using compile on the search expressions, this time went down to 29 s. I tried using match instead of search, and expressions like "if pattern in line:", instead of re.search(), but these didn't make much of a difference.

Later I thought that Unix commands such as grep were specially suited for the task, so I gave them a try. I modified my script to run in two steps: in the first one I used zcat and awk (called from within the script) to create a much smaller temporary file with only the lines containing the information I wanted. In a second step, I would process this file with standard Python code. This hybrid approach reduced the processing time to just 12 s. Sometimes using the best tool really makes a difference, and it seems that the Unix utilities are hard to come close to in terms of performance.

It is only after programming exercises like this one that one realizes how important writing good code is (something I will probably never do, but I try). For some reason I always think of Windows, and how Microsoft refuses to make an efficient program, relying on improvementes on the hardware instead. It's as if I tried to speed up my first script using a faster computer, instead of fixing the code to be more efficient.

Tags: , , ,

3 Comments »

3 Responses to “Speeding up file processing with Unix commands”

  1. Super Coco on 17 Feb 2008 at 17:19 pm #

    I usually create shell scripts involving a lot of command line utilities (awk, grep, sed, wc...) to process very large files (sometime as large as several GB) and I'd expect the scripts to very veeeery slow, but it often surprises me how fast they can be :-O

  2. Speeding up file processing with Unix commands « handyfloss on 17 Sep 2008 at 10:51 am #

    [...] Entry available at: http://handyfloss.net/2008.02/speeding-up-file-processing-with-unix-commands/ [...]

  3. Summary of my Python optimization adventures « handyfloss on 18 Sep 2008 at 14:43 pm #

    [...] the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix [...]

Trackback URI | Comments RSS

Leave a Reply

Subscribe without commenting

« | »

  • The contents of this blog are under a Creative Commons License.

    Creative Commons License

  • Meta