Speeding up file processing with Unix commands

In my last post I commented some changes I made to a Python script to process a file reducing the memory overhead related to reading the file directly to RAM.

I realized that the script needed much optimizing, and resorted to reading the link a reader (Paddy3118) was kind enough to point me to, I realized I could save time by compiling my search expressions. Basically my script opens a gzipped file, searches for lines containing some keywords, and uses the info read from those lines. The original script would take 44 seconds to process a 6.9 MB file (49 MB uncompressed). Using compile on the search expressions, this time went down to 29 s. I tried using match instead of search, and expressions like “if pattern in line:“, instead of re.search(), but these didn’t make much of a difference.

Later I thought that Unix commands such as grep were specially suited for the task, so I gave them a try. I modified my script to run in two steps: in the first one I used zcat and awk (called from within the script) to create a much smaller temporary file with only the lines containing the information I wanted. In a second step, I would process this file with standard Python code. This hybrid approach reduced the processing time to just 12 s. Sometimes using the best tool really makes a difference, and it seems that the Unix utilities are hard to come close to in terms of performance.

It is only after programming exercises like this one that one realizes how important writing good code is (something I will probably never do, but I try). For some reason I always think of Windows, and how Microsoft refuses to make an efficient program, relying on improvementes on the hardware instead. It’s as if I tried to speed up my first script using a faster computer, instead of fixing the code to be more efficient.

3 Comments »

  1. Super Coco said,

    February 17, 2008 @ 17:19 pm

    I usually create shell scripts involving a lot of command line utilities (awk, grep, sed, wc…) to process very large files (sometime as large as several GB) and I’d expect the scripts to very veeeery slow, but it often surprises me how fast they can be :-O

  2. Speeding up file processing with Unix commands « handyfloss said,

    September 17, 2008 @ 10:51 am

    […] Entry available at: http://handyfloss.net/2008.02/speeding-up-file-processing-with-unix-commands/ […]

  3. Summary of my Python optimization adventures « handyfloss said,

    September 18, 2008 @ 14:43 pm

    […] the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix […]

RSS feed for comments on this post · TrackBack URI

Leave a Comment