Some more tweaks to my Python script

Update: you can find the outcome of all this in a latter post: Project BHS

All the comments to my previous post have provided me with hints to increase further the efficiency of a script I am working on. Here I present the advices I have followed, and the speed gain they provided me. I will speak of “speedup”, instead of timing, because this second set of tests has been made in a different computer. The “base” speed will be the last value of my previous test set (1.5 sec in that computer, 1.66 in this one). A speedup of “2” will thus mean half an execution time (0.83 s in this computer).

Version 6: Andrew Dalke suggested the substitution of:

line = re.sub('>','<',line)

with:

line = line.replace('>','<')

Avoiding the re module seems to speed up things, if we are searching for fixed strings, so the additional features of the re module are not needed.

This is true, and I got a speedup of 1.37.

Version 7: Andrew Dalke also suggested substituting:

search_cre = re.compile(r'total_credit').search
if search_cre(line):

with:

if 'total_credit' in line:

This is more readable, more concise, and apparently faster. Doing it increases the speedup to 1.50.

Version 8: Andrew Dalke also proposed flattening some variables, and specifically avoiding dictionary search inside loops. I went further than his advice, even, and substituted:

stat['win'] = [0,0]

loop
  stat['win'][0] = something
  stat['win'][1] = somethingelse

with:

win_stat_0 = 0
win_stat_1 = 0

loop
  win_stat_0 = something
  win_stat_1 = somethingelse

This pushed the speedup futher up, to 1.54.

Version 9: Justin proposed reducing the number of times some patterns were matched, and extract some info more directly. I attained that by substituting:

loop:
  if 'total_credit' in line:
    line   = line.replace('>','<')
    aline  = line.split('<')
    credit = float(aline[2])

with:

pattern    = r'total_credit>([^<]+)<';
search_cre = re.compile(pattern).search

loop:
  if 'total_credit' in line:
    cre    = search_cre(line)
    credit = float(cre.group(1))

This trick saved enough to increase the speedup to 1.62.

Version 10: The next tweak was an idea of mine. I was diggesting a huge log file with zcat and grep, to produce a smaller intermediate file, which Python would process. The structure of this intermediate file is of alternating lines with “total_credit” then “os_name” then “total_credit”, and so on. When processing this file with Python, I was searching the line for “total_credit” to differentiate between these two lines, like this:

for line in f:
  if 'total_credit' in line:
    do something
  else:
    do somethingelse

But the alternating structure of my input would allow me to do:

odd = True
for line in f:
  if odd:
    do something
    odd = False
  else:
    do somethingelse
    odd = True

Presumably, checking falsity of a boolean is faster than matching a pattern, although in this case the gain was not huge: the speedup went up to 1.63.

Version 11: Another clever suggestion by Andrew Dalke was to avoid using the intermediate file, and use os.popen to connect to and read from the zcat/grep command directly. Thus, I substituted:

os.system('zcat host.gz | grep -F -e total_credit -e os_name > '+tmp)

f = open(tmp)
for line in f:
  do something

with:

f = os.popen('zcat host.gz | grep -F -e total_credit -e os_name')

for line in f:
  do something

This saves disk I/O time, and the performance is increased accordingly. The speedup goes up to 1.98.

All the values I have given are for a sample log (from MalariaControl.net) with 7 MB of gzipped info (49 MB uncompressed). I also tested my scripts with a 267 MB gzipped (1.8 GB uncompressed) log (from SETI@home), and a plot of speedups vs. versions follows:

versions2.png

Execution speedup vs. version
(click to enlarge)

Notice how the last modification (avoiding the temporary file) is of much more importance for the bigger file than for the smaller one. Recall also that the odd/even modification (version 10) is of very little importance for the small file, but quite efficient for the big file (compare it with Version 9).

The plot doesn’t tell (it compares versions with the same input, not one input with the other), but my eleventh version of the script runs the 267 MB log faster than the 7 MB one with Version 1! For the 7 MB input, the overall speedup from Version 1 to Version 11 is above 50.

Comments (11)

Summary of my Python optimization adventures

This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix commands.

The script reads a host.gz log file from a given BOINC project (more precisely one I got from MalariaControl.net, because it is a small project, so its logs are also smaller), and extracts how many computers are running the project, and how much credit they are getting. The statistics are separated by operating system (Windows, Linux, MacOS and other).

Version 0

Here I read the whole file to RAM, then process it with Python alone. Running time: 34.1s.

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

# Process file:
f = gzip.open('host.gz','r')
for line in f.readlines():
  if re.search('total_credit',line):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif re.search('os_name',line):
    if re.search('Windows',line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif re.search('Linux',line):
        stat['lin'][0] += 1
        stat['lin'][1] += credit
    elif re.search('Darwin',line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# Return output:
nstring = ''
cstring = ''
for osy in os_list:
  nstring +=   "%15.0f " % (stat[osy][0])
  try:
    cstring += "%15.0f " % (stat[osy][1])
  except:
    print osy,stat[osy]

print nstring
print cstring

Version 1

The only difference is a “for line in f:“, instead of “for line in f.readlines():“. This saves a LOT of memory, but is slower. Running time: 44.3s.

Version 2

In this version, I use precompiled regular expresions, and the time-saving is noticeable. Running time: 26.2s

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]


pattern    = r'total_credit'
match_cre  = re.compile(pattern).match
pattern    = r'os_name';
match_os   = re.compile(pattern).match
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Process file:
f = gzip.open('host.gz','r')

for line in f:
  if match_cre(line,5):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif match_os(line,5):
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# etc.

Version 3

Later I decided to use AWK to perform the heaviest part: parsing the big file, to produce a second, smaller, file that Python will read. Running time: 14.8s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with AWK:
tmp = 'bhs.tmp'
os.system('zcat host.gz | awk \'/total_credit/{printf $0}/os_name/{print}\' > '+tmp)

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
# Process tmp file:
f = open(tmp)
for line in f:
  line = re.sub('>','<',line)
  aline = line.split('<')
  credit = float(aline[2])
  os_str = aline[6]
  if search_win(os_str):
    stat['win'][0] += 1
    stat['win'][1] += credit
  elif search_lin(os_str):
    stat['lin'][0] += 1
    stat['lin'][1] += credit
  elif search_dar(os_str):
    stat['dar'][0] += 1
    stat['dar'][1] += credit
  else:
    stat['oth'][0] += 1
    stat['oth'][1] += credit
f.close()

# etc

Version 4

Instead of using AWK, I decided to use grep, with the idea that nothing can beat this tool, when it comes to pattern matching. I was not disappointed. Running time: 5.4s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

pattern    = r'total_credit'
search_cre = re.compile(pattern).search

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with grep:
tmp = 'bhs.tmp'
os.system('zcat host.gz | grep -e total_credit -e os_name > '+tmp)

# Process tmp file:
f = open(tmp)
for line in f:
  if search_cre(line):
    line = re.sub('>','<',line)
    aline = line.split('<')
    credit = float(aline[2])
  else:
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit

f.close()

# etc

Version 5

I was not completely happy yet. I discovered the -F flag for grep (in the man page), and decided to use it. This flag tells grep that the pattern we are using is a literal, so no expansion of it has to be made. Using the -F flag I further reduced the running time to: 1.5s.

time_vs_version.png

Running time vs. script version (Click to enlarge)

Comments (14)

Speeding up file processing with Unix commands

In my last post I commented some changes I made to a Python script to process a file reducing the memory overhead related to reading the file directly to RAM.

I realized that the script needed much optimizing, and resorted to reading the link a reader (Paddy3118) was kind enough to point me to, I realized I could save time by compiling my search expressions. Basically my script opens a gzipped file, searches for lines containing some keywords, and uses the info read from those lines. The original script would take 44 seconds to process a 6.9 MB file (49 MB uncompressed). Using compile on the search expressions, this time went down to 29 s. I tried using match instead of search, and expressions like “if pattern in line:“, instead of re.search(), but these didn’t make much of a difference.

Later I thought that Unix commands such as grep were specially suited for the task, so I gave them a try. I modified my script to run in two steps: in the first one I used zcat and awk (called from within the script) to create a much smaller temporary file with only the lines containing the information I wanted. In a second step, I would process this file with standard Python code. This hybrid approach reduced the processing time to just 12 s. Sometimes using the best tool really makes a difference, and it seems that the Unix utilities are hard to come close to in terms of performance.

It is only after programming exercises like this one that one realizes how important writing good code is (something I will probably never do, but I try). For some reason I always think of Windows, and how Microsoft refuses to make an efficient program, relying on improvementes on the hardware instead. It’s as if I tried to speed up my first script using a faster computer, instead of fixing the code to be more efficient.

Comments (3)

Python: speed vs. memory tradeoff reading files

I was making a script to process some log file, and I basically wanted to go line by line, and act upon each line if some condition was met. For the task of reading files, I generally use readlines(), so my first try was:

f = open(filename,'r')
for line in f.readlines():
  if condition:
    do something
f.close()

However, I realized that as the size of the file read increased, the memory footprint of my script increased too, to the point of almost halting my computer when the size of the file was comparable to the available RAM (1GB).

Of course, Python hackers will frown at me, and say that I was doing something stupid… Probably so. I decided to try a different thing to reduce the memory usage, and did the following:

f = open(filename,'r')
for line in f:
  if condition:
    do something
f.close()

Both pieces of code look very similar, but pay a bit of attention and you’ll see the difference.

The problem with “f.readlines()” is that it reads the whole file and assigns lines to the elements of an (anonymous, in this case) array. Then, the for loops through the array, which is in memory. This leads to faster execution, because the file is read once and then forgotten, but requires more memory, because an array of the size of the file has to be created in the RAM.

fileread_memory

Fig. 1: Memory vs file size for both methods of reading the file

When you do “for line in f:“, you are effectively reading the lines one by one when you do each cycle of the loop. Hence, the memory use is effectively constant, and very low, albeit the disk is accessed more often, and this usually leads to slower execution of the code.

fileread_time.png

Fig. 2: Execution time vs file size for both methods of reading the file

Comments (2)

Labeled breaks in Python

I am a recent fan of Python, a very neat scripting language.

One thing I miss from Perl is the availability of labeled breaks. What are those? Suppose you have two nested loops. When a condition is met in the inner loop, you want to exit both loops. With Python there is not straightforward way of doing it. Imagine we are reading an array of data, line by line and column by column, and we want to exit when meeting the first zero value. With Perl:

LINELOOP: foreach my $i (0..$lines)
{
  COLLOOP: foreach my $j (0..$columns)
  {
     break LINELOOP unless $val[$i][$j];
  };
};

A simple “break” will exit the innermost loop, but we can use a label to exit a specific loop. However, in Python there is no such a thing as a labeled loop, as explained in this PEP.

My rant is with the explanations given by van Rossum himself in Python mailing list to reject the change:

1. The complexity added to the language, permanently.
2. My expectation that the feature will be abused more than it will be used right.

Wow! Incredible reasons!

The first one is silly: other languages have it, and it has worked fine. Adding complexity to a tool for the sake of it is really stupid, I agree. But the fact is labeled breaks would be tremendously useful, so the increase in complexity would be justified. Surely a language that can only print “Hello world” would be less complex, yet of little use.

The second reason is absolutely over-the-shoulder-of-the-users. So now good old Guido must guide his sheep along the “correct” path, lest we get lost! He is punishing the good programmers by not giving them a useful tool, so that bad programmers are protected from their stupidity. It’s like not selling cars at all because some people drive while drunk.

Just my 2 cents…

Comments (3)

Why J2EE is complex

I read in O’Reillynet a comment on AurigaLogic’s Blogic.

Blogic comments on why J2EE is so complex and tedious to use. Their main thesis to support that complexity is… hold your breath…. fasten your seatbelts… : if it were easier, more stupid people would be using it!. Ta-da!!

Amazing, the “blogic” of this people.

Comments

Dynamic file read with Perl

GNU/Linux command-line users, programmers and hackers worldwide have probably come to know and love the wonderful tail shell command, together with cat, head, grep, awk and sed, easily one of the single most usefull commands.

A killer feature of tail is the -f (--follow) argument, which outputs the last lines of a file and then keeps waiting for new lines that might keep appearing in the file, and show them on the screen when they do. This is invaluable to keep track of, e.g., logfiles where new entries are being added all the time, and one does not want to be doing a tail by hand.

Since I am a great fan of Perl, and use its scripts for anything short of cooking dinner (but wait…), I have found myself in situations where I had to tail the last lines of a file. This can be done in several ways:

system "tail $file";

or

my $str = `tail $file`;
print $str;

or with a open() statement, then reading the whole file (or a part), and printing it. The first example with system is the most “direct” one, but reading the file (or a part) into a variable is very handy for doing with it all the nifty things Perl does so well to text strings (substituting, deleting, including, reordering, comparing…).

However, when tail -f was needed (i.e., keep on tracking the file and operate on the output as it appears), I kept using system calls, and all the formatting had to be done in the shell spawned by the system call, not by Perl. This was sad.

So, I was so happy when I discovered a simple trick to make open() read dynamically. There are better ways of doing it, more efficiently and correctly, but this one works, and is quite simple. If efficience is vital for you, this is not probably the place to learn about it. Actually, if you look for efficiency, you shouldn’t be using Perl at all :^)

Example of Perl code that reads dynamically a file “$in“:

open(INFILE,"tail -0f $in |") || die "Failed!\n";
while(my $line = <INFILE>)
{ 
  do whatever to $line;
};
close(INFILE)

Update: Explanation to the code above:

The open() call pipes the output of the tail command (notice the -f flag. Do a man tail to know more) to the file tag “INFILE”. The “||” sign is an [[logical disjunction|OR]], and means “do the thing on my right side if the thing on my left didn’t end successfully (but ONLY in that case!)”.

Next, we perform a while loop over the lines in the pipe. The “<INLINE>” construct extracts elements in INLINE, treating it as an array. As you can see, these elements are assigned to a new variable $line, and the loop continues while $line has some non-false value, i.e. while there are lines in INFILE.

The paragraph inside the curled keys is [[pseudocode]], obviously; you put there your code. And, for tidiness, once we exit the loop, and INFILE is exhausted of lines, we close it.

Comments (4)

PHP

I am trying to learn PHP by myself, and so far I have realized that it is Perl in disguise! Variables, control sequences… everything is the same, even comments with “#” and line terminations through “;”.

Just wanted to share my joy, since I have a quite a bit of experience with Perl, and therefore my PHP learning will take off much faster!

Comments

« Previous Page « Previous Page Next entries »