Summary of my Python optimization adventures
This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix commands.
The script reads a host.gz log file from a given BOINC project (more precisely one I got from MalariaControl.net, because it is a small project, so its logs are also smaller), and extracts how many computers are running the project, and how much credit they are getting. The statistics are separated by operating system (Windows, Linux, MacOS and other).
Version 0
Here I read the whole file to RAM, then process it with Python alone. Running time: 34.1s.
#!/usr/bin/python
import os
import re
import gzip
credit  = 0
os_list = ['win','lin','dar','oth']
stat = {}
for osy in os_list:
  stat[osy] = [0,0]
# Process file:
f = gzip.open('host.gz','r')
for line in f.readlines():
  if re.search('total_credit',line):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif re.search('os_name',line):
    if re.search('Windows',line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif re.search('Linux',line):
        stat['lin'][0] += 1
        stat['lin'][1] += credit
    elif re.search('Darwin',line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()
# Return output:
nstring = ''
cstring = ''
for osy in os_list:
  nstring +=   "%15.0f " % (stat[osy][0])
  try:
    cstring += "%15.0f " % (stat[osy][1])
  except:
    print osy,stat[osy]
print nstring
print cstring
Version 1
The only difference is a “for line in f:“, instead of “for line in f.readlines():“. This saves a LOT of memory, but is slower. Running time: 44.3s.
Version 2
In this version, I use precompiled regular expresions, and the time-saving is noticeable. Running time: 26.2s
#!/usr/bin/python
import os
import re
import gzip
credit  = 0
os_list = ['win','lin','dar','oth']
stat = {}
for osy in os_list:
  stat[osy] = [0,0]
pattern    = r'total_credit'
match_cre  = re.compile(pattern).match
pattern    = r'os_name';
match_os   = re.compile(pattern).match
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search
# Process file:
f = gzip.open('host.gz','r')
for line in f:
  if match_cre(line,5):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif match_os(line,5):
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()
# etc.
Version 3
Later I decided to use AWK to perform the heaviest part: parsing the big file, to produce a second, smaller, file that Python will read. Running time: 14.8s.
#!/usr/bin/python
import os
import re
credit  = 0
os_list = ['win','lin','dar','oth']
stat = {}
for osy in os_list:
  stat[osy] = [0,0]
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search
# Distile file with AWK:
tmp = 'bhs.tmp'
os.system('zcat host.gz | awk \'/total_credit/{printf $0}/os_name/{print}\' > '+tmp)
stat = {}
for osy in os_list:
  stat[osy] = [0,0]
# Process tmp file:
f = open(tmp)
for line in f:
  line = re.sub('>','<',line)
  aline = line.split('<')
  credit = float(aline[2])
  os_str = aline[6]
  if search_win(os_str):
    stat['win'][0] += 1
    stat['win'][1] += credit
  elif search_lin(os_str):
    stat['lin'][0] += 1
    stat['lin'][1] += credit
  elif search_dar(os_str):
    stat['dar'][0] += 1
    stat['dar'][1] += credit
  else:
    stat['oth'][0] += 1
    stat['oth'][1] += credit
f.close()
# etc
Version 4
Instead of using AWK, I decided to use grep, with the idea that nothing can beat this tool, when it comes to pattern matching. I was not disappointed. Running time: 5.4s.
#!/usr/bin/python
import os
import re
credit  = 0
os_list = ['win','lin','dar','oth']
stat = {}
for osy in os_list:
  stat[osy] = [0,0]
pattern    = r'total_credit'
search_cre = re.compile(pattern).search
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search
# Distile file with grep:
tmp = 'bhs.tmp'
os.system('zcat host.gz | grep -e total_credit -e os_name > '+tmp)
# Process tmp file:
f = open(tmp)
for line in f:
  if search_cre(line):
    line = re.sub('>','<',line)
    aline = line.split('<')
    credit = float(aline[2])
  else:
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()
# etc
Version 5
I was not completely happy yet. I discovered the -F flag for grep (in the man page), and decided to use it. This flag tells grep that the pattern we are using is a literal, so no expansion of it has to be made. Using the -F flag I further reduced the running time to: 1.5s.


Paolo Bonzini said,
February 18, 2008 @ 10:10 am
What you use Python substring-search functions instead of regex matching?
Andrew Dalke said,
February 18, 2008 @ 10:15 am
When looking for fixed length strings in Python, use “substring in string”. For example, “if ‘total_credit’ in line:’. This is something like 20 times faster than using a regular expression and easier to understand. You could also replace the re.sub with a line=line.replace(“>”, “<“) .
There are a few more subtle ways to make Python code faster. Each stat[os_name] requires a dictionary lookup. You can reduce that by doing “win_stat = stat[‘win’]”, and so on for each OS name, then use ‘winstat[0] += 1’. This works best if the code is moved into a function because there are optimizations for looking up a local variable in a function vs. looking things up module variables like you are using.
You can also rework the code to use readlines(hint). This reads ‘hint’ bytes then enough bytes to get to the next newline. But I thought Python 2.3 changed the way iteration worked to include an internal read-ahead buffer in order to achieve the same performance gain.
isilanes said,
February 18, 2008 @ 10:42 am
Thanks Paolo and Andrew for your recommendations. The fact is that I am by no means a hacker, just a lame hobbyist :^)
I post my experience so that others can benefit, but my ignorance is huuuuge. Thanks to interaction with others, such as your comments, I keep learning every day… thanks!
@Paolo: If you mean using match instead of search, I did it, and saw little or no gain (although I expected to see it).
@Andrew: I have read that “if pattern in line:” is faster than “if re.search(pattern, line):”, but I have tested it, and if “pattern” is pre-compiled, I see no real advantage (actually it was slightly slower).
The dictionary lookup avoidance thing is a good advice, I will try it. Anyway, I am reluctant to make much more optimization here, because most of the computation time is spent in the system call to zcat and grep, so it doesn’t make much sense make a big effort to reduce the time spent in the “Python part”. I will eventually do it, anyway, just for the fun :^)
James said,
February 18, 2008 @ 11:41 am
Is this really Python optimization? Haven’t you just offloaded most of the program to grep, effectively “rewriting” it in C?
isilanes said,
February 18, 2008 @ 12:01 pm
James, you’re absolutely right. What I meant was “optimizing a script”, by using the best tools I could get access to (plus my limited knowledge).
Anyway, my first attempt was to make it all in Python, and as effectively as possible, and I give some hints of what to do and what not to do… so this qualifies as “Python optimization”? :^)
Andrew Dalke said,
February 18, 2008 @ 13:21 pm
When I benchmark with
python -m timeit -s ‘import re’ -s ‘search = re.compile(“Linux”).match’ ‘search(“Uses Linux”)’
I get “0.413 usec per loop”. When I benchmark with
python -m timeit ‘”Linux” in “Uses Linux”‘
I get “0.141 usec per loop”. Not quite 3 times faster than using search. (My 20-fold case was when I tested re.search(), which has a couple extra function calls overhead and a cache check.)
Are you sure you timed what you think you timed? Every time I’ve done the comparison the “in” test is faster, and I know the underlying implementation well enough that I can’t think of how it can be slower than the re code.
Also, instead of writing to a temporary file and reading from that, use either the subprocess module or the older and harder to use os.popen call. (Harder to use because it’s harder to deal with errors.) That should also give you some performance increase because you aren’t doing a full read/write through the disk.
isilanes said,
February 18, 2008 @ 13:34 pm
Andrew, I am not doubting that the “in” construct is faster (I repeated your test, and here it’s 3 times faster, as well). The problem is that maybe that part of the script is not the bottleneck, and the uncertainty in measured time (I always measured walltime with /usr/bin/time -f %e command is of the order of the difference in using one or the other, so I can’t diferentiate. I’ll keep testing… (the faster the script, the more noticeable the subtle differences).
Justin said,
February 18, 2008 @ 15:07 pm
If you are going to do this:
if search_cre(line):
line = re.sub('>','<',line)
aline = line.split('<')
credit = float(aline[2])
you should just change the regular expression to to be
"total_credit >(?P[^<]+)<"
or such, and then just pull out the credit if it matched. The way you’re doing it, you are processing the same line 3 times. From the looks of it, you could change the program to use 1 regular expression for everything, instead of 4.
f = os.popen('zcat host.gz')
Will be a lot faster than the gzip module though.
Andrew Dalke said,
February 18, 2008 @ 17:23 pm
While I find it annoying to use, try the “profile” module to get an idea of where time is spent in your program. Because your code isn’t written as a module, the easiest way to do the profiling is using the command “execfile ‘filename'”.
This should tell you which line consumes the most time.
Andrew Dalke said,
February 18, 2008 @ 17:39 pm
Oops, missed the parens: execfile(“your_filename.py”)
Lorenzo E. Danielsson said,
February 18, 2008 @ 20:43 pm
@James: what is wrong about calling an external tool to perform a job that it was designed to do? It doesn’t make the program itself any less “python”. That’s the UNIX way, let each tool do what it’s best at.
Top Posts « WordPress.com said,
February 18, 2008 @ 23:59 pm
[…] Summary of my Python optimization adventures This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, […] […]
Summary of my Python optimization adventures « handyfloss said,
September 18, 2008 @ 14:42 pm
[…] Entry available at: http://handyfloss.net/2008.02/summary-of-my-python-optimization-adventures/ […]