Disabling autoscale in a Xmgrace agr file

I am a heavy user of the Xmgrace plotting program, and I love it. An operation very ofter used is to scale the X and Y axes to our liking, to show different parts of our data in the resulting plot. You can do that from the command line by setting the "world" of the graph, providing four numbers as X,Y boundaries:

% xmgrace -world xmin ymin xmax ymax file.dat

Apart from setting the maximum and minimum values for X and Y, we can make use of the autoscale option to selectively show some ranges. The four options to autoscale are:

  • none - show the X,Y ranges defined by the "world" variable (if not set, the default is "0 0 1 1").
  • xy - forget about "world" data, make plot range in X and Y enough to plot all data in input.
  • x - autoscale X to show all data, but respect Y given by "world". This means that if a point is not shown because it lies outside the Y range, then it doesn't count to force X autoscale. This is a wee bit trickier than it sounds.
  • y - see previous point, with X and Y swapped.

But Xmgrace is not only about command line, or even GUI. You can write a .agr file (for example by saving a plot from the Xmgrace GUI), and manipulate it so that the following command:

% xmgrace file.agr

will bring up a plot with all the data and formatting we have put into the .agr file. It's really handy to save a file as-is.

Now, the syntax for inputting the world in the .agr is well known:

@ world xmin, ymin, xmax, ymax

where xmin etc. are floating point numbers.

The problem is how to hardcode the autoscale feature into the .agr. I had always been forced to do:

% xmgrace -autoscale none file.agr

from the command line, because I couldn't find out how to include it in the .agr. Finally I did find it, and that's the main reason of this post. The syntax is explained in the manual at the Xmrace site, but I found it after googling for agr files containing "autoscale" in them. The line to include seems to be:

@ autoscale onread none

A .agr containing the above line will produce, when called as follows:

% xmgrace file.agr

the same output as a file not containing it, when called as follows:

% xmgrace -autoscale none file.agr

Tags: , , , , , ,

Making a PDF grayscale with ghostscript

A request from a friend made me face the problem of converting a color PDF into a grayscale one. Searching the web provided some ways of doing so with Adobe Acrobat, via some obscure menu item somewhere.

However, the very same operation could be undertaken with free tools, such as ghostscript. I found a way to do it in the YANUB blog, and I will copy-paste it here, with a small modification.

Assuming we have a file called color.pdf, and we want to convert it into grayscale.pdf, we could run the following command (all in a single line, and omitting the "\" line continuation marks):

% gs -sOutputFile=grayscale.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH color.pdf

I prefer the above to YANUB's version below (in red what he lacks, in blue what I lack), because a shell operation is substituted by some option(s) of the command we are running:

% gs -sOutputFile=grayscale.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH color.pdf < /dev/null

A sample Perl script to alleviate the tedious writing above:

#!/usr/bin/perl -w
use strict;
my $infile = $ARGV[0];
my $outfile = $infile;
$outfile =~ s/\.pdf$//;
$outfile = $outfile."_gray.pdf";
system "gs -sOutputFile=$outfile -sDEVICE=pdfwrite -sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH $infile"

Assuming we call the Perl script "togray.pl", and that we have a color file "input.pdf", we could just issue the command:

% togray.pl input.pdf

and we would get a grayscale version of it, named "input_gray.pdf".

Tags: , , , , , , ,

Installation of simyo Huawei E220 under MacOSX

I recently subscribed to simyo's mobile internet service. I was considering also Orange, as explained in a previous post (es), but simyo's offer is better.

I am writing how to make the modem simyo provides (the commonplace Huawei E220) on MacOSX first, because apparently the PIN has to be deactivated for the modem to work in Linux. I have to admit that in MacOSX installation was a breeze.

Software installation

Start MacOS, then plug the USB modem. A window will open automatically, with two objects inside: "MobileConnect" and "User Manual". The former is the installer binary, and the latter is a folder with the manuals in PDF format (for me, they were in English and Spanish).

Clicking on the "MobileConnect" icon the installer will start, and after being asked to accept an EULA, then introduce the admin password, then choosing a location for placing the files (actually just a hard disk, not a concrete dir), the installer does its thing.

Profile setting

After that, we only need to configure a connection in the "Mobile Connect" window that opens automatically after installation. For that, click on "Setting..." and create a new profile. If you read the manual (see above), it is easy to fill in the blanks. In short:

  • Profile name: whatever you want
  • Access Point Name: this is the APN value that simyo tells you in some paper (gprs-service.com)
  • Telephone number: *99#
  • Account name: irrelevant
  • Password: irrelevant

Save the above, then choose the profile you just created in the drop-down list in "Profile name", then hit the "Connect" button. If after saying "Dialing up, please wait", it tells you "Connection succesfull!", then everything is fine!

PIN deactivation

Apparently using the modem under Linux requires that the PIN is deactivated. Doing that under MacOSX is easy: when the "Mobile Connect" window is active, go to the "Manage PIN" drop-down menu in the top bar. There you can find "Activate", "Deactivate" and "Modify". Self-explanatory, ain't it?

Tags: , , , , , , , , ,

DreamHost MediaWiki update problem

I recently updated the MediaWiki installation in one of my DreamHost domains from 1.12 to 1.13, and I started to see the following error messages when trying to edit/save pages (the capital letter triplets used for privacy):

Database error

A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

(SQL query hidden)

from within function "Article:getHiddenCategories". MySQL returned error "1146: Table XXX.YYY_page_props' doesn't exist (mysql.ZZZ.AAA)".

After a Google search that yielded only two results, I checked a mediawiki.org page talking about the subject. The (maybe obvious) reason for my error was that I hadn't run the MediaWiki update script, as one should after any upgrade.

The procedures is outlined in this other mediawiki.org page. However there is a little catch: the PHP version (at least in my case) accessible in the shell of the server where my wiki is is 4.4.8, but the MediaWiki update script needs PHP5. No problem, I checked the DreamHost wiki, and found out that for PHP5 I could use the following executable: /usr/local/dh/cgi-system/php5.cgi.

Running that executable on the corresponding update.php script (after setting up AdminSettings.php as told to), everything was OK again.

Tags: , , , , , ,

My music collection surpasses 9000 songs

Following the "report" series started with my first summary of info about the music collection I listen to, I will update that info in this post.

The data (in parentheses the difference with respect to last report, 5 months ago).

Files

Total files        9512 (+1439)
  - Commercial     6161 (+1174)
  - Jamendo        3226 (+225)
  - Other CC       71 (+40)
  - Other          54 (+0)
Total playtime     25d (+4d)
Disk usage         45GB (+7GB)
Artist count       1270 (+236)
Album count        847 (+109)
MP3 count          0 (+0)
OGG count          9512 (+1439)

Last.fm

Playcount           41534 (+5255)

Most played artists Joaquín Sabina - 2711 (+195)
                    The Beatles - 1346 (+118)
                    David TMX - 853 (+82)
                    Silvio Rodríguez - 782 (+37)
                    Extremoduro - 694 (+251)
                    Fito & Fitipaldis - 675 (+53)
                    Siniestro Total - 650 (+39)
                    Bad Religion - 632 (+59)
                    La Polla Records - 565 (+28)
                    Ismael Serrano - 478
                    Ska-P - 440 (+20)

Most played songs   Cuando aparezca el petróleo (E. Sánchez) - 66 (+10)
                    La del pirata cojo (J. Sabina) - 55 (+3)
                    Tirado en la calle (E. Sánchez) - 53 (+7)
                    Conductores suicidas (J. Sabina) - 51 (+3)
                    Y sin embargo (J. Sabina) - 49 (+4)
                    Pacto entre caballeros (J. Sabina) - 47 (+2)

Amarok

Playcount         30392 (+4796)

Favorite artists  NanowaR - 96.16% (+2.14)
                  ABBA - 95.85%
                  Erick Sánchez - 95.19%
                  Rafael Caballero - 94.73 (+0.43)
                  Peiremans - 94.68% (+1.20)
                  Leihotikan - 94.53% (+0.14%)
                  Su ta Gar - 94.44% (+0.34)
                  Simon and Garfunkel - 94.26% (+0.42)
                  La Caja Negra - 94.22% (+0.65)
                  Antarhes - 94.18%
                  Ska-P - 94.12% (-0.96)
                  Eskorbuto - 94.06%
                  Fito & Fitipaldis - 93.87%
                  Juan Luis Guerra - 93.75% (+0.10)

Favorite songs    Salir - Extremoduro
                  You shook me all night long - AC/DC
                  Km 0 - Ismael Serrano
                  Golfa - Extremoduro
                  Todos los segundos cuentan - La Caja Negra
                  Vértigo - Ismael Serrano
                  1st movement of Winter - Antonio Vivaldi
                  Total eclipse of the heart - Bonnie Tyler
                  New America - Bad Religion
                  Caperucita - Ismael Serrano
                  Fiesta pagana (Mägo de Oz) - Mägo de Oz
                  Cuando aparezca el petróleo - Erick Sánchez
                  La extraña pareja - Ismael Serrano
                  Highway to hell - AC/DC
                  Uno - dos - tres - cuatro - Javier Álvarez
                  El roce de tu cuerpo - Platero y Tú
                  Torn - Natalie Imbruglia
                  Un muerto encierras - Ismael Serrano
                  Chop suey - System of a Down
                  Tirado en la calle - Erick Sánchez
Tags: , , , , ,

Making iSight camera work in Ubuntu

As I said in a previous post, I bought a MacBook, and I am making all bits work correctly. Out-of-the-box support from Ubuntu (the only GNU/Linux I tried on the MacBook so far) is excellent, but some things (camera, WiFi...) need proprietary drivers, so some more tweaks are needed.

I have followed the instructions in the Ubuntu community site, as with the procedures detailed in the previous post.

Basically, it all boils down to:

Fetch the Apple drivers for the camera

As root (if, unlike me, you like sudo, then run the following as user, but prepended with sudo), mount the Mac OSX partition (you didn't delete it, right?) and copy the relevant file somewhere else (the cp command should be all in one line):

# cd
# mkdir /mnt/macosx
# mount /dev/sda2 /mnt/macosx
# cp /mnt/macosx/System/Library/Extensions/
     IOUSBFamily.kext/Contents/PlugIns/AppleUSBVideoSupport.kext/
     Contents/MacOS/AppleUSBVideoSupport .
# umount /mnt/macosx

You might have noticed that the Mac OSX partition is not sda1, but sda2. Don't ask me. It turns out like this after following my own installation instructions. Apple must have decided to install the OS in the second partition for some reason.

Install the required packages

We need a package called isight-firmware-tools. Unfortunately it is not present in the Hardy repos at the moment (it was in the Gutsy ones, I think). You can add a Launchpad repo, editing /etc/apt/sources.list to add:

deb http://ppa.launchpad.net/mactel-support/ubuntu hardy main
deb-src http://ppa.launchpad.net/mactel-support/ubuntu hardy main

Then, as root:

# aptitude update
# aptitude install isight-firmware-tools

You will be prompted for a path to the driver you copied before. You can press Enter without paying much attention, then execute (assuming you copied the driver to your root home):

# cd
# ift-extract -a ./AppleUSBVideoSupport

To activate the driver, restart HAL:

# /etc/init.d/hal restart

Test it with Ekiga

As explained in the Ubuntu community site, you can run Ekiga as user (after installing the ekiga package). Choose V4L2 as video plugin, and Built-in iSight should appear among the Input device list. If it does, the process worked.

Tags: , , , , , , , , , ,

Installing Ubuntu Hardy Heron on a MacBook

Yes, dear reader, I committed the heresy of purchasing an Apple MacBook. I obviously didn't do it for MacOS X, for which I couldn't care less, but for the hardware, which is quite good. I was looking for a laptop as small as possible, keeping price low (it cost 799 eur), and screen not too small (this one has a 13" one. Maybe even 12" is acceptable. 13" sure is).

You can see some pictures of it at my MacBook gallery.

If you, like me, are used to PCs, then there are a few things to note:

  • It has a different layout in the keyboard. Most prominently, some keys are missing: Del, PgUp, PgDn, Home, End. Some others (Win key, AltGr) have substitutes that can be mapped. Also the equivalent to AltGr and right Ctrl are kind of swapped: the key closest to the SpaceBar is right "cmd" (could be right Ctrl), and the farthest one is left "alt" (could be AltGr)
  • The touchpad has a single button, and tapping on it won't click. There is no zone on it to use as vertical scroll, either. Luckily the latter can be fixed via software, so that in Ubuntu the touchpad does behave correctly: you can tap-click, and you can scroll with a smooth movement of a finger. The single-button issue is not present in USB mice: they work "normally".

I would like to outline here the process of installing Ubuntu (Hardy Heron) in this machine. For that, I recommend reading (as I did), the following links:

Repartition of the hard disk

My Mac came with 120 GB (109 real) of HD, all of it devoted to OS X. Unfortunately, the Ubuntu installer can not cope with resizing of HFS+ partitions. Fortunately, OS X itself can. You can make use of Boot Camp as follows: go to Go->Utilities->Boot Camp Assistant. There you can (should) reduce the existing HFS+ partition to the bare minimum (in my machine it was 22GB, because OSX already uses 17GB, and it won't accept less than 5GB of free disk). Leave the rest unassigned, and quit.

Installation of multi-boot system

The first hurdle in our Linux installation is that the Mac machines do not have a "normal" BIOS. The BIOS is important for Linux/Windows installations, so this is a drawback. Macs come with a thingie called Extensible Firmware Interface (EFI), instead. However, there is a nice little tool called rEFIt that can help us with it.

To install rEFIt, you can follow the instructions at its Sourceforge site. I followed the Automatic Installation with the Installer Package instructions. Basically I downloaded the Mac disk image from the download page, opened in the Mac OSX file browser, double-clicked it to open it, then double-clicked on the rEFIt.mpkg file inside, and followed the instructions.

This will make the rEFIt menu appear in the next reboot, but only if you hold some key while booting (I think it's "C"). If you want the menu to always appear, do the following in a terminal, inside Mac OSX:

% cd /efi/refit
% ./enable-always.sh

Installation of Linux OS

After doing the above, you should reboot with an Ubuntu installation CD inserted. If the EFI installation was correct, you will be presented with the rEFIt menu, in which you will have two big icons (OSX and the Linux CD), and five small ones below ("Start EFI Shell", "Start Partitioning Tool", "About rEFIt", "Shut down computer" and "Restart computer").

Use the left-rigth arrow keys to select the Ubuntu CD, and press Enter. At that moment, or after installing Ubuntu (I don't recall), the computer could complain saying: "No bootable device -- insert boot disk and press any key". If so, reboot and, in the aforementioned rEFIt menu, choose the second small icon, "Start Partitioning Tool". This tool will prompt you to update the MBR. Accept, and let it do its magic.

When booting with the CD, you will have the option to make an absolutely normal Ubuntu installation. The Ubuntu MacBook page says that Boot Camp will complain if you make more than two partitions in total. It will, but for me this is ridiculous, since OSX is already eating up one. There's no way I will install any Linux in a single partition (withouth even swap!). If you do not care about opening Boot Camp ever again (I don't), do a totally normal install. I created two 8.5GB partitions for / (one for Ubuntu, another one unused for the future), a 750MB swap partition, and the rest (73GB) as /home (potentially shared among the two Linux I could install).

After the installation, reboot and you will find the aforementioned rEFIt menu. Choosing the penguin icon on the right side will take you to the GRUB screen you probably are accustomed to. What this means is that you have to go through two boot menus when booting, but that's a minor issue, I think. The first menu is an EFI menu, in which you choose OSX or GRUB. The second one is the GRUB menu that lets you choose among different installed kernels.

And I think that's it...

I will keep on writing when I have time, at least about how to make WiFi work, and also how to configure Compiz Fusion. Yes, the X3100 graphics chip that the MacBooks carry is blacklisted, as not working with CF. But, believe me, it does work!

Tags: , , , , , , , ,

This blog is my OpenID provider

I really like the idea behind OpenID, and I already have an account at Weblogs SL. Of course, my WordPress.com also was a valid OpenID provider. Moroever, my isilanes.org site (and before that my EHU page) was turned into an OpenID provider by adding the following lines (extra blank added before "link", to make text visible):

< link rel="openid.server" href="http://openid.blogs.es/index.php/serve" />
< link rel="openid.delegate" href="http://openid.blogs.es/isilanes" />

But I was not completely happy with that. I when signing a comment in a blog (for example) with my WP blog URL, my nickname would appear as "handyfloss" (the name of the blog), not "isilanes" (my nick). If I used the Weblog URL (or that of www.ehu.es/isilanes), my nick would be "isilanes", but clicking on my nick would take the reader to that URL, instead of to my blog.

With this WordPress.org blog these issues are gone. I have installed the Yadis plugin, and now I can sign with the "isilanes" nick, and give a link to this blog.

The configuration of the plugin is really simple: go to Options->Yadis->Add New Service, and select "Other...". You will be asked for two data: "OpenID Server" and "OpenID Delegate" (both provided by your OpenID account, with Weblog or whoever). Fill in the requests, click "submit", and you're done!

Tags: , , , ,

Some more tweaks to my Python script

Update: you can find the outcome of all this in a latter post: Project BHS

All the comments to my previous post have provided me with hints to increase further the efficiency of a script I am working on. Here I present the advices I have followed, and the speed gain they provided me. I will speak of "speedup", instead of timing, because this second set of tests has been made in a different computer. The "base" speed will be the last value of my previous test set (1.5 sec in that computer, 1.66 in this one). A speedup of "2" will thus mean half an execution time (0.83 s in this computer).

Version 6: Andrew Dalke suggested the substitution of:

line = re.sub('>','<',line)

with:

line = line.replace('>','<')

Avoiding the re module seems to speed up things, if we are searching for fixed strings, so the additional features of the re module are not needed.

This is true, and I got a speedup of 1.37.

Version 7: Andrew Dalke also suggested substituting:

search_cre = re.compile(r'total_credit').search
if search_cre(line):

with:

if 'total_credit' in line:

This is more readable, more concise, and apparently faster. Doing it increases the speedup to 1.50.

Version 8: Andrew Dalke also proposed flattening some variables, and specifically avoiding dictionary search inside loops. I went further than his advice, even, and substituted:

stat['win'] = [0,0]

loop
  stat['win'][0] = something
  stat['win'][1] = somethingelse

with:

win_stat_0 = 0
win_stat_1 = 0

loop
  win_stat_0 = something
  win_stat_1 = somethingelse

This pushed the speedup futher up, to 1.54.

Version 9: Justin proposed reducing the number of times some patterns were matched, and extract some info more directly. I attained that by substituting:

loop:
  if 'total_credit' in line:
    line   = line.replace('>','<')
    aline  = line.split('<')
    credit = float(aline[2])

with:

pattern    = r'total_credit>([^<]+)<';
search_cre = re.compile(pattern).search

loop:
  if 'total_credit' in line:
    cre    = search_cre(line)
    credit = float(cre.group(1))

This trick saved enough to increase the speedup to 1.62.

Version 10: The next tweak was an idea of mine. I was diggesting a huge log file with zcat and grep, to produce a smaller intermediate file, which Python would process. The structure of this intermediate file is of alternating lines with "total_credit" then "os_name" then "total_credit", and so on. When processing this file with Python, I was searching the line for "total_credit" to differentiate between these two lines, like this:

for line in f:
  if 'total_credit' in line:
    do something
  else:
    do somethingelse

But the alternating structure of my input would allow me to do:

odd = True
for line in f:
  if odd:
    do something
    odd = False
  else:
    do somethingelse
    odd = True

Presumably, checking falsity of a boolean is faster than matching a pattern, although in this case the gain was not huge: the speedup went up to 1.63.

Version 11: Another clever suggestion by Andrew Dalke was to avoid using the intermediate file, and use os.popen to connect to and read from the zcat/grep command directly. Thus, I substituted:

os.system('zcat host.gz | grep -F -e total_credit -e os_name > '+tmp)

f = open(tmp)
for line in f:
  do something

with:

f = os.popen('zcat host.gz | grep -F -e total_credit -e os_name')

for line in f:
  do something

This saves disk I/O time, and the performance is increased accordingly. The speedup goes up to 1.98.

All the values I have given are for a sample log (from MalariaControl.net) with 7 MB of gzipped info (49 MB uncompressed). I also tested my scripts with a 267 MB gzipped (1.8 GB uncompressed) log (from SETI@home), and a plot of speedups vs. versions follows:

versions2.png

Execution speedup vs. version
(click to enlarge)

Notice how the last modification (avoiding the temporary file) is of much more importance for the bigger file than for the smaller one. Recall also that the odd/even modification (version 10) is of very little importance for the small file, but quite efficient for the big file (compare it with Version 9).

The plot doesn't tell (it compares versions with the same input, not one input with the other), but my eleventh version of the script runs the 267 MB log faster than the 7 MB one with Version 1! For the 7 MB input, the overall speedup from Version 1 to Version 11 is above 50.

Tags: , , , ,

Summary of my Python optimization adventures

This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix commands.

The script reads a host.gz log file from a given BOINC project (more precisely one I got from MalariaControl.net, because it is a small project, so its logs are also smaller), and extracts how many computers are running the project, and how much credit they are getting. The statistics are separated by operating system (Windows, Linux, MacOS and other).

Version 0

Here I read the whole file to RAM, then process it with Python alone. Running time: 34.1s.

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

# Process file:
f = gzip.open('host.gz','r')
for line in f.readlines():
  if re.search('total_credit',line):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif re.search('os_name',line):
    if re.search('Windows',line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif re.search('Linux',line):
        stat['lin'][0] += 1
        stat['lin'][1] += credit
    elif re.search('Darwin',line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# Return output:
nstring = ''
cstring = ''
for osy in os_list:
  nstring +=   "%15.0f " % (stat[osy][0])
  try:
    cstring += "%15.0f " % (stat[osy][1])
  except:
    print osy,stat[osy]

print nstring
print cstring

Version 1

The only difference is a "for line in f:", instead of "for line in f.readlines():". This saves a LOT of memory, but is slower. Running time: 44.3s.

Version 2

In this version, I use precompiled regular expresions, and the time-saving is noticeable. Running time: 26.2s

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]


pattern    = r'total_credit'
match_cre  = re.compile(pattern).match
pattern    = r'os_name';
match_os   = re.compile(pattern).match
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Process file:
f = gzip.open('host.gz','r')

for line in f:
  if match_cre(line,5):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif match_os(line,5):
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# etc.

Version 3

Later I decided to use AWK to perform the heaviest part: parsing the big file, to produce a second, smaller, file that Python will read. Running time: 14.8s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with AWK:
tmp = 'bhs.tmp'
os.system('zcat host.gz | awk \'/total_credit/{printf $0}/os_name/{print}\' > '+tmp)

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
# Process tmp file:
f = open(tmp)
for line in f:
  line = re.sub('>','<',line)
  aline = line.split('<')
  credit = float(aline[2])
  os_str = aline[6]
  if search_win(os_str):
    stat['win'][0] += 1
    stat['win'][1] += credit
  elif search_lin(os_str):
    stat['lin'][0] += 1
    stat['lin'][1] += credit
  elif search_dar(os_str):
    stat['dar'][0] += 1
    stat['dar'][1] += credit
  else:
    stat['oth'][0] += 1
    stat['oth'][1] += credit
f.close()

# etc

Version 4

Instead of using AWK, I decided to use grep, with the idea that nothing can beat this tool, when it comes to pattern matching. I was not disappointed. Running time: 5.4s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

pattern    = r'total_credit'
search_cre = re.compile(pattern).search

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with grep:
tmp = 'bhs.tmp'
os.system('zcat host.gz | grep -e total_credit -e os_name > '+tmp)

# Process tmp file:
f = open(tmp)
for line in f:
  if search_cre(line):
    line = re.sub('>','<',line)
    aline = line.split('<')
    credit = float(aline[2])
  else:
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit

f.close()

# etc

Version 5

I was not completely happy yet. I discovered the -F flag for grep (in the man page), and decided to use it. This flag tells grep that the pattern we are using is a literal, so no expansion of it has to be made. Using the -F flag I further reduced the running time to: 1.5s.

time_vs_version.png

Running time vs. script version (Click to enlarge)

Tags: , , , ,