handyfloss - Because FLOSS is handy, isn't it?

Is my theory bullshit?

March 15, 2008 at 1:14 am · Filed under This evil world

This post tries to sketch a rule of the thumb to quickly check whether an idea/theory/belief is utterly useless or not. I have admittedly adapted it from the [[Bertrand Russell|russellian]] definition of [[Science]]. Recall that utterly useless ideas are not necessarily wrong. They are just that: utterly useless.

There is a single basic question you have to ask yourself when you invent/encounter a flashy new theory or idea like [[Psychokinesis|telekinesis]] or [[homeopathy]]:

Can I imagine any conceivable way of refuting this theory?

If the answer is “no”, then the theory is bullshit.

If you accept this, you are bound to abandon the theory if someone comes up with a valid experiment at which your theory fails (if someone challenges your telekinetic powers and you can not shut her mouth, you must accept you don’t have telekinetic powers).

On the other hand, if you don’t accept the above premise, you must, without excuse, believe in any other theory that can not be proved wrong, such as the [[Invisible Pink Unicorn]] or the [[Flying Spaghetti Monster]]. Failing to do so will undoubtedly qualify you as an absolute hypocrite.

Now, the long explanation…

Proving something true is theoretically impossible, but proving something wrong is trivial: if I say that all swans are white, no matter how many white swans I see, I will never be sure that the theory is true. On the other hand, after the first black swan I see, I will conclude without doubt that the theory was wrong.

Thus, “proving” some theory is usually equaled to designing an experiment in controlled conditions, where a result is expected from the theory, and we get precisely that result in the experiment. Obviously, we could have obtained a different result, and our theory would have been proved wrong. It is precisely the fact that a different result could potentialy refute our theory what makes the desired result confirm it. It follows that, if there is no conceivable circumstance under which the experiment could have failed, our theory can not be disproved, and therefore can not be “proved” through absence of refutation.

Take for example a [[precognition|seer]] who claims to be able to see the future. Her theory is not necessarily bullshit: one can devise a test, failing which would mean that she is wrong. For example, one can ask her to “see” something that she can not access by normal means, and that she can not guess by chance, for example the next lottery winning number. If she guesses correctly, the theory is temporarily accepted. If she fails, the theory is dropped.

Not it comes the funny twist: any argument that tries to make the precognition theory above survive after a failure (e.g. “I do not control when I can see the future”, “I only see abstract visions that I have to interpret afterwards”, and so on… you know the thing), automatically turns it into bullshit. Directly. And that because of the little rule of the thumb I present above.

Permalink Comments (2)

Project BHS

March 13, 2008 at 20:49 pm · Filed under Free software and related beasts

As outlined in some previous posts[1,2,3,4], I have been playing around with a piece of Python code to process some log files. The log files to process were actually host.gz files from some [[BOINC]] projects, and the data I want to extract from them is quite simple: the Windows, Linux and Mac shares in the number of computers contributing to them (and the [[BOINC Credit System|work they do]]). By logging this processed data myself, I can see the time evolution of this share, and hopefully show the slow but steady rise of GNU/Linux :^)

I figured out that the contribution to distributed computing projects could be a reasonable indicator of the Windows predominance status. There are many other indicators (for example the number of visits to a web site, e.g. this very one), and I don’t claim that this one is “better”. I just want to add it to the reference list for the reader.

There is a problem with “Windows vs. Linux” figures, and it is that they are not really “competing” products. When cars or soft drinks are the subject, one can figure out the [[market share]], looking at the number of items sold. Linux being [[free software]], one can hardly measure the amount of “sold copies”, and with Windows being pre-installed in most new computers, one can not really trust the “number of computers sold = number of Windows copies sold”, because some users even remove the Windows partition and install Linux on top of it.

Counting the visits to some sites is not without problems, either. Any web site will have a particular audience, and the result will be biased by that fact. When my blog was in WordPress.com, I had roughly as many visits from Windows users as from Linux users, and almost all of them used Firefox as a browser. Obviously this data is not an accurate reflection of the world at large. It so happened that free software users are more likely to surf to sites like mine, hence the bias.

So, without further ado, let me introduce the “BOINC Host Statistics” program (BHS). Here you are a link to its home page. You can find results I have harvested so far in the Screenshots section. For example, the SETI@home credit generation rate statistics follows:

What the plot tells us is that (at the time of writing this) 500 million [[BOINC Credit System|cobblestones]] are being granted to contributors each day. Of them, around 82% are being given to Windows computers, 9-10% to Mac, 8% to GNU/Linux, and the rest to computers running other OSs.

Permalink Comments

New version of Sociable WP plugin

March 11, 2008 at 12:55 pm · Filed under Free software and related beasts

Another reason to love FLOSS: developers are close to the users, and they LISTEN.

I recently started using the Sociable WordPress plugin for this blog. This wonderful plugin by Joost de Valk, lets you put some links to social bookmarking/news/recommendation sites on the web at the bottom of each post, so a reader can send your post to such a site with a single click.

There are many WP plugins that do this, but I liked the looks of Joost’s, and the pleasant way of managing it. I chose Digg, Reddit, del.icio.us, Technorati and Slashdot, but I felt that at least two sites that I liked were missing from the available sites list: MenÃ©ame and Barrapunto.

So I boldly decided to contact the developer, Joost de Valk, and ask for them:

Hi Joost,

I have just discovered your “Sociable” WordPress plugin, and I like it a lot.

However, there is always room for improvement, and as such I would like to suggest you to add links to the following sites:

MenÃ©ame (http://meneame.net/)
Barrapunto (http://barrapunto.com/)

Both are Spanish “versions” of popular sites: Digg and Slashdot, respectively.

I mainly write in English, but I think that blogs with a Spanish audience could benefit a lot from these links.

Now I realize I even forgot to say “thanks in advance” or anything… I was a bit unpolite, I fear. Anyway, his answer came a couple of days later:

I’ll add them in the next version, coming out… tonight I guess :)

Can I trust upon you to promote it a bit there? :)

Cheers,
Joost

It is actually true that a new version of Sociable has been released, and it includes MenÃ©ame and Barrapunto as available sites. So here it goes your promotion, Joost ;^)

Isn’t it great when people collaborate and are generally nice to each other? Isn’t everyone tired of a society where people don’t do anything unless they get money or power in return?

Thanks Joost and other bona fide developers for your great work.

Permalink Comments (4)

Blackout summary X

March 11, 2008 at 11:56 am · Filed under This evil world

Last week a new power failure affected the Campus. At least the PCs at the DIPC were reseted around midnight. So, here goes the updated list of blackouts I have been able to compile, with comments if any:

2008-Mar-05
2007-Dec-10 (I used the reboot of my computer to install kernel 2.6.22-3)
2007-Oct-16
2007-Aug-27 (at least three short power failures, 5-10 minutes apart)
2007-May-19
2006-Oct-21 (they warned beforehand)
2006-Sep-14 (Orpheus fell, the DNSs fell, the DHCP servers fell)
2006-Jul-04 (Orpheus didn’t fall)
2006-Jun-16
2006-Jun-13
2006-Jun-08
2006-Jun-04
2006-May-26 (The card-based automated access to the Faculty broke down)
2005-Dec-21
2005-Dec-13

Summary: 15 blackouts in 813 days, or 54.2 dpb (days per blackout). 86 days since last blackout. Average dpb went up by 2.2.

First post in the series: here

Permalink Comments

This blog is my OpenID provider

March 2, 2008 at 12:26 pm · Filed under howto

I really like the idea behind OpenID, and I already have an account at Weblogs SL. Of course, my WordPress.com also was a valid OpenID provider. Moroever, my isilanes.org site (and before that my EHU page) was turned into an OpenID provider by adding the following lines (extra blank added before “link”, to make text visible):

< link rel="openid.server" href="http://openid.blogs.es/index.php/serve" /> < link rel="openid.delegate" href="http://openid.blogs.es/isilanes" />

But I was not completely happy with that. I when signing a comment in a blog (for example) with my WP blog URL, my nickname would appear as “handyfloss” (the name of the blog), not “isilanes” (my nick). If I used the Weblog URL (or that of www.ehu.es/isilanes), my nick would be “isilanes”, but clicking on my nick would take the reader to that URL, instead of to my blog.

With this WordPress.org blog these issues are gone. I have installed the Yadis plugin, and now I can sign with the “isilanes” nick, and give a link to this blog.

The configuration of the plugin is really simple: go to Options->Yadis->Add New Service, and select “Other…“. You will be asked for two data: “OpenID Server” and “OpenID Delegate” (both provided by your OpenID account, with Weblog or whoever). Fill in the requests, click “submit”, and you’re done!

Permalink Comments (3)

EU fines MS with 899M euro over non-compliance

March 1, 2008 at 18:26 pm · Filed under This evil world

The European Union decided last wednesday that they’d impose a 899M euro penalty on Microsoft for not providing the information they had been asked to release in 2004.

The short story goes like this: the EU decided that MS was to make public the specifications of some protocols and formats for allowing interoperability of Windows with other OSs. MS decided that this would be bad for their monopoly, so refused. Later, they pretended to comply, by sending the EU 30k pages of basically bullshit (for practical purposes, that documentation was useless). Now, the EU has decided that we should take none of it, and has fined MS for not complying.

Go, EU, go!

Permalink Comments

Gmail CAPTCHA broken?

February 28, 2008 at 12:20 pm · Filed under This evil world

I just read, through a link provided by Julen, that apparently Gmail CAPTCHA has been broken (referred to at Slashdot).

This CAPTCHA in particular is the one Google asks a new user to identify correctly to create a new Gmail account. If a robot, or any other automated process, is able to make the correct guess and pretend is a legit user, this opens the doors for massive amounts of new Gmail accounts for spammers. We’ll see what comes out of that (more spam, probably).

Permalink Comments (3)

A vueltas con el incremento de ancho de banda de Euskaltel

February 21, 2008 at 14:33 pm · Filed under This evil world

Como el lector quizÃ¡ sabrÃ¡, Euskaltel ha duplicado (y triplicado) la anchura de banda de todas o casi todas sus ofertas de conexiÃ³n a Internet. Y lo ha hecho manteniendo los precios, lo cual es de agradecer (aunque no del todo soprendente, dado que llevaban mÃ¡s de 2 aÃ±os sin cambiar su oferta).

Pues bien, la lÃnea de 300 kb que tienen mis padres contratada, supuestamente ahora la ofrecen a 1Mb. Â¡Genial, el triple de velocidad por el mismo precio! Bueno, la realidad es que no es del todo cierto. Parece ser que cambiar la pÃ¡gina web para ofrecer mayores velocidades es mÃ¡s fÃ¡cil que realmente servir mayores velocidades, con lo cual hay una pequeÃ±a discrepancia entre lo ofertado y lo servido: mis padres siguen con 300 kb.

DecidÃ esperar hasta febrero para que tuvieran un tiempo “de gracia” para adecuar el servicio a la oferta, pero como ya estamos a finales, he decidido quejarme a travÃ©s de su Ã¡rea de cliente.

Como admiro y respeto a Euskatel por su buen trato al cliente y eficiencia en el servicio, he querido homenajearlos publicando en el blog la conversaciÃ³n electrÃ³nica que estoy teniendo con ellos. De esta manera, mis lectores verÃ¡n lo buenos que son en Euskaltel (o lo malos que son: en su mano estÃ¡). Mi experiencia es una gota en el ocÃ©ano, pero con que un solo lector decida contratar Euskaltel por leer esto ya sentirÃ© que he hecho algo por una compaÃ±ia que se desvive por darme el mejor servicio posible.

MÃ¡s posts sobre Euskaltel:

18-01-2007 – Euskaltel, avanzamos por tÃ
21-01-2007 – Euskaltel (II)
07-10-2007 – Euskaltel y sus tarifas

Mi queja original (18-02-2008):

Veo en su pÃ¡gina (euskaltel.es), que el contrato Despega 300 se ha convertido en Despega 1M, manteniendo la tarifa. Mis padres tienen contratado dicho servicio, pero la velocidad de conexiÃ³n sigue siendo de 300 kb. Quisiera saber quÃ© tipo de error han cometido uds., bien sea por publicidad engaÃ±osa (si el error estÃ¡ en la pÃ¡gina web), o deficiencia de servicio (si nos estÃ¡n dando un ancho de banda menor del contratado). Por supuesto, tambiÃ©n me gustarÃa que dicho error sea subsanado cuanto antes.

Gracias de antemano.

Respuesta de Euskaltel (20-02-2008):

Estimado cliente:

En respuesta a la consulta que nos remite a traves de su correo, efectivamente el servicio de Internet Despega 300 Kbps a dejado de comercializarse para pasar a ser Despega 1 Mb por la misma cuota mensual.

En el caso de los clientes que tiene contratado el Despega 300 Kpbs, se les va a subir la velocidad a 1 Mg sin coste adicional, estas subidas de velocidad se estan realizando paulatinamente y esta previsto que para verano ya esten todos nuestros clientes con las velocidades actualizadas. No obstante, cuando se vaya a producir el aumento de velocidad, recibiran noticias por parte nuestras informandoles de dicho cambio.

Esperando haber aclarado sus dudas,

Reciba un cordial saludo,

Euskaltel, S.A.

Mi respuesta (21-02-2008):

Estimada Euskaltel,

Comprendo y respeto los motivos (aunque no se me expliquen) de Euskaltel para hacer una subida paulatina de velocidades a los clientes actuales (aunque esto suponga un agravio comparativo frente a los nuevos clientes, que lo obtienen inmediamente).

Como entiendo que Euskaltel es igual de compresiva o mÃ¡s que yo, supongo que no le importarÃ¡ que yo, a cambio, pague un tercio de mi cuota mensual habitual, ya que se me da un tercio de la velocidad contratada (me dan el ancho de banda del momento que firmÃ©, pero no el ACTUAL del servicio que contratÃ©). Por supuesto, y al igual que Euskaltel conmigo, irÃ© aumentando “paulatinamente” mi aporte mensual a Euskaltel, y espero (salvo imprevistos) pagar el 100% de mi cuota “para verano”, cuando previsiblemente uds. me darÃ¡n el 100% del servicio contratado.

IÃ±aki

P.D.: pueden uds. seguir esta conversaciÃ³n, al igual que todos mis lectores, en mi blog: http://handyfloss.wordpress.com/2008/02/21/a-vueltas-con-el-incremento-de-ancho-de-banda-de-euskaltel/

Respuesta Euskaltel (22-02-2008)

Estimado cliente:

En respuesta a la consulta que nos remite a traves de su mensaje, le informamos de que Euskaltel cuando comunico el aumento de velocidad que aplicaria sobre los servicios ya contratados por los clientes sin modificar las cuotas, tambien comunico que el cambio se aplicaria de forma escalonada durante los proximos meses. Le informamos tambien que para este tipo de cambios, la ley tiene un plazo estipulado de 6 meses.

Asi mismo, Euskaltel tambien comunico que a partir de ese momento la velocidad minima que ofreceria seria 1M.

A los clientes que tengan contratado el servicio despega 300kb, Euskaltel les aumentara la velocidad de conexion sin que por ello se incremente la cuota mensual, lo cual no perjudica al cliente en ningun momento. Ni reducira la cuota a un tercio, puesto que Euskaltel en ningun momento comunico que modificaria la cuota mensual sobre los servicios de banda ancha contratados manteniendo la velocidad que en breve quedara obsoleta, sino que aumentaria la velocidad manteniendo la cuota mensual.

Tambien le recordamos que a los clientes se les esta ofreciendo el ancho de banda que contrataron, como Vd. bien dice, asta(sic) que Euskaltel les aplique el aumento de velocidad cuando llegue el momento; y se les notificara dicho cambio.

Reciba un cordial saludo,

Euskaltel, S.A.

Mi respuesta (22-02-2008):

Estimada Euskaltel,

En ningÃºn momento he dudado de que tuvieran uds. la ley de su parte. Es mÃ¡s, estaba totalmente seguro de que si la ley les permitÃa retrasar el aumento de velocidad prometido 6 meses, se tomarÃan uds. los 6 meses, como admiten que harÃ¡n. Si les hubiera permitido tomarse 12 meses, se habrÃÃ¡n tomado 12, obviamente. Todo ello por dar el mejor servicio posible a sus clientes, Â¡faltarÃa mÃ¡s!

Soy reticente a tomarme mi aumento de velocidad “manteniendo la cuota” (como tanto repiten), como un regalo que Euskaltel me hace en su infinita bondad. MÃ¡s bien me lo tomo como obligaciÃ³n legal de no discriminaciÃ³n de unos clientes frente a otros, ya que (por motivos de negocio) han actualizado sus obsoletas tarifas (llevaban mÃ¡s de 2 aÃ±os congeladas) para los nuevos clientes, y (mal que les pese) no pueden tener doble tarificaciÃ³n para clientes nuevos y viejos. Por tanto, se ven obligados a aumentarme el ancho de banda, y lo van a hacer lo mÃ¡s tarde que les permite la ley. AsÃ que excÃºsenme que no les dÃ© las gracias.

La Ãºnica duda que me queda es la justificaciÃ³n moral (ya que legal parece haber) para ofrecer un servicio mejor a los nuevos clientes, con el consiguiente agravio comparativo para los clientes actuales. Parece que en vez de premiar la fidelidad prefieren insultarla.

Dada su polÃtica, lo mÃ¡s sabio por mi parte serÃa darme de baja, e inmediatamente darme de alta, para poder beneficiarme de su actual tarifa. Claro que no dudo de que uds. contarÃ¡n con innumerables salvaguardas legales para obstaculizarme dicha operaciÃ³n lo mÃ¡s posible, retrasando la baja tanto como la ley les permita, de manera que no me saliese ventajoso hacer eso.

Â¿Leyendo mis argumentos les parece a uds. que estÃ¡n trabajando por tener contentos a los clientes?

Mi humilde consejo, para la prÃ³xima vez, es que si van a hacer un cambio de tarifas o servicios, lo hagan para TODOS los clientes simultÃ¡neamente (si no pueden, esperen hasta poder), y hagan el anuncio del cambio 1 minuto DESPUÃ‰S de efectuarlo. CrÃ©anme, nadie les denunciarÃ¡ por haber duplicado el ancho de banda sin avisar. Avisar sin duplicar, por el contrario, sÃ puede ser constitutivo de delito (o al menos grave falta a los ojos de los clientes).

Atentamente,

IÃ±aki

Update:

Respuesta Euskaltel (25-02-2008):

Estimado cliente:

En respuesta a su mensaje, le confirmamos su recepcion.

Muchas gracias por su colaboracion.

Reciba un cordial saludo,

Euskaltel, S.A.

Update:

A dÃa de 11 de abril de 2008, ya me han subido la velocidad a 1Mb, como comento en este post, publicado dos dÃas mÃ¡s tarde de los hechos.

Permalink Comments (3)

Some more tweaks to my Python script

February 19, 2008 at 22:35 pm · Filed under howto

Update: you can find the outcome of all this in a latter post: Project BHS

All the comments to my previous post have provided me with hints to increase further the efficiency of a script I am working on. Here I present the advices I have followed, and the speed gain they provided me. I will speak of “speedup”, instead of timing, because this second set of tests has been made in a different computer. The “base” speed will be the last value of my previous test set (1.5 sec in that computer, 1.66 in this one). A speedup of “2” will thus mean half an execution time (0.83 s in this computer).

Version 6: Andrew Dalke suggested the substitution of:

line = re.sub('>','<',line)

with:

line = line.replace('>','<')

Avoiding the re module seems to speed up things, if we are searching for fixed strings, so the additional features of the re module are not needed.

This is true, and I got a speedup of 1.37.

Version 7: Andrew Dalke also suggested substituting:

search_cre = re.compile(r'total_credit').search if search_cre(line):

with:

if 'total_credit' in line:

This is more readable, more concise, and apparently faster. Doing it increases the speedup to 1.50.

Version 8: Andrew Dalke also proposed flattening some variables, and specifically avoiding dictionary search inside loops. I went further than his advice, even, and substituted:

stat['win'] = [0,0]

loop
  stat['win'][0] = something
  stat['win'][1] = somethingelse

with:

win_stat_0 = 0
win_stat_1 = 0

loop
  win_stat_0 = something
  win_stat_1 = somethingelse

This pushed the speedup futher up, to 1.54.

Version 9: Justin proposed reducing the number of times some patterns were matched, and extract some info more directly. I attained that by substituting:

loop:
  if 'total_credit' in line:
    line   = line.replace('>','<')
    aline  = line.split('<')
    credit = float(aline[2])

with:

pattern    = r'total_credit>([^<]+)<';
search_cre = re.compile(pattern).search

loop:
  if 'total_credit' in line:
    cre    = search_cre(line)
    credit = float(cre.group(1))

This trick saved enough to increase the speedup to 1.62.

Version 10: The next tweak was an idea of mine. I was diggesting a huge log file with zcat and grep, to produce a smaller intermediate file, which Python would process. The structure of this intermediate file is of alternating lines with “total_credit” then “os_name” then “total_credit”, and so on. When processing this file with Python, I was searching the line for “total_credit” to differentiate between these two lines, like this:

for line in f:
  if 'total_credit' in line:
    do something
  else:
    do somethingelse

But the alternating structure of my input would allow me to do:

odd = True
for line in f:
  if odd:
    do something
    odd = False
  else:
    do somethingelse
    odd = True

Presumably, checking falsity of a boolean is faster than matching a pattern, although in this case the gain was not huge: the speedup went up to 1.63.

Version 11: Another clever suggestion by Andrew Dalke was to avoid using the intermediate file, and use os.popen to connect to and read from the zcat/grep command directly. Thus, I substituted:

os.system('zcat host.gz | grep -F -e total_credit -e os_name > '+tmp)

f = open(tmp)
for line in f:
  do something

with:

f = os.popen('zcat host.gz | grep -F -e total_credit -e os_name')

for line in f:
  do something

This saves disk I/O time, and the performance is increased accordingly. The speedup goes up to 1.98.

All the values I have given are for a sample log (from MalariaControl.net) with 7 MB of gzipped info (49 MB uncompressed). I also tested my scripts with a 267 MB gzipped (1.8 GB uncompressed) log (from SETI@home), and a plot of speedups vs. versions follows:

Execution speedup vs. version
(click to enlarge)

Notice how the last modification (avoiding the temporary file) is of much more importance for the bigger file than for the smaller one. Recall also that the odd/even modification (version 10) is of very little importance for the small file, but quite efficient for the big file (compare it with Version 9).

The plot doesn’t tell (it compares versions with the same input, not one input with the other), but my eleventh version of the script runs the 267 MB log faster than the 7 MB one with Version 1! For the 7 MB input, the overall speedup from Version 1 to Version 11 is above 50.

Permalink Comments (11)

Summary of my Python optimization adventures

February 17, 2008 at 22:57 pm · Filed under howto

This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix commands.

The script reads a host.gz log file from a given BOINC project (more precisely one I got from MalariaControl.net, because it is a small project, so its logs are also smaller), and extracts how many computers are running the project, and how much credit they are getting. The statistics are separated by operating system (Windows, Linux, MacOS and other).

Version 0

Here I read the whole file to RAM, then process it with Python alone. Running time: 34.1s.

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

# Process file:
f = gzip.open('host.gz','r')
for line in f.readlines():
  if re.search('total_credit',line):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif re.search('os_name',line):
    if re.search('Windows',line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif re.search('Linux',line):
        stat['lin'][0] += 1
        stat['lin'][1] += credit
    elif re.search('Darwin',line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# Return output:
nstring = ''
cstring = ''
for osy in os_list:
  nstring +=   "%15.0f " % (stat[osy][0])
  try:
    cstring += "%15.0f " % (stat[osy][1])
  except:
    print osy,stat[osy]

print nstring
print cstring

Version 1

The only difference is a “for line in f:“, instead of “for line in f.readlines():“. This saves a LOT of memory, but is slower. Running time: 44.3s.

Version 2

In this version, I use precompiled regular expresions, and the time-saving is noticeable. Running time: 26.2s

#!/usr/bin/python

import os
import re
import gzip

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]


pattern    = r'total_credit'
match_cre  = re.compile(pattern).match
pattern    = r'os_name';
match_os   = re.compile(pattern).match
pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Process file:
f = gzip.open('host.gz','r')

for line in f:
  if match_cre(line,5):
    credit = float(re.sub('/?total_credit>',' ',line.split()[0])
  elif match_os(line,5):
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit
f.close()

# etc.

Version 3

Later I decided to use AWK to perform the heaviest part: parsing the big file, to produce a second, smaller, file that Python will read. Running time: 14.8s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with AWK:
tmp = 'bhs.tmp'
os.system('zcat host.gz | awk \'/total_credit/{printf $0}/os_name/{print}\' > '+tmp)

stat = {}
for osy in os_list:
  stat[osy] = [0,0]
# Process tmp file:
f = open(tmp)
for line in f:
  line = re.sub('>','<',line)
  aline = line.split('<')
  credit = float(aline[2])
  os_str = aline[6]
  if search_win(os_str):
    stat['win'][0] += 1
    stat['win'][1] += credit
  elif search_lin(os_str):
    stat['lin'][0] += 1
    stat['lin'][1] += credit
  elif search_dar(os_str):
    stat['dar'][0] += 1
    stat['dar'][1] += credit
  else:
    stat['oth'][0] += 1
    stat['oth'][1] += credit
f.close()

# etc

Version 4

Instead of using AWK, I decided to use grep, with the idea that nothing can beat this tool, when it comes to pattern matching. I was not disappointed. Running time: 5.4s.

#!/usr/bin/python

import os
import re

credit  = 0
os_list = ['win','lin','dar','oth']

stat = {}
for osy in os_list:
  stat[osy] = [0,0]

pattern    = r'total_credit'
search_cre = re.compile(pattern).search

pattern    = r'Windows';
search_win = re.compile(pattern).search
pattern    = r'Linux';
search_lin = re.compile(pattern).search
pattern    = r'Darwin';
search_dar = re.compile(pattern).search

# Distile file with grep:
tmp = 'bhs.tmp'
os.system('zcat host.gz | grep -e total_credit -e os_name > '+tmp)

# Process tmp file:
f = open(tmp)
for line in f:
  if search_cre(line):
    line = re.sub('>','<',line)
    aline = line.split('<')
    credit = float(aline[2])
  else:
    if search_win(line):
      stat['win'][0] += 1
      stat['win'][1] += credit
    elif search_lin(line):
      stat['lin'][0] += 1
      stat['lin'][1] += credit
    elif search_dar(line):
      stat['dar'][0] += 1
      stat['dar'][1] += credit
    else:
      stat['oth'][0] += 1
      stat['oth'][1] += credit

f.close()

# etc

Version 5

I was not completely happy yet. I discovered the -F flag for grep (in the man page), and decided to use it. This flag tells grep that the pattern we are using is a literal, so no expansion of it has to be made. Using the -F flag I further reduced the running time to: 1.5s.

Running time vs. script version (Click to enlarge)

Permalink Comments (13)

« Previous Page — « Previous entries « Previous Page · Next Page » Next entries » — Next Page »

Meta