Project BHS

As outlined in some previous posts[1,2,3,4], I have been playing around with a piece of Python code to process some log files. The log files to process were actually host.gz files from some [[BOINC]] projects, and the data I want to extract from them is quite simple: the Windows, Linux and Mac shares in the number of computers contributing to them (and the [[BOINC Credit System|work they do]]). By logging this processed data myself, I can see the time evolution of this share, and hopefully show the slow but steady rise of GNU/Linux :^)

I figured out that the contribution to distributed computing projects could be a reasonable indicator of the Windows predominance status. There are many other indicators (for example the number of visits to a web site, e.g. this very one), and I don’t claim that this one is “better”. I just want to add it to the reference list for the reader.

There is a problem with “Windows vs. Linux” figures, and it is that they are not really “competing” products. When cars or soft drinks are the subject, one can figure out the [[market share]], looking at the number of items sold. Linux being [[free software]], one can hardly measure the amount of “sold copies”, and with Windows being pre-installed in most new computers, one can not really trust the “number of computers sold = number of Windows copies sold”, because some users even remove the Windows partition and install Linux on top of it.

Counting the visits to some sites is not without problems, either. Any web site will have a particular audience, and the result will be biased by that fact. When my blog was in WordPress.com, I had roughly as many visits from Windows users as from Linux users, and almost all of them used Firefox as a browser. Obviously this data is not an accurate reflection of the world at large. It so happened that free software users are more likely to surf to sites like mine, hence the bias.

So, without further ado, let me introduce the “BOINC Host Statistics” program (BHS). Here you are a link to its home page. You can find results I have harvested so far in the Screenshots section. For example, the SETI@home credit generation rate statistics follows:

What the plot tells us is that (at the time of writing this) 500 million [[BOINC Credit System|cobblestones]] are being granted to contributors each day. Of them, around 82% are being given to Windows computers, 9-10% to Mac, 8% to GNU/Linux, and the rest to computers running other OSs.

Comments

New version of Sociable WP plugin

Another reason to love FLOSS: developers are close to the users, and they LISTEN.

I recently started using the Sociable WordPress plugin for this blog. This wonderful plugin by Joost de Valk, lets you put some links to social bookmarking/news/recommendation sites on the web at the bottom of each post, so a reader can send your post to such a site with a single click.

There are many WP plugins that do this, but I liked the looks of Joost’s, and the pleasant way of managing it. I chose Digg, Reddit, del.icio.us, Technorati and Slashdot, but I felt that at least two sites that I liked were missing from the available sites list: Menéame and Barrapunto.

So I boldly decided to contact the developer, Joost de Valk, and ask for them:

Hi Joost,

I have just discovered your “Sociable” WordPress plugin, and I like it a lot.

However, there is always room for improvement, and as such I would like to suggest you to add links to the following sites:

Menéame (http://meneame.net/)
Barrapunto (http://barrapunto.com/)

Both are Spanish “versions” of popular sites: Digg and Slashdot, respectively.

I mainly write in English, but I think that blogs with a Spanish audience could benefit a lot from these links.

Now I realize I even forgot to say “thanks in advance” or anything… I was a bit unpolite, I fear. Anyway, his answer came a couple of days later:

I’ll add them in the next version, coming out… tonight I guess :)

Can I trust upon you to promote it a bit there? :)

Cheers,
Joost

It is actually true that a new version of Sociable has been released, and it includes Menéame and Barrapunto as available sites. So here it goes your promotion, Joost ;^)

Isn’t it great when people collaborate and are generally nice to each other? Isn’t everyone tired of a society where people don’t do anything unless they get money or power in return?

Thanks Joost and other bona fide developers for your great work.

Comments (4)

Some more tweaks to my Python script

Update: you can find the outcome of all this in a latter post: Project BHS

All the comments to my previous post have provided me with hints to increase further the efficiency of a script I am working on. Here I present the advices I have followed, and the speed gain they provided me. I will speak of “speedup”, instead of timing, because this second set of tests has been made in a different computer. The “base” speed will be the last value of my previous test set (1.5 sec in that computer, 1.66 in this one). A speedup of “2” will thus mean half an execution time (0.83 s in this computer).

Version 6: Andrew Dalke suggested the substitution of:

line = re.sub('>','<',line)

with:

line = line.replace('>','<')

Avoiding the re module seems to speed up things, if we are searching for fixed strings, so the additional features of the re module are not needed.

This is true, and I got a speedup of 1.37.

Version 7: Andrew Dalke also suggested substituting:

search_cre = re.compile(r'total_credit').search
if search_cre(line):

with:

if 'total_credit' in line:

This is more readable, more concise, and apparently faster. Doing it increases the speedup to 1.50.

Version 8: Andrew Dalke also proposed flattening some variables, and specifically avoiding dictionary search inside loops. I went further than his advice, even, and substituted:

stat['win'] = [0,0]

loop
  stat['win'][0] = something
  stat['win'][1] = somethingelse

with:

win_stat_0 = 0
win_stat_1 = 0

loop
  win_stat_0 = something
  win_stat_1 = somethingelse

This pushed the speedup futher up, to 1.54.

Version 9: Justin proposed reducing the number of times some patterns were matched, and extract some info more directly. I attained that by substituting:

loop:
  if 'total_credit' in line:
    line   = line.replace('>','<')
    aline  = line.split('<')
    credit = float(aline[2])

with:

pattern    = r'total_credit>([^<]+)<';
search_cre = re.compile(pattern).search

loop:
  if 'total_credit' in line:
    cre    = search_cre(line)
    credit = float(cre.group(1))

This trick saved enough to increase the speedup to 1.62.

Version 10: The next tweak was an idea of mine. I was diggesting a huge log file with zcat and grep, to produce a smaller intermediate file, which Python would process. The structure of this intermediate file is of alternating lines with “total_credit” then “os_name” then “total_credit”, and so on. When processing this file with Python, I was searching the line for “total_credit” to differentiate between these two lines, like this:

for line in f:
  if 'total_credit' in line:
    do something
  else:
    do somethingelse

But the alternating structure of my input would allow me to do:

odd = True
for line in f:
  if odd:
    do something
    odd = False
  else:
    do somethingelse
    odd = True

Presumably, checking falsity of a boolean is faster than matching a pattern, although in this case the gain was not huge: the speedup went up to 1.63.

Version 11: Another clever suggestion by Andrew Dalke was to avoid using the intermediate file, and use os.popen to connect to and read from the zcat/grep command directly. Thus, I substituted:

os.system('zcat host.gz | grep -F -e total_credit -e os_name > '+tmp)

f = open(tmp)
for line in f:
  do something

with:

f = os.popen('zcat host.gz | grep -F -e total_credit -e os_name')

for line in f:
  do something

This saves disk I/O time, and the performance is increased accordingly. The speedup goes up to 1.98.

All the values I have given are for a sample log (from MalariaControl.net) with 7 MB of gzipped info (49 MB uncompressed). I also tested my scripts with a 267 MB gzipped (1.8 GB uncompressed) log (from SETI@home), and a plot of speedups vs. versions follows:

versions2.png

Execution speedup vs. version
(click to enlarge)

Notice how the last modification (avoiding the temporary file) is of much more importance for the bigger file than for the smaller one. Recall also that the odd/even modification (version 10) is of very little importance for the small file, but quite efficient for the big file (compare it with Version 9).

The plot doesn’t tell (it compares versions with the same input, not one input with the other), but my eleventh version of the script runs the 267 MB log faster than the 7 MB one with Version 1! For the 7 MB input, the overall speedup from Version 1 to Version 11 is above 50.

Comments (11)

Python: speed vs. memory tradeoff reading files

I was making a script to process some log file, and I basically wanted to go line by line, and act upon each line if some condition was met. For the task of reading files, I generally use readlines(), so my first try was:

f = open(filename,'r')
for line in f.readlines():
  if condition:
    do something
f.close()

However, I realized that as the size of the file read increased, the memory footprint of my script increased too, to the point of almost halting my computer when the size of the file was comparable to the available RAM (1GB).

Of course, Python hackers will frown at me, and say that I was doing something stupid… Probably so. I decided to try a different thing to reduce the memory usage, and did the following:

f = open(filename,'r')
for line in f:
  if condition:
    do something
f.close()

Both pieces of code look very similar, but pay a bit of attention and you’ll see the difference.

The problem with “f.readlines()” is that it reads the whole file and assigns lines to the elements of an (anonymous, in this case) array. Then, the for loops through the array, which is in memory. This leads to faster execution, because the file is read once and then forgotten, but requires more memory, because an array of the size of the file has to be created in the RAM.

fileread_memory

Fig. 1: Memory vs file size for both methods of reading the file

When you do “for line in f:“, you are effectively reading the lines one by one when you do each cycle of the loop. Hence, the memory use is effectively constant, and very low, albeit the disk is accessed more often, and this usually leads to slower execution of the code.

fileread_time.png

Fig. 2: Execution time vs file size for both methods of reading the file

Comments (2)

Filelight makes my day

First of all: yes, this could have been made with du. Filelight is just more visual.

The thing is that yesterday I noticed that my root partition was a bit on the crowded side (90+%). I though it could be because of /var/cache/apt/archives/, where all the installed .deb files reside, and started purging some unneeded installed packages (very few… I only install what I need). However, I decided to double check, and Filelight has given me the clue:

Filelight_root

(click to enlarge)

Some utter disaster in a printing job filled the /var/spool/cups/tmp/ with 1.5GB of crap! After deleting it, my root partition is back to 69% full, which is normal (I partitioned my disk with 3 roots of 7.5GB (for three simultaneous OS installations, if need be), a /home of 55GB, and a secondary disk of 250GB).

Simple problem, simple solution.

Comments

App of the week: digiKam

As digital cameras get more and more common, and personal photo collections grow bigger, solutions for managing all these images are more and more needed.

I bought my first digital camera (a Nikon CoolPix 2500) almost 4 years ago (now I see the model was 1 year old when I bought my unit), and now I own a Panasonic Lumix DMC FX10 I’m so happy with. I obviously have the need outlined above, plus the desire to sometimes share some pictures over the web. I didn’t want to go for something like Picasa, and made a lengthy Perl/Tk script to generate HTML albums from some info I would introduce.

When I later discovered digiKam, I realized it had all the features I wanted. It is incredibly useful to tag your pictures, so that you can later on retrieve, say, “all the pictures in which my father appears”. It also has many other features, like easy access to image manipulation (of which I only use the rotation for photos requiring it), or ordering of the pictures by date, so you can see how many pictures were taken each month. The humble, but for me killer, features is that you can automatically generate HTML albums from a list of pictures, which can be selected e.g. by their tags.

Give it a try, and you’ll love it.

Comments

Reflexión repentina y aleatoria

Hay muy pocos problemas que los ordenadores no puedan solucionar. Y casi ninguno que no puedan crear.

Comments (1)

SpyPig: another annoyance against your privacy

I’ve read in a post in Genbeta [es], about a “service” for e-mail senders called SpyPig. It basically boils down to sending a notification to the sender of an e-mail, when the recipient opens it. This way, the recipient can not say that she hasn’t read it.

I will deal with two issues: moral and technological. Morally, I think this kind of things suck. I have received these e-mails asking for confirmation of having been read, and I never found appealing to answer. But at least you were asked politely. What these pigs SpyPigs do is provide a sneaky way of doing it without the recipient knowing. Would you consider someone doing it on you a friend? Not me.

Now, technologically, the system is more than simple, and anyone with access to a web server could do it. The idea is that the sender writes the e-mail in HTML mode, and inserts a picture (can be a blank image) hosted at some SpyPig server. When the recipient opens the HTML message, the image is loaded from the server, and the logs of the server will reflect when the image was loaded, and hence the e-mail opened. When this happens, the server notifies the sender.

The bottom line of this story is that HTML IS BAD for e-mails. My e-mail readers never allow displaying HTML messages, and show me the source HTML code instead (of course, I can allow HTML, but why would I?). So this SpyPig thing will never work for against me. And this SpyPig story is just one more reason not to allow displaying HTML in the messages you read. Of course, for the e-mails you send, consider sending them in plain text. Your recipients will be a bit happier.

For more tips on what NOT to do on web/e-mail issues, check the e-mail/web tips section in this blog.

Comments (5)

German government caught buying malware to intercept Skype calls

I’ll parrot here the news shaking the blogosphere today: apparently the German government intended to obtain software to spy on Skype users. What’s next?

Links:

Comments

Windows 2000 Server on a NAS? No, thanks.

You would think that, as a researcher in a serious center like the DIPC, one would hardly ever encounter a MS product, at least in the server/cluster section (more than one fellow here has Windows in his/her computer, but don’t tell anyone, it’s a secret).

However, we do have some server running Windows, and its presence is almost transparent for the user (which is good). And I say “almost”, because it stumbled upon one of its “features”, and the sysadmin ended up confessing :^)

The thing is that I happened to try to create a directory with “CO” (carbon monoxide) in its name (it was a dir for a SIESTA calculation), when a dir with the same name already existed, except it had “Co” (cobalt) in the name. Well, the filesystem complained that a dir by the same name already existed! I could not believe my eyes!!

Basically the filesystem would not make any difference at all for different capitalizations. For example:

<pre>
% mkdir testdir
% mkdir TESTDIR
mkdir: cannot create directory `TESTDIR': File exists

% touch testfile
% rm TESTfile
rm: remove regular empty file `TESTfile'?
</pre>

The explanation? The directory I was in was exported from a NAS running… ta-da: Windows 2000 Server!

How incredibly stupid and annoying is it to have a filesystem that ignores character case altogether? And how error prone? Because if you are not aware of that, you might delete a file you didn’t intend to!! Someone could try to excuse MS by saying that, OK, that was in 2000. But, look, Linux could tell upper case from lower case since its inception in 1991, and Unix since the seventies! The root of the problem is the filesystem used by the OS, of course. But it so happens that the filesystems used by Linux since 1991 (beginning in ext and then many others) had this capability (and many more), and are free. All that MS had to do was to use them, instead of FAT or NTFS. But instead they chose to develop those (inferior) filesystems in parallel for almost 20 years now. I call that stupidity.

No need to say that the sysadmin of the DIPC absolutely regrets having been naive enough to ever buy that MS crap.

Comments

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »