Free software woes

Yes, [[FLOSS]] also has its quirks and problems, and I am going to rant about some of them, that I run into the last week.

Problem 1: fsck on laptops

The reader might know that Linux comes with a collection of file system checkers/fixers, under the name fsck.* (where * = ext2/3, reiserfs, jfs, xfs…). When one formats a new partition (or tunes an existing one), some parameters are set, as for example in what circumstances fsck should be run automatically (you can always run it by hand). The typical setting is to run the command on each partition (just before mounting it) every N times it is mounted, or every M days.

It is also set that if a filesystem is not shut down cleanly (e.g., by crashing the computer or directly unplugging it), fsck will be run automatically on next boot (hey, that’s so nice!).

However, here’s the catch: on laptops, and with the aim of saving power, fsck will (tipically) not run automatically when on batteries. This seems a great idea, but you can imagine an scenario where it fails: shut down the laptop uncleanly, then power it up on batteries, and… voilà, you are presented with a system that seems to boot, but gives a lot of problems, the X don’t work… because the disk was corrupt, and wasn’t fixed on boot.

When this happened to me, I fixed it by booting while plugged. In principle you could also boot on single user mode, then chose “Check the filesystem” in the menu you will be presented (I’m talking about Ubuntu here), and fix the problem, even on batteries. But still, it’s annoying. IMHO fsck should run after unclean shutdowns, no matter being plugged or on batteries.

Problem 2: failed hibernate can seriously screw your system

I tried [[Hibernate (OS feature)|hibernating]] my laptop (a feature I keep finding problems with), but it was taking too long, and I was forced to shut it down using the power button. This, in itself, is a serious issue, but I could live with it.

But what I can’t live with is that after the event, I had no way of booting back! I tried all I could, and finally had to reinstall the OS. I am the one whom it happened to, and I still find it hard to believe: Linux so fucked up that you have to reinstall. I thought reinstalling belonged to the Windows Dark Ages!

Problem 3: faulty SD card

Since the problems tend to come together, it’s no surprise that I came across this error when trying to reinstall the machine borked with previous problem. The thing is that I was using a SD card as installation media, burning the ISO into it with [[UNetbootin]]. The burning didn’t burp any error, but the installation failed, usually (but not always) on the same point.

After minutes (hours?) of going crazy, I burned the ISO into another SD card, and it worked like a charm.

My complain is not that the SD was faulty, which I can understand (hardware fails). What I am angry at is the fact that I checked (with the aforementioned fsck command) the FS in the card many times, and I reformatted it (with mkfs) many more times, and Linux would always say that the formatting had been correct, and that all checks where fine. I understand that things are sometimes OK, sometimes KO. I just want to know when is which!

5 Comments »

  1. Super Jamie said,

    March 11, 2009 @ 23:42 pm

    Your SD problem is not just restricted to Free Software. Even when performing exhaustive hardware tests with proprietary (and expensive!) diagnostic software, a component can “test” as okay, however in reality it doesn’t work, and swapping it out for a new one fixes the issue.

    This stems from the fact that software is an inferior way to test hardware. What you really need is something that can test the SD card at a low-level electrical level. However such a device would probably cost more than we’ll both earn together this year, so we make do with what we have.

    I agree that fsck-on-battery shouldn’t be automatically skipped, however I think an interactive option (perhaps with a short timer to skip by default) would be ideal.

  2. isilanes said,

    March 12, 2009 @ 11:08 am

    Thanks for your comment, Super Jamie. I understand that testing hardware with software could be unreliable, but… if the hw error is noticeable by sw, by definition it is also checkable by sw, and vice versa: if it is not possible to check it with sw, then it should as well be unnoticeable by sw. It doesn’t make much sense that a hw error that makes a file transfer fail goes unnoticed by a dedicated tool such as fsck. It that’s the case, then substitute the fsck check by regular file transfers, which are apparently more subtle tests!!

    OK, OK, I know I’m being naive. Any test of a given system is a multi-variable problem, with many different things to check, values to measure and inferences to make from the collected data. I just wanted to protest a bit.

  3. EdorFaus said,

    June 25, 2009 @ 23:04 pm

    (Maybe a bit late, but…) Checking the SD card with fsck is not even close to a complete test of the card.

    See, fsck only checks the filesystem itself – in other words, the structures that let you store files on there and then find/recognize them afterwards – not the actual data in the files. There’s a crucial difference there, though I suppose it might not be obvious to a layman.

    What it means is that, most likely, the data sectors that make up the filesystem structures are OK, but there’s one(or more) sector(s) that contains file data that isn’t – in other words, the filesystem (which is what fsck checks) is OK, but some of the data that the installer requires to function, is not.

    While I can only guess, I figure that the write probably didn’t return an error, but that the data didn’t actually get written properly – either that, or the read fails.

    The only real way to check for the former kind of error, that I know about, is to write the data you need, and then read it back to compare with the original (probably via a checksum like e.g. MD5).
    This takes time though, so most copy operations don’t do it, you usually have to ask for it explicitly or do it manually afterwards. Also, to be really sure, you’ll have to flush any caches first (at minimum unmount then mount, in this case it’s probably a good idea to eject the card and then reinsert it).

    A further issue, which compounds the problem, is that such write errors can be data dependent. In other words, the sector might work perfectly for one set of bytes, but not for a different set.
    This makes testing difficult, as it’s practically impossible to test all sets of bytes (block sizes being generally 512 bytes minimum, often more) in any kind of reasonable time. Because of this, even comprehensive tests only tests certain patterns.
    Note that this is something I know from RAM tests – while I believe it to be true for flash as well, I technically don’t really know that.

  4. Super Jamie said,

    June 25, 2009 @ 23:28 pm

    Most definitely true. Faulty RAM does display the characteristics you speak of, and by extension a magnetic hard drive would suffer the same errors due to having the cache on its’ controller card.

    However, to do such exhaustive read-writes of an SD card would probably reduce its’ service life significantly, especially if left on overnight or for a week! SD cards (and flash devices in general) are so cheap these days, why would you bother? Even a 16Gb microSD is “only” US$50.

    Sandisk announced several years ago that their company vision was to make SD so ubiquitous it would be like film, something so cheap and available that you write a card once then keep it forever.

  5. isilanes said,

    June 26, 2009 @ 9:37 am

    Hey, thanks EdorFaus and Super Jamie!

    I am (at a superficial level) aware of the issues you pose. I understand that the underlaying physical medium could be OK, yet the actual data (the 0s and 1s forming the file) could be corrupted. For example, a 0 could be written instead of a 1 somewhere. In this case, any tool like fsck would say the card is OK (and it is), but the file in the card would not match the original file (that I could check with a checksum, or even diff). I was just assuming that the only reason for a file being corrupted was a failure of the underlaying medium, which of course is not true.

RSS feed for comments on this post · TrackBack URI

Leave a Comment