Archive for the 'Uncategorized' Category

Federating search through open protocols

Cory Doctorow wrote a Guardian column the other week that draws attention to the dangers of having one or a few big companies in charge of Search Services for the internet:

It’s a terrible idea to vest this much power with one company, even one as fun, user-centered and technologically excellent as Google. It’s too much power for a handful of companies to wield.

The question of what we can and can’t see when we go hunting for answers demands a transparent, participatory solution. […]

I completely agree with him that there’s a problem here, in fact also for at least one other reason which he didn’t mention. That reason also invalidates the solution he seems to propose – a sort of non-profit search giant under public control. Scroll down a few sections if you want to hear an alternate proposal…

Search giants slow innovation

Monopolists kill innovation even if they’re trying hard not to be evil, simply because monopolies kill innovation. There’s a specific problem with Search, in that it costs a boat-load of money to just start out doing it, let alone improve on anything. You’ll always have to index the whole internet, for example – no matter how good your algorithms, nobody will use your service if you don’t have good coverage. After Cuil, venture capitalists may hesitate to cough-up that sort of money.

Only a handful of companies have the means to put up a hundred-thousand servers and compete with Google. After more than half a decade, Microsoft now managed to produce Bing, which from my impressions so far is on par with Google Search. Read that again: half a decade – on par. What about innovation? Where’s the PageRank killer? What happened to those big leaps of progress that led to Google?

This is not Microsoft’s failure. The guy that might have had the hypothetical breakthrough-new idea just might have happened to work at another cool company, one that didn’t have the money to dive into Search. I’d say this is rather a failure of the free market (but see my About page: I’m not an economist – I have really no idea what I’m talking about :)). Every hypothetical insurgent has to overcome a multi-million dollar hurdle just to take a shot at the problem. That means there will always be too few candidates.

Paul Graham thinks it takes a different kind of investors to tackle the problem – ones that have the guts to throw money at this. I think we should better find a way to bring the cost down. But, let’s quickly shoot at the idea of a non-profit first.

A non-profit would kill innovation

As in completely, totally kill it. A public, participatory system is what you settle for when you want stability: it thus necessarily opposes innovation. You want a stable government, so you build a democracy. But you leave innovation to the free market, because innovating under parliamentary oversight would take forever.

Just imagine what would happen: we’d settle on, say, Nutch, throw a huge amount of public money at it, and then end up spending that money on endless bureaucracy – some users want this innovation, some that, others want to try something totally different instead, academics get to write papers about how it could all be better, the steering committee gets to debate it too, and then when a decision is near, there will be endless rounds of appeal…

(Doctorow realises this, as he writes “But can an ad-hoc group of net-heads marshall the server resources to store copies of the entire Internet?”)

Federation

We want to achieve two goals: the one that Doctorow outlined, which I will rephrase as “Search services that transparently serve the interests of all those who search as well as all those who want to be found” (with some legal limits to it of course), and the fast-innovation goal, which I think boils down to this: start-ups shouldn’t need to build every aspect of the search engine just to get to improve one aspect of it. The following is a rough outline of a crazy idea, and again: I have no idea what I’m talking about. Here we go…

Let’s call the people who search consumers, and the ones who want to be found providers. If you look at how the Google platform works internally, you’ll see there’s roughly a separation that reflects the presence of these two parties: there are index and document servers (let’s call them the back-end) that represent the providers, and there’s the front-end that handles a consumer’s query, talks to the index/document servers, and compiles a priority list for the consumer.

In the age of dial-up connections, you had to have all that happen within the data center. There’s a massive amount of communication between the back-end and the front-end servers. So it had to be designed the way it was. Now that there’s fat bandwidth all-over, couldn’t the front-end servers be separated from the back-end servers?

As a consumer, I’d get to deal with a front-end-providing company that would serve my interests, and my interests only. A natural choice would be my ISP, but as a more extreme solution the front-end could run on my desktop machine – the details don’t matter for now. The point is, there could be many of these front-ends, and I could switch to a different solution if I wanted more transparency (in that case I’d get an open-source solution, I guess) or if I wanted the latest and greatest.

All these front-ends would deal with many back-end servers – just like it is now, because the internet can just not be indexed on only a few machines. But they wouldn’t have to be owned by one company: there could be many. As a provider, then, I’d also have a choice of companies that would compete to serve my interests – they wouldn’t certainly not drop me from their index (as in Doctorow’s problem outline), because I’m paying them. A natural choice for this would be my hosting company, but if they do a bad job (too slow, wrong keywords, whatever), I could fire them and go somewhere else.

(Big parties like Akamai or Amazon would be at a small advantage here, having a lot of server power to handle many index queries, but small parties could cut deals with other small parties to mirror each others’ servers – heck, I’m thinking about details again!)

Note that in addition, providers are in a much better position to index their documents than search-engine crawlers currently are. They could index information that crawlers may not get to – this is the main goal of the more narrowly defined federated search that Wikipedia currently serves up for that term. What’s proposed here is bigger – all-inclusive.

So who does the PageRanking?

There’s a little problem of course, in that the above is not an accurate picture of how stuff works. At Google, the back-end servers have to also store each site’s PageRank, and the front-ends rely on that for their ordering work. In the federated model, there would be some conflict of interest there: wouldn’t the providers bribe their back-end companies to game the system?

If all the companies involved were small enough, then no. If one back-end would return dishonest rankings, it would quickly become known among the front-ends, and they would drop this back-end from their lists. That’s similar to what Google does and what Doctorow is worried about, but there’s a big difference: if your back-end company behaves in this way, and you suffer as a provider, you can leave them and find a more respectable back-end. Honest providers would not have to suffer.

What about innovation? For one scenario, let’s say I’m a new front-end company and I want to replace PageRank by my innovation called RankPage. I’d have to get all the back-end guys to give me some sort of access to their servers to get to calculate RankPage. But that should (in theory, at least) be relatively easy: they don’t stand to lose anything, except maybe some compute time and sysadmin hours. If I turn out to be onto something, I’ll become a big front-end, driving a lot of consumers to them – that is, helping me try my innovation is ultimately in the best interest of the providers they serve. Note that nobody incurs high costs in this model.

(I’m having a really hard time stopping myself from thinking about details here, but let’s say a good front-end in this federated-search world would be able to deal with heterogeneity, where some back-ends respond with PageRank, some also provide RankPage, and some do yet something else…)

(And for more irrelevant details: we would also see many more specialist front-ends appear, that serve consumers with very specific interests. Could be cool!)

Why it won’t happen anytime soon

While the front-ends and back-ends could have many different implementations, they would have to somehow be able to speak to each other in a very extensible language (we don’t want to end up with something like email – built on a hugely successful protocol, that however doesn’t even facilitate verifying the originator of a message!). That extensibility is pretty difficult to design, I imagine.

(Perhaps superfluously noted: it’s crucially important to establish a protocol, and not an implementation. If we’d settle for a federated version of Nutch, however good it may be, there’s no way to innovate afterwards.)

What’s also difficult to deal with, is the chicken-and-egg problem: no consumers will come unless all providers are on-board on this, and why would the providers participate? I could see a few big parties driving this process though – parties that want to become less dependent on Google (and Bing, and Yahoo Search).

Looking at how long it’s taken to establish OAuth (and that still has the job of conquering the world ahead of it), this might really take a while to come together.

But wouldn’t it be cool…

Advertisements

OpenSolaris’ ARM port

I’m usually too slow to catch onto news items like this. This time ’round, Sybreon dropped it onto my Google Reader home page – thanks dude :)

Two things I thought:

It’s worth mentioning that Ian Murdock said this will form the basis for “Solaris 11”.

The ARM port makes a lot of sense to me. I can imagine companies being interested in having all their employees’ smartphones becoming first-class members of their company computing ecosystem (did I just write that monster sentence??). I’m sort of thinking: MacOS will be able to offer this, but in a more closed flavour, Linux will be able to offer this, but in a more heterogeneous flavour, and Solaris could sort of offer the middle ground between those.

I know, I probably sound as “head in the clouds” as Jonathan Schwartz right now. Anyway, having multiple solutions can only be good: some companies will be looking for the flexibility of Linux offerings, but others may like the fact they can get it all from one vendor, who not only provides support but also holds the reins on development. A winner in any case will be ARM…

Reusing existing encrypted logical volumes while installing Ubuntu 8.10

…I couldn’t think of a longer title :)

Here’s the situation: I have a desktop which ran Debian Etch and later Lenny, and now I want to run Ubuntu Intrepid on it. Some might say that you could use the wonders of APT to dist-upgrade the system, but that seemed a bit of a long stretch to me. In any case, getting a fresh installation would be a lot easier.

However, I wanted to keep the partitions which had been carefully layed-out when I installed Etch: I mostly followed the recipe that’s in this earlier post of mine, which produced an encrypted volume with a few LVM volumes inside that. Keeping this structure saves you

  • moving the data in /home back and forth (actually, the forth part is still necessary, because you wouldn’t want to do this without backups, but at least you save yourself the back part)
  • going through the whole encrypted/LVM partitioning-shebang again (although you could reasonably opt out of filling the disk with random bits since that’s happened before)
  • uhm, I can’t remember point three…

Here’s the little problem: the Intrepid alternate installer doesn’t give you the option of opening existing LUKS volumes or activating LVM volumes. Luckily, I found some hints in this Debian bug-report. In fact going by the pointers that FJP gives there, you don’t really need me to tell you anything more – but I’ll still do it anyway to document that/how it works with the Intrepid alternate installer.

Some time before you enter the partitioner, you change to another console (e.g. Ctrl-Alt-F2), and type

modprobe dm-mod
modprobe aes
cryptsetup luksOpen /dev/sdx2 sdx2_crypt # replace x and 2
# enter the passphrase...
vgchange -a y group_name # replace group_name

After that, you can go into the partitioner, and your LVM volumes will appear. If you do the above after entering the partitioner, it doesn’t recognise them correctly for some reason that’s too deep for me to grasp. Now you’ll still have to set the mount points, and you need to be careful when choosing which volumes to format (not /home, for example). The installation then proceeds as usual. Read on before you reboot though:

I rebooted straight after the install finished, and ran into the problem that the installer hadn’t written /etc/crypttab, so that the encrypted volume did not get unlocked and booting failed. It was easily fixable, using the install-cd in rescue mode. For some reason in rescue mode it asks the same questions as during the install, but I ignored that and asked for a command prompt (it’s in the menu, sorry I didn’t take screenshots…):

modprobe dm-mod
modprobe aes
cryptsetup luksOpen /dev/sdx2 sdx2_crypt # replace as before
# enter passphrase
vgchange -a y group_name
mkdir /target # don't worry, this is in temporary space
mount -t ext3 /dev/group_name/root_vol /target # mount your root dir ("/")
mount -t ext3 /dev/group_name/home_vol /target/home # optional?
mount -t ext3 /dev/sdx1 /target/boot # replace x and 1
mount -t proc /proc /target/proc
mount -t sysfs /sys /target/sys
chroot /target

Now, you’re not fiddling in temporary space anymore – just thought I’d mention it. Oh and for some reason mount complained when I tried this without specifying the ext3 filesystem types, I don’t see why. Let’s continue: we’re going to make an entry in /etc/crypttab, and then rebuild the boot image.

echo "sdx2_crypt /dev/sdx2 none luks" >> /etc/crypttab
update-initramfs -u all

This rewrites the initrd images in your /boot, so that next time they’ll ask you to unlock the cryptodisk. I would not do the echoing, preferring an editor instead, but you get the idea. Most probably, you can also do all this before rebooting after the installer has done its work – that would save you some hassle (let me know if that works for you, thanks!).

Finally, in case you’re curious: Intrepid Ibex is quite neat. I’ll be a frequent user of the on-the-fly guest account feature

Amazon MP3 album downloads

The Amazon MP3 store has come to Europe – well, to the UK actually: logging into the site from a Dutch IP address lets you browse but not buy… nothing an ssh tunnel to a UK machine can’t fix ;). The cool thing is, as soon as I got to the site, it offered me an Amazon MP3 Downloader deb-package built for Ubuntu 8.10, and there are also packages for Debian 4, Fedora 9, and OpenSuse 11. It’s nice being treated as a first-class citizen. US users knew all of this already I guess, having had Amazon’s MP3 store around for a while.

It’s not all roses and other good stuff though (whatever the saying is, I don’t recall!): there are no source packages and so we need to make do with the provided binary packages. The Intrepid package has dependencies that are satisfied on Hardy, too, but it’s been built against i386-libraries. To get the required libraries set up on my amd64 system, I resorted to a script called getlibs. Note that I wouldn’t usually endorse using scripts found on a forum, and I won’t do it now either. I’ll just say I scrolled through the script (you’ll find it uses some sudo calls – I think I like that they are inside the script, this helps in tighter log-keeping) and I decided it looked like good work to me. You’ll have to decide for yourself.

(To be totally honest, I didn’t completely trust my own judgment, so I also checked after running getlibs whether it really only did what I thought it did, using primitive checks like find with the -mtime switch…)

Well, after installing the deb with dpkg with the –force-architecture switch, and running getlibs on the amazonmp3 binary twice, things worked like a charm. I bought a Coldplay album for 3 pounds, a price at which I’m more than happy to accept lossily compressed (but DRM-free!) files.

Of course I can’t say anything about the other couple of million tracks on Amazon MP3, but the ones I got seem to be archival quality MP3s. Here’s EncSpot’s output (EncSpot is a tool that prints encoder settings from the file headers, I’m not sure it’s still current – I remembered it from years ago when I spent a lot more time at the HydrogenAudio forums – but there’s a deb-package available at RareWares):

me@laptop:~/Amazon MP3/Coldplay/Viva La Vida Or Death And All His Friends$ encspot 01\ -\ Life\ In\ Technicolor.mp3
01 - Life In Technicolor.mp3
----------------------------

Bitrates:
------------------------------------------------------------
 32                                                     0.0%
 80                                                     0.0%
112                                                     0.0%
128                                                     0.1%
160                                                     0.3%
192     ||||||||||||||||                               13.9%
224     ||||||||||||||||||||||||||||||||||||||||       33.3%
256     ||||||||||||||||||||||||                       20.2%
320     ||||||||||||||||||||||||||||||||||||||         32.2%
------------------------------------------------------------

Type                : mpeg 1 layer III
Bitrate             : 256
Mode                : joint stereo
Frequency           : 44100 Hz
Frames              : 5707
Length              : 00:02:29
Av. Reservoir       : 90
Emphasis            : none
Scalefac            : 6.3%
Bad Last Frame      : no
Sync Errors         : 0
Encoder             : Lame 3.97

Lame Header:

Quality                       : 97
Version String                : Lame 3.97
Tag Revision                  : 0
VBR Method                    : vbr-old / vbr-rh
Lowpass Filter                : 19500
Psycho-acoustic Model         : nspsytune
Safe Joint Stereo             : yes
nogap (continued)             : no
nogap (continuation)          : no
ATH Type                      : 4
ABR Bitrate                   : 32
Noise Shaping                 : 1
Stereo Mode                   : Joint Stereo
Unwise Settings Used          : no
Input Frequency               : 44.1kHz

--[ EncSpot Console 2.0 ]--[ http://www.guerillasoft.com ]--

So that’s Lame at almost ridiculously high-quality settings.

I hope Amazon decide to open the source to their download tool, and thus get it included in the repositories of the various distros. Better yet, it could then be integrated as a plugin for, say, Quod Libet or Banshee or [insert your media player]. The other minor issue is pricing: except for a few “loss leaders” that go for 3 pounds, most download-albums I came across cost about as much as their CD-versions. Given that Amazon will ship those CDs to your door for free if you don’t mind a few days of waiting, the download option doesn’t sound too attractive for the money.

I don’t think you can download your files again if you lose them, but since they’re DRM-free, you can always just reload backups without nasty surprises.

Anyway, I’ll stick around for the three-pound specials… :)

Shrinking a Reiser file system

Most likely there are integrated, graphical tools that can do this quite nicely, but then with tricky operations I’ve always preferred the bare and basic tools. Gladly, I found a nice walkthrough of the procedure using only basic tools. To state the obvious: backup before you resize.

Here’s a summary, just in case that page disappears: you use resize_reiserfs to shrink the file system (it needs to be offline – I used a USB stick with SystemRescueCD, which comes with all the reiserfsprogs onboard). You can use something like “-s-20G” (note the minus sign) to shrink by 20GB. Shrink the file system a little bit more than you’d really need to (this is the smart, lazy, brilliant trick in the procedure – read on).

In the second step, you use fdisk to resize the partition containing the file system, making sure that you don’t alter the starting cylinder (important!). By shrinking the partition a bit less than you did the file system, you save yourself the worries of getting the end cylinder exactly right.

Finally, run resize_reiserfs again, without any switches: this grows the file system to take all available space in the partition.

Systematically backing up a wordpress blog

I’ve been meaning to write about this forever, but now someone else has done all the work already! The script presented there works wonderfully. I’ve just made a tiny addition, because I don’t like ending up with a directory full of huge XML files. Those of you who have been here before will have probably guessed already: the addition is to check the backup in to a revision control system.

To begin with, this is what my patch to the script looks like:

--- wordpress_backup.perl	2008-11-12 22:24:57 +0000
+++ wordpress_backup.perl	2008-11-12 23:05:32 +0000
@@ -13,14 +13,13 @@
 my $path_to_file="/path/to/file/";
 my $url="https://$username.wordpress.com/";
 my $author="all";
-#Filename format: fileprefix.date.xml
+#Filename format: fileprefix.xml
 my $fileprefix="wordpress";
 ###############################################  

 #Change that if you want
 my $agent="unixwayoflife/1.0";  

-my $date=((localtime)[5] +1900)."-".((localtime)[4] +1)."-".(localtime)[3];
 my $mech = WWW::Mechanize->new( agent => $agent );
 $mech->get( $url."wp-login.php" );  

@@ -38,7 +37,7 @@

 ##Download the file
 $mech->get($url."wp-admin/export.php?author=$author&submit=Download+Export+File&download=true");
-$file_name="$fileprefix.$date.xml";
+$file_name="$fileprefix.xml";
 $mech->save_content( $path_to_file.$file_name );
 print ("Download ttt[OK]n");  

@@ -50,4 +49,5 @@
 print ("Login out ttt[OK]n");  

 print("File successfully saved in $path_to_file$file_namen");
+system("bzr commit -m "* wordpress_backup.perl was here"");
 exit 0;

As you can see, only four lines changed – the maximum I could manage, since I haven’t ever edited a perl script before :P

All you then need to do is putting in your blog’s details at the top, and run it. The last thing will fail – the “bzr commit”, but running it once will get you the initial XML file to check in to revision control (if you happen to be a native English speaker, can you tell me if that should be “check-in to” or “check into”, or yet something else?). Having the file, say “wordpress.xml”, and assuming you have bazaar, you can now bring the backups under version control with a simple

bzr init #this will create a .bzr tree under your current working directory
bzr add wordpress.xml
bzr commit -m "some happy message about the first commit"

The next time you run the perl script it will overwrite wordpress.xml and commit changes to the repository. If you have a big weblog, this could save you quite some disk space…

Confused thoughts about Digital Rights Management

I was reading this blog post the other day, and it inspired me to rethink the whole digital rights management issue too.

I used to voice a strong opinion whenever DRM came up in a conversation. Mostly, this opinion was based on some very bad experiences with early-days DRM: given that the red book audio CD standard did not provide any “content protection” features, all the tricks the music labels pulled to hinder CD-copying also made such discs incompatible with the standard. As a result, these “protected” discs would not be read by some transports, or cause lots of skipping, glitches, and clipping. So for me, DRM mostly meant “broken product”.

The other sentiment I had was that since I had bought the CD, it was now my music, and so how dare these people decide for me that I couldn’t listen to it from my PC’s hard disk? Of course this instinctive “ownership sentiment” was a somewhat uninformed one, in the sense that it was ignoring the very essence of copyright. As I was talking to myself about before, basically we (as in “we the people”) have collectively decided to give up (at least in part) our right to copy.

With that sentiment out of the way, and DRM being a built-in feature of many newer data carriers and formats and thus not as broken a product as it used to be, what is it that really bugs me about it? Well, it simply doesn’t work. It’s a broken design because to let people use the “protected” content, you have to give them the means to decode/unlock it. And that’s a bit like jailing someone with the door key hidden inside the cell. Effectively then, it only gets to annoy and hamper all of us legitimate users (as for me, I can’t even play most such “protected” content, as I’m a free software user), while those who are really out to breach copyrights will find a way to do that anyway – by design. DRM makes us suffer for no good reason.

Now, the problem with that is, that we do need some effective kind of copyright enforcement, or so I believe. Perhaps not for music, as technological progress has reduced the costs of producing music to the point where artists can consider giving their work away, for example as promotional material (and many do so). But what about movies, or computer games?

Perhaps I should just pretend DRM is not broken, and get a Playstation 3. I’ll also pretend “they” didn’t just cheat me out of watching those Blu-Ray movies and playing those funky games on my laptop. At least that way DRM won’t be cumbersome and annoying, but will come in pretty packaging with exemplary easy-of-use. It takes a lot of pretending though…