Federating search through open protocols

Cory Doctorow wrote a Guardian column the other week that draws attention to the dangers of having one or a few big companies in charge of Search Services for the internet:

It’s a terrible idea to vest this much power with one company, even one as fun, user-centered and technologically excellent as Google. It’s too much power for a handful of companies to wield.

The question of what we can and can’t see when we go hunting for answers demands a transparent, participatory solution. […]

I completely agree with him that there’s a problem here, in fact also for at least one other reason which he didn’t mention. That reason also invalidates the solution he seems to propose – a sort of non-profit search giant under public control. Scroll down a few sections if you want to hear an alternate proposal…

Search giants slow innovation

Monopolists kill innovation even if they’re trying hard not to be evil, simply because monopolies kill innovation. There’s a specific problem with Search, in that it costs a boat-load of money to just start out doing it, let alone improve on anything. You’ll always have to index the whole internet, for example – no matter how good your algorithms, nobody will use your service if you don’t have good coverage. After Cuil, venture capitalists may hesitate to cough-up that sort of money.

Only a handful of companies have the means to put up a hundred-thousand servers and compete with Google. After more than half a decade, Microsoft now managed to produce Bing, which from my impressions so far is on par with Google Search. Read that again: half a decade – on par. What about innovation? Where’s the PageRank killer? What happened to those big leaps of progress that led to Google?

This is not Microsoft’s failure. The guy that might have had the hypothetical breakthrough-new idea just might have happened to work at another cool company, one that didn’t have the money to dive into Search. I’d say this is rather a failure of the free market (but see my About page: I’m not an economist – I have really no idea what I’m talking about :)). Every hypothetical insurgent has to overcome a multi-million dollar hurdle just to take a shot at the problem. That means there will always be too few candidates.

Paul Graham thinks it takes a different kind of investors to tackle the problem – ones that have the guts to throw money at this. I think we should better find a way to bring the cost down. But, let’s quickly shoot at the idea of a non-profit first.

A non-profit would kill innovation

As in completely, totally kill it. A public, participatory system is what you settle for when you want stability: it thus necessarily opposes innovation. You want a stable government, so you build a democracy. But you leave innovation to the free market, because innovating under parliamentary oversight would take forever.

Just imagine what would happen: we’d settle on, say, Nutch, throw a huge amount of public money at it, and then end up spending that money on endless bureaucracy – some users want this innovation, some that, others want to try something totally different instead, academics get to write papers about how it could all be better, the steering committee gets to debate it too, and then when a decision is near, there will be endless rounds of appeal…

(Doctorow realises this, as he writes “But can an ad-hoc group of net-heads marshall the server resources to store copies of the entire Internet?”)


We want to achieve two goals: the one that Doctorow outlined, which I will rephrase as “Search services that transparently serve the interests of all those who search as well as all those who want to be found” (with some legal limits to it of course), and the fast-innovation goal, which I think boils down to this: start-ups shouldn’t need to build every aspect of the search engine just to get to improve one aspect of it. The following is a rough outline of a crazy idea, and again: I have no idea what I’m talking about. Here we go…

Let’s call the people who search consumers, and the ones who want to be found providers. If you look at how the Google platform works internally, you’ll see there’s roughly a separation that reflects the presence of these two parties: there are index and document servers (let’s call them the back-end) that represent the providers, and there’s the front-end that handles a consumer’s query, talks to the index/document servers, and compiles a priority list for the consumer.

In the age of dial-up connections, you had to have all that happen within the data center. There’s a massive amount of communication between the back-end and the front-end servers. So it had to be designed the way it was. Now that there’s fat bandwidth all-over, couldn’t the front-end servers be separated from the back-end servers?

As a consumer, I’d get to deal with a front-end-providing company that would serve my interests, and my interests only. A natural choice would be my ISP, but as a more extreme solution the front-end could run on my desktop machine – the details don’t matter for now. The point is, there could be many of these front-ends, and I could switch to a different solution if I wanted more transparency (in that case I’d get an open-source solution, I guess) or if I wanted the latest and greatest.

All these front-ends would deal with many back-end servers – just like it is now, because the internet can just not be indexed on only a few machines. But they wouldn’t have to be owned by one company: there could be many. As a provider, then, I’d also have a choice of companies that would compete to serve my interests – they wouldn’t certainly not drop me from their index (as in Doctorow’s problem outline), because I’m paying them. A natural choice for this would be my hosting company, but if they do a bad job (too slow, wrong keywords, whatever), I could fire them and go somewhere else.

(Big parties like Akamai or Amazon would be at a small advantage here, having a lot of server power to handle many index queries, but small parties could cut deals with other small parties to mirror each others’ servers – heck, I’m thinking about details again!)

Note that in addition, providers are in a much better position to index their documents than search-engine crawlers currently are. They could index information that crawlers may not get to – this is the main goal of the more narrowly defined federated search that Wikipedia currently serves up for that term. What’s proposed here is bigger – all-inclusive.

So who does the PageRanking?

There’s a little problem of course, in that the above is not an accurate picture of how stuff works. At Google, the back-end servers have to also store each site’s PageRank, and the front-ends rely on that for their ordering work. In the federated model, there would be some conflict of interest there: wouldn’t the providers bribe their back-end companies to game the system?

If all the companies involved were small enough, then no. If one back-end would return dishonest rankings, it would quickly become known among the front-ends, and they would drop this back-end from their lists. That’s similar to what Google does and what Doctorow is worried about, but there’s a big difference: if your back-end company behaves in this way, and you suffer as a provider, you can leave them and find a more respectable back-end. Honest providers would not have to suffer.

What about innovation? For one scenario, let’s say I’m a new front-end company and I want to replace PageRank by my innovation called RankPage. I’d have to get all the back-end guys to give me some sort of access to their servers to get to calculate RankPage. But that should (in theory, at least) be relatively easy: they don’t stand to lose anything, except maybe some compute time and sysadmin hours. If I turn out to be onto something, I’ll become a big front-end, driving a lot of consumers to them – that is, helping me try my innovation is ultimately in the best interest of the providers they serve. Note that nobody incurs high costs in this model.

(I’m having a really hard time stopping myself from thinking about details here, but let’s say a good front-end in this federated-search world would be able to deal with heterogeneity, where some back-ends respond with PageRank, some also provide RankPage, and some do yet something else…)

(And for more irrelevant details: we would also see many more specialist front-ends appear, that serve consumers with very specific interests. Could be cool!)

Why it won’t happen anytime soon

While the front-ends and back-ends could have many different implementations, they would have to somehow be able to speak to each other in a very extensible language (we don’t want to end up with something like email – built on a hugely successful protocol, that however doesn’t even facilitate verifying the originator of a message!). That extensibility is pretty difficult to design, I imagine.

(Perhaps superfluously noted: it’s crucially important to establish a protocol, and not an implementation. If we’d settle for a federated version of Nutch, however good it may be, there’s no way to innovate afterwards.)

What’s also difficult to deal with, is the chicken-and-egg problem: no consumers will come unless all providers are on-board on this, and why would the providers participate? I could see a few big parties driving this process though – parties that want to become less dependent on Google (and Bing, and Yahoo Search).

Looking at how long it’s taken to establish OAuth (and that still has the job of conquering the world ahead of it), this might really take a while to come together.

But wouldn’t it be cool…


8 Responses to “Federating search through open protocols”

  1. 1 jmount 15 June 2009 at 2:34

    Good article. The situation is even worse than you describe- even if somebody was willing to pay to build a crawl-farm most sites lock out spiders other than the top few. So a new player would have a lot of trouble indexing the internet before they were famous.

  2. 2 Jacob 15 June 2009 at 18:40

    There’s another problem: search isn’t just about absolute PageRank, it’s also about the relevance of a page for a particular query. Calculating the weight of a document for a search term can be quite expensive, and some weighting methods require calculating a global weight for every possible term. This quickly gets into the problem of heterogeneity that you glossed over above.

  3. 3 Mirek Sopek aka 1K-1 16 June 2009 at 6:05

    The relevance of your thoughts is obvious. Particularly, if you look at forthcoming Semantic Web and its search/reasoning needs.

    It is quite paradoxical and contrary to Tim-Berners Lee ideas, but for Semantic Web to succeed, the world needs something I call Google^2 (Google raised to the power of 2 – at least).

    Close look at RDF and Ontologies and reasoning shows that.
    And this is even more dangerous.

    Cory Doctorow recently called for the public sector to step into search area. Maybe that’s the idea. After some after-thoughts, I find it quite dangerous, too.

  4. 4 yungchin 17 June 2009 at 22:47

    jmount: Thanks, I didn’t realise that! I guess it’s understandable that they don’t want to waste too many cycles serving crawlers (and I take it there’s some fear of malicious crawlers that are out to copy whole sites), but it’s otherwise counter-intuitive: if you want to be found, let in the search engines… this could be a tough issue though.

    Jacob: You’re right, indeed, it’s trickier than I described. It seems to me though that by the nature of the problem, weighting needn’t necessarily involve global results – the relevance of a hit should not depend so much on the relevance of another, independently found hit, or?

    Mirek Sopek: Yes, a Google^2 would be very worrying! :)

  5. 5 Jacob 19 June 2009 at 1:15

    Weighting doesn’t have to be global, it just depends on the statistics you want to use. If all you want to know is the relevance of a term on a page, you don’t need global stats. But if you want to know the relevance of a term in general, maybe to compare 2 different terms, then you’re getting into global stats. What I’m saying is that 2 words are not necessarily equal, and knowing the number of pages a word has occurred on can be very valuable for search.

  6. 7 Shane 30 August 2009 at 6:42

    I think your federated search concept is an intriguing idea. Obviously you’re right that the details need to be worked out, and it certainly sounds like a daunting project.

    One thing I wanted to take exception to, though – while it’s true that government and nonprofits often are not innovative, it’s not a universal truth that nonprofit/public != innovative. The BBC has put out innovative concepts throughout its history. Here in America, we can still rely on some pretty incredible technical innovation from certain parts of our Department of Defense (see ARPANet). Similarly, nonprofit NGOs and universities often come up groundbreaking ideas.

    • 8 yungchin 9 November 2009 at 1:16

      Hi, thanks! I’m sorry I didn’t react any sooner – I haven’t been taking care of these pages too well for the last few months…

      Indeed I should apologise for some carelessly-written statements in the above; however if you read it again, it’s not really the non-profit aspect I’m objecting to. Rather, it’s the public-as-in-participatory aspect; none of the examples you mention are very democratically-run institutions.

Comments are currently closed.

%d bloggers like this: