Saturday, October 27, 2007

Death of Google

Unarguably Google is the most phenomenal breakthrough to happen on the WWW scene. Tim Berner Lee's invention could not have been this efficacious had Google not been founded on September 7th 1998. Without Google, the web would just be a huge archipelago of resources, with no way of making out the significant ones from the insignificant ones. Undeniably Google has become imperative for bringing out the intrinsic value of web.

What is so unique about Google that makes it the entrance way(Sorry Yahoo) to the web? Its not that Google was the first search engine to come into existence, there were others(few still are) like Alta Vista, Yahoo etc, but none could reach where Google has today. What makes Google special is its approach to rank search results based on an ingenious algorithm called Page Rank. Page rank attaches a numeric relevance to a resource(web page) broadly based on two parameters:
  1. Number of resources referring(hyper linking) to the given resource.

  2. Relevance of resources referring to the given resource.
Basic(and quite powerful) idea behind this algorithm is to find an intermediate path between traditional approaches removing their shortcomings while incorporating their advantageous. There are two predominant approaches from the pre-Google era:
  1. Web Directory:-Yahoo started off as a Web Directory, where flesh and bone humans manually index the web with listing down most relevant sites in various categories.In order to find information one can either browse individual categories or can search for a page containing a particular keyword within the indexed sites.This approach, though scoring high on relevance scale, suffers from low scalability, hence rapid growth of web rendered this approach unusable on account of incompleteness of the finished(if it ever can be) product.

  2. Relevance Based on Keyword Usage:-This approach for automatic indexing of the web involves sending software agents called web-bots or spiders across the web and then determining the relevance of a web page, for a particular keyword, based on frequency and location(title/body) of the given keyword in the current web page.This approach, though scalable, suffers from low relevance problem as high frequency usage of a keyword does not guarantee relevance. Also, as the relevance to a web page is associated solely based on keyword usage, it is pretty easy, for a rogue webmaster to spoof the keyword usage in a web page to deceive the spiders into treating his page as of high relevance, irrespective of the actual content and hence showing up higher in the search results.

The Google Approach
Page Rank approach used by Google is an elegant intermediary between the two approaches given above. This approach resembles approach two above, in the sense, that it involves sending spiders across the web to automatically index web pages, hence enjoying high scalability, but it does not use keyword usage in a page to determine its relevance, instead it exploits structure of the web to determine popularity/relevance of the web page by finding number and relevance of other web pages referring to it.

It is remarkable that how similar this approach is to the Web Directory approach, in the sense that both consider involvement of flesh and bone human beings important in determining the relevance of a resource. But with an important difference; While Web Directories explicitly rank relevant web pages, Page Rank treats a reference(a hyper link) as an implicit vote of relevance. Also the vote is given a higher weightage if it comes from a page with high relevance itself. This way it saves mankind of the impossible feat of manually ranking each web page.

Google and Web 2.0

Web 2.0 is here. Web as a network is being utilized more than it has ever been. Today everything is about user generated content to the extent that Time magazine named You as the person of the year 2006. Be it YouTube, Blogger, Wikipedia, MySpace, eBay, Flickr, Second Life, del.icio.us, Twitter and so on, it's all about user generated content.

How does Google fit in this new form of web, where user generated content heavily outstrips all other form of contents?Is there a need for Google still? First reason these questions arise is because most of Web2.0 applications have a search mechanism of their own;YouTube, Flickr and del.icio.us use tagging mechanism for all the content entered by its users to determine the type of content. Blogger, Wikipedia, MySpace an eBay have search engines of their own as they are the whole and sole owners of their content. The argument simply is that if one wants to search for a particular article in Wikipedia he/she will prefer searching from Wikipedia's inbuilt search engine rather than using Google.

Second and more powerful reason which is the main topic of this post is that Google's democratic model will have hard time dealing with the authoritarian nature of the User Generated Content Service Providers. One such shocking example which has gone largely unnoticed till now relates to how Wikipedia works behind the scene.

Wikipedia and Google
Wikipedia is another great thing to happen to web after Google. For the uninitiated, Wikipedia is an online encyclopedia with over two million articles in the English version. Its power lies in the fact that all of its content is user(non experts included) generated which is continuously reviewed and commented upon by Wikipedia editors for various factors which make an article great. This way Wikipedia ensures that it grows infinitely, which is impossible for any other encyclopedia authored by a limited set of experts, but still maintains more than reasonable standard of quality for its articles.

So far so good. Though from the last para it looks like that Wikipedia is the epitome of democracy. Sadly that's not true. Though the content on Wikipedia is generated by its users, the control of that information lies in the hands of Wikipedia and when a single authority is in control of information of this extent, results can be devastating if it is not cautious.

A careful observation of search results obtained from Google, search term being immaterial, reveals a pattern. Almost every time, first page will contain a link to some article on Wikipedia. A striking example :Search Sergey Brin and first result that pops up links to the Wikipedia entry, superseding the Google Corporate Information page itself!!!

I am, by no means stating that page from Google is more relevant to the search term when compared to Wikipedia one, in fact Wikipedia page provides much more information than its Google counterpart. All I am trying to say is that it's quite possible for a relatively irrelevant page in Wikipedia to show up higher than a more relevant page, owing to the way Wikipedia has structured its content. Let me explain. An article in Wikipedia(as it's manifested on web) is simply a HTML page with external links to pages outside Wikipedia and internal links to pages inside Wikipedia. Also Wikipedia provides an WYSIWYG(What You See Is What You Get) editor for easy editing of its pages. While this takes the burden of encoding the page in HTML from the author but it also takes away lot of control. Interestingly Wikipedia's editor has two different hypertext buttons, one for linking to pages inside Wikipedia and the other for creating an external link. Why such differential treatment? Why two buttons where one could have sufficed?

The answer to these questions is the most shocking revelation of how irresponsible control of information of this enormous extent in WWW era can lead to authoritarian behavior. A look at HTML source of any Wikipedia article would unearth that while all external links have an extra attribute rel='nofollow' in their link tag, the same is missing from all the internal links. Presence of rel attribute with value nofollow discounts the vote of relevance by the source page to the target page, that otherwise would have helped increase the relevance of the target page in the Google's(and few other search engines') database.

As noted on this page rel='nofollow' was originally suggested by Google to avoid spamdexing which is nothing but stealing high relevance or page rank of a site by linking from origninal page to an irrelevant spam page. The problem is more prominent on Web 2.0 sites where its pretty easy to author content(and hence create links) on a highly relevant site(say a forum). Some Web 2.0 sites have circumvented the problem by enabling rel='nofollow' by default for user generated content, with no way of disabling it. Wikipedia is one of them.

Sadly Wikipedia has enabled rel='nofollow' only for external links i.e. links pointing to pages outside Wikipedia and not for its internal links i.e. links pointing to pages within Wikipedia. What this necessarily means is that while author of an article on Wikipedia, might have painstakingly gone through hundreds of web pages for creating the article and would have happily returned the credit by linking to all the references, which he found highly relevant, but due to links to his references being external he has not helped their page rank by linking back to them because of the rel='nofollow' attribute. Also what he might have done inadvertently is, while linking to some of the Wikipedia pages from his article, just to adhere to the convention of linking terms to existing articles on Wikipedia or while creating the See Also section in his article, consequently passed on the relevance of the current page to all those pages he linked to!!!

This type of structuring leads to a Viral Effect where one highly page-ranked page on Wikipedia ends up recursively contributing to the page ranks of all the pages it links to and so on. This ensures that all pages on Wikipedia have a reasonably good page rank irrespective of the actual content. A slap on Google's face. Other than degrading search results it also highly discredits the references used for creating the page by not contributing to their rank. A slap on original content providers' face. What this also means is that, in times to come, Wikipedia will invariably be one of the top rank holders for any search term and Google will simply be a menu card with just a single item on the offering. External pages, if required will be accessed from references section in the resulting Wikipedia article from the search. Or in other words, Google will be dead.

Corroborations
Here are few instances corroborating this authoritarian behaviour(Knowingly or unknowingly) on Wikipedia's part and Viral Effect it leads to.
  • Brion Vibber's(Wikimedia) email of turning rel='nofollow' back on Wikipedia
  • A comparison of third and fourth ranked search results obtained for search term Yahoo Pipes, with a much taut description on readwriteweb then the corresponding Wikipedia entry.
  • Wikipedia's What links here section that lists all internal articles linking to the current page and hence contributing to its page rank.This section invariably contains a bloated list with many relatively unrelated pages, linking to the current page, out of context.

Conclusion
This article, by no means, should be treated as an attempt to bring ignominy on Wikipedia's name. Wikipedia is a great humanitarian project. I personally refer to it regularly for all my doubts.The purpose of this article is just to bring out the fact that Google will need a renovation in the changing WWW scene and Wikipedia could have been a bit more responsible by disclosing the differential treatment of internal and external links to its editor-users and by cautioning them about using internal links. Another solution can be to allow users to enable/disable rel='nofollow' for all the links and differentiate two classes(enabled/disabled) of links by an easily visible feature(say different colors) on the article page and wish that spamdexing will be curbed by assuming same good faith that drives Wikipedia. But that's just a suggestion, being an outsider I don't have even faintest of idea of managing world's largest encyclopedia and people at Wikimedia know a lot better and I leave the solution to them. I also hope resolving this is high on Wikipedia's priority list and Wikimedia people are being held back only by technical difficulties of implementing an ingenious solution like all their solutions in the past.