The Maths of Google

Did you know that every search on Google has a little maths behind it? Without the work of algorithms, there would be no Google, or Yahoo for that matter. Google runs on the PageRank algorithm.

You might imagine that this algorithm works on the principle of counting how many of the key words appear in the text of the webpage. The pages that would be ranked the highest would then be the pages with the highest number of occurrences of the key words. This used to be the way that these search engines worked back in the 90s; they used text based ranking systems to decide which pages should be ranked the highest, however there were many problems with this. For example, a search about a common word such as ‘Internet’ was problematic as the first pages shown could be a page written in Chinese with repeated use of the word ‘Internet’ but containing no other information about the subject desired. To illustrate this better, just image I was to write a web page with just the word ‘Maths’ repeated over and over again. This would be the highest ranked page, but would be of no use to the searcher.

The usefulness of a search engine depends on the relevance of the results it gives back; there are millions of web pages but some will be more relevant, popular or authoritative than others. Enter Larry Page and Sergey Brin. In 1998 they developed one of the most influential algorithms: the PageRank algorithm, which became the Google trademark catapulting them as the best and most widely used search engine. But how does it work?

Put simply, say we have webpage x which has a hyperlink to webpage y, this means that y is relevant and important to the topic on webpage x. If there are lots of pages that link to y, then page is considered important. If, for example, page z only has one link, but it comes from an authoritative website v (for example http://www.bbc.co.uk) then v transfers its authority to z, asserting that z is important.

Let me show you some diagrams to hopefully make this simpler. In this situation there are 4 pages of equal importance.

Firstly, a bit of terminology: this is called a graph, each ‘page’ is called a node and each ‘arrow’ is called an edge. So, node 1 has 3 outgoing edges, so it will pass a third of its importance to the 3 other nodes. Node 3 only has one outgoing edge, so will pass all of its importance to node 1. If we continue this process we can see how much importance is transferred to each node:

To find the PageRank of each note the algorithm uses this equation:

pagerank.jpg

In the second equation the ‘d’ represents the damping factor which can be set between 0 and 1 (I believe that Page and Brin have said that they use a factor of 0.85). This will form a set of simultaneous equations, which must be solved to find the PageRank for each webpage.

As there are billions of webpages on the internet, the Google search engine uses an iterative computation of PageRank values. This means that each page is assigned an initial starting value and the PageRanks of all pages are then calculated in several computation circles based on the equations determined by the PageRank algorithm shown above. According to publications of Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web.

I love this as an example of how maths can be employed to create innovative and efficient solutions to problems! Let me know what you think! M x

Advertisements

11 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s