Inventing Google The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page (founders) / 1998
Cluster Anatomy Web Search for a Planet: The Google Cluster Architecture Luiz André Barroso, Jeffrey Dean and Urs Hoelzle / 2003 Google's secret of success? Dealing with failure Urs Hoelzle (Vice President of Engineering and Operations) / 2004
Programming for Google Cluster MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat (Google staff) / 2004
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d ... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank can be thought of as a model of user behavior .
We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page.
The probability that the random surfer visits a page is its PageRank.
High PR has a page if…
there are many pages that point to it
or if there are some pages that point to it and have a high PR
Note recursive weight propagation through web link structure.
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.
Damping factor d is the probability at each page the "random surfer" will get bored and request another random page.
Google's MapReduce is a programming model and an associated implementation for processing and generating large data sets.
Automates the task of recovering a program in case of a failure.
It is critical to keeping the company's costs down.
MR in brief:
Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word; frequency) pairs.
Hostname is determined for each document by map and term vector created (so there is multiple entries); reduce function then merges all entries associated with particular host and throws away infrequent terms.
Map: (hostname; term vector); Reduce: (hostname; term vector)