Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Google Cluster Innards


Published on

Anatomy of Google cluster & MapReduce programming ...

Published in: Technology, News & Politics
  • Hello dear, My name is mariam nasrin, I know that this email will meet you in a good health and also surprisingly but God has his own way of bringing people together. Nice to Meet you I would appreciate if you can reply me back( ) So that i can explain you more about me. thank Yours mariam.
    Are you sure you want to  Yes  No
    Your message goes here
  • Google is being run by Indians, managerially and technically. Even though Page and Schmidt are CEO and Executive Chairman of Big G, but still we can’t forget that it was Amit Singhal, an IIT Roorkey Graduate, who re-wrote the whole algorithm of Google Search Engine in 2000 which made Google the best in the industry. Then, Nikesh Arora of BHU-IT is the Chief Business Manager; Vic Goundotra is the man behind the whole Google Plus… and, many many more. Search FAMOUS INDIANS WORKING IN GOOGLE for more details.
    Are you sure you want to  Yes  No
    Your message goes here
  • ;-)
    Are you sure you want to  Yes  No
    Your message goes here
  • The price example...

    $287000 - 176 x 2GHz Xeon, 176Gb Ram, 7TB HDD

    The number of PC and ram size do not seem right to me<br /><br/>
    Are you sure you want to  Yes  No
    Your message goes here

Google Cluster Innards

  1. Google Cluster Innards Martin Dvorak [email_address]
  2. Agenda <ul><li>Inventing Google The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page (founders) / 1998 </li></ul><ul><li>Cluster Anatomy Web Search for a Planet: The Google Cluster Architecture Luiz André Barroso, Jeffrey Dean and Urs Hoelzle / 2003 Google's secret of success? Dealing with failure Urs Hoelzle (Vice President of Engineering and Operations) / 2004 </li></ul><ul><li>Programming for Google Cluster MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat (Google staff) / 2004 </li></ul>
  3. Inventing Google
  4. Inventing Google <ul><li>Sergey & Larry - Ph.D. students at Stanford University </li></ul><ul><li>Prototype (1998) </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>24,000,000 pages (8,058,044,651 today) </li></ul></ul><ul><li>Google </li></ul><ul><ul><li>“ We chose our system name, Google, because it is a common spelling of googol , or 10 100 and fits well with our goal of building very large-scale search engines.” </li></ul></ul><ul><li>Page Rank </li></ul><ul><ul><li>An objective measure of its citation importance that corresponds well with people’s subjective idea of importance. </li></ul></ul>
  5. Inventing Google: Foundation <ul><li>PageRank*: </li></ul><ul><ul><li>We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d ... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) </li></ul></ul>*) Larry Page A T1 Tn … C1 Cn
  6. Inventing Google: Foundation <ul><li>Page Rank formula informally </li></ul><ul><ul><li>PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) </li></ul></ul><ul><ul><li>PageRank can be thought of as a model of user behavior . </li></ul></ul><ul><ul><li>We assume there is a &quot;random surfer&quot; who is given a web page at random and keeps clicking on links, never hitting &quot;back&quot; but eventually gets bored and starts on another random page. </li></ul></ul><ul><ul><li>The probability that the random surfer visits a page is its PageRank. </li></ul></ul><ul><ul><ul><li>High PR has a page if… </li></ul></ul></ul><ul><ul><ul><ul><li>there are many pages that point to it </li></ul></ul></ul></ul><ul><ul><ul><ul><li>or if there are some pages that point to it and have a high PR </li></ul></ul></ul></ul><ul><ul><li>Note recursive weight propagation through web link structure. </li></ul></ul><ul><ul><li>Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one. </li></ul></ul><ul><ul><li>Damping factor d is the probability at each page the &quot;random surfer&quot; will get bored and request another random page. </li></ul></ul><ul><ul><ul><li>Personalization  </li></ul></ul></ul>
  7. Inventing Google: Foundation <ul><li>PageRank relevancy tuning </li></ul><ul><ul><li>Page title </li></ul></ul><ul><ul><li>Anchor text </li></ul></ul><ul><ul><li>Meta </li></ul></ul><ul><ul><li>Font </li></ul></ul><ul><ul><ul><li>Size </li></ul></ul></ul><ul><ul><ul><li>Weight </li></ul></ul></ul><ul><ul><li>Capitalization </li></ul></ul><ul><ul><li>… </li></ul></ul>
  8. Inventing Google: Anatomy
  9. Inventing Google: Anatomy <ul><li>URL Server </li></ul><ul><ul><li>Providers list of URLs to be fetched to crawlers </li></ul></ul><ul><li>Google Crawler s (GoogleBot) </li></ul><ul><ul><li>Multiple distributed crawlers </li></ul></ul><ul><ul><ul><li>Own DNS cache </li></ul></ul></ul><ul><ul><ul><li>300 connections open at once </li></ul></ul></ul><ul><ul><li>Send fetched pages to Store Server </li></ul></ul><ul><ul><li>Originally written in Python </li></ul></ul><ul><li>Store Server </li></ul><ul><ul><li>Compresses and stores files to repository. </li></ul></ul><ul><ul><li>DOCID is created for each page. </li></ul></ul><ul><li>Repository </li></ul><ul><ul><li>Stores fetched pages for further processing by Indexer </li></ul></ul>
  10. Inventing Google: Anatomy <ul><li>Indexer </li></ul><ul><ul><li>Reads pages from Repository (uncompress) </li></ul></ul><ul><ul><li>Parses each document (Flex on top of own stack): </li></ul></ul><ul><ul><ul><li>Page converted to set of Hits (position, font, capitalization, title/achor/meta) / 2B </li></ul></ul></ul><ul><ul><ul><li>Added to Document Index </li></ul></ul></ul><ul><ul><li>Hits are distributed to Barrels (i.e. one document to multiple barrels) </li></ul></ul><ul><ul><li>Every link found in page is stored to Anchors file </li></ul></ul><ul><li>Forward and Inverted Barrels (2*64) </li></ul><ul><ul><li>Forward Index </li></ul></ul><ul><ul><ul><li>Barrel keeps range of Hits sorted by DOCIDs </li></ul></ul></ul><ul><ul><ul><li>(DOCID, (WORDID, word’s Hit reference+)+) </li></ul></ul></ul><ul><ul><li>Processed by Sorter: </li></ul></ul><ul><ul><ul><li>Generates inverted index from forward index – sorts Hits by WORDIDs </li></ul></ul></ul><ul><ul><ul><li>Creates (WORDID, offset s ) used by Lexicon </li></ul></ul></ul><ul><ul><li>Inverted Index (short/full) </li></ul></ul><ul><ul><ul><li>(WORDID, (DOCID reference, Hit list reference)+)) </li></ul></ul></ul><ul><ul><ul><li>Short: DOCIDs sorted by/contains just quality Hits (word in title, anchor,...); optimal single word search </li></ul></ul></ul><ul><ul><ul><li>Full: DOCIDs sorted by DOCID; optimal Hit lists merging i.e. multi-word search </li></ul></ul></ul><ul><li>Anchors file </li></ul><ul><ul><li>Anchor (from, to, text) </li></ul></ul><ul><li>URL Resolver </li></ul><ul><ul><li>Reads anchors file: </li></ul></ul><ul><ul><ul><li>Relation 2 absolute URL conversion + DOCID assignment </li></ul></ul></ul><ul><ul><ul><li>Creates links file </li></ul></ul></ul><ul><li>Links file </li></ul><ul><ul><li>(url, target: DOCID) </li></ul></ul>
  11. Inventing Google: Anatomy <ul><li>Searcher uses… </li></ul><ul><ul><li>Lexicon </li></ul></ul><ul><ul><ul><li>Keeps map saying which Barrel to use. </li></ul></ul></ul><ul><ul><ul><li>Originally kept in memory (256MB). </li></ul></ul></ul><ul><ul><ul><ul><li>IMHO now must be used something like Multi-level VM Page Table </li></ul></ul></ul></ul><ul><ul><ul><ul><li>It is is/was of fixed size (14,000,000 words) </li></ul></ul></ul></ul><ul><ul><li>Barrels </li></ul></ul><ul><ul><ul><li>Each barrel keeps range of WORDIDs </li></ul></ul></ul><ul><ul><ul><li>WORID 2 DOCID map </li></ul></ul></ul><ul><ul><li>PageRank pool </li></ul></ul><ul><ul><ul><li>Keeps counted page rank for each DOCID </li></ul></ul></ul><ul><ul><li>Doc Index </li></ul></ul><ul><ul><ul><li>DOCID ordered information about each document </li></ul></ul></ul><ul><ul><ul><ul><li>(DOCID, status, repository pointer, checksum, stat, URL, title) </li></ul></ul></ul></ul>
  12. Cluster Innards
  13. Cluster Innards: Global Google <ul><li>Over 30 Google clusters around the world. </li></ul><ul><ul><li>DNS based & geo location driven load-balancing : </li></ul></ul><ul><ul><ul><li>Domain Name: GOOGLE.COM Registrar: ALLDOMAINS.COM INC. Whois Server: Referral URL: Name Server: NS2.GOOGLE.COM Name Server: NS1.GOOGLE.COM Name Server: NS3.GOOGLE.COM Name Server: NS4.GOOGLE.COM Status: REGISTRAR-LOCK Updated Date: 03-oct-2002 Creation Date: 15-sep-1997 Expiration Date: 14-sep-2011 </li></ul></ul></ul><ul><ul><ul><li>2005, May 7: Google DNS hack speculations </li></ul></ul></ul><ul><li>Total PCs </li></ul><ul><ul><ul><li>> 5,000 in 2000 </li></ul></ul></ul><ul><ul><ul><li>> 15,000 in 2003 </li></ul></ul></ul><ul><ul><ul><li>>79,000* in 2004 </li></ul></ul></ul>*) I’m not sure about this number, it was taken from an external resource.
  14. Cluster Innards: HW <ul><li>Basics cluster design insights </li></ul><ul><ul><li>Reliability in SW rather then server-class HW . </li></ul></ul><ul><ul><ul><li>Commodity PCs used to build high-end computing cluster at a low end prices. </li></ul></ul></ul><ul><ul><ul><li>Example: </li></ul></ul></ul><ul><ul><ul><ul><li>$287,000 – 176x 2GHz Xeon, 176GB RAM, 7TB HDD </li></ul></ul></ul></ul><ul><ul><ul><ul><li>$758,000 – 8x 2GHZ Xeon, 64GB RAM, 8TB HDD </li></ul></ul></ul></ul><ul><ul><li>Design is tailored for best aggregate request throughput , not peak server response time – individual request parallelization. </li></ul></ul><ul><li>Google has inexpensively built out its computing infrastructure by using thousands of &quot;commodity&quot; servers </li></ul><ul><ul><li><2,000 servers in single cluster. </li></ul></ul><ul><ul><li>Dual-processor x86 servers (starting at 533MHz Celeron) with 2-4 GB of memory per machine, 1+ 80GB IDE drive. </li></ul></ul><ul><ul><li>Rack: 40-80 of x86-based servers. </li></ul></ul>
  15. Cluster Innards: HW <ul><li>Optimistically, a consumer PC might crash once in three years from a software glitch or hardware problem. </li></ul><ul><ul><li>&quot;At Google scale...if you have thousands of PCs, you can expect one (failure) a day ,…&quot; </li></ul></ul><ul><li>1,000,000s not 1,000,000,000s of dollars. </li></ul><ul><ul><li>“ The trick is to make these racks of hardware work together and to ensure that the failure of one machine doesn't derail an operation.” </li></ul></ul><ul><li>Switched Ethernet </li></ul><ul><ul><li>Commodity networking hardware is used - typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth. </li></ul></ul><ul><ul><li>Locality optimizations (GFS) </li></ul></ul>
  16. Cluster Innards: SW <ul><li>Stripped-down version of Linux , which is based on the Red Hat distribution but is really just the operating system kernel modified for Google. </li></ul><ul><li>Google File System is optimized for handling large blocks of data. </li></ul><ul><ul><li>64MB block </li></ul></ul><ul><ul><li>The file system was designed to assume that a failure, such as a failed disk or unplugged network cable , can happen at any time. </li></ul></ul><ul><ul><li>Data is replicated in three places, and there is a &quot;master&quot; machine that can locate copies of a piece of data, such as a keyword index, if the original is out of commission. </li></ul></ul><ul><li>Google has created &quot;batch&quot; job scheduling software that acts as a sort of taskmaster for millions of operations called the Global Work Queue . </li></ul><ul><li>Another important engineering feat done by Google is to make writing programs that run across thousands of servers very straightforward… </li></ul>
  17. Programming for Cluster
  18. Programming For Cluster <ul><li>Google's MapReduce is a programming model and an associated implementation for processing and generating large data sets. </li></ul><ul><ul><li>Automates the task of recovering a program in case of a failure. </li></ul></ul><ul><ul><li>It is critical to keeping the company's costs down. </li></ul></ul><ul><li>MR in brief: </li></ul><ul><ul><li>Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. </li></ul></ul><ul><li>Features </li></ul><ul><ul><li>Functional style programming. </li></ul></ul><ul><ul><li>Automatic parallelization. </li></ul></ul>
  19. Programming For Cluster <ul><li>Map Reduce runtime… </li></ul><ul><ul><li>takes care of the details of partitioning the input data </li></ul></ul><ul><ul><li>scheduling the program's execution across a set of machines, handling machine failures </li></ul></ul><ul><ul><li>managing the required inter-machine communication. </li></ul></ul><ul><ul><li>… and more. </li></ul></ul><ul><li>MR hides machines the messy details of parallelization , fault-tolerance , data distribution and load balancing in a library. </li></ul><ul><ul><li>Therefore even programmers without any experience with parallel and distributed systems can easily utilize the resources of a large distributed system. </li></ul></ul><ul><li>Numbers … </li></ul><ul><ul><li>TB of data processed (>20) </li></ul></ul><ul><ul><li>On 1,000s of machines </li></ul></ul><ul><ul><li>100s of MR programs in place </li></ul></ul>
  20. Programming For Cluster <ul><li>Special purpose computations examples: </li></ul><ul><ul><li>Inputs: </li></ul></ul><ul><ul><ul><li>crawled documents </li></ul></ul></ul><ul><ul><ul><li>web request logs </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul><ul><ul><li>Outputs: </li></ul></ul><ul><ul><ul><li>inverted indices </li></ul></ul></ul><ul><ul><ul><li>various representations of the graph structure of web documents </li></ul></ul></ul><ul><ul><ul><li>summaries of the number of pages crawled per host (dumper) </li></ul></ul></ul><ul><ul><ul><li>the set of most frequent queries in a given day, etc. </li></ul></ul></ul>
  21. Programming For Cluster <ul><li>LISP roots </li></ul><ul><ul><li>Remind map and reduce primitives: </li></ul></ul><ul><ul><ul><li>map (func, list, ...) </li></ul></ul></ul><ul><ul><ul><ul><li>Creates new list from the results of applying func to each element of each list. There must be one list per argument to the function. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>map (lambda x, y: x+y, [1,2],[3,4]) --> [4,6] </li></ul></ul></ul></ul><ul><ul><ul><ul><li>map (None, [1,2],[3,4]) --> [[1,3],[2,4]] </li></ul></ul></ul></ul><ul><ul><ul><li>reduce (func, list {,init}) </li></ul></ul></ul><ul><ul><ul><ul><li>Applies func to each pair of items in turn. The results are accumulated. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>reduce (lambda x, y: x+y, [1,2,3,4],5) --> 15 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>reduce (lambda x, y: x&y, [1,0,1]) --> 0 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>reduce (None, [], 1) --> 1 </li></ul></ul></ul></ul><ul><li>Programming model </li></ul><ul><ul><li>Key/values 2 key/values </li></ul></ul><ul><ul><li>Map & reduce functions written by user are linked with MR library. </li></ul></ul><ul><ul><ul><li>map (k1,v1)  list(k2,v2) </li></ul></ul></ul><ul><ul><ul><li>reduce (k2,list(v2))  list(v2) </li></ul></ul></ul><ul><ul><li>Input/output file, tuning parameters, … </li></ul></ul>
  22. Programming For Cluster <ul><li>Example (pseudocode): </li></ul><ul><ul><li>map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, &quot;1&quot;); </li></ul></ul><ul><ul><li>reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); </li></ul></ul>
  23. Programming For Cluster <ul><li>Example solves the problem of counting the number of occurrences of each word in a large collection of documents. </li></ul><ul><li>More examples: </li></ul><ul><ul><li>Distributed Grep </li></ul></ul><ul><ul><ul><li>Map: (URL, true); Reduce: id </li></ul></ul></ul><ul><ul><li>Count of URL Access Frequency </li></ul></ul><ul><ul><ul><li>Access page log is used as input for map. </li></ul></ul></ul><ul><ul><ul><li>Map: (URL,1); Reduce: (URL; total count) </li></ul></ul></ul><ul><ul><li>Reverse Web-Link Graph </li></ul></ul><ul><ul><ul><li>Link database is processed by map. </li></ul></ul></ul><ul><ul><ul><li>Map: (target; source); Reduce: (target; list(source)) </li></ul></ul></ul><ul><ul><li>Term-Vector per Host </li></ul></ul><ul><ul><ul><li>A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word; frequency) pairs. </li></ul></ul></ul><ul><ul><ul><li>Hostname is determined for each document by map and term vector created (so there is multiple entries); reduce function then merges all entries associated with particular host and throws away infrequent terms. </li></ul></ul></ul><ul><ul><ul><li>Map: (hostname; term vector); Reduce: (hostname; term vector) </li></ul></ul></ul>
  24. Programming For Cluster
  25. Programming For Cluster <ul><li>Also… </li></ul><ul><ul><li>Multiple tasks performed by single worker ( load-balancing ) </li></ul></ul><ul><ul><li>Master </li></ul></ul><ul><ul><ul><li>idle, in-progress, completed (workers) </li></ul></ul></ul><ul><ul><ul><li>worker failure (re-execution) </li></ul></ul></ul><ul><ul><ul><li>master failure (rare, restart) </li></ul></ul></ul><ul><ul><li>MR locality optimization </li></ul></ul><ul><ul><ul><li>Save network bandwidth </li></ul></ul></ul><ul><ul><ul><li>GFS 64MB block replication </li></ul></ul></ul><ul><ul><ul><li>MR scheduler takes GFS replicas location into account </li></ul></ul></ul><ul><ul><li>Stragglers </li></ul></ul><ul><ul><ul><li>Overload, HW problems, etc. </li></ul></ul></ul><ul><ul><ul><li>Backup task – fork twin execution for straggler </li></ul></ul></ul><ul><ul><li>Task granularity </li></ul></ul><ul><ul><ul><li>Number of workers driven by M&R pairs number </li></ul></ul></ul><ul><ul><ul><li>For example: M=200,000; R=5,000 using 2,000 workers </li></ul></ul></ul>
  26. Putting Things Together
  27. I’m Feeling Lucky <ul><li>Pre-phase: </li></ul><ul><ul><li>Browser requests e.g. </li></ul></ul><ul><ul><li>DNS-based load-balancing selects cluster according to the geographical location of the user & actual cluster utilization </li></ul></ul><ul><ul><li>The rest of the evaluation is entirely local to the that cluster </li></ul></ul><ul><li>Phase 1: Index servers... </li></ul><ul><ul><li>Parse the query </li></ul></ul><ul><ul><ul><li>Perform spell-check and fork Ad task </li></ul></ul></ul><ul><ul><ul><li>Convert words into WORDIDs </li></ul></ul></ul><ul><ul><li>Choose inverted Barrel(s) using Lexicon </li></ul></ul><ul><ul><ul><li>Barrel index is formed by number of servers whose data are randomly distributed and replicated (full index/index shards) so search is highly parallelizable </li></ul></ul></ul><ul><ul><li>Inverted barrel maps each query word to a matching list of documents (Hit list) </li></ul></ul><ul><ul><ul><li>Seek to the start of the document list in the short barrel for every word (multiple tasks) </li></ul></ul></ul><ul><ul><ul><li>Scan through document list until there is document that matches all search terms </li></ul></ul></ul><ul><ul><ul><li>If we are in the short barrels and at the end of any document list, seek to the start of the document list to the full barrel for every word and go to the step 1 </li></ul></ul></ul><ul><ul><ul><li>If we are not at the end of any document list, go to the step 1 </li></ul></ul></ul><ul><ul><li>Sort the DOCIDs that have matched </li></ul></ul><ul><li>Phase 2: Document servers... </li></ul><ul><ul><li>For each DOCID compute actual title , URL and query-specific document summary (matched words context). </li></ul></ul><ul><ul><li>Document servers are used to dispatch this completion – also documents are randomly distributed and replicated, so the completion is highly parallelizable </li></ul></ul>
  28. Bonus
  29. Stanford lab (around 1996)
  30. The Original Google Storage: 10x4GB (1996)
  31. Google San Francisco (2004)
  32. A cluster of coolness Google History
  33. Google Results Page Per Day
  34. References <ul><li>Sergey Brin, Lawrence Page; The Anatomy of a Large-Scale Hypertextual Web Search Engine; 1998 </li></ul><ul><li>Luiz André Barroso, Jeffrey Dean and Urs Hoelzle: Web Search for a Planet: The Google Cluster Architecture ; 2003 </li></ul><ul><li>Urs Hoelzle: Google's secret of success? Dealing with failure ; 2004 </li></ul><ul><li>Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters; 2004 </li></ul><ul><li>Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System ; 2003 </li></ul><ul><li> </li></ul><ul><ul><li>GoogleBot bot.html </li></ul></ul><ul><li> </li></ul><ul><li>http://www. </li></ul><ul><li> </li></ul><ul><li>Uniquely Google ™ </li></ul><ul><li>Stanford Gadgets </li></ul><ul><li>Google hacked? </li></ul>