Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk


Published on

My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.

It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk

  1. 1. Cloud Infrastructure: GFS & MapReduce Andrii Vozniuk Used slides of: Jeff Dean, Ed AustinData Management in the Cloud EPFL February 27, 2012
  2. 2. Outline• Motivation• Problem Statement• Storage: Google File System (GFS)• Processing: MapReduce• Benchmarks• Conclusions
  3. 3. Motivation• Huge amounts of data to store and process• Example @2004: – 20+ billion web pages x 20KB/page = 400+ TB – Reading from one disc 30-35 MB/s • Four months just to read the web • 1000 hard drives just to store the web • Even more complicated if we want to process data• Exp. growth. The solution should be scalable.
  4. 4. Motivation• Buy super fast, ultra reliable hardware? – Ultra expensive – Controlled by third party – Internals can be hidden and proprietary – Hard to predict scalability – Fails less often, but still fails! – No suitable solution on the market
  5. 5. Motivation• Use commodity hardware? Benefits: – Commodity machines offer much better perf/$ – Full control on and understanding of internals – Can be highly optimized for their workloads – Really smart people can do really smart things• Not that easy: – Fault tolerance: something breaks all the time – Applications development – Debugging, Optimization, Locality – Communication and coordination – Status reporting, monitoring• Handle all these issues for every problem you want to solve
  6. 6. Problem Statement• Develop a scalable distributed file system for large data-intensive applications running on inexpensive commodity hardware• Develop a tool for processing large data sets in parallel on inexpensive commodity hardware• Develop the both in coordination for optimal performance
  7. 7. Google Cluster Environment Cloud DatacentersServers Racks Clusters
  8. 8. Google Cluster Environment• @2009: – 200+ clusters – 1000+ machines in many of them – 4+ PB File systems – 40GB/s read/write load – Frequent HW failures – 100s to 1000s active jobs (1 to 1000 tasks)• Cluster is 1000s of machines – Stuff breaks: for 1000 – 1 per day, for 10000 – 10 per day – How to store data reliably with high throughput? – How to make it easy to develop distributed applications?
  9. 9. Google Technology Stack
  10. 10. Google Technology Stack Focus of this talk
  11. 11. GF S Google File System*• Inexpensive commodity hardware• High Throughput > Low Latency• Large files (multi GB)• Multiple clients• Workload – Large streaming reads – Small random writes – Concurrent append to the same file* The Google File System. S. Ghemawat, H. Gobioff, S. Leung. SOSP, 2003
  12. 12. GF S Architecture• User-level process running on commodity Linux machines• Consists of Master Server and Chunk Servers• Files broken into chunks (typically 64 MB), 3x redundancy (clusters, DCs)• Data transfers happen directly between clients and Chunk Servers
  13. 13. GF S Master Node• Centralization for simplicity• Namespace and metadata management• Managing chunks – Where they are (file<-chunks, replicas) – Where to put new – When to re-replicate (failure, load-balancing) – When and what to delete (garbage collection)• Fault tolerance – Shadow masters – Monitoring infrastructure outside of GFS – Periodic snapshots – Mirrored operations log
  14. 14. GF S Master Node• Metadata is in memory – it’s fast! – A 64 MB chunk needs less than 64B metadata => for 640 TB less than 640MB• Asks Chunk Servers when – Master starts – Chunk Server joins the cluster• Operation log – Is used for serialization of concurrent operations – Replicated – Respond to client only when log is flushed locally and remotely
  15. 15. GF S Chunk Servers• 64MB chunks as Linux files – Reduce size of the master‘s datastructure – Reduce client-master interaction – Internal fragmentation => allocate space lazily – Possible hotspots => re-replicate• Fault tolerance – Heart-beat to the master – Something wrong => master inits replication
  16. 16. GF S Mutation Order Current lease holder? Write request 3a. data identity of primary location of replicas (cached by client)Operation completed Operation completedor Error report 3b. data Primary assigns # to mutations Applies it Forwards write request 3c. data Operation completed
  17. 17. GF S Control & Data Flow• Decouple control flow and data flow• Control flow – Master -> Primary -> Secondaries• Data flow – Carefully picked chain of Chunk Servers • Forward to the closest first • Distance estimated based on IP – Fully utilize outbound bandwidth – Pipelining to exploit full-duplex links
  18. 18. GF S Other Important Things• Snapshot operation – make a copy very fast• Smart chunks creation policy – Below-average disk utilization, limited # of recent• Smart re-replication policy – Under replicated first – Chunks that are blocking client – Live files first (rather than deleted)• Rebalance and GC periodically How to process data stored in GFS?
  19. 19. M M M R R MapReduce*• A simple programming model applicable to many large-scale computing problems• Divide and conquer strategy• Hide messy details in MapReduce runtime library: – Automatic parallelization – Load balancing – Network and disk transfer optimizations – Fault tolerance – Part of the stack: improvements to core library benefit all users of library*MapReduce: Simplified Data Processing on Large Clusters.J. Dean, S. Ghemawat. OSDI, 2004
  20. 20. M M MR R Typical problem• Read a lot of data. Break it into the parts• Map: extract something important from each part• Shuffle and Sort• Reduce: aggregate, summarize, filter, transform Map results• Write the results• Chain, Cascade• Implement Map and Reduce to fit the problem
  21. 21. M M MR R Nice Example
  22. 22. M M MR R Other suitable examples• Distributed grep• Distributed sort (Hadoop MapReduce won TeraSort)• Term-vector per host• Document clustering• Machine learning• Web access log stats• Web link-graph reversal• Inverted index construction• Statistical machine translation
  23. 23. M M MR R Model• Programmer specifies two primary methods – Map(k,v) -> <k’,v’>* – Reduce(k’, <v’>*) -> <k’,v’>*• All v’ with same k’ are reduced together, in order• Usually also specify how to partition k’ – Partition(k’, total partitions) -> partition for k’ • Often a simple hash of the key • Allows reduce operations for different k’ to be parallelized
  24. 24. M M MR R CodeMap map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");Reduce reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result))
  25. 25. M M MR R Architecture • One master, many workers • Infrastructure manages scheduling and distribution • User implements Map and Reduce
  26. 26. M M MR R Architecture
  27. 27. M M MR R Architecture
  28. 28. M M MR R Architecture Combiner = local Reduce
  29. 29. M M MR R Important things• Mappers scheduled close to data• Chunk replication improves locality• Reducers often run on same machine as mappers• Fault tolerance – Map crash – re-launch all task of machine – Reduce crash – repeat crashed task only – Master crash – repeat whole job – Skip bad records• Fighting ‘stragglers’ by launching backup tasks• Proven scalability• September 2009 Google ran 3,467,000 MR Jobs averaging 488 machines per Job• Extensively used in Yahoo and Facebook with Hadoop
  30. 30. GF S Benchmarks: GFS
  31. 31. M M MR R Benchmarks: MapReduce
  32. 32. Conclusions“We believe we get tremendous competitive advantageby essentially building our own infrastructure”-- Eric Schmidt• GFS & MapReduce – Google achieved their goals – A fundamental part of their stack• Open source implementations – GFS  Hadoop Distributed FS (HDFS) – MapReduce  Hadoop MapReduce
  33. 33. Thank your for your attention!