Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gfs andmapreduce


Published on

Published in: Technology
  • Be the first to like this

Gfs andmapreduce

  1. 1. Google Base Infrastructure:GFS and MapReduce Johannes Passing
  2. 2. Agenda Context Google File System Aims Design Walk through MapReduce Aims Design Walk through Conclusion13.04.2008 Johannes Passing 2
  3. 3. Context Google‘s scalability strategy Cheap hardware Lots of it More than 450.000 servers (NYT, 2006) Basic Problems Faults are the norm Software must be massively parallelized Writing distributed software is hard13.04.2008 Johannes Passing 3
  4. 4. GFS: Aims Aims Must scale to thousands of servers Fault tolerance/Data redundancy Optimize for large WORM files Optimize for large reads Favor throughput over latency Non-Aims Space efficiency Client side caching futile POSIX compliance Significant deviation from standard DFS (NFS, AFS, DFS, …)13.04.2008 Johannes Passing 4
  5. 5. Design Choices File system cluster Spread data over thousands of machines Provide safety by redundant storage Master/Slave architecture13.04.2008 Johannes Passing 5
  6. 6. Design Choices Files split into chunks Unit of management and distribution Replicated across servers Subdivided into blocks for integrity checks Accept internal fragmentation for better throughput GFS … File System13.04.2008 Johannes Passing 6
  7. 7. Architecture: Chunkserver Host chunks Store chunks as files in Linux FS Verify data integrity Implemented entirely in user mode Fault tolerance Fail requests on integrity violation detection All chunks stored on at least one other server Re-replication restores server after downtime13.04.2008 Johannes Passing 7
  8. 8. Architecture: Master Namespace and metadata management Cluster management Chunk location registry (transient) Chunk placement decisions Re-replication, Garbage collection Health monitoring Fault tolerance Backed by shadow masters Mirrored operations log Periodic snapshots13.04.2008 Johannes Passing 8
  9. 9. Reading a chunk Send filename and offset Determine file and chunk Return chunk locations and version Choose closest server and request chunk Verify chunk version and block checksums Return data if valid, else fail request13.04.2008 Johannes Passing 9
  10. 10. Writing a chunk Send filename and offset Determine lease owner/ grant lease Return chunk locations and lease owner Push data to any chunkserver Forward along replica chain Send write request to primary Create mutation order, apply Forward write request Validate blocks, apply modifications , inc version, ack13.04.2008 Johannes Passing Reply to client 10
  11. 11. Appending a record (1/2) Send append request Choose offset Return chunk locations and lease owner Push data to any chunkserver Forward along replica chain Send append request Check for left space in chunk Pad chunk and request retry Request chunk allocation Return chunk locations and lease owner13.04.2008 Johannes Passing 11
  12. 12. Appending a record (2/2) Send append request Allocate chunk, write data Forward request to replicas Write data, failure occurs Fail request A A A Client retries Padding Padding Padding Write data B B Garbage Forward request to replicas B B B Write data Free Free Free Free Free Free Success13.04.2008 Johannes Passing 12
  13. 13. File Region Consistency File region states: Consistent All clients see same data Defined Consistent ^ clients see change in its entirety Inconsistent Consequences Unique record IDs to identify duplicates Records should be self-identifying and self-validating B B Garbage inconsistent B B B defined D Free C D Free C D Free C consistent defined C C C13.04.2008 Johannes Passing 13
  14. 14. Fault tolerance GFS Automatic re-replication Replicas are spread across racks Fault mitigations Data corruption choose different replica Machine crash choose replica on different machine Network outage choose replica in different network Master crash shadow masters13.04.2008 Johannes Passing 14
  15. 15. Wrap-up GFS Proven scalability, fault tolerance and performance Highly specialized Distribution transparent to clients Only partial abstraction – clients must cooperate Network is the bottleneck13.04.2008 Johannes Passing 15
  16. 16. MapReduce: Aims Aims Unified model for large scale distributed data processing Massively parallelizable Distribution transparent to developer Fault tolerance Allow moving of computation close to data13.04.2008 Johannes Passing 16
  17. 17. Higher order functions Functional: map(f, [a, b, c, …]) a b c f f f f(a) f(b) f(c) Google: map(k, v) (k, v) map (f(k1), f(v1)) (f(k1), f(v1)) (f(k1), f(v1))13.04.2008 Johannes Passing 17
  18. 18. Higher order functions Functional: foldl/reduce(f, [a, b, c, …], i) f f c f b i a Google: reduce(k, [a, b, c, d, …]) a b c d f(a) – f(c, d)13.04.2008 Johannes Passing 18
  19. 19. Idea Observation Map can be easily parallelized Reduce might be parallelized Idea Design application around map and reduce scheme Infrastructure manages scheduling and distribution User implements map and reduce Constraints All data coerced into key/value pairs13.04.2008 Johannes Passing 19
  20. 20. Walkthrough Spawn master Provide data e.g. from GFS or BigTable Create M splits Spawn M mappers Map Once per key/value pair Partition into R buckets Mappers Combiners Reducers Spawn up to R*M combiners Combine Barrier Spawn up to R reducers Reduce13.04.2008 Johannes Passing 20
  21. 21. Scheduling Locality Mappers scheduled close to data Chunk replication improves locality Reducers run on same machine as mappers Choosing M and R Load balancing & fast recovery vs. number of output files M >> R, R small multiple of #machines Schedule backup tasks to avoid stragglers13.04.2008 Johannes Passing 21
  22. 22. Fault tolerance MapReduce Fault mitigations Map crash All intermediate data in questionable state Repeat all map tasks of machine Rationale of barrier Reduce crash Data is global, completed stages marked Repeat crashed resuce task only Master crash start over Repeated crashs on certain records Skip records13.04.2008 Johannes Passing 22
  23. 23. Wrap-up MapReduce Proven scalability Restrict programming model for better runtime support Tailored to Google’s needs Programming model is fairly low-level13.04.2008 Johannes Passing 23
  24. 24. Conclusion Rethinking infrastructure Consequent design for scalability and performance Highly specialized solutions Not generally applicable Solutions more low-level than usual Maintenance efforts may be significant „We believe we get tremendous competitive advantage by essentially building our own infrastructures“ --Eric Schmidt13.04.2008 Johannes Passing 24
  25. 25. References Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, 2004 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”, 2003 Ralf Lämmel, “Googles MapReduce Programming Model – Revisited”, SCP journal, 2006 Google, “Cluster Computing and MapReduce”, minilecture/listing.html, 2007 David F. Carr, “How Google Works”, Storage/How-Google-Works/, 2006 Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, “Evaluating MapReduce for Multi-core and Multiprocessor Systems”, 200713.04.2008 Johannes Passing 25