Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • gfs and mapreduce
    Are you sure you want to
    Your message goes here
    Be the first to like this
No Downloads

Views

Total Views
278
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
1
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Google Base Infrastructure:GFS and MapReduce Johannes Passing
  • 2. Agenda Context Google File System Aims Design Walk through MapReduce Aims Design Walk through Conclusion13.04.2008 Johannes Passing 2
  • 3. Context Google‘s scalability strategy Cheap hardware Lots of it More than 450.000 servers (NYT, 2006) Basic Problems Faults are the norm Software must be massively parallelized Writing distributed software is hard13.04.2008 Johannes Passing 3
  • 4. GFS: Aims Aims Must scale to thousands of servers Fault tolerance/Data redundancy Optimize for large WORM files Optimize for large reads Favor throughput over latency Non-Aims Space efficiency Client side caching futile POSIX compliance Significant deviation from standard DFS (NFS, AFS, DFS, …)13.04.2008 Johannes Passing 4
  • 5. Design Choices File system cluster Spread data over thousands of machines Provide safety by redundant storage Master/Slave architecture13.04.2008 Johannes Passing 5
  • 6. Design Choices Files split into chunks Unit of management and distribution Replicated across servers Subdivided into blocks for integrity checks Accept internal fragmentation for better throughput GFS … File System13.04.2008 Johannes Passing 6
  • 7. Architecture: Chunkserver Host chunks Store chunks as files in Linux FS Verify data integrity Implemented entirely in user mode Fault tolerance Fail requests on integrity violation detection All chunks stored on at least one other server Re-replication restores server after downtime13.04.2008 Johannes Passing 7
  • 8. Architecture: Master Namespace and metadata management Cluster management Chunk location registry (transient) Chunk placement decisions Re-replication, Garbage collection Health monitoring Fault tolerance Backed by shadow masters Mirrored operations log Periodic snapshots13.04.2008 Johannes Passing 8
  • 9. Reading a chunk Send filename and offset Determine file and chunk Return chunk locations and version Choose closest server and request chunk Verify chunk version and block checksums Return data if valid, else fail request13.04.2008 Johannes Passing 9
  • 10. Writing a chunk Send filename and offset Determine lease owner/ grant lease Return chunk locations and lease owner Push data to any chunkserver Forward along replica chain Send write request to primary Create mutation order, apply Forward write request Validate blocks, apply modifications , inc version, ack13.04.2008 Johannes Passing Reply to client 10
  • 11. Appending a record (1/2) Send append request Choose offset Return chunk locations and lease owner Push data to any chunkserver Forward along replica chain Send append request Check for left space in chunk Pad chunk and request retry Request chunk allocation Return chunk locations and lease owner13.04.2008 Johannes Passing 11
  • 12. Appending a record (2/2) Send append request Allocate chunk, write data Forward request to replicas Write data, failure occurs Fail request A A A Client retries Padding Padding Padding Write data B B Garbage Forward request to replicas B B B Write data Free Free Free Free Free Free Success13.04.2008 Johannes Passing 12
  • 13. File Region Consistency File region states: Consistent All clients see same data Defined Consistent ^ clients see change in its entirety Inconsistent Consequences Unique record IDs to identify duplicates Records should be self-identifying and self-validating B B Garbage inconsistent B B B defined D Free C D Free C D Free C consistent defined C C C13.04.2008 Johannes Passing 13
  • 14. Fault tolerance GFS Automatic re-replication Replicas are spread across racks Fault mitigations Data corruption choose different replica Machine crash choose replica on different machine Network outage choose replica in different network Master crash shadow masters13.04.2008 Johannes Passing 14
  • 15. Wrap-up GFS Proven scalability, fault tolerance and performance Highly specialized Distribution transparent to clients Only partial abstraction – clients must cooperate Network is the bottleneck13.04.2008 Johannes Passing 15
  • 16. MapReduce: Aims Aims Unified model for large scale distributed data processing Massively parallelizable Distribution transparent to developer Fault tolerance Allow moving of computation close to data13.04.2008 Johannes Passing 16
  • 17. Higher order functions Functional: map(f, [a, b, c, …]) a b c f f f f(a) f(b) f(c) Google: map(k, v) (k, v) map (f(k1), f(v1)) (f(k1), f(v1)) (f(k1), f(v1))13.04.2008 Johannes Passing 17
  • 18. Higher order functions Functional: foldl/reduce(f, [a, b, c, …], i) f f c f b i a Google: reduce(k, [a, b, c, d, …]) a b c d f(a) – f(c, d)13.04.2008 Johannes Passing 18
  • 19. Idea Observation Map can be easily parallelized Reduce might be parallelized Idea Design application around map and reduce scheme Infrastructure manages scheduling and distribution User implements map and reduce Constraints All data coerced into key/value pairs13.04.2008 Johannes Passing 19
  • 20. Walkthrough Spawn master Provide data e.g. from GFS or BigTable Create M splits Spawn M mappers Map Once per key/value pair Partition into R buckets Mappers Combiners Reducers Spawn up to R*M combiners Combine Barrier Spawn up to R reducers Reduce13.04.2008 Johannes Passing 20
  • 21. Scheduling Locality Mappers scheduled close to data Chunk replication improves locality Reducers run on same machine as mappers Choosing M and R Load balancing & fast recovery vs. number of output files M >> R, R small multiple of #machines Schedule backup tasks to avoid stragglers13.04.2008 Johannes Passing 21
  • 22. Fault tolerance MapReduce Fault mitigations Map crash All intermediate data in questionable state Repeat all map tasks of machine Rationale of barrier Reduce crash Data is global, completed stages marked Repeat crashed resuce task only Master crash start over Repeated crashs on certain records Skip records13.04.2008 Johannes Passing 22
  • 23. Wrap-up MapReduce Proven scalability Restrict programming model for better runtime support Tailored to Google’s needs Programming model is fairly low-level13.04.2008 Johannes Passing 23
  • 24. Conclusion Rethinking infrastructure Consequent design for scalability and performance Highly specialized solutions Not generally applicable Solutions more low-level than usual Maintenance efforts may be significant „We believe we get tremendous competitive advantage by essentially building our own infrastructures“ --Eric Schmidt13.04.2008 Johannes Passing 24
  • 25. References Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, 2004 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”, 2003 Ralf Lämmel, “Googles MapReduce Programming Model – Revisited”, SCP journal, 2006 Google, “Cluster Computing and MapReduce”, http://code.google.com/edu/content/submissions/mapreduce- minilecture/listing.html, 2007 David F. Carr, “How Google Works”, http://www.baselinemag.com/c/a/Projects-Networks-and- Storage/How-Google-Works/, 2006 Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, “Evaluating MapReduce for Multi-core and Multiprocessor Systems”, 200713.04.2008 Johannes Passing 25