Your SlideShare is downloading. ×
2 Map Reduce And Hdfs
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2 Map Reduce And Hdfs

1,519
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,519
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
70
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. MapReduce and HDFS This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: © 2009 Cloudera, Inc.
  • 2. Overview • Why MapReduce? • What is MapReduce? • The Hadoop Distributed File System © 2009 Cloudera, Inc.
  • 3. How MapReduce is Structured • Functional programming meets distributed computing • A batch data processing system • Factors out many reliability concerns from application logic © 2009 Cloudera, Inc.
  • 4. MapReduce Provides: • Automatic parallelization & distribution • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers © 2009 Cloudera, Inc.
  • 5. Programming Model • Borrows from functional programming • Users implement interface of two functions: – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list © 2009 Cloudera, Inc.
  • 6. map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. © 2009 Cloudera, Inc.
  • 7. map map (in_key, in_value) -> (out_key, intermediate_value) list © 2009 Cloudera, Inc.
  • 8. Example: Upper-case Mapper let map(k, v) = emit(k.toUpper(), v.toUpper()) (“foo”, “bar”) (“FOO”, “BAR”) (“Foo”, “other”) (“FOO”, “OTHER”) (“key2”, “data”) (“KEY2”, “DATA”) © 2009 Cloudera, Inc.
  • 9. Example: Explode Mapper let map(k, v) = foreach char c in v: emit(k, c) (“A”, “cats”) (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”) (“B”, “hi”) (“B”, “h”), (“B”, “i”) © 2009 Cloudera, Inc.
  • 10. Example: Filter Mapper let map(k, v) = if (isPrime(v)) then emit(k, v) (“foo”, 7) (“foo”, 7) (“test”, 10) (nothing) © 2009 Cloudera, Inc.
  • 11. Example: Changing Keyspaces let map(k, v) = emit(v.length(), v) (“hi”, “test”) (4, “test”) (“x”, “quux”) (4, “quux”) (“y”, “abracadabra”) (10, “abracadabra”) © 2009 Cloudera, Inc.
  • 12. reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key • (in practice, usually only one final value per key) © 2009 Cloudera, Inc.
  • 13. reduce reduce (out_key, intermediate_value list) -> out_value list © 2009 Cloudera, Inc.
  • 14. Example: Sum Reducer let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) (“A”, [42, 100, 312]) (“A”, 454) (“B”, [12, 6, -2]) (“B”, 16) © 2009 Cloudera, Inc.
  • 15. Example: Identity Reducer let reduce(k, vals) = foreach v in vals: emit(k, v) (“A”, [42, 100, 312]) (“A”, 42), (“A”, 100), (“A”, 312) (“B”, [12, 6, -2]) (“B”, 12), (“B”, 6), (“B”, -2) © 2009 Cloudera, Inc.
  • 16. © 2009 Cloudera, Inc.
  • 17. Parallelism • map() functions run in parallel, creating different intermediate values from different input data sets • reduce() functions also run in parallel, each working on a different output key • All values are processed independently • Bottleneck: reduce phase can’t start until map phase is completely finished. © 2009 Cloudera, Inc.
  • 18. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(String output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result); © 2009 Cloudera, Inc.
  • 19. Combining Phase • Run on mapper nodes after map phase • “Mini-reduce,” only on local map output • Used to save bandwidth before sending data to full reducer • Reducer can be combiner if commutative & associative – e.g., SumReducer © 2009 Cloudera, Inc.
  • 20. Combiner, graphically © 2009 Cloudera, Inc.
  • 21. WordCount Redux map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: emit(w, 1); reduce(String output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; emit(output_key, result); © 2009 Cloudera, Inc.
  • 22. MapReduce Conclusions • MapReduce has proven to be a useful abstraction in many areas • Greatly simplifies large-scale computations • Functional programming paradigm can be applied to large-scale applications • You focus on the “real” problem, library deals with messy details © 2009 Cloudera, Inc.
  • 23. HDFS Some slides designed by Alex Moschuk, University of Washington Redistributed under the Creative Commons Attribution 3.0 license
  • 24. HDFS: Motivation • Based on Google’s GFS • Redundant storage of massive amounts of data on cheap and unreliable computers • Why not use an existing file system? – Different workload and design priorities – Handles much bigger dataset sizes than other filesystems © 2009 Cloudera, Inc.
  • 25. Assumptions • High component failure rates – Inexpensive commodity components fail all the time • “Modest” number of HUGE files – Just a few million – Each is 100MB or larger; multi-GB files typical • Files are write-once, mostly appended to – Perhaps concurrently • Large streaming reads • High sustained throughput favored over low latency © 2009 Cloudera, Inc.
  • 26. HDFS Design Decisions • Files stored as blocks – Much larger size than most filesystems (default is 64MB) • Reliability through replication – Each block replicated across 3+ DataNodes • Single master (NameNode) coordinates access, metadata – Simple centralized management • No data caching – Little benefit due to large data sets, streaming reads • Familiar interface, but customize the API – Simplify the problem; focus on distributed apps © 2009 Cloudera, Inc.
  • 27. HDFS Client Block Diagram $ " " ### ! ! © 2009 Cloudera, Inc.
  • 28. Based on GFS Architecture Figure from “The Google File System,” Ghemawat et. al., SOSP 2003 © 2009 Cloudera, Inc.
  • 29. Metadata • Single NameNode stores all metadata – Filenames, locations on DataNodes of each file • Maintained entirely in RAM for fast lookup • DataNodes store opaque file contents in “block” objects on underlying local filesystem © 2009 Cloudera, Inc.
  • 30. HDFS Conclusions • HDFS supports large-scale processing workloads on commodity hardware – designed to tolerate frequent component failures – optimized for huge files that are mostly appended and read – filesystem interface is customized for the job, but still retains familiarity for developers – simple solutions can work (e.g., single master) • Reliably stores several TB in individual clusters © 2009 Cloudera, Inc.

×