MapReduce A Gentle Introduction, In Four Acts
Act I Introduction
<ul><li>Map is a higher order procedure that takes as its arguments a procedure of one argument and a list. </li></ul>What...
<ul><li>Reduce is a higher order procedure that takes as its arguments a procedure of two arguments and a list.  </li></ul...
<ul><li>An algorithm inspired by map and reduce, used to perform ‘embarassingly parallel’ computations. </li></ul><ul><li>...
Semi-Structured Data?
The Web Is Kind Of A Mess
But There Is Some Order <html> <head> <title> Marmots I’ve Loved </title> </head> <body> <h1> Marmot List </h1> <ul> <li> ...
<ul><li>So, what do you do if you’ve got gigabytes (or terrabytes) of this sort of data, and you want to analyze it? </li>...
Act II Enter Stage Left – MapReduce
<ul><li>Conceptually, it’s easy to make make map parallel. </li></ul><ul><li>If you have 10 million records and 10 nodes, ...
<ul><li>Reduce is harder, can’t in general split the list up among nodes, and recombine the results.  Evaluation order mat...
<ul><li>Now we’re back in the same place that we are with Map, conceptually easy to make parallel, still a hard engineerin...
MapReduce Pseudocode Distributed Word Count* *This example is legally required to be in all introductions to MapReduce map...
Act III Hadoop (Streaming Mode)
Hadoop! <ul><li>Apache umbrella project (what isn’t, nowadays?) </li></ul><ul><li>Open source MapReduce implementation, di...
MapReduce Mapper Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr/...
MapReduce Reducer Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr...
Streaming Mode <ul><li>Jobs read from STDIN, write to STDOUT. </li></ul><ul><li>Framework guarantees that a given reduce j...
Act IV Amazon Elastic Map Reduce
So I’ve Got This Pile Of Data, Now What?
Buy A Bunch Of Servers?
 
Elastic Map Reduce <ul><li>Cloudy Hadoop </li></ul><ul><li>Pay for processing time by the hour. </li></ul><ul><li>Works wi...
Tips! <ul><li>Make sure to turn debugging on!  Seriously, otherwise, world of pain. </li></ul><ul><li>Don’t use the consol...
Upcoming SlideShare
Loading in...5
×

Map Reduce

574

Published on

Talk given by Michael Bevilaqua-Linn at Philly.rb on March 9th, 2010

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
574
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Map Reduce

  1. 1. MapReduce A Gentle Introduction, In Four Acts
  2. 2. Act I Introduction
  3. 3. <ul><li>Map is a higher order procedure that takes as its arguments a procedure of one argument and a list. </li></ul>What is Map >> l = (1..10) => 1..10 >> l.map { |i| i + 1 } => [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
  4. 4. <ul><li>Reduce is a higher order procedure that takes as its arguments a procedure of two arguments and a list. </li></ul><ul><li>Has some other names. In Ruby, it’s inject. </li></ul>What is Reduce >> l = (1..10) => 1..10 >> l.inject {|i, j| i + j } => 55
  5. 5. <ul><li>An algorithm inspired by map and reduce, used to perform ‘embarassingly parallel’ computations. </li></ul><ul><li>A framework based on that algorithm, used inside Google. </li></ul><ul><li>A handy way to deal with large (like, really, really large) amounts of semi-structured data. </li></ul>What Is MapReduce
  6. 6. Semi-Structured Data?
  7. 7. The Web Is Kind Of A Mess
  8. 8. But There Is Some Order <html> <head> <title> Marmots I’ve Loved </title> </head> <body> <h1> Marmot List </h1> <ul> <li> Marcy </li> <li> Stacy </li> </ul> </body> </html> 12:00:23 GET /marmots/index.html 12:00:55 GET /marmots/stacy.jpg 12:00:67 GET /marmots/marcy.jpg
  9. 9. <ul><li>So, what do you do if you’ve got gigabytes (or terrabytes) of this sort of data, and you want to analyze it? </li></ul><ul><li>You could buy a distributed data warehouse. Pricy! </li></ul><ul><li>And you still need to do ETL for everything. </li></ul><ul><li>And you’ve got nulls all over the place. </li></ul><ul><li>And maybe your schema changes. A lot. </li></ul>But What To Do With It?
  10. 10. Act II Enter Stage Left – MapReduce
  11. 11. <ul><li>Conceptually, it’s easy to make make map parallel. </li></ul><ul><li>If you have 10 million records and 10 nodes, send 1 million records to each node along with the map code. </li></ul><ul><li>That’s it! </li></ul><ul><li>Well, not really. It’s a hard engineering problem. (Need a distributed data store to store results, nodes, fail, and on and on…) </li></ul>What Is Map, Part Deux
  12. 12. <ul><li>Reduce is harder, can’t in general split the list up among nodes, and recombine the results. Evaluation order matters! </li></ul><ul><li>(1 / 2 / 3 / 4) != (1 / 2) / (3 / 4) </li></ul><ul><li>But what if we constrain ourselves to work only on key-value pairs? </li></ul><ul><li>Then we can distribute all the records that correspond to a particular key to the same node, and get an answer for that key. </li></ul>What Is Reduce, Part Deux
  13. 13. <ul><li>Now we’re back in the same place that we are with Map, conceptually easy to make parallel, still a hard engineering problem. </li></ul><ul><li>But how useful is it? </li></ul>What Is Reduce, Part Deux, Part Deux
  14. 14. MapReduce Pseudocode Distributed Word Count* *This example is legally required to be in all introductions to MapReduce map(record) words = split(record, ‘ ‘) for word in words emit(word, 1) reduce(key, values) int count = 0 for value in values count += 1 emit(key, count)
  15. 15. Act III Hadoop (Streaming Mode)
  16. 16. Hadoop! <ul><li>Apache umbrella project (what isn’t, nowadays?) </li></ul><ul><li>Open source MapReduce implementation, distributed filesystem (HDFS), non-relational data store (HBase), declarative language for processing semi-structured data (Pig). </li></ul><ul><li>I’ve really only used the MapReduce implementation, in ‘Streaming Mode’ </li></ul>
  17. 17. MapReduce Mapper Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr/bin/ruby STDIN.each_line do |line| words = line.split(' ') words.each { |word| puts &quot;#{word} 1&quot; } end
  18. 18. MapReduce Reducer Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr/bin/ruby count = 0 current_word = nil STDIN.each_line do |line| key, value = line.split(&quot; &quot;) current_word = key if nil == current_word if (key != current_word) then puts &quot;#{current_word} #{count}&quot; count = 0 current_word = key end count += value.to_i end puts &quot;#{current_word} #{count}&quot;
  19. 19. Streaming Mode <ul><li>Jobs read from STDIN, write to STDOUT. </li></ul><ul><li>Framework guarantees that a given reduce job will process an entire set of keys (ie: the key ‘marmot’ will not be split across two nodes) </li></ul><ul><li>Can use any language you want </li></ul><ul><li>Probably pretty slow, with all the STDIN/STDOUTing going on </li></ul><ul><li>Probably should use Pig instead </li></ul>
  20. 20. Act IV Amazon Elastic Map Reduce
  21. 21. So I’ve Got This Pile Of Data, Now What?
  22. 22. Buy A Bunch Of Servers?
  23. 24. Elastic Map Reduce <ul><li>Cloudy Hadoop </li></ul><ul><li>Pay for processing time by the hour. </li></ul><ul><li>Works with streaming mode, regular mode, pig. </li></ul><ul><li>Kinda sorta demonstration! </li></ul>
  24. 25. Tips! <ul><li>Make sure to turn debugging on! Seriously, otherwise, world of pain. </li></ul><ul><li>Don’t use the console for anything complicated. Use the ruby client (just google it). </li></ul><ul><li>For multi-step MRing, don’t write out to S3 </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×