Map Reduce
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Map Reduce

  • 700 views
Uploaded on

Talk given by Michael Bevilaqua-Linn at Philly.rb on March 9th, 2010

Talk given by Michael Bevilaqua-Linn at Philly.rb on March 9th, 2010

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
700
On Slideshare
698
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
16
Comments
0
Likes
0

Embeds 2

http://www.slideshare.net 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MapReduce A Gentle Introduction, In Four Acts
  • 2. Act I Introduction
  • 3.
    • Map is a higher order procedure that takes as its arguments a procedure of one argument and a list.
    What is Map >> l = (1..10) => 1..10 >> l.map { |i| i + 1 } => [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
  • 4.
    • Reduce is a higher order procedure that takes as its arguments a procedure of two arguments and a list.
    • Has some other names. In Ruby, it’s inject.
    What is Reduce >> l = (1..10) => 1..10 >> l.inject {|i, j| i + j } => 55
  • 5.
    • An algorithm inspired by map and reduce, used to perform ‘embarassingly parallel’ computations.
    • A framework based on that algorithm, used inside Google.
    • A handy way to deal with large (like, really, really large) amounts of semi-structured data.
    What Is MapReduce
  • 6. Semi-Structured Data?
  • 7. The Web Is Kind Of A Mess
  • 8. But There Is Some Order <html> <head> <title> Marmots I’ve Loved </title> </head> <body> <h1> Marmot List </h1> <ul> <li> Marcy </li> <li> Stacy </li> </ul> </body> </html> 12:00:23 GET /marmots/index.html 12:00:55 GET /marmots/stacy.jpg 12:00:67 GET /marmots/marcy.jpg
  • 9.
    • So, what do you do if you’ve got gigabytes (or terrabytes) of this sort of data, and you want to analyze it?
    • You could buy a distributed data warehouse. Pricy!
    • And you still need to do ETL for everything.
    • And you’ve got nulls all over the place.
    • And maybe your schema changes. A lot.
    But What To Do With It?
  • 10. Act II Enter Stage Left – MapReduce
  • 11.
    • Conceptually, it’s easy to make make map parallel.
    • If you have 10 million records and 10 nodes, send 1 million records to each node along with the map code.
    • That’s it!
    • Well, not really. It’s a hard engineering problem. (Need a distributed data store to store results, nodes, fail, and on and on…)
    What Is Map, Part Deux
  • 12.
    • Reduce is harder, can’t in general split the list up among nodes, and recombine the results. Evaluation order matters!
    • (1 / 2 / 3 / 4) != (1 / 2) / (3 / 4)
    • But what if we constrain ourselves to work only on key-value pairs?
    • Then we can distribute all the records that correspond to a particular key to the same node, and get an answer for that key.
    What Is Reduce, Part Deux
  • 13.
    • Now we’re back in the same place that we are with Map, conceptually easy to make parallel, still a hard engineering problem.
    • But how useful is it?
    What Is Reduce, Part Deux, Part Deux
  • 14. MapReduce Pseudocode Distributed Word Count* *This example is legally required to be in all introductions to MapReduce map(record) words = split(record, ‘ ‘) for word in words emit(word, 1) reduce(key, values) int count = 0 for value in values count += 1 emit(key, count)
  • 15. Act III Hadoop (Streaming Mode)
  • 16. Hadoop!
    • Apache umbrella project (what isn’t, nowadays?)
    • Open source MapReduce implementation, distributed filesystem (HDFS), non-relational data store (HBase), declarative language for processing semi-structured data (Pig).
    • I’ve really only used the MapReduce implementation, in ‘Streaming Mode’
  • 17. MapReduce Mapper Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr/bin/ruby STDIN.each_line do |line| words = line.split(' ') words.each { |word| puts &quot;#{word} 1&quot; } end
  • 18. MapReduce Reducer Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr/bin/ruby count = 0 current_word = nil STDIN.each_line do |line| key, value = line.split(&quot; &quot;) current_word = key if nil == current_word if (key != current_word) then puts &quot;#{current_word} #{count}&quot; count = 0 current_word = key end count += value.to_i end puts &quot;#{current_word} #{count}&quot;
  • 19. Streaming Mode
    • Jobs read from STDIN, write to STDOUT.
    • Framework guarantees that a given reduce job will process an entire set of keys (ie: the key ‘marmot’ will not be split across two nodes)
    • Can use any language you want
    • Probably pretty slow, with all the STDIN/STDOUTing going on
    • Probably should use Pig instead
  • 20. Act IV Amazon Elastic Map Reduce
  • 21. So I’ve Got This Pile Of Data, Now What?
  • 22. Buy A Bunch Of Servers?
  • 23.  
  • 24. Elastic Map Reduce
    • Cloudy Hadoop
    • Pay for processing time by the hour.
    • Works with streaming mode, regular mode, pig.
    • Kinda sorta demonstration!
  • 25. Tips!
    • Make sure to turn debugging on! Seriously, otherwise, world of pain.
    • Don’t use the console for anything complicated. Use the ruby client (just google it).
    • For multi-step MRing, don’t write out to S3