Map Reduce
Upcoming SlideShare
Loading in...5
×
 

Map Reduce

on

  • 607 views

Talk given by Michael Bevilaqua-Linn at Philly.rb on March 9th, 2010

Talk given by Michael Bevilaqua-Linn at Philly.rb on March 9th, 2010

Statistics

Views

Total Views
607
Views on SlideShare
605
Embed Views
2

Actions

Likes
0
Downloads
16
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Map Reduce Map Reduce Presentation Transcript

    • MapReduce A Gentle Introduction, In Four Acts
    • Act I Introduction
      • Map is a higher order procedure that takes as its arguments a procedure of one argument and a list.
      What is Map >> l = (1..10) => 1..10 >> l.map { |i| i + 1 } => [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
      • Reduce is a higher order procedure that takes as its arguments a procedure of two arguments and a list.
      • Has some other names. In Ruby, it’s inject.
      What is Reduce >> l = (1..10) => 1..10 >> l.inject {|i, j| i + j } => 55
      • An algorithm inspired by map and reduce, used to perform ‘embarassingly parallel’ computations.
      • A framework based on that algorithm, used inside Google.
      • A handy way to deal with large (like, really, really large) amounts of semi-structured data.
      What Is MapReduce
    • Semi-Structured Data?
    • The Web Is Kind Of A Mess
    • But There Is Some Order <html> <head> <title> Marmots I’ve Loved </title> </head> <body> <h1> Marmot List </h1> <ul> <li> Marcy </li> <li> Stacy </li> </ul> </body> </html> 12:00:23 GET /marmots/index.html 12:00:55 GET /marmots/stacy.jpg 12:00:67 GET /marmots/marcy.jpg
      • So, what do you do if you’ve got gigabytes (or terrabytes) of this sort of data, and you want to analyze it?
      • You could buy a distributed data warehouse. Pricy!
      • And you still need to do ETL for everything.
      • And you’ve got nulls all over the place.
      • And maybe your schema changes. A lot.
      But What To Do With It?
    • Act II Enter Stage Left – MapReduce
      • Conceptually, it’s easy to make make map parallel.
      • If you have 10 million records and 10 nodes, send 1 million records to each node along with the map code.
      • That’s it!
      • Well, not really. It’s a hard engineering problem. (Need a distributed data store to store results, nodes, fail, and on and on…)
      What Is Map, Part Deux
      • Reduce is harder, can’t in general split the list up among nodes, and recombine the results. Evaluation order matters!
      • (1 / 2 / 3 / 4) != (1 / 2) / (3 / 4)
      • But what if we constrain ourselves to work only on key-value pairs?
      • Then we can distribute all the records that correspond to a particular key to the same node, and get an answer for that key.
      What Is Reduce, Part Deux
      • Now we’re back in the same place that we are with Map, conceptually easy to make parallel, still a hard engineering problem.
      • But how useful is it?
      What Is Reduce, Part Deux, Part Deux
    • MapReduce Pseudocode Distributed Word Count* *This example is legally required to be in all introductions to MapReduce map(record) words = split(record, ‘ ‘) for word in words emit(word, 1) reduce(key, values) int count = 0 for value in values count += 1 emit(key, count)
    • Act III Hadoop (Streaming Mode)
    • Hadoop!
      • Apache umbrella project (what isn’t, nowadays?)
      • Open source MapReduce implementation, distributed filesystem (HDFS), non-relational data store (HBase), declarative language for processing semi-structured data (Pig).
      • I’ve really only used the MapReduce implementation, in ‘Streaming Mode’
    • MapReduce Mapper Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr/bin/ruby STDIN.each_line do |line| words = line.split(' ') words.each { |word| puts &quot;#{word} 1&quot; } end
    • MapReduce Reducer Distributed Word Count* *This example is legally required to be in all introductions to MapReduce #!/usr/bin/ruby count = 0 current_word = nil STDIN.each_line do |line| key, value = line.split(&quot; &quot;) current_word = key if nil == current_word if (key != current_word) then puts &quot;#{current_word} #{count}&quot; count = 0 current_word = key end count += value.to_i end puts &quot;#{current_word} #{count}&quot;
    • Streaming Mode
      • Jobs read from STDIN, write to STDOUT.
      • Framework guarantees that a given reduce job will process an entire set of keys (ie: the key ‘marmot’ will not be split across two nodes)
      • Can use any language you want
      • Probably pretty slow, with all the STDIN/STDOUTing going on
      • Probably should use Pig instead
    • Act IV Amazon Elastic Map Reduce
    • So I’ve Got This Pile Of Data, Now What?
    • Buy A Bunch Of Servers?
    •  
    • Elastic Map Reduce
      • Cloudy Hadoop
      • Pay for processing time by the hour.
      • Works with streaming mode, regular mode, pig.
      • Kinda sorta demonstration!
    • Tips!
      • Make sure to turn debugging on! Seriously, otherwise, world of pain.
      • Don’t use the console for anything complicated. Use the ruby client (just google it).
      • For multi-step MRing, don’t write out to S3