• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Using Ruby to do Map/Reduce with Hadoop

Using Ruby to do Map/Reduce with Hadoop



Ruby user group presentation introducing map/reduce and how to use ruby to process data with map/reduce on hadoop clusters.

Ruby user group presentation introducing map/reduce and how to use ruby to process data with map/reduce on hadoop clusters.



Total Views
Views on SlideShare
Embed Views



2 Embeds 5

https://si0.twimg.com 4
https://duckduckgo.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Using Ruby to do Map/Reduce with Hadoop Using Ruby to do Map/Reduce with Hadoop Presentation Transcript

    • 5/12/11 James Kebinger
    • Agenda   Introduction/Who am I?   Map/Reduce basics   Wukong   Hadoop and Amazon Elastic Map Reduce   Real Examples   Pig, Other Tools5/12/11 James Kebinger
    • Introduction   James Kebinger   Software Engineer/Data Analyst   Data Team at PatientsLikeMe   Ruby, SQL, R, Hadoop/Pig   jkebinger@gmail.com   @monkeyatlarge on twitter   Blogs at Monkeyatlarge.com5/12/11 James Kebinger
    • Big Data?   Flexibility is key   Keep the whole haystack, figure out the needles later   Don’t need to plan what fields to keep ahead of time   Store everything you can afford to – on cheap storage   Be able to get answers before you forget the question5/12/11 James Kebinger
    • Examples   Extract session summaries from archived weblogs   Match patients to treatment centers en-masse to get an estimate of the distribution5/12/11 James Kebinger
    • Map/Reduce Ruby Style nums = [1,2,4,5,6] by_2 = nums.map {|num| number*2} sums_by_2 = by_2.reduce {|memo, obj| memo + obj}5/12/11 James Kebinger
    • Map/Reduce  Mapper transforms or filters input   Emits 0 to n outputs for each line/record input   Each output starts with the Key, followed by 0 or more values Monkey Monkey 1 foo bar Zebra 1 zebra dog Dog 1 Map Sort Cat dog Cat 1 baz foo Zebra 1 Dog 15/12/11 James Kebinger
    • Map/Reduce Monkey Monkey 1 foo bar Zebra 1 zebra dog Dog 1 Map Sort Cat dog Cat 1 baz foo Zebra 1 Dog 1STOP_WORDS = %w{foo bar baz}ARGF.each_line do |line| line.split(/s+/).each do |word| puts [word,1].join("t") unless STOP_WORDS.include? word endend 5/12/11 James Kebinger
    • Map/Reduce  Reducer receives input sorted by key(s)   Applies an operation to all data with the same key   Emits the result Cat 1 Dog 1 Cat 1 Dog 1 Dog 2 Sort Reduce Monkey 1 Monkey 1 Zebra 1 Zebra 2 Zebra 15/12/11 James Kebinger
    • Map/Reducelast_word = nil; current_count = 0ARGF.each do |line| word, count = line.split("t") if word != last_word puts [last_word, current_count].join("t") unless last_word.nil? last_word = word current_count = 0 end current_count += count.to_iendputs [last_word, current_count].join("t") unless last_word.nil? Cat 1 Dog 1 Cat 1 Dog 1 Dog 2 Sort Reduce Monkey 1 Monkey 1 Zebra 1 Zebra 2 Zebra 1 5/12/11 James Kebinger
    • With Unix Pipes cat SOMEFILE | ruby wordcount-simple-mapper.rb | sort | ruby wordcount-simple-reducer.rb5/12/11 James Kebinger
    • Enter Wukong   Project out of InfoChimps   Deals with much of the minutiae   Base classes for mappers and reducers   Lets your script deal with lines, records or objects as needed   Nice logging showing data flow   Support for launching locally, Amazon EMR or your own Hadoop cluster   https://github.com/mrflip/wukong5/12/11 James Kebinger
    • Wukong Word Countmodule WordCount class Mapper < Wukong::Streamer::LineStreamer def process line words = line.strip.split(/W+/).reject(&:blank?) words.each{|word| yield [word, 1] } end end class Reducer < Wukong::Streamer::ListReducer def finalize yield [ key, values.map(&:last).map(&:to_i).sum ] end endendWukong::Script.new( WordCount::Mapper, WordCount::Reducer).run5/12/11 James Kebinger
    • Data Accumulation class Reducer < Wukong::Streamer::AccumulatingReducer def start! word, count @word = word; @count = 0 end def accumulate word, count @count += count.to_i end def finalize yield [ @word, @count ] end end5/12/11 James Kebinger
    • Super Size Me5/12/11 James Kebinger
    • Hadoop   Hadoop is the Java implementation of Map/Reduce   Industrial scale   Tools to manage clusters of machines   Distributed File System   Move data around   Move your scripts to the data   Used to need Java   Hadoop Streaming allows data to be piped through scripts in any language   Map-reduce becoming a little bit like assembly5/12/11 James Kebinger
    • Hadoop Data Flow5/12/11 James Kebinger
    • Amazon EMR   Amazon Elastic MapReduce   Stick your data and code in S3   Pick how many machines and how big   Light up a cluster, process data and shut it down5/12/11 James Kebinger
    • Same Code, Two Speeds ruby wukong_test.rb --run=local input_file output   Changing the run argument from local* to   “hadoop” to run on a hadoop cluster   “emr” to run on Amazon Elastic Map Reduce * And a few more arguments for EMR ruby wukong_test.rb --run=emr --key_pair=jamesk-mbp --emr_root=s3://jkk-plm –instance-type=c1.medium –num-instances=4 s3://jkk-plm/testing/input/* s3://jkk-plm/testing/output5/12/11 James Kebinger
    • For Comparison elastic-mapreduce --create --name=myjob --key-pair=jamesk-mbp --slave-instance-type=c1.medium --num-instances 3 --bootstrap-action=s3://MYBUCKET/emr_bootstrap.sh --credentials LOCAL_PATH_TO/credentials.json --log-uri=s3://MYBUCKET/logs --stream --cache s3n://MYBUCKET/code/my_mapper.rb --cache s3n://MYBUCKET/code/my_reducer.rb --mapper=/usr/bin/ruby1.8 my_mapper.rb --reducer=/usr/bin/ruby1.8 my_reducer.rb --input=s3n://MYBUCKET/input/*.bz2 --output=s3n://MYBUCKET5/12/11 James Kebinger
    • Closest Location (map)class Mapper < Wukong::Streamer::RecordStreamer def process person_id, person_lat, person_lon, location_id,location_lat, location_lon distance = haversine_distance(person_lat.to_f,person_lon.to_f, location_lat.to_f, location_lon.to_f) yield [person_id, distance, location_id] if distance <3000 end end 5/12/11 James Kebinger
    • Closest Location (reduce) class Reducer < Wukong::Streamer::AccumulatingReducer def start! person_id, distance, location_id @person_id = person_id @closest_distance = distance @closest_location_id = location_id end def accumulate person_id, distance, location_id if @closest_distance > distance @closest_distance = distance @closest_location_id = location_id end end def finalize yield [@person_id, @closest_location_id, @closest_distance] end end5/12/11 James Kebinger
    • Sort and Secondary Sortpartition   Data is sorted before reduce phase byA12 45.6 B12 key, then partitionedA12 3.4 B13A12 34.3 B10   Take advantage of that to get sortedA27 20.1 B11 input to each reducerA99 23.0 B10A99 20.1 B11   Split and sort by different keys using a couple of arguments Partition   Add args to launchA12 3.4 B13A12 34.3 B10   num.key.fields.for.partition=1A12 45.6 B12   stream.num.map.output.key.fields=2A27 20.1 B11A99 20.1 B11   partitionerA99 23.0 B10 org.apache.hadoop.mapred.lib.KeyFiel dBasedPartitioner 5/12/11 James Kebinger Sort
    • Session Splitting Example   Get smarter later   How long do logged-in users of a certain class spend on the site?   Scan weblogs in time order for each user, splitting when clicks are at least N minutes apart   With a little Ruby and a credit card, can know the answer in a few hours5/12/11 James Kebinger
    • class Reducer < Wukong::Streamer::AccumulatingReducer GAP = 0.5 #hour attr_accessor :sessions def start! *args @sessions = [] end def accumulate user_id, date_str, page_name date = DateTime.parse(date_str) curr_session = sessions.last if curr_session.nil? || (date - curr_session[:end])*24 > GAP curr_session = {:start => date, :count => 0, :pages => Hash.new(0)} sessions << curr_session end curr_session[:end] = date curr_session[:count] += 1 curr_session[:pages][page_name] += 1 end def finalize sessions.each do |s| yield [key, s[:start].s[:end] …. end endend James Kebinger
    • Gotchas   Mappers run one per file – so upload data in chunks   Except LZO, which need index. Bzip splitting around the corner?   Getting dependencies onto all the machines   Bootstrap scripts   Package files into the cache, will be symlinked into the working directory   If your job crashes, you pay for the whole hour5/12/11 James Kebinger
    • Bootstrap Script sudo apt-get update sudo apt-get -y install ruby1.8-dev wget https://jkk-plm.s3.amazonaws.com/bootstrap-deps/rubygems-1.7.2.tgz tar xvfz rubygems-1.7.2.tgz cd rubygems-1.7.2 && sudo ruby setup.rb sudo gem install wukong json --no-rdoc --no-ri5/12/11 James Kebinger
    • Orchestration   Examples still at the level of map-reduce   Sophisticated workflows have many maps and reduces   Growing selection of tools to move up a level of abstraction   Write a data flow, compile it down into a series of maps and reduces5/12/11 James Kebinger
    • Apache Pig   Data types   Int/Float/Date/String   Tuple   Bag   High level operations to join, filter, sort and group data   Min/Max/Count   And you can still call out to scripting languages…   Handy tool even running on one machine5/12/11 James Kebinger
    • Haversine Distance, Pig5/12/11 James Kebinger
    • Other High Level Tools   Hive: Like SQL   Cascading: Pipe metaphor. Java, but works with Jruby5/12/11 James Kebinger
    • Further Reading   Data Recipes Blog thedatachef.blogspot.com   Other tools for Ruby, Python   MRTool   List on wukong github5/12/11 James Kebinger
    • Thank You   James Kebinger   jkebinger@gmail.com5/12/11 James Kebinger