Your SlideShare is downloading. ×
Using Ruby to do Map/Reduce with Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Using Ruby to do Map/Reduce with Hadoop


Published on

Ruby user group presentation introducing map/reduce and how to use ruby to process data with map/reduce on hadoop clusters.

Ruby user group presentation introducing map/reduce and how to use ruby to process data with map/reduce on hadoop clusters.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. 5/12/11 James Kebinger
  • 2. Agenda   Introduction/Who am I?   Map/Reduce basics   Wukong   Hadoop and Amazon Elastic Map Reduce   Real Examples   Pig, Other Tools5/12/11 James Kebinger
  • 3. Introduction   James Kebinger   Software Engineer/Data Analyst   Data Team at PatientsLikeMe   Ruby, SQL, R, Hadoop/Pig    @monkeyatlarge on twitter   Blogs at Monkeyatlarge.com5/12/11 James Kebinger
  • 4. Big Data?   Flexibility is key   Keep the whole haystack, figure out the needles later   Don’t need to plan what fields to keep ahead of time   Store everything you can afford to – on cheap storage   Be able to get answers before you forget the question5/12/11 James Kebinger
  • 5. Examples   Extract session summaries from archived weblogs   Match patients to treatment centers en-masse to get an estimate of the distribution5/12/11 James Kebinger
  • 6. Map/Reduce Ruby Style nums = [1,2,4,5,6] by_2 = {|num| number*2} sums_by_2 = by_2.reduce {|memo, obj| memo + obj}5/12/11 James Kebinger
  • 7. Map/Reduce  Mapper transforms or filters input   Emits 0 to n outputs for each line/record input   Each output starts with the Key, followed by 0 or more values Monkey Monkey 1 foo bar Zebra 1 zebra dog Dog 1 Map Sort Cat dog Cat 1 baz foo Zebra 1 Dog 15/12/11 James Kebinger
  • 8. Map/Reduce Monkey Monkey 1 foo bar Zebra 1 zebra dog Dog 1 Map Sort Cat dog Cat 1 baz foo Zebra 1 Dog 1STOP_WORDS = %w{foo bar baz}ARGF.each_line do |line| line.split(/s+/).each do |word| puts [word,1].join("t") unless STOP_WORDS.include? word endend 5/12/11 James Kebinger
  • 9. Map/Reduce  Reducer receives input sorted by key(s)   Applies an operation to all data with the same key   Emits the result Cat 1 Dog 1 Cat 1 Dog 1 Dog 2 Sort Reduce Monkey 1 Monkey 1 Zebra 1 Zebra 2 Zebra 15/12/11 James Kebinger
  • 10. Map/Reducelast_word = nil; current_count = 0ARGF.each do |line| word, count = line.split("t") if word != last_word puts [last_word, current_count].join("t") unless last_word.nil? last_word = word current_count = 0 end current_count += count.to_iendputs [last_word, current_count].join("t") unless last_word.nil? Cat 1 Dog 1 Cat 1 Dog 1 Dog 2 Sort Reduce Monkey 1 Monkey 1 Zebra 1 Zebra 2 Zebra 1 5/12/11 James Kebinger
  • 11. With Unix Pipes cat SOMEFILE | ruby wordcount-simple-mapper.rb | sort | ruby wordcount-simple-reducer.rb5/12/11 James Kebinger
  • 12. Enter Wukong   Project out of InfoChimps   Deals with much of the minutiae   Base classes for mappers and reducers   Lets your script deal with lines, records or objects as needed   Nice logging showing data flow   Support for launching locally, Amazon EMR or your own Hadoop cluster  James Kebinger
  • 13. Wukong Word Countmodule WordCount class Mapper < Wukong::Streamer::LineStreamer def process line words = line.strip.split(/W+/).reject(&:blank?) words.each{|word| yield [word, 1] } end end class Reducer < Wukong::Streamer::ListReducer def finalize yield [ key, ] end WordCount::Mapper, WordCount::Reducer).run5/12/11 James Kebinger
  • 14. Data Accumulation class Reducer < Wukong::Streamer::AccumulatingReducer def start! word, count @word = word; @count = 0 end def accumulate word, count @count += count.to_i end def finalize yield [ @word, @count ] end end5/12/11 James Kebinger
  • 15. Super Size Me5/12/11 James Kebinger
  • 16. Hadoop   Hadoop is the Java implementation of Map/Reduce   Industrial scale   Tools to manage clusters of machines   Distributed File System   Move data around   Move your scripts to the data   Used to need Java   Hadoop Streaming allows data to be piped through scripts in any language   Map-reduce becoming a little bit like assembly5/12/11 James Kebinger
  • 17. Hadoop Data Flow5/12/11 James Kebinger
  • 18. Amazon EMR   Amazon Elastic MapReduce   Stick your data and code in S3   Pick how many machines and how big   Light up a cluster, process data and shut it down5/12/11 James Kebinger
  • 19. Same Code, Two Speeds ruby wukong_test.rb --run=local input_file output   Changing the run argument from local* to   “hadoop” to run on a hadoop cluster   “emr” to run on Amazon Elastic Map Reduce * And a few more arguments for EMR ruby wukong_test.rb --run=emr --key_pair=jamesk-mbp --emr_root=s3://jkk-plm –instance-type=c1.medium –num-instances=4 s3://jkk-plm/testing/input/* s3://jkk-plm/testing/output5/12/11 James Kebinger
  • 20. For Comparison elastic-mapreduce --create --name=myjob --key-pair=jamesk-mbp --slave-instance-type=c1.medium --num-instances 3 --bootstrap-action=s3://MYBUCKET/ --credentials LOCAL_PATH_TO/credentials.json --log-uri=s3://MYBUCKET/logs --stream --cache s3n://MYBUCKET/code/my_mapper.rb --cache s3n://MYBUCKET/code/my_reducer.rb --mapper=/usr/bin/ruby1.8 my_mapper.rb --reducer=/usr/bin/ruby1.8 my_reducer.rb --input=s3n://MYBUCKET/input/*.bz2 --output=s3n://MYBUCKET5/12/11 James Kebinger
  • 21. Closest Location (map)class Mapper < Wukong::Streamer::RecordStreamer def process person_id, person_lat, person_lon, location_id,location_lat, location_lon distance = haversine_distance(person_lat.to_f,person_lon.to_f, location_lat.to_f, location_lon.to_f) yield [person_id, distance, location_id] if distance <3000 end end 5/12/11 James Kebinger
  • 22. Closest Location (reduce) class Reducer < Wukong::Streamer::AccumulatingReducer def start! person_id, distance, location_id @person_id = person_id @closest_distance = distance @closest_location_id = location_id end def accumulate person_id, distance, location_id if @closest_distance > distance @closest_distance = distance @closest_location_id = location_id end end def finalize yield [@person_id, @closest_location_id, @closest_distance] end end5/12/11 James Kebinger
  • 23. Sort and Secondary Sortpartition   Data is sorted before reduce phase byA12 45.6 B12 key, then partitionedA12 3.4 B13A12 34.3 B10   Take advantage of that to get sortedA27 20.1 B11 input to each reducerA99 23.0 B10A99 20.1 B11   Split and sort by different keys using a couple of arguments Partition   Add args to launchA12 3.4 B13A12 34.3 B10   num.key.fields.for.partition=1A12 45.6 B12  20.1 B11A99 20.1 B11   partitionerA99 23.0 B10 org.apache.hadoop.mapred.lib.KeyFiel dBasedPartitioner 5/12/11 James Kebinger Sort
  • 24. Session Splitting Example   Get smarter later   How long do logged-in users of a certain class spend on the site?   Scan weblogs in time order for each user, splitting when clicks are at least N minutes apart   With a little Ruby and a credit card, can know the answer in a few hours5/12/11 James Kebinger
  • 25. class Reducer < Wukong::Streamer::AccumulatingReducer GAP = 0.5 #hour attr_accessor :sessions def start! *args @sessions = [] end def accumulate user_id, date_str, page_name date = DateTime.parse(date_str) curr_session = sessions.last if curr_session.nil? || (date - curr_session[:end])*24 > GAP curr_session = {:start => date, :count => 0, :pages =>} sessions << curr_session end curr_session[:end] = date curr_session[:count] += 1 curr_session[:pages][page_name] += 1 end def finalize sessions.each do |s| yield [key, s[:start].s[:end] …. end endend James Kebinger
  • 26. Gotchas   Mappers run one per file – so upload data in chunks   Except LZO, which need index. Bzip splitting around the corner?   Getting dependencies onto all the machines   Bootstrap scripts   Package files into the cache, will be symlinked into the working directory   If your job crashes, you pay for the whole hour5/12/11 James Kebinger
  • 27. Bootstrap Script sudo apt-get update sudo apt-get -y install ruby1.8-dev wget tar xvfz rubygems-1.7.2.tgz cd rubygems-1.7.2 && sudo ruby setup.rb sudo gem install wukong json --no-rdoc --no-ri5/12/11 James Kebinger
  • 28. Orchestration   Examples still at the level of map-reduce   Sophisticated workflows have many maps and reduces   Growing selection of tools to move up a level of abstraction   Write a data flow, compile it down into a series of maps and reduces5/12/11 James Kebinger
  • 29. Apache Pig   Data types   Int/Float/Date/String   Tuple   Bag   High level operations to join, filter, sort and group data   Min/Max/Count   And you can still call out to scripting languages…   Handy tool even running on one machine5/12/11 James Kebinger
  • 30. Haversine Distance, Pig5/12/11 James Kebinger
  • 31. Other High Level Tools   Hive: Like SQL   Cascading: Pipe metaphor. Java, but works with Jruby5/12/11 James Kebinger
  • 32. Further Reading   Data Recipes Blog   Other tools for Ruby, Python   MRTool   List on wukong github5/12/11 James Kebinger
  • 33. Thank You   James Kebinger   jkebinger@gmail.com5/12/11 James Kebinger