0
Ruby on HadoopTuesday, January 8, 13
Introduction                                      Hi.                                   I’m Ted O’Meara                   ...
MapReduceTuesday, January 8, 13
History of MapReduce        • First implemented          by Google        • Used in CouchDB,          Hadoop, etc.        ...
What is MapReduce?Tuesday, January 8, 13
What is MapReduce?                                                                 sum = 0   input = ["deer", "bear",     ...
Hadoop BreakdownTuesday, January 8, 13
History of Hadoop        •Doug Cutting @ Yahoo!        •It is a Toy Elephant        •It is also a framework for         di...
Network TopologyTuesday, January 8, 13
Hadoop Cluster                         Cluster                         •Commodity hardware                         •Partit...
Hadoop Cluster                         NameNode                         •Keeps track of the DataNodes                     ...
Hadoop Cluster                         DataNode                         •Stores filesystem blocks                         •...
Hadoop Cluster                         JobTracker                         •Delegates which TaskTrackers should handle a   ...
Hadoop Cluster                         TaskTracker                         •Worker for MapReduce jobs                     ...
HDFSTuesday, January 8, 13
HDFS                                           hadoop fs -put localfile /user/hadoop/hadoopfile                         55...
Hadoop StreamingTuesday, January 8, 13
Hadoop Streaming        $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar                           -input "/u...
Hadoop Streaming                          Pig        Hive          Wukong                         Pig Latin   SQL-ish     ...
Wukong        •Infochimps        •Currently going through         heavy development        •Use the 3.0.0.pre3 gem        ...
Wukong            Wukong                             wukong-hadoop            •Write mappers and reducers        •A CLI to...
Wukong Processors                                     Wukong.processor(:mapper) do                                        ...
Wukong Processors                         Wukong.processor(:reducer, Wukong::Processor::Accumulator) do                   ...
Wukong Processors           wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb                             --mode...
The End                         Thank you!                             @tomeara                             ted@tedomeara....
Upcoming SlideShare
Loading in...5
×

Ruby on hadoop

2,479

Published on

Introduction to Hadoop, as well as a brief overview of the Wukong and wukon-hadoop gems

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,479
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
106
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Ruby on hadoop"

  1. 1. Ruby on HadoopTuesday, January 8, 13
  2. 2. Introduction Hi. I’m Ted O’Meara ...and I just quit my job last week. @tomeara tedomeara.comTuesday, January 8, 13
  3. 3. MapReduceTuesday, January 8, 13
  4. 4. History of MapReduce • First implemented by Google • Used in CouchDB, Hadoop, etc. • Helps to “distill” data into a concentrated result setTuesday, January 8, 13
  5. 5. What is MapReduce?Tuesday, January 8, 13
  6. 6. What is MapReduce? sum = 0 input = ["deer", "bear", input.each do |x| "river", "car", "car", "river", input.map! { |x| [x, 1] } sum += x[1] "deer", "car", "bear"] endTuesday, January 8, 13
  7. 7. Hadoop BreakdownTuesday, January 8, 13
  8. 8. History of Hadoop •Doug Cutting @ Yahoo! •It is a Toy Elephant •It is also a framework for distributed computing •It is a distributed filesystemTuesday, January 8, 13
  9. 9. Network TopologyTuesday, January 8, 13
  10. 10. Hadoop Cluster Cluster •Commodity hardware •Partition tolerant •Network-aware (rack-aware) 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNodeTuesday, January 8, 13
  11. 11. Hadoop Cluster NameNode •Keeps track of the DataNodes •Uses “heartbeat” to determine a node’s health •The most resources should be spent here 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNodeTuesday, January 8, 13
  12. 12. Hadoop Cluster DataNode •Stores filesystem blocks •Can be scaled. Spun up/down. •Replicate based on a set replication factor 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNodeTuesday, January 8, 13
  13. 13. Hadoop Cluster JobTracker •Delegates which TaskTrackers should handle a MapReduce job •Communicates with the NameNode to assign a TaskTracker close to the DataNode where the source exists 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNodeTuesday, January 8, 13
  14. 14. Hadoop Cluster TaskTracker •Worker for MapReduce jobs •The closer to the DataNode with the data, the better 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNodeTuesday, January 8, 13
  15. 15. HDFSTuesday, January 8, 13
  16. 16. HDFS hadoop fs -put localfile /user/hadoop/hadoopfile 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNodeTuesday, January 8, 13
  17. 17. Hadoop StreamingTuesday, January 8, 13
  18. 18. Hadoop Streaming $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input "/user/me/samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "/user/me/samples/cachefile/out" -cacheArchive hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=3 -jobconf mapred.job.name="Experiment" 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNodeTuesday, January 8, 13
  19. 19. Hadoop Streaming Pig Hive Wukong Pig Latin SQL-ish Ruby! Hadoop EcosystemTuesday, January 8, 13
  20. 20. Wukong •Infochimps •Currently going through heavy development •Use the 3.0.0.pre3 gem https://github.com/infochimps-labs/wukong/tree/3.0.0 •Model your jobs with wukong-hadoop https://github.com/infochimps-labs/wukong-hadoopTuesday, January 8, 13
  21. 21. Wukong Wukong wukong-hadoop •Write mappers and reducers •A CLI to use with Hadoop using Ruby •Created around building tasks •As of 3.0.0, Wukong uses with Wukong “Processors”, which are Ruby •Better than piping in the shell classes that define map, reduce, (you can see this with --dry_run) and other tasksTuesday, January 8, 13
  22. 22. Wukong Processors Wukong.processor(:mapper) do      field :min_length, Integer, :default => 1   field :max_length, Integer, :default => 256   field :split_on, Regexp, :default => /s+/   field :remove, Regexp, :default => /[^a-zA-Z0-9]+/   field :fold_case, :boolean, :default => false      def process string •Fields are accessible     tokenize(string).each do |token|       yield token if acceptable?(token)     end through switches in shell   end   private •Local hand-off is made at   def tokenize string     string.split(split_on).map do |token| STDOUT to STDIN       stripped = token.gsub(remove, )       fold_case ? stripped.downcase : stripped     end   end   def acceptable? token     (min_length..max_length).include?(token.length)   end endTuesday, January 8, 13
  23. 23. Wukong Processors Wukong.processor(:reducer, Wukong::Processor::Accumulator) do   attr_accessor :count      def start record     self.count = 0   end      def accumulate record     self.count += 1   end   def finalize     yield [key, count].join("t")   end endTuesday, January 8, 13
  24. 24. Wukong Processors wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb --mode=local --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub Simpsons - Ep 8 do 7 Doctor 1 Does 2 doesnt 1 dog 2 Doh 1 doif 1 doing 2 done 1 doneYou 1 dont 10 Dont 1Tuesday, January 8, 13
  25. 25. The End Thank you! @tomeara ted@tedomeara.comTuesday, January 8, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×