SlideShare a Scribd company logo
1 of 25
Download to read offline
Ruby on Hadoop
Tuesday, January 8, 13
Introduction




                                      Hi.
                                   I’m Ted O’Meara
                         ...and I just quit my job last week.

                                    @tomeara
                                 tedomeara.com

Tuesday, January 8, 13
MapReduce
Tuesday, January 8, 13
History of MapReduce



        • First implemented
          by Google
        • Used in CouchDB,
          Hadoop, etc.
        • Helps to “distill” data into
          a concentrated result set




Tuesday, January 8, 13
What is MapReduce?




Tuesday, January 8, 13
What is MapReduce?




                                                                 sum = 0
   input = ["deer", "bear",
                                                                 input.each do |x|
   "river", "car", "car", "river",   input.map! { |x| [x, 1] }
                                                                   sum += x[1]
   "deer", "car", "bear"]
                                                                 end




Tuesday, January 8, 13
Hadoop Breakdown
Tuesday, January 8, 13
History of Hadoop



        •Doug Cutting @ Yahoo!
        •It is a Toy Elephant
        •It is also a framework for
         distributed computing
        •It is a distributed filesystem




Tuesday, January 8, 13
Network Topology


Tuesday, January 8, 13
Hadoop Cluster

                         Cluster
                         •Commodity hardware
                         •Partition tolerant
                         •Network-aware (rack-aware)



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         NameNode
                         •Keeps track of the DataNodes
                         •Uses “heartbeat” to determine a node’s health
                         •The most resources should be spent here



                          555.555.1.*             555.555.2.*                 444.444.1.*
                              JobTracker              NameNode                 TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode
                                                                          ♥    TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         DataNode
                         •Stores filesystem blocks
                         •Can be scaled. Spun up/down.
                         •Replicate based on a set replication factor



                          555.555.1.*             555.555.2.*               444.444.1.*
                              JobTracker               NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         JobTracker
                         •Delegates which TaskTrackers should handle a
                          MapReduce job
                         •Communicates with the NameNode to assign a TaskTracker
                          close to the DataNode where the source exists


                          555.555.1.*                 555.555.2.*              444.444.1.*
                              JobTracker                  NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode
                                                  ♥    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         TaskTracker
                         •Worker for MapReduce jobs
                         •The closer to the DataNode with the data, the better



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
HDFS


Tuesday, January 8, 13
HDFS

                                           hadoop fs -put localfile /user/hadoop/hadoopfile




                         555.555.1.*                   555.555.2.*                    444.444.1.*
                             JobTracker                      NameNode                   TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming


Tuesday, January 8, 13
Hadoop Streaming
        $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
                          -input "/user/me/samples/cachefile/input.txt" 
                          -mapper "xargs cat" 
                          -reducer "cat" 
                          -output "/user/me/samples/cachefile/out" 
                          -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' 
                          -jobconf mapred.map.tasks=3 
                          -jobconf mapred.reduce.tasks=3 
                          -jobconf mapred.job.name="Experiment"




                         555.555.1.*                  555.555.2.*                      444.444.1.*
                             JobTracker                     NameNode                     TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming




                          Pig        Hive          Wukong
                         Pig Latin   SQL-ish         Ruby!




           Hadoop Ecosystem
Tuesday, January 8, 13
Wukong


        •Infochimps
        •Currently going through
         heavy development
        •Use the 3.0.0.pre3 gem
            https://github.com/infochimps-labs/wukong/tree/3.0.0

        •Model your jobs with
         wukong-hadoop
            https://github.com/infochimps-labs/wukong-hadoop




Tuesday, January 8, 13
Wukong



            Wukong                             wukong-hadoop
            •Write mappers and reducers        •A CLI to use with Hadoop
             using Ruby                        •Created around building tasks
            •As of 3.0.0, Wukong uses           with Wukong
             “Processors”, which are Ruby      •Better than piping in the shell
             classes that define map, reduce,
                                                (you can see this with --dry_run)
             and other tasks




Tuesday, January 8, 13
Wukong Processors

                                     Wukong.processor(:mapper) do
                                       
                                       field :min_length, Integer, :default    =>   1
                                       field :max_length, Integer, :default    =>   256
                                       field :split_on,   Regexp,   :default   =>   /s+/
                                       field :remove,     Regexp,   :default   =>   /[^a-zA-Z0-9']+/
                                       field :fold_case, :boolean, :default    =>   false
                                       
                                       def process string

        •Fields are accessible           tokenize(string).each do |token|
                                           yield token if acceptable?(token)
                                         end
         through switches in shell     end

                                       private
        •Local hand-off is made at      def tokenize string
                                         string.split(split_on).map do |token|
         STDOUT to STDIN                   stripped = token.gsub(remove, '')
                                           fold_case ? stripped.downcase : stripped
                                         end
                                       end

                                       def acceptable? token
                                         (min_length..max_length).include?(token.length)
                                       end
                                     end




Tuesday, January 8, 13
Wukong Processors



                         Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

                           attr_accessor :count
                           
                           def start record
                             self.count = 0
                           end
                           
                           def accumulate record
                             self.count += 1
                           end

                           def finalize
                             yield [key, count].join("t")
                           end
                         end




Tuesday, January 8, 13
Wukong Processors

           wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb 
                            --mode=local 
                            --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub




                                      Simpsons - Ep 8
                                      do 7
                                      Doctor     1
                                      Does 2
                                      doesn't    1
                                      dog 2
                                      D'oh 1
                                      doif 1
                                      doing      2
                                      done 1
                                      doneYou    1
                                      don't 10
                                      Don't 1




Tuesday, January 8, 13
The End




                         Thank you!
                             @tomeara
                             ted@tedomeara.com




Tuesday, January 8, 13

More Related Content

Similar to Ruby on hadoop

Similar to Ruby on hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pp1tx present
Pp1tx presentPp1tx present
Pp1tx present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
Test schedule
Test scheduleTest schedule
Test schedule
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
My bar
My barMy bar
My bar
 
Pptx present
Pptx presentPptx present
Pptx present
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
 
Pptx present
Pptx presentPptx present
Pptx present
 
#jeet
#jeet#jeet
#jeet
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Ruby on hadoop

  • 1. Ruby on Hadoop Tuesday, January 8, 13
  • 2. Introduction Hi. I’m Ted O’Meara ...and I just quit my job last week. @tomeara tedomeara.com Tuesday, January 8, 13
  • 4. History of MapReduce • First implemented by Google • Used in CouchDB, Hadoop, etc. • Helps to “distill” data into a concentrated result set Tuesday, January 8, 13
  • 6. What is MapReduce? sum = 0 input = ["deer", "bear", input.each do |x| "river", "car", "car", "river", input.map! { |x| [x, 1] } sum += x[1] "deer", "car", "bear"] end Tuesday, January 8, 13
  • 8. History of Hadoop •Doug Cutting @ Yahoo! •It is a Toy Elephant •It is also a framework for distributed computing •It is a distributed filesystem Tuesday, January 8, 13
  • 10. Hadoop Cluster Cluster •Commodity hardware •Partition tolerant •Network-aware (rack-aware) 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 11. Hadoop Cluster NameNode •Keeps track of the DataNodes •Uses “heartbeat” to determine a node’s health •The most resources should be spent here 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 12. Hadoop Cluster DataNode •Stores filesystem blocks •Can be scaled. Spun up/down. •Replicate based on a set replication factor 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 13. Hadoop Cluster JobTracker •Delegates which TaskTrackers should handle a MapReduce job •Communicates with the NameNode to assign a TaskTracker close to the DataNode where the source exists 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 14. Hadoop Cluster TaskTracker •Worker for MapReduce jobs •The closer to the DataNode with the data, the better 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 16. HDFS hadoop fs -put localfile /user/hadoop/hadoopfile 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 18. Hadoop Streaming $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input "/user/me/samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "/user/me/samples/cachefile/out" -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=3 -jobconf mapred.job.name="Experiment" 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 19. Hadoop Streaming Pig Hive Wukong Pig Latin SQL-ish Ruby! Hadoop Ecosystem Tuesday, January 8, 13
  • 20. Wukong •Infochimps •Currently going through heavy development •Use the 3.0.0.pre3 gem https://github.com/infochimps-labs/wukong/tree/3.0.0 •Model your jobs with wukong-hadoop https://github.com/infochimps-labs/wukong-hadoop Tuesday, January 8, 13
  • 21. Wukong Wukong wukong-hadoop •Write mappers and reducers •A CLI to use with Hadoop using Ruby •Created around building tasks •As of 3.0.0, Wukong uses with Wukong “Processors”, which are Ruby •Better than piping in the shell classes that define map, reduce, (you can see this with --dry_run) and other tasks Tuesday, January 8, 13
  • 22. Wukong Processors Wukong.processor(:mapper) do      field :min_length, Integer, :default => 1   field :max_length, Integer, :default => 256   field :split_on, Regexp, :default => /s+/   field :remove, Regexp, :default => /[^a-zA-Z0-9']+/   field :fold_case, :boolean, :default => false      def process string •Fields are accessible     tokenize(string).each do |token|       yield token if acceptable?(token)     end through switches in shell   end   private •Local hand-off is made at   def tokenize string     string.split(split_on).map do |token| STDOUT to STDIN       stripped = token.gsub(remove, '')       fold_case ? stripped.downcase : stripped     end   end   def acceptable? token     (min_length..max_length).include?(token.length)   end end Tuesday, January 8, 13
  • 23. Wukong Processors Wukong.processor(:reducer, Wukong::Processor::Accumulator) do   attr_accessor :count      def start record     self.count = 0   end      def accumulate record     self.count += 1   end   def finalize     yield [key, count].join("t")   end end Tuesday, January 8, 13
  • 24. Wukong Processors wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb --mode=local --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub Simpsons - Ep 8 do 7 Doctor 1 Does 2 doesn't 1 dog 2 D'oh 1 doif 1 doing 2 done 1 doneYou 1 don't 10 Don't 1 Tuesday, January 8, 13
  • 25. The End Thank you! @tomeara ted@tedomeara.com Tuesday, January 8, 13