SlideShare a Scribd company logo
Ruby on Hadoop
Tuesday, January 8, 13
Introduction




                                      Hi.
                                   I’m Ted O’Meara
                         ...and I just quit my job last week.

                                    @tomeara
                                 tedomeara.com

Tuesday, January 8, 13
MapReduce
Tuesday, January 8, 13
History of MapReduce



        • First implemented
          by Google
        • Used in CouchDB,
          Hadoop, etc.
        • Helps to “distill” data into
          a concentrated result set




Tuesday, January 8, 13
What is MapReduce?




Tuesday, January 8, 13
What is MapReduce?




                                                                 sum = 0
   input = ["deer", "bear",
                                                                 input.each do |x|
   "river", "car", "car", "river",   input.map! { |x| [x, 1] }
                                                                   sum += x[1]
   "deer", "car", "bear"]
                                                                 end




Tuesday, January 8, 13
Hadoop Breakdown
Tuesday, January 8, 13
History of Hadoop



        •Doug Cutting @ Yahoo!
        •It is a Toy Elephant
        •It is also a framework for
         distributed computing
        •It is a distributed filesystem




Tuesday, January 8, 13
Network Topology


Tuesday, January 8, 13
Hadoop Cluster

                         Cluster
                         •Commodity hardware
                         •Partition tolerant
                         •Network-aware (rack-aware)



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         NameNode
                         •Keeps track of the DataNodes
                         •Uses “heartbeat” to determine a node’s health
                         •The most resources should be spent here



                          555.555.1.*             555.555.2.*                 444.444.1.*
                              JobTracker              NameNode                 TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode
                                                                          ♥    TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         DataNode
                         •Stores filesystem blocks
                         •Can be scaled. Spun up/down.
                         •Replicate based on a set replication factor



                          555.555.1.*             555.555.2.*               444.444.1.*
                              JobTracker               NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         JobTracker
                         •Delegates which TaskTrackers should handle a
                          MapReduce job
                         •Communicates with the NameNode to assign a TaskTracker
                          close to the DataNode where the source exists


                          555.555.1.*                 555.555.2.*              444.444.1.*
                              JobTracker                  NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode
                                                  ♥    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         TaskTracker
                         •Worker for MapReduce jobs
                         •The closer to the DataNode with the data, the better



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
HDFS


Tuesday, January 8, 13
HDFS

                                           hadoop fs -put localfile /user/hadoop/hadoopfile




                         555.555.1.*                   555.555.2.*                    444.444.1.*
                             JobTracker                      NameNode                   TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming


Tuesday, January 8, 13
Hadoop Streaming
        $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
                          -input "/user/me/samples/cachefile/input.txt" 
                          -mapper "xargs cat" 
                          -reducer "cat" 
                          -output "/user/me/samples/cachefile/out" 
                          -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' 
                          -jobconf mapred.map.tasks=3 
                          -jobconf mapred.reduce.tasks=3 
                          -jobconf mapred.job.name="Experiment"




                         555.555.1.*                  555.555.2.*                      444.444.1.*
                             JobTracker                     NameNode                     TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming




                          Pig        Hive          Wukong
                         Pig Latin   SQL-ish         Ruby!




           Hadoop Ecosystem
Tuesday, January 8, 13
Wukong


        •Infochimps
        •Currently going through
         heavy development
        •Use the 3.0.0.pre3 gem
            https://github.com/infochimps-labs/wukong/tree/3.0.0

        •Model your jobs with
         wukong-hadoop
            https://github.com/infochimps-labs/wukong-hadoop




Tuesday, January 8, 13
Wukong



            Wukong                             wukong-hadoop
            •Write mappers and reducers        •A CLI to use with Hadoop
             using Ruby                        •Created around building tasks
            •As of 3.0.0, Wukong uses           with Wukong
             “Processors”, which are Ruby      •Better than piping in the shell
             classes that define map, reduce,
                                                (you can see this with --dry_run)
             and other tasks




Tuesday, January 8, 13
Wukong Processors

                                     Wukong.processor(:mapper) do
                                       
                                       field :min_length, Integer, :default    =>   1
                                       field :max_length, Integer, :default    =>   256
                                       field :split_on,   Regexp,   :default   =>   /s+/
                                       field :remove,     Regexp,   :default   =>   /[^a-zA-Z0-9']+/
                                       field :fold_case, :boolean, :default    =>   false
                                       
                                       def process string

        •Fields are accessible           tokenize(string).each do |token|
                                           yield token if acceptable?(token)
                                         end
         through switches in shell     end

                                       private
        •Local hand-off is made at      def tokenize string
                                         string.split(split_on).map do |token|
         STDOUT to STDIN                   stripped = token.gsub(remove, '')
                                           fold_case ? stripped.downcase : stripped
                                         end
                                       end

                                       def acceptable? token
                                         (min_length..max_length).include?(token.length)
                                       end
                                     end




Tuesday, January 8, 13
Wukong Processors



                         Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

                           attr_accessor :count
                           
                           def start record
                             self.count = 0
                           end
                           
                           def accumulate record
                             self.count += 1
                           end

                           def finalize
                             yield [key, count].join("t")
                           end
                         end




Tuesday, January 8, 13
Wukong Processors

           wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb 
                            --mode=local 
                            --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub




                                      Simpsons - Ep 8
                                      do 7
                                      Doctor     1
                                      Does 2
                                      doesn't    1
                                      dog 2
                                      D'oh 1
                                      doif 1
                                      doing      2
                                      done 1
                                      doneYou    1
                                      don't 10
                                      Don't 1




Tuesday, January 8, 13
The End




                         Thank you!
                             @tomeara
                             ted@tedomeara.com




Tuesday, January 8, 13

More Related Content

Similar to Ruby on hadoop

Hadoop
HadoopHadoop
Pptx present
Pptx presentPptx present
Pptx present
Nitish Bhardwaj
 
Pptx present
Pptx presentPptx present
Pptx present
sstestpd5
 
Pptx present
Pptx presentPptx present
Pptx present
nitishtest1
 
Pp1tx present
Pp1tx presentPp1tx present
Pp1tx present
Nitish Bhardwaj
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
Nitish Bhardwaj
 
Pptx present
Pptx presentPptx present
Pptx present
Nitish Bhardwaj
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
Nitish Bhardwaj
 
Test schedule
Test scheduleTest schedule
Test schedule
Nitish Bhardwaj
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
Nitish Bhardwaj
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
Nitish Bhardwaj
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
Nitish Bhardwaj
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
Nitish Bhardwaj
 
Heisenberg
HeisenbergHeisenberg
Heisenberg
Nitish Bhardwaj
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
Nitish Bhardwaj
 
#jeet
#jeet#jeet
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)
Nitish Bhardwaj
 
Checkupload1 140213043220-phpapp01
Checkupload1 140213043220-phpapp01Checkupload1 140213043220-phpapp01
Checkupload1 140213043220-phpapp01
Nitish Bhardwaj
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
Nitish Bhardwaj
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
This is a long company name trying to break the username filed
 

Similar to Ruby on hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pp1tx present
Pp1tx presentPp1tx present
Pp1tx present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
Test schedule
Test scheduleTest schedule
Test schedule
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
 
Heisenberg
HeisenbergHeisenberg
Heisenberg
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
 
#jeet
#jeet#jeet
#jeet
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (2)
 
Checkupload1 140213043220-phpapp01
Checkupload1 140213043220-phpapp01Checkupload1 140213043220-phpapp01
Checkupload1 140213043220-phpapp01
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 

Recently uploaded

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 

Ruby on hadoop

  • 1. Ruby on Hadoop Tuesday, January 8, 13
  • 2. Introduction Hi. I’m Ted O’Meara ...and I just quit my job last week. @tomeara tedomeara.com Tuesday, January 8, 13
  • 4. History of MapReduce • First implemented by Google • Used in CouchDB, Hadoop, etc. • Helps to “distill” data into a concentrated result set Tuesday, January 8, 13
  • 6. What is MapReduce? sum = 0 input = ["deer", "bear", input.each do |x| "river", "car", "car", "river", input.map! { |x| [x, 1] } sum += x[1] "deer", "car", "bear"] end Tuesday, January 8, 13
  • 8. History of Hadoop •Doug Cutting @ Yahoo! •It is a Toy Elephant •It is also a framework for distributed computing •It is a distributed filesystem Tuesday, January 8, 13
  • 10. Hadoop Cluster Cluster •Commodity hardware •Partition tolerant •Network-aware (rack-aware) 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 11. Hadoop Cluster NameNode •Keeps track of the DataNodes •Uses “heartbeat” to determine a node’s health •The most resources should be spent here 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 12. Hadoop Cluster DataNode •Stores filesystem blocks •Can be scaled. Spun up/down. •Replicate based on a set replication factor 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 13. Hadoop Cluster JobTracker •Delegates which TaskTrackers should handle a MapReduce job •Communicates with the NameNode to assign a TaskTracker close to the DataNode where the source exists 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 14. Hadoop Cluster TaskTracker •Worker for MapReduce jobs •The closer to the DataNode with the data, the better 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 16. HDFS hadoop fs -put localfile /user/hadoop/hadoopfile 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 18. Hadoop Streaming $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input "/user/me/samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "/user/me/samples/cachefile/out" -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=3 -jobconf mapred.job.name="Experiment" 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 19. Hadoop Streaming Pig Hive Wukong Pig Latin SQL-ish Ruby! Hadoop Ecosystem Tuesday, January 8, 13
  • 20. Wukong •Infochimps •Currently going through heavy development •Use the 3.0.0.pre3 gem https://github.com/infochimps-labs/wukong/tree/3.0.0 •Model your jobs with wukong-hadoop https://github.com/infochimps-labs/wukong-hadoop Tuesday, January 8, 13
  • 21. Wukong Wukong wukong-hadoop •Write mappers and reducers •A CLI to use with Hadoop using Ruby •Created around building tasks •As of 3.0.0, Wukong uses with Wukong “Processors”, which are Ruby •Better than piping in the shell classes that define map, reduce, (you can see this with --dry_run) and other tasks Tuesday, January 8, 13
  • 22. Wukong Processors Wukong.processor(:mapper) do      field :min_length, Integer, :default => 1   field :max_length, Integer, :default => 256   field :split_on, Regexp, :default => /s+/   field :remove, Regexp, :default => /[^a-zA-Z0-9']+/   field :fold_case, :boolean, :default => false      def process string •Fields are accessible     tokenize(string).each do |token|       yield token if acceptable?(token)     end through switches in shell   end   private •Local hand-off is made at   def tokenize string     string.split(split_on).map do |token| STDOUT to STDIN       stripped = token.gsub(remove, '')       fold_case ? stripped.downcase : stripped     end   end   def acceptable? token     (min_length..max_length).include?(token.length)   end end Tuesday, January 8, 13
  • 23. Wukong Processors Wukong.processor(:reducer, Wukong::Processor::Accumulator) do   attr_accessor :count      def start record     self.count = 0   end      def accumulate record     self.count += 1   end   def finalize     yield [key, count].join("t")   end end Tuesday, January 8, 13
  • 24. Wukong Processors wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb --mode=local --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub Simpsons - Ep 8 do 7 Doctor 1 Does 2 doesn't 1 dog 2 D'oh 1 doif 1 doing 2 done 1 doneYou 1 don't 10 Don't 1 Tuesday, January 8, 13
  • 25. The End Thank you! @tomeara ted@tedomeara.com Tuesday, January 8, 13