SlideShare a Scribd company logo
1 of 40
Download to read offline
1




                                                                                                                   Scale Unlimited, Inc




                                                                                                                       A very short
                                                                                                                       Intro to
                                                                                                                       Hadoop




                                                                                                                                                                                 photo by: i_pinz, flickr
        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
2




      Welcome to Intro to Hadoop



             Usually a 4 hour online class
             Focus on Why, What, How of Hadoop




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
3




      Meet Your Instructor


             Developer of Bixo web mining toolkit
             Committer on Apache Tika
             Hadoop developer and trainer
             Solr developer and trainer




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
4

                                                                                                                       Overview




                                                                                                                       A very short
                                                                                                                       Intro to




                                                                                                                                                                                 photo by: exfordy, flickr
                                                                                                                       Hadoop
        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
5




      How to Crunch a Petabyte?


           Lots of disks, spinning all the time
           Redundancy, since disks die
           Lots of CPU cores, working all the time
           Retry, since network errors happen




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
6




      Hadoop to the Rescue


           Scalable - many servers with lots of cores and spindles
           Reliable - detect failures, redundant storage
           Fault-tolerant - auto-retry, self-healing
           Simple - use many servers as one really big computer




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
7




      Logical Architecture
                                                           Cluster




                                                                                                        Execution


                                                                                                         Storage




           Logically, Hadoop is simply a computing cluster that provides:
                   a Storage layer, and
                   an Execution layer


        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
8




      Storage Layer


           Hadoop Distributed File System (aka HDFS, or Hadoop DFS)
           Runs on top of regular OS file system, typically Linux ext3
           Fixed-size blocks (64MB by default) that are replicated
           Write once, read many; optimized for streaming in and out




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
9




      Execution Layer


           Hadoop Map-Reduce
           Responsible for running a job in parallel on many servers
           Handles re-trying a task that fails, validating complete results
           Jobs consist of special “map” and “reduce” operations




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
10




      Scalable
                    Cluster




                            Rack                                           Rack                                             Rack

                            Node                   Node                    Node                     Node                    ...


                                 cpu                                             Execution


                                 disk                                              Storage




             Virtual execution and storage layers span many nodes (servers)
             Scales linearly (sort of) with cores and disks.

        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
11




      Reliable



             Each block is replicated, typically three times.
             Each task must succeed, or the job fails




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
12




      Fault-tolerant



             Failed tasks are automatically retried.
             Failed data transfers are automatically retried.
             Servers can join and leave the cluster at any time.




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
13




      Simple
                                              Node                    Node                     Node                     Node                     ...


                                                   cpu                                               Execution


                                                   disk                                                Storage




             Reduces complexity
             Conceptual “operating system” that spans many CPUs & disks



        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
14




      Typical Hadoop Cluster

           Has one “master” server - high quality, beefy box.
                   NameNode process - manages file system
                   JobTracker process - manages tasks
           Has multiple “slave” servers - commodity hardware.
                   DataNode process - manages file system blocks on local drives
                   TaskTracker process - runs tasks on server
           Uses high speed network between all servers



        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
15




      Architectural Components
       Solid boxes are unique applications                                                           NameNode                                         DataNode
       Dashed boxes are child JVM instances                                                                                                            DataNode
                                                                                                                                                         DataNode
                                                                                                                                                           DataNode                  data block
           (on same node as parent)
       Dotted boxes are blocks of managed files
           (on same node as parent)


                                                                                                                                                                                 mapper
                                                                                                                                                                                     mapper
                                                                                                                                                                                 child jvm
                                                                                                                                                                                        mapper
                                                                                                                                                                                   child jvm
                                                                                                                                                                                       child jvm
                                                               Client                                JobTracker
                                                                                     jobs                                        tasks
                                                                                                                                                             TaskTracker
                                                                                                                                                                                 reducer
                                                                                                                                                                                     reducer
                                                                                                                                                                                 child jvm
                                                                                                                                                                                        reducer
                                                                                                                                                                                   child jvm
                                                                                                                                                                                       child jvm


                                                                                                       Master                                         Slaves



        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
16

                                                                                                                       Distributed File
                                                                                                                       System




                                                                                                                       A very short
                                                                                                                       Intro to




                                                                                                                                                                                  photo by: Graham Racher, flickr
                                                                                                                       Hadoop
        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
17




      Virtual File System


             Treats many disks on many servers as one huge, logical volume
             Data is stored in 1...n blocks
             The “DataNode” process manages blocks of data on a slave.
             The “NameNode” process keeps track of file metadata on the master.




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
18




      Replication


             Each block is stored on several different disks (default is 3)
             Hadoop tries to copy blocks to different servers and racks.
             Protects data against disk, server, rack failures.
             Reduces the need to move data to code.




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
19




      Error Recovery



             Slaves constantly “check in” with the master.
             Data is automatically replicated if a disk or server “goes away”.




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
20




      Limitations


             Optimized for streaming data in/out
                     So no random access to data in a file
                     Data rates ≈ 30% - 50% of max raw disk rate
             No append support currently
                     Write once, read many




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
21




      NameNode


             Runs on master node
                     Is a single point of failure
                     There are no built-in software hot failover mechanisms
             Maintains filesystem metadata, the “Namespace”
                     files and hierarchical directories




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
22




      DataNodes

                                                                              DataNode
                                                                               DataNode
                                                                                 DataNode
                                                                                   DataNode                                     data block




             Files stored on HDFS are chunked and stored as blocks on DataNode
             Manages storage attached to the nodes that they run on
             Data never flows through NameNode, only DataNodes



        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
23

                                                                                                                       Map-Reduce
                                                                                                                       Paradigm




                                                                                                                       A very short
                                                                                                                       Intro to




                                                                                                                                                                                  photo by: Graham Racher, flickr
                                                                                                                       Hadoop
        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
24




      Definitions

          Key Value Pair -> two units of data, exchanged between Map & Reduce
          Map -> The ‘map’ function in the MapReduce algorithm
                  user defined
                  converts each input Key Value Pair to 0...n output Key Value Pairs
          Reduce -> The ‘reduce’ function in the MapReduce algorithm
                  user defined
                  converts each input Key + all Values to 0...n output Key Value Pairs
          Group -> A built-in operation that happens between Map and Reduce
                  ensures each Key passed to Reduce includes all Values

        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
25




      All Together
                                                        Mapper



                                                        Mapper                                            Shuffle                                      Reducer



                                                        Mapper                                            Shuffle                                      Reducer




                                                        Mapper                                            Shuffle                                      Reducer




                                                        Mapper




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
26




      How MapReduce Works

             Map translates input to keys and values to new keys and values
                                                              [K1,V1]                      Map                    [K2,V2]



             System Groups each unique key with all its values
                                                            [K2,V2]                     Group                   [K2,{V2,V2,....}]




             Reduce translates the values of each unique key to new keys and values
                                               [K2,{V2,V2,....}]                       Reduce                   [K3,V3]


        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
27




      Canonical Example - Word Count

             Word Count
                     Read a document, parse out the words, count the frequency of each word


             Specifically, in MapReduce
                     With a document consisting of lines of text
                     Translate each line of text into key = word and value = 1
                             e.g. <“the”,1> <“quick”,1> <“brown”,1> <“fox”,1> <“jumped”,1>
                     Sum the values (ones) for each unique word


        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
28
                                                                                           [1, When in the Course of human events, it becomes]

                                                                                      [2, necessary for one people to dissolve the political bands]

                                                                                     [3, which have connected them with another, and to assume]
                                     [K1,V1]

                                                                                                                         [n, {k,k,k,k,...}]


                                                                                                                                Map


                                     [K2,V2]                                      [When,1]                [in,1]               [the,1]            [Course,1]                [k,1]




                                                                    User
                                                                                                                                Group
                                                                   Defined




                                 [K2,{V2,V2,....}]                              [When,{1,1,1,1}]                     [people,{1,1,1,1,1,1}]                       [dissolve,{1,1,1}]


                                                                                              [connected,{1,1,1,1,1}]                             [k,{v,v,...}]



                                                                                                                              Reduce


                                                                                                 [When,4]                     [peope,6]                  [dissolve,3]
                                     [K3,V3]
                                                                                                            [connected,5]               [k,sum(v,v,v,..)]



        Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
29




      Divide & Conquer (splitting data)

             Because
                     The Map function only cares about the current key and value, and
                     The Reduce function only cares about the current key and its values
             Then
                     A Mapper can invoke Map on an arbitrary number of input keys and values
                             or just some fraction of the input data set
                     A Reducer can invoke Reduce on an arbitrary number of the unique keys
                             but all the values for that key



        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
30
                                     [K1,V1]                                          [K2,V2]                                 [K2,{V2,V2,...}]                                     [K3,V3]

                                                            Mapper



                                                            Mapper                                              Shuffle                                        Reducer



                                                            Mapper                                              Shuffle                                        Reducer




                                                            Mapper                                              Shuffle                                        Reducer




                                                             Mapper

                                                                                                        Mappers must
                                                                                                       complete before
                                                                                                        Reducers can
                                                                                                           begin
            split1      split2       split3      split4              ...                                                                           part-00000               part-00001   part-000N

                                          file                                                                                                                            directory



        Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
31




      Divide & Conquer (parallelizable)

             Because
                     Each Mapper is independent and processes part of the whole, and
                     Each Reducer is independent and processes part of the whole
             Then
                     Any number of Mappers can run on each node, and
                     Any number of Reducers can run on each node, and
                     The cluster can contain any number of nodes



        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
32




      JobTracker
                                                                                                                                                                jobs
                                                                                                                                        Client                                   JobTracker




             Is a single point of failure
             Determines # Mapper Tasks from file splits via InputFormat
             Uses predefined value for # Reducer Tasks
             Client applications use JobClient to submit jobs and query status
             Command line use hadoop job <commands>
             Web status console use http://jobtracker-server:50030/


        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
33




      TaskTracker
                                                                                                                                                                                  mapper
                                                                                                                                                                                      mapper
                                                                                                                                                                                  child jvm
                                                                                                                                                                                         mapper
                                                                                                                                                                                    child jvm
                                                                                                                                                                                        child jvm


                                                                                                                                                                    TaskTracker
                                                                                                                                                                                  reducer
                                                                                                                                                                                      reducer
                                                                                                                                                                                  child jvm
                                                                                                                                                                                         reducer
                                                                                                                                                                                    child jvm
                                                                                                                                                                                        child jvm



             Spawns each Task as a new child JVM
             Max # mapper and reducer tasks set independently
             Can pass child JVM opts via mapred.child.java.opts
             Can re-use JVM to avoid overhead of task initialization




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
34

                                                                                                                       Hadoop (sort of)
                                                                                                                       Deep Thoughts




                                                                                                                       A very short
                                                                                                                       Intro to




                                                                                                                                                                                  photo by: Graham Racher, flickr
                                                                                                                       Hadoop
        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
35




      Avoiding Hadoop


             Hadoop is a big hammer - but not every problem is a nail
                     Small data
                     Real-time data
                     Beware the Hadoopaphile




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
36




      Avoiding Map-Reduce



                     Writing Hadoop code is painful and error prone
                     Hive & Pig are good solutions for People Who Like SQL
                     Cascading is a good solution for complex, stable workflows




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
37




      Leveraging the Eco-System


                     Many open source projects built on top of Hadoop
                             HBase - scalable NoSQL data store
                             Sqoop - getting SQL data in/out of Hadoop
                     Other projects work well with Hadoop
                             Kafka/Scribe - getting log data in/out of Hadoop
                             Avro - data serialization/storage format




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
38




      Get Involved



             Join the mailing list - http://hadoop.apache.org/mailing_lists.html
             Go to user group meetings - e.g. http://hadoop.meetup.com/




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
39




      Learn More

             Buy the book - Hadoop: The Definitive Guide, 2nd edition
             Try the tutorials
                     http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
             Get training (danger, personal plug)
                     http://www.scaleunlimited.com/training
                     http://www.cloudera.com/hadoop-training




        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011
40




      Resources

             Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com
             Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html
             Users groups - e.g. http://hadoop.meetup.com/
             Hadoop API - http://hadoop.apache.org/common/docs/current/api/
             Hadoop: The Definitive Guide, 2nd edition by Tom White
             Cascading: http://www.cascading.org
             Datameer: http://www.datameer.com



        Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011

More Related Content

Viewers also liked

High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010Steve Loughran
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
What is big data
What is big dataWhat is big data
What is big dataCnu Federer
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?Cnu Federer
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterpriseDataWorks Summit
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache CassandraMike Frampton
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...TheInevitableCloud
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesAerospike, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkesgfawkesnew2
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
The Service Catalog: Cornerstone of Service Management
The Service Catalog: Cornerstone of Service Management The Service Catalog: Cornerstone of Service Management
The Service Catalog: Cornerstone of Service Management BMC Software
 
Best Practices for Implementing a Service Catalog and Enhanced ITSM
Best Practices for Implementing a Service Catalog and Enhanced ITSMBest Practices for Implementing a Service Catalog and Enhanced ITSM
Best Practices for Implementing a Service Catalog and Enhanced ITSMhdicapitalarea
 
HR Internship Presentation
HR Internship PresentationHR Internship Presentation
HR Internship PresentationWilldeeta Hayden
 
How to-build-a-service-catalog
How to-build-a-service-catalogHow to-build-a-service-catalog
How to-build-a-service-catalogspeak2kd11
 

Viewers also liked (20)

High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
What is big data
What is big dataWhat is big data
What is big data
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The Enterprise
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache Cassandra
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkes
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
The Service Catalog: Cornerstone of Service Management
The Service Catalog: Cornerstone of Service Management The Service Catalog: Cornerstone of Service Management
The Service Catalog: Cornerstone of Service Management
 
Best Practices for Implementing a Service Catalog and Enhanced ITSM
Best Practices for Implementing a Service Catalog and Enhanced ITSMBest Practices for Implementing a Service Catalog and Enhanced ITSM
Best Practices for Implementing a Service Catalog and Enhanced ITSM
 
HR Internship Presentation
HR Internship PresentationHR Internship Presentation
HR Internship Presentation
 
How to-build-a-service-catalog
How to-build-a-service-catalogHow to-build-a-service-catalog
How to-build-a-service-catalog
 

Similar to A (very) short intro to Hadoop

Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Why I Love Python V2
Why I Love Python V2Why I Love Python V2
Why I Love Python V2gsroma
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Experience in Corporate Training in Virtual Worlds
Experience in Corporate Training in Virtual WorldsExperience in Corporate Training in Virtual Worlds
Experience in Corporate Training in Virtual WorldsAgile Dimensions LLC
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Eok bo ppt
Eok bo pptEok bo ppt
Eok bo pptvijayeok
 
An Introduction to Spring Data
An Introduction to Spring DataAn Introduction to Spring Data
An Introduction to Spring DataOliver Gierke
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Opening opensource : The Jenkins Way
Opening opensource : The Jenkins WayOpening opensource : The Jenkins Way
Opening opensource : The Jenkins WayNicolas De Loof
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrierFlytxt
 
Ria2010 workshop dev mobile
Ria2010 workshop dev mobileRia2010 workshop dev mobile
Ria2010 workshop dev mobileMichael Chaize
 
Lockless in Seattle - Using In-Memory OLTP for Transaction Processing
Lockless in Seattle -  Using In-Memory OLTP for Transaction ProcessingLockless in Seattle -  Using In-Memory OLTP for Transaction Processing
Lockless in Seattle - Using In-Memory OLTP for Transaction ProcessingMark Broadbent
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesCloudera, Inc.
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 

Similar to A (very) short intro to Hadoop (20)

Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Why I Love Python V2
Why I Love Python V2Why I Love Python V2
Why I Love Python V2
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Experience in Corporate Training in Virtual Worlds
Experience in Corporate Training in Virtual WorldsExperience in Corporate Training in Virtual Worlds
Experience in Corporate Training in Virtual Worlds
 
Drupal vs Sharepoint
Drupal vs SharepointDrupal vs Sharepoint
Drupal vs Sharepoint
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Eok bo ppt
Eok bo pptEok bo ppt
Eok bo ppt
 
An Introduction to Spring Data
An Introduction to Spring DataAn Introduction to Spring Data
An Introduction to Spring Data
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Opening opensource : The Jenkins Way
Opening opensource : The Jenkins WayOpening opensource : The Jenkins Way
Opening opensource : The Jenkins Way
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrier
 
Ria2010 workshop dev mobile
Ria2010 workshop dev mobileRia2010 workshop dev mobile
Ria2010 workshop dev mobile
 
Lockless in Seattle - Using In-Memory OLTP for Transaction Processing
Lockless in Seattle -  Using In-Memory OLTP for Transaction ProcessingLockless in Seattle -  Using In-Memory OLTP for Transaction Processing
Lockless in Seattle - Using In-Memory OLTP for Transaction Processing
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop Series
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 

More from Ken Krugler

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraKen Krugler
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorialKen Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big dataKen Krugler
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoopKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 

More from Ken Krugler (9)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

A (very) short intro to Hadoop

  • 1. 1 Scale Unlimited, Inc A very short Intro to Hadoop photo by: i_pinz, flickr Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 2. 2 Welcome to Intro to Hadoop Usually a 4 hour online class Focus on Why, What, How of Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 3. 3 Meet Your Instructor Developer of Bixo web mining toolkit Committer on Apache Tika Hadoop developer and trainer Solr developer and trainer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 4. 4 Overview A very short Intro to photo by: exfordy, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 5. 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry, since network errors happen Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 6. 6 Hadoop to the Rescue Scalable - many servers with lots of cores and spindles Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 7. 7 Logical Architecture Cluster Execution Storage Logically, Hadoop is simply a computing cluster that provides: a Storage layer, and an Execution layer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 8. 8 Storage Layer Hadoop Distributed File System (aka HDFS, or Hadoop DFS) Runs on top of regular OS file system, typically Linux ext3 Fixed-size blocks (64MB by default) that are replicated Write once, read many; optimized for streaming in and out Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 9. 9 Execution Layer Hadoop Map-Reduce Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Jobs consist of special “map” and “reduce” operations Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 10. 10 Scalable Cluster Rack Rack Rack Node Node Node Node ... cpu Execution disk Storage Virtual execution and storage layers span many nodes (servers) Scales linearly (sort of) with cores and disks. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 11. 11 Reliable Each block is replicated, typically three times. Each task must succeed, or the job fails Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 12. 12 Fault-tolerant Failed tasks are automatically retried. Failed data transfers are automatically retried. Servers can join and leave the cluster at any time. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 13. 13 Simple Node Node Node Node ... cpu Execution disk Storage Reduces complexity Conceptual “operating system” that spans many CPUs & disks Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 14. 14 Typical Hadoop Cluster Has one “master” server - high quality, beefy box. NameNode process - manages file system JobTracker process - manages tasks Has multiple “slave” servers - commodity hardware. DataNode process - manages file system blocks on local drives TaskTracker process - runs tasks on server Uses high speed network between all servers Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 15. 15 Architectural Components Solid boxes are unique applications NameNode DataNode Dashed boxes are child JVM instances DataNode DataNode DataNode data block (on same node as parent) Dotted boxes are blocks of managed files (on same node as parent) mapper mapper child jvm mapper child jvm child jvm Client JobTracker jobs tasks TaskTracker reducer reducer child jvm reducer child jvm child jvm Master Slaves Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 16. 16 Distributed File System A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 17. 17 Virtual File System Treats many disks on many servers as one huge, logical volume Data is stored in 1...n blocks The “DataNode” process manages blocks of data on a slave. The “NameNode” process keeps track of file metadata on the master. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 18. 18 Replication Each block is stored on several different disks (default is 3) Hadoop tries to copy blocks to different servers and racks. Protects data against disk, server, rack failures. Reduces the need to move data to code. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 19. 19 Error Recovery Slaves constantly “check in” with the master. Data is automatically replicated if a disk or server “goes away”. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 20. 20 Limitations Optimized for streaming data in/out So no random access to data in a file Data rates ≈ 30% - 50% of max raw disk rate No append support currently Write once, read many Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 21. 21 NameNode Runs on master node Is a single point of failure There are no built-in software hot failover mechanisms Maintains filesystem metadata, the “Namespace” files and hierarchical directories Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 22. 22 DataNodes DataNode DataNode DataNode DataNode data block Files stored on HDFS are chunked and stored as blocks on DataNode Manages storage attached to the nodes that they run on Data never flows through NameNode, only DataNodes Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 23. 23 Map-Reduce Paradigm A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 24. 24 Definitions Key Value Pair -> two units of data, exchanged between Map & Reduce Map -> The ‘map’ function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs Reduce -> The ‘reduce’ function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs Group -> A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 25. 25 All Together Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 26. 26 How MapReduce Works Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2] System Groups each unique key with all its values [K2,V2] Group [K2,{V2,V2,....}] Reduce translates the values of each unique key to new keys and values [K2,{V2,V2,....}] Reduce [K3,V3] Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 27. 27 Canonical Example - Word Count Word Count Read a document, parse out the words, count the frequency of each word Specifically, in MapReduce With a document consisting of lines of text Translate each line of text into key = word and value = 1 e.g. <“the”,1> <“quick”,1> <“brown”,1> <“fox”,1> <“jumped”,1> Sum the values (ones) for each unique word Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 28. 28 [1, When in the Course of human events, it becomes] [2, necessary for one people to dissolve the political bands] [3, which have connected them with another, and to assume] [K1,V1] [n, {k,k,k,k,...}] Map [K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1] User Group Defined [K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}] [connected,{1,1,1,1,1}] [k,{v,v,...}] Reduce [When,4] [peope,6] [dissolve,3] [K3,V3] [connected,5] [k,sum(v,v,v,..)] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 29. 29 Divide & Conquer (splitting data) Because The Map function only cares about the current key and value, and The Reduce function only cares about the current key and its values Then A Mapper can invoke Map on an arbitrary number of input keys and values or just some fraction of the input data set A Reducer can invoke Reduce on an arbitrary number of the unique keys but all the values for that key Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 30. 30 [K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3] Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Mappers must complete before Reducers can begin split1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 31. 31 Divide & Conquer (parallelizable) Because Each Mapper is independent and processes part of the whole, and Each Reducer is independent and processes part of the whole Then Any number of Mappers can run on each node, and Any number of Reducers can run on each node, and The cluster can contain any number of nodes Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 32. 32 JobTracker jobs Client JobTracker Is a single point of failure Determines # Mapper Tasks from file splits via InputFormat Uses predefined value for # Reducer Tasks Client applications use JobClient to submit jobs and query status Command line use hadoop job <commands> Web status console use http://jobtracker-server:50030/ Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 33. 33 TaskTracker mapper mapper child jvm mapper child jvm child jvm TaskTracker reducer reducer child jvm reducer child jvm child jvm Spawns each Task as a new child JVM Max # mapper and reducer tasks set independently Can pass child JVM opts via mapred.child.java.opts Can re-use JVM to avoid overhead of task initialization Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 34. 34 Hadoop (sort of) Deep Thoughts A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 35. 35 Avoiding Hadoop Hadoop is a big hammer - but not every problem is a nail Small data Real-time data Beware the Hadoopaphile Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 36. 36 Avoiding Map-Reduce Writing Hadoop code is painful and error prone Hive & Pig are good solutions for People Who Like SQL Cascading is a good solution for complex, stable workflows Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 37. 37 Leveraging the Eco-System Many open source projects built on top of Hadoop HBase - scalable NoSQL data store Sqoop - getting SQL data in/out of Hadoop Other projects work well with Hadoop Kafka/Scribe - getting log data in/out of Hadoop Avro - data serialization/storage format Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 38. 38 Get Involved Join the mailing list - http://hadoop.apache.org/mailing_lists.html Go to user group meetings - e.g. http://hadoop.meetup.com/ Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 39. 39 Learn More Buy the book - Hadoop: The Definitive Guide, 2nd edition Try the tutorials http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Get training (danger, personal plug) http://www.scaleunlimited.com/training http://www.cloudera.com/hadoop-training Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011
  • 40. 40 Resources Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html Users groups - e.g. http://hadoop.meetup.com/ Hadoop API - http://hadoop.apache.org/common/docs/current/api/ Hadoop: The Definitive Guide, 2nd edition by Tom White Cascading: http://www.cascading.org Datameer: http://www.datameer.com Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Monday, December 19, 2011