SlideShare a Scribd company logo
1 of 121
Download to read offline
Introduction to HDFS
                and MapReduce

                           Copyright © 2012-2013, Think Big Analytics, All
                                                         Rights Reserved
Thursday, January 10, 13
Who Am I
             - Ryan Tabora
             - Data Developer at Think
                   Big Analytics

             - Big Data Consulting
             - Experience working with
                   Hadoop, HBase, Hive,
                   Solr, Cassandra, etc.


                                           Copyright © 2012-2013, Think Big Analytics, All
                                       2                                 Rights Reserved
Thursday, January 10, 13
Who Am I
             - Ryan Tabora
             - Data Developer at Think
                   Big Analytics

             - Big Data Consulting
             - Experience working with
                   Hadoop, HBase, Hive,
                   Solr, Cassandra, etc.


                                           Copyright © 2012-2013, Think Big Analytics, All
                                       2                                 Rights Reserved
Thursday, January 10, 13
Think Big is the leading professional services firm that’s purpose built
                                  for Big Data.
      • One of Silicon Valley’s Fastest Growing Big Data start ups
      • 100% Focus on Big Data consulting & Data Science solution services
      • Management Background:
                  Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast,
                   Accenture
                  C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999
      • Clients: 40+
      • North America Locations
                  • US East: Boston, New York, Washington D.C.
                  • US Central: Chicago, Austin
                  • US West: HQ Mountain View, San Diego, Salt Lake City
      • EMEA & APAC

Confidential Think Big Analytics
                                                                                            3
Thursday, January 10, 13
Think Big Recognized as a Top Pure-Play Big Data Vendor




                                           Source: Forbes February
                                           2012




Confidential Think Big Analytics
                                                                     01/04/13
                                                                            4
Thursday, January 10, 13
Agenda
                     - Big Data
                     - Hadoop Ecosystem
                     - HDFS
                     - MapReduce in Hadoop
                     - The Hadoop Java API
                     - Conclusions

                                             Copyright © 2012-2013, Think Big Analytics, All
                                       5                                   Rights Reserved
Thursday, January 10, 13
Big Data


                                  Copyright © 2012-2013, Think Big Analytics, All
                              6                                 Rights Reserved
Thursday, January 10, 13
A Data Shift...




                               Source: EMC Digital Universe Study*
                                                                     Copyright © 2012-2013, Think Big Analytics, All
                                               7                                                   Rights Reserved
Thursday, January 10, 13
Motivation

             “Simple algorithms and lots
                of data trump complex
                       models. ”
                                 Halevy, Norvig, and Pereira
                            (Google), IEEE Intelligent Systems



                                               Copyright © 2012-2013, Think Big Analytics, All
                                     8                                       Rights Reserved
Thursday, January 10, 13
Pioneers
                • Google and Yahoo:
                     - Index 850+ million websites, over one
                           trillion URLs.

                • Facebook ad targeting:
                     - 840+ million users, > 50% of whom are
                           active daily.



                                                 Copyright © 2012-2013, Think Big Analytics, All
                                            9                                  Rights Reserved
Thursday, January 10, 13
Hadoop
                           Ecosystem


                                    Copyright © 2012-2013, Think Big Analytics, All
                               10                                 Rights Reserved
Thursday, January 10, 13
Common Tool?
                     • Hadoop
                           - Cluster: distributed computing
                             platform.

                           - Commodity*, server-class hardware.
                           - Extensible Platform.



                                                    Copyright © 2012-2013, Think Big Analytics, All
                                           11                                     Rights Reserved
Thursday, January 10, 13
Hadoop Origins
                • MapReduce and Google File System (GFS)
                       pioneered at Google.

                • Hadoop is the commercially-supported
                       open-source equivalent.




                                                 Copyright © 2012-2013, Think Big Analytics, All
                                          12                                   Rights Reserved
Thursday, January 10, 13
What Is Hadoop?
                •      Hadoop is a platform.

                •      Distributes and replicates data.

                •      Manages parallel tasks created by users.

                •      Runs as several processes on a cluster.

                •      The term Hadoop generally refers to a toolset, not a
                       single tool.




                                                          Copyright © 2012-2013, Think Big Analytics, All
                                               13                                       Rights Reserved
Thursday, January 10, 13
Why Hadoop?
                • Handles unstructured to semi-structured to
                       structured data.

                • Handles enormous data volumes.
                • Flexible data analysis and machine learning
                       tools.

                • Cost-effective scalability.


                                                Copyright © 2012-2013, Think Big Analytics, All
                                          14                                  Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                15                                     Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                15                                     Rights Reserved
Thursday, January 10, 13
HDFS


                                  Copyright © 2012-2013, Think Big Analytics, All
                            16                                  Rights Reserved
Thursday, January 10, 13
What Is HDFS?
                • Hadoop Distributed File System.
                • Stores files in blocks across many nodes in a
                       cluster.

                • Replicates the blocks across nodes for
                       durability.

                • Master/Slave architecture.


                                               Copyright © 2012-2013, Think Big Analytics, All
                                      17                                     Rights Reserved
Thursday, January 10, 13
HDFS Traits
                • Not fully POSIX compliant.
                • No file updates.
                • Write once, read many times.
                • Large blocks, sequential read patterns.
                • Designed for batch processing.



                                                 Copyright © 2012-2013, Think Big Analytics, All
                                      18                                       Rights Reserved
Thursday, January 10, 13
HDFS Master
                • NameNode
                           - Runs on a single node as a master process
                             ‣ Holds file metadata (which blocks are where)
                             ‣ Directs client access to files in HDFS
                • SecondaryNameNode
                           - Not a hot failover
                           - Maintains a copy of the NameNode metadata
                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                19                                       Rights Reserved
Thursday, January 10, 13
HDFS Slaves
                • DataNode
                           - Generally runs on all nodes in the cluster
                             ‣ Block creation/replication/deletion/reads
                             ‣ Takes orders from the NameNode




                                                        Copyright © 2012-2013, Think Big Analytics, All
                                              20                                      Rights Reserved
Thursday, January 10, 13
HDFS Illustrated
                                                   NameNode



                           Put File

         File


                                      DataNode 1   DataNode 2   DataNode 3




                                      DataNode 4   DataNode 5   DataNode 6




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                       21                                                  Rights Reserved
Thursday, January 10, 13
HDFS Illustrated
                                                   NameNode



                           Put File

         File


                                      DataNode 1   DataNode 2   DataNode 3




                                      DataNode 4   DataNode 5   DataNode 6




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                       21                                                  Rights Reserved
Thursday, January 10, 13
HDFS Illustrated
                                                   NameNode
                                                                1
                           Put File                             2
                                                                3




                                      DataNode 1   DataNode 2   DataNode 3




                                      DataNode 4   DataNode 5   DataNode 6




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                       21                                                  Rights Reserved
Thursday, January 10, 13
HDFS Illustrated
                                                   NameNode
                                                                1,4,6
                           Put File                             2
                                                                3




                                      DataNode 1   DataNode 2   DataNode 3




                                      DataNode 4   DataNode 5   DataNode 6




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                       21                                                  Rights Reserved
Thursday, January 10, 13
HDFS Illustrated
                                                   NameNode
                                                                1,4,6
                           Put File                             2 ,5,3
                                                                3




                                      DataNode 1   DataNode 2   DataNode 3




                                      DataNode 4   DataNode 5   DataNode 6




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                       21                                                  Rights Reserved
Thursday, January 10, 13
HDFS Illustrated
                                                   NameNode
                                                                1,4,6
                           Put File                             2 ,5,3
                                                                3,2,6




                                      DataNode 1   DataNode 2   DataNode 3




                                      DataNode 4   DataNode 5   DataNode 6




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                       21                                                  Rights Reserved
Thursday, January 10, 13
HDFS Illustrated
                                                   NameNode
                                                                1,4,6
                           Put File                             2 ,5,3
                                                                3,2,6




                                      DataNode 1   DataNode 2   DataNode 3




                                      DataNode 4   DataNode 5   DataNode 6




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                       21                                                  Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                             1,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6




                                   DataNode 1   DataNode 2    DataNode 3




                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                             1,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6




                                   DataNode 1   DataNode 2    DataNode 3




                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                             1,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6




                                   DataNode 1   DataNode 2    DataNode 3




                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                               ,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6




                                                DataNode 2    DataNode 3




                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                             5,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6




                                                DataNode 2    DataNode 3




                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                             5,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6




                                                DataNode 2    DataNode 3




                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                             5,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6


         Read time
             =
          Transfer                              DataNode 2    DataNode 3



           Rate x
         Number of
         Machines*
                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
Power of Hadoop
                                                NameNode
                                                             5,4,6
                       Read File                             2 ,5,3
                                                             3 ,2,6


         Read time
                                                                                       100 MB/s
             =
                                                                                          x
          Transfer                              DataNode 2    DataNode 3

                                                                                          3
           Rate x
                                                                                          =
         Number of
                                                                                       300MB/s
         Machines*
                                   DataNode 4   DataNode 5    DataNode 6




                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                    22                                                   Rights Reserved
Thursday, January 10, 13
HDFS Shell
                • Easy to use command line interface.
                • Create, copy, move, and delete files.
                • Administrative duties - chmod, chown, chgrp.
                • Set replication factor for a file.
                • Head, tail, cat to view files.



                                                  Copyright © 2012-2013, Think Big Analytics, All
                                        23                                      Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                24                                     Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                24                                     Rights Reserved
Thursday, January 10, 13
MapReduce
                              in
                            Hadoop

                                    Copyright © 2012-2013, Think Big Analytics, All
                               25                                 Rights Reserved
Thursday, January 10, 13
MapReduce Basics
                • Logical functions: Mappers and Reducers.
                • Developers write map and reduce functions,
                       then submit a jar to the Hadoop cluster.

                • Hadoop handles distributing the Map and
                       Reduce tasks across the cluster.

                • Typically batch oriented.


                                                     Copyright © 2012-2013, Think Big Analytics, All
                                           26                                      Rights Reserved
Thursday, January 10, 13
MapReduce
                               Daemons
           •JobTracker (Master)
               - Manages MapReduce jobs, giving tasks to
                      different nodes, managing task failure

           •TaskTracker (Slave)
               - Creates individual map and reduce tasks
               - Reports task status to JobTracker


                                                    Copyright © 2012-2013, Think Big Analytics, All
                                          27                                      Rights Reserved
Thursday, January 10, 13
MapReduce in
                             Hadoop




                                     Copyright © 2012-2013, Think Big Analytics, All
                                28                                 Rights Reserved
Thursday, January 10, 13
MapReduce in
                             Hadoop
                      Let’s look at how MapReduce
                        actually works in Hadoop,
                             using WordCount.


                                         Copyright © 2012-2013, Think Big Analytics, All
                                   28                                  Rights Reserved
Thursday, January 10, 13
Input            Mappers          Sort,            Reducers                       Output
                                           Shuffle

       Hadoop uses                       (hadoop, 1)
       MapReduce
                                                                                        a2
                                     (mapreduce, 1)                                     hadoop 1
                                                                                        is 2
                                     (uses, 1)
                                       (is, 1), (a, 1)
        There is a
        Map phase
                                        (map, 1),(phase,1)
                                     (there, 1)                                         map 1
                                                                                        mapreduce 1
                                                                                        phase 2


                                           (phase,1)
                                 (is, 1), (a, 1)                                        reduce 1
                                       (there, 1),                                      there 2
        There is a
      Reduce phase                     (reduce 1)                                       uses 1



                                                                 Copyright © 2012-2013, Think Big Analytics, All
                                                     29                                        Rights Reserved
Thursday, January 10, 13
Input            Mappers          Sort,            Reducers                       Output
                                           Shuffle

       Hadoop uses                       (hadoop, 1)
       MapReduce
                                                                                        a2
                                     (mapreduce, 1)                                     hadoop 1
                                                                                        is 2
                                     (uses, 1)


                           We need to convert
                                       (is, 1), (a, 1)
        There is a
        Map phase
                                        (map, 1),(phase,1)

                                the Input
                                     (there, 1)                                         map 1
                                                                                        mapreduce 1
                                                                                        phase 2

                            into the Output.
                                           (phase,1)
                                 (is, 1), (a, 1)                                        reduce 1
                                       (there, 1),                                      there 2
        There is a
      Reduce phase                     (reduce 1)                                       uses 1



                                                                 Copyright © 2012-2013, Think Big Analytics, All
                                                     29                                        Rights Reserved
Thursday, January 10, 13
Input            Mappers    Sort,   Reducers                       Output
                                     Shuffle

       Hadoop uses
       MapReduce
                                                                         a2
                                                                         hadoop 1
                                                                         is 2


        There is a
        Map phase
                                                                         map 1
                                                                         mapreduce 1
                                                                         phase 2




                                                                         reduce 1
                                                                         there 2
        There is a
      Reduce phase                                                       uses 1



                                                  Copyright © 2012-2013, Think Big Analytics, All
                                         30                                     Rights Reserved
Thursday, January 10, 13
Input            Mappers


       Hadoop uses
       MapReduce
                           (doc1, "…")




        There is a
        Map phase
                           (doc2, "…")




                            (doc3, "")




        There is a
      Reduce phase
                           (doc4, "…")



                                              Copyright © 2012-2013, Think Big Analytics, All
                                         31                                 Rights Reserved
Thursday, January 10, 13
Input            Mappers

                                         (hadoop, 1)
       Hadoop uses
       MapReduce
                           (doc1, "…")   (uses, 1)
                                         (mapreduce, 1)

                                         (there, 1)
                                         (is, 1)
        There is a
        Map phase
                           (doc2, "…")   (a, 1)
                                         (map, 1)
                                         (phase, 1)


                            (doc3, "")


                                         (there, 1)
                                         (is, 1)
        There is a
      Reduce phase
                           (doc4, "…")   (a, 1)
                                         (reduce, 1)
                                         (phase, 1)

                                                          Copyright © 2012-2013, Think Big Analytics, All
                                                 32                                     Rights Reserved
Thursday, January 10, 13
Input            Mappers              Sort,            Reducers
                                               Shuffle
                                                                   0-9, a-l
       Hadoop uses                           (hadoop, 1)
       MapReduce
                           (doc1, "…")
                                         (mapreduce, 1)
                                         (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (doc2, "…")                              m-q
                                            (map, 1),(phase,1)
                                         (there, 1)


                            (doc3, "")
                                               (phase,1)            r-z
                                     (is, 1), (a, 1)
                                           (there, 1),
        There is a
      Reduce phase
                           (doc4, "…")     (reduce 1)



                                                                     Copyright © 2012-2013, Think Big Analytics, All
                                                         33                                        Rights Reserved
Thursday, January 10, 13
Input            Mappers              Sort,          Reducers
                                               Shuffle
                                                                   0-9, a-l
       Hadoop uses                           (hadoop, 1)
       MapReduce
                           (doc1, "…")                          (a, [1,1]),
                                         (mapreduce, 1)       (hadoop, [1]),
                                                                (is, [1,1])
                                         (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (doc2, "…")                             m-q
                                            (map, 1),(phase,1)
                                         (there, 1)              (map, [1]),
                                                             (mapreduce, [1]),
                                                               (phase, [1,1])
                            (doc3, "")
                                               (phase,1)            r-z
                                     (is, 1), (a, 1)          (reduce, [1]),
                                           (there, 1),        (there, [1,1]),
        There is a
      Reduce phase
                           (doc4, "…")     (reduce 1)            (uses, 1)



                                                                     Copyright © 2012-2013, Think Big Analytics, All
                                                         34                                        Rights Reserved
Thursday, January 10, 13
Input            Mappers              Sort,          Reducers                         Output
                                               Shuffle
                                                                   0-9, a-l
       Hadoop uses                           (hadoop, 1)
       MapReduce
                           (doc1, "…")                          (a, [1,1]),                 a2
                                         (mapreduce, 1)       (hadoop, [1]),                hadoop 1
                                                                (is, [1,1])                 is 2
                                         (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (doc2, "…")                             m-q
                                            (map, 1),(phase,1)
                                         (there, 1)              (map, [1]),                map 1
                                                             (mapreduce, [1]),              mapreduce 1
                                                               (phase, [1,1])               phase 2
                            (doc3, "")
                                               (phase,1)            r-z
                                     (is, 1), (a, 1)          (reduce, [1]),                reduce 1
                                           (there, 1),        (there, [1,1]),               there 2
        There is a
      Reduce phase
                           (doc4, "…")     (reduce 1)            (uses, 1)                  uses 1



                                                                     Copyright © 2012-2013, Think Big Analytics, All
                                                         35                                        Rights Reserved
Thursday, January 10, 13
Input            Mappers              Sort,          Reducers                         Output
                                               Shuffle
                                                                   0-9, a-l
       Hadoop uses                           (hadoop, 1)
       MapReduce           (doc1, "…")                          (a, [1,1]),                 a2
                                         (mapreduce, 1)       (hadoop, [1]),                hadoop 1
                                                                (is, [1,1])                 is 2
                                         (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (doc2, "…")                             m-q
                                            (map, 1),(phase,1)
                                         (there, 1)              (map, [1]),                map 1
                                                             (mapreduce, [1]),              mapreduce 1
                                                               (phase, [1,1])               phase 2
                            (doc3, "")
                                               (phase,1)            r-z
                                     (is, 1), (a, 1)          (reduce, [1]),
                                           (there, 1),        (there, [1,1]),
                           (doc4, "…")     (reduce 1)            (uses, 1)



                                                                     Copyright © 2012-2013, Think Big Analytics, All
                                                         36                                        Rights Reserved
Thursday, January 10, 13
Input            Mappers              Sort,          Reducers                         Output
                                               Shuffle
                                                                   0-9, a-l
       Hadoop uses                           (hadoop, 1)
       MapReduce           (doc1, "…")                          (a, [1,1]),                 a2
                                         (mapreduce, 1)       (hadoop, [1]),                hadoop 1
                                                                (is, [1,1])                 is 2
                                         (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (doc2, "…")                             m-q
                                            (map, 1),(phase,1)
                                         (there, 1)              (map, [1]),                map 1
                                                             (mapreduce, [1]),              mapreduce 1
                                                               (phase, [1,1])               phase 2
             Map:           (doc3, "")


     •
                                               (phase,1)            r-z
             Transform one input 1), (a, 1)
                              (is,
                                   to 0-N
                                                              (reduce, [1]),
             outputs.               (there, 1),               (there, [1,1]),
                           (doc4, "…")     (reduce 1)            (uses, 1)



                                                                     Copyright © 2012-2013, Think Big Analytics, All
                                                        36                                         Rights Reserved
Thursday, January 10, 13
Input            Mappers              Sort,               Reducers                          Output
                                               Shuffle
                                                                        0-9, a-l
       Hadoop uses                           (hadoop, 1)
       MapReduce           (doc1, "…")                               (a, [1,1]),                  a2
                                         (mapreduce, 1)            (hadoop, [1]),                 hadoop 1
                                                                     (is, [1,1])                  is 2
                                         (uses, 1)
                                           (is, 1), (a, 1)
        There is a
        Map phase
                           (doc2, "…")                                  m-q
                                            (map, 1),(phase,1)
                                         (there, 1)              (map, [1]),                      map 1
                                                             (mapreduce, [1]),                    mapreduce 1
                                                               (phase, [1,1])                     phase 2
             Map:           (doc3, "")                           Reduce:

     •                                                       •
                                               (phase,1)                 r-z
             Transform one input 1), (a, 1)
                              (is,
                                   to 0-N                        Collect multiple           inputs into
                                                                    (reduce, [1]),
             outputs.               (there, 1),                  one output.
                                                                   (there, [1,1]),
                           (doc4, "…")     (reduce 1)                 (uses, 1)



                                                                           Copyright © 2012-2013, Think Big Analytics, All
                                                        36                                               Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                         NameNode

                           M         R
                                                         JobTracker

                               jar
                                           TaskTracker   TaskTracker   TaskTracker




                                           DataNode      DataNode      DataNode




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                             37                                            Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                         NameNode

                           M         R
                                                         JobTracker

                               jar
                                           TaskTracker   TaskTracker   TaskTracker




                                           DataNode      DataNode      DataNode




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                             37                                            Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                         NameNode

                           M         R
                                                         JobTracker

                               jar
                                           TaskTracker   TaskTracker   TaskTracker



                                               M             M             M




                                           DataNode      DataNode      DataNode




                                                                               Copyright © 2012-2013, Think Big Analytics, All
                                                             37                                              Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                         NameNode

                           M         R
                                                         JobTracker

                               jar
                                           TaskTracker   TaskTracker   TaskTracker




        Map Phase                              M             M             M




                                           DataNode      DataNode      DataNode




                                                                               Copyright © 2012-2013, Think Big Analytics, All
                                                             37                                              Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                            NameNode

                           M         R
                                                             JobTracker

                               jar
                                            TaskTracker     TaskTracker    TaskTracker


                                                                                             * Intermediate Data Is
        Map Phase                          k,v   M   k,v   k,v   M        k,v   M      k,v
                                                                                                  Stored Locally



                                             DataNode        DataNode       DataNode




                                                                                    Copyright © 2012-2013, Think Big Analytics, All
                                                                 37                                               Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                            NameNode

                           M         R
                                                             JobTracker

                               jar
                                            TaskTracker     TaskTracker    TaskTracker




        Map Phase                          k,v       k,v   k,v            k,v       k,v




                                             DataNode        DataNode       DataNode




                                                                                 Copyright © 2012-2013, Think Big Analytics, All
                                                                 37                                            Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                            NameNode

                           M         R
                                                             JobTracker

                               jar
                                            TaskTracker     TaskTracker    TaskTracker



                                           k,v       k,v   k,v            k,v       k,v


      Shuffle/Sort

                                             DataNode        DataNode       DataNode




                                                                                 Copyright © 2012-2013, Think Big Analytics, All
                                                                 37                                            Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                            NameNode

                           M         R
                                                             JobTracker

                               jar
                                            TaskTracker     TaskTracker     TaskTracker



                                           k,v       k,v   k,v        k,v            k,v


      Shuffle/Sort

                                             DataNode        DataNode       DataNode




                                                                                  Copyright © 2012-2013, Think Big Analytics, All
                                                                 37                                             Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                            NameNode

                           M         R
                                                             JobTracker

                               jar
                                            TaskTracker     TaskTracker     TaskTracker



                                           k,v   R   k,v   k,v   R    k,v        R      k,v




   Reduce Phase
                                             DataNode        DataNode       DataNode




                                                                                     Copyright © 2012-2013, Think Big Analytics, All
                                                                 37                                                Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                         NameNode

                           M         R
                                                         JobTracker

                               jar
                                           TaskTracker   TaskTracker   TaskTracker



                                               R              R             R




   Reduce Phase
                                           DataNode      DataNode      DataNode




                                                                                Copyright © 2012-2013, Think Big Analytics, All
                                                             37                                               Rights Reserved
Thursday, January 10, 13
Cluster View of
                                           MapReduce
                                                         NameNode

                           M         R
                                                         JobTracker

                               jar
                                           TaskTracker   TaskTracker   TaskTracker




  Job Complete!                            DataNode      DataNode      DataNode




                                                                             Copyright © 2012-2013, Think Big Analytics, All
                                                             37                                            Rights Reserved
Thursday, January 10, 13
The
                           Hadoop
                           Java API

                                      Copyright © 2012-2013, Think Big Analytics, All
                              38                                    Rights Reserved
Thursday, January 10, 13
MapReduce in Java




                                  Copyright © 2012-2013, Think Big Analytics, All
                             39                                 Rights Reserved
Thursday, January 10, 13
MapReduce in Java

                           Let’s look at WordCount
                                 written in the
                            MapReduce Java API.



                                            Copyright © 2012-2013, Think Big Analytics, All
                                      39                                  Rights Reserved
Thursday, January 10, 13
Map Code
public class SimpleWordCountMapper
 extends MapReduceBase implements
 Mapper<LongWritable, Text, Text, IntWritable> {

    static final Text word = new Text();
    static final IntWritable one = new IntWritable(1);

    @Override
    public void map(LongWritable key, Text documentContents,
        OutputCollector<Text, IntWritable> collector, Reporter reporter)
        throws IOException {
      String[] tokens = documentContents.toString().split("s+");
      for (String wordString : tokens) {
        if (wordString.length() > 0) {
          word.set(wordString.toLowerCase());
          collector.collect(word, one);
        }
      }
    }
}

                                                Copyright © 2012-2013, Think Big Analytics, All
                                    40                                        Rights Reserved
Thursday, January 10, 13
Map Code
public class SimpleWordCountMapper
 extends MapReduceBase implements
 Mapper<LongWritable, Text, Text, IntWritable> {

    static final Text word = new Text();
    static final IntWritable one = new IntWritable(1);

    @Override
    public void map(LongWritable key, Text documentContents,
        OutputCollector<Text, IntWritable> collector, Reporter reporter)
        throws IOException {
      String[] tokens = documentContents.toString().split("s+");
      for (String wordString : tokens) {
        if (wordString.length() > 0) {
          word.set(wordString.toLowerCase());
          collector.collect(word, one);
        }                                       Let’s drill into this code...
      }
    }
}

                                                   Copyright © 2012-2013, Think Big Analytics, All
                                       40                                        Rights Reserved
Thursday, January 10, 13
Map Code
public class SimpleWordCountMapper
 extends MapReduceBase implements
 Mapper<LongWritable, Text, Text, IntWritable> {

    static final Text word = new Text();
    static final IntWritable one = new IntWritable(1);

    @Override
    public void map(LongWritable key, Text documentContents,
        OutputCollector<Text, IntWritable> collector, Reporter reporter)
        throws IOException {
      String[] tokens = documentContents.toString().split("s+");
      for (String wordString : tokens) {
        if (wordString.length() > 0) {
          word.set(wordString.toLowerCase());
          collector.collect(word, one);
        }
      }
    }
}

                                                Copyright © 2012-2013, Think Big Analytics, All
                                    41                                        Rights Reserved
Thursday, January 10, 13
Map Code
public class SimpleWordCountMapper                               Mapper class with 4
 extends MapReduceBase implements                              type parameters for the
 Mapper<LongWritable, Text, Text, IntWritable> {              input key-value types and
                                                                    output types.
    static final Text word = new Text();
    static final IntWritable one = new IntWritable(1);

    @Override
    public void map(LongWritable key, Text documentContents,
        OutputCollector<Text, IntWritable> collector, Reporter reporter)
        throws IOException {
      String[] tokens = documentContents.toString().split("s+");
      for (String wordString : tokens) {
        if (wordString.length() > 0) {
          word.set(wordString.toLowerCase());
          collector.collect(word, one);
        }
      }
    }
}

                                                Copyright © 2012-2013, Think Big Analytics, All
                                    41                                        Rights Reserved
Thursday, January 10, 13
Map Code
public class SimpleWordCountMapper
 extends MapReduceBase implements
 Mapper<LongWritable, Text, Text, IntWritable> {

    static final Text word = new Text();               Output key-value objects
    static final IntWritable one = new IntWritable(1);      we’ll reuse.

    @Override
    public void map(LongWritable key, Text documentContents,
        OutputCollector<Text, IntWritable> collector, Reporter reporter)
        throws IOException {
      String[] tokens = documentContents.toString().split("s+");
      for (String wordString : tokens) {
        if (wordString.length() > 0) {
          word.set(wordString.toLowerCase());
          collector.collect(word, one);
        }
      }
    }
}

                                                    Copyright © 2012-2013, Think Big Analytics, All
                                       42                                         Rights Reserved
Thursday, January 10, 13
Map Code
public class SimpleWordCountMapper
 extends MapReduceBase implements
 Mapper<LongWritable, Text, Text, IntWritable> {

    static final Text word = new Text();                Map method with input,
    static final IntWritable one = new IntWritable(1); output “collector”, and
                                                          reporting object.
    @Override
    public void map(LongWritable key, Text documentContents,
        OutputCollector<Text, IntWritable> collector, Reporter reporter)
        throws IOException {
      String[] tokens = documentContents.toString().split("s+");
      for (String wordString : tokens) {
        if (wordString.length() > 0) {
          word.set(wordString.toLowerCase());
          collector.collect(word, one);
        }
      }
    }
}

                                                    Copyright © 2012-2013, Think Big Analytics, All
                                       43                                         Rights Reserved
Thursday, January 10, 13
Map Code
public class SimpleWordCountMapper
 extends MapReduceBase implements
 Mapper<LongWritable, Text, Text, IntWritable> {

    static final Text word = new Text();
    static final IntWritable one = new IntWritable(1);

    @Override
    public void map(LongWritable key, Text documentContents,
        OutputCollector<Text, IntWritable> collector, Reporter reporter)
        throws IOException {
      String[] tokens = documentContents.toString().split("s+");
      for (String wordString : tokens) {
        if (wordString.length() > 0) {
          word.set(wordString.toLowerCase());
          collector.collect(word, one);            Tokenize the line,
        }                                            “collect” each
      }                                               (word, 1)
    }
}

                                                Copyright © 2012-2013, Think Big Analytics, All
                                    44                                        Rights Reserved
Thursday, January 10, 13
Reduce Code
public class SimpleWordCountReducer
 extends MapReduceBase implements
 Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterator<IntWritable> counts,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      int count = 0;
      while (counts.hasNext()) {
        count += counts.next().get();
      }
      output.collect(key, new IntWritable(count));
    }
}



                                                Copyright © 2012-2013, Think Big Analytics, All
                                    45                                        Rights Reserved
Thursday, January 10, 13
Reduce Code
public class SimpleWordCountReducer
 extends MapReduceBase implements
 Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterator<IntWritable> counts,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      int count = 0;
      while (counts.hasNext()) {
        count += counts.next().get();
      }
      output.collect(key, new IntWritable(count));
    }
}
                                                Let’s drill into this code...



                                                Copyright © 2012-2013, Think Big Analytics, All
                                    45                                        Rights Reserved
Thursday, January 10, 13
Reduce Code
public class SimpleWordCountReducer
 extends MapReduceBase implements
 Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterator<IntWritable> counts,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      int count = 0;
      while (counts.hasNext()) {
        count += counts.next().get();
      }
      output.collect(key, new IntWritable(count));
    }
}



                                                Copyright © 2012-2013, Think Big Analytics, All
                                    46                                        Rights Reserved
Thursday, January 10, 13
Reduce Code
public class SimpleWordCountReducer                              Reducer class with 4
 extends MapReduceBase implements                              type parameters for the
 Reducer<Text, IntWritable, Text, IntWritable> {              input key-value types and
                                                                    output types.
    @Override
    public void reduce(Text key, Iterator<IntWritable> counts,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      int count = 0;
      while (counts.hasNext()) {
        count += counts.next().get();
      }
      output.collect(key, new IntWritable(count));
    }
}



                                                Copyright © 2012-2013, Think Big Analytics, All
                                    46                                        Rights Reserved
Thursday, January 10, 13
Reduce Code
public class SimpleWordCountReducer
 extends MapReduceBase implements                               Reduce method with
 Reducer<Text, IntWritable, Text, IntWritable> {              input, output “collector”,
                                                                and reporting object.
    @Override
    public void reduce(Text key, Iterator<IntWritable> counts,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      int count = 0;
      while (counts.hasNext()) {
        count += counts.next().get();
      }
      output.collect(key, new IntWritable(count));
    }
}



                                                Copyright © 2012-2013, Think Big Analytics, All
                                    47                                        Rights Reserved
Thursday, January 10, 13
Reduce Code
public class SimpleWordCountReducer
 extends MapReduceBase implements
 Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterator<IntWritable> counts,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      int count = 0;
      while (counts.hasNext()) {
                                                   Count the counts per
        count += counts.next().get();
      }                                              word and emit
      output.collect(key, new IntWritable(count));    (word, N)
    }
}



                                                 Copyright © 2012-2013, Think Big Analytics, All
                                     48                                        Rights Reserved
Thursday, January 10, 13
Other Options
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                49                                     Rights Reserved
Thursday, January 10, 13
Other Options
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                49                                     Rights Reserved
Thursday, January 10, 13
Other Options
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                49                                     Rights Reserved
Thursday, January 10, 13
Conclusions


                                     Copyright © 2012-2013, Think Big Analytics, All
                                50                                 Rights Reserved
Thursday, January 10, 13
Hadoop Benefits

                     • A cost-effective, scalable way to:
                           - Store massive data sets.
                           - Perform arbitrary analyses on
                             those data sets.



                                                   Copyright © 2012-2013, Think Big Analytics, All
                                           51                                    Rights Reserved
Thursday, January 10, 13
Hadoop Tools

                     • Offers a variety of tools for:
                           - Application development.
                           - Integration with other platforms
                             (e.g., databases).



                                                   Copyright © 2012-2013, Think Big Analytics, All
                                            52                                   Rights Reserved
Thursday, January 10, 13
Hadoop
                               Distributions
                     • A rich, open-source ecosystem.
                           - Free to use.
                           - Commercially-supported
                             distributions.



                                                   Copyright © 2012-2013, Think Big Analytics, All
                                              53                                 Rights Reserved
Thursday, January 10, 13
Thank You!
             - Feel free to contact me at
               ‣ ryan.tabora@thinkbiganalytics.com
             - Or our solutions consultant
               ‣ matt.mcdevitt@thinkbiganalytics.com
             - As always, THINK BIG!




                                                       Copyright © 2012-2013, Think Big Analytics, All
                                          54                                         Rights Reserved
Thursday, January 10, 13
Bonus
                           Content


                                   Copyright © 2012-2013, Think Big Analytics, All
                              55                                 Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                56                                     Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                56                                     Rights Reserved
Thursday, January 10, 13
Hive:
                           SQL for
                           Hadoop

                                     Copyright © 2012-2013, Think Big Analytics, All
                              57                                   Rights Reserved
Thursday, January 10, 13
Hive




                                  Copyright © 2012-2013, Think Big Analytics, All
                            58                                  Rights Reserved
Thursday, January 10, 13
Hive

                           Let’s look at WordCount
                                written in Hive,
                            the SQL for Hadoop.



                                            Copyright © 2012-2013, Think Big Analytics, All
                                      58                                  Rights Reserved
Thursday, January 10, 13
CREATE TABLE docs (line STRING);

  LOAD DATA INPATH 'docs'
  OVERWRITE INTO TABLE docs;

  CREATE TABLE word_counts AS
  SELECT word, count(1) AS count FROM
  (SELECT explode(split(line, 's')) AS word
  FROM docs) w
  GROUP BY word ORDER BY word;


                                Copyright © 2012-2013, Think Big Analytics, All
                           59                                 Rights Reserved
Thursday, January 10, 13
CREATE TABLE docs (line STRING);

  LOAD DATA INPATH 'docs'
  OVERWRITE INTO TABLE docs;

  CREATE TABLE word_counts AS
  SELECT word, count(1) AS count FROM
  (SELECT explode(split(line, 's')) AS word
  FROM docs) w
  GROUP BY word ORDER BY word; Let’s drill into this code...


                                           Copyright © 2012-2013, Think Big Analytics, All
                                59                                       Rights Reserved
Thursday, January 10, 13
CREATE TABLE docs (line STRING);

  LOAD DATA INPATH 'docs'
  OVERWRITE INTO TABLE docs;

  CREATE TABLE word_counts AS
  SELECT word, count(1) AS count FROM
  (SELECT explode(split(line, 's')) AS word
  FROM docs) w
  GROUP BY word ORDER BY word;


                                Copyright © 2012-2013, Think Big Analytics, All
                           60                                 Rights Reserved
Thursday, January 10, 13
Create a table to hold
  CREATE TABLE docs (line STRING);               the raw text we’re
                                               counting. Each line is a
                                                     “column”.
  LOAD DATA INPATH 'docs'
  OVERWRITE INTO TABLE docs;

  CREATE TABLE word_counts AS
  SELECT word, count(1) AS count FROM
  (SELECT explode(split(line, 's')) AS word
  FROM docs) w
  GROUP BY word ORDER BY word;


                                Copyright © 2012-2013, Think Big Analytics, All
                           60                                 Rights Reserved
Thursday, January 10, 13
CREATE TABLE docs (line STRING);

  LOAD DATA INPATH 'docs'                      Load the text in the
                                             “docs” directory into the
  OVERWRITE INTO TABLE docs;                           table.

  CREATE TABLE word_counts AS
  SELECT word, count(1) AS count FROM
  (SELECT explode(split(line, 's')) AS word
  FROM docs) w
  GROUP BY word ORDER BY word;


                                Copyright © 2012-2013, Think Big Analytics, All
                           61                                 Rights Reserved
Thursday, January 10, 13
CREATE TABLE docs (line STRING);
                                                Create the final table
  LOAD DATA INPATH 'docs'                     and fill it with the results
  OVERWRITE INTO TABLE docs;                   from a nested query of
                                                 the docs table that
                                                performs WordCount
  CREATE TABLE word_counts AS                          on the fly.
  SELECT word, count(1) AS count FROM
  (SELECT explode(split(line, 's')) AS word
  FROM docs) w
  GROUP BY word ORDER BY word;


                                Copyright © 2012-2013, Think Big Analytics, All
                           62                                 Rights Reserved
Thursday, January 10, 13
Hive




                                  Copyright © 2012-2013, Think Big Analytics, All
                            63                                  Rights Reserved
Thursday, January 10, 13
Hive

               Because so many Hadoop users
                come from SQL backgrounds,
                   Hive is one of the most
              essential tools in the ecosystem!!


                                     Copyright © 2012-2013, Think Big Analytics, All
                              63                                   Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                64                                     Rights Reserved
Thursday, January 10, 13
The Hadoop Ecosystem
                     • HDFS - Hadoop Distributed File System.
                     • Map/Reduce - A distributed framework for
                           executing work in parallel.

                     • Hive - A SQL like syntax with a meta store to
                           allow SQL manipulation of data stored on HDFS.

                     • Pig - A top down scripting language to
                           manipulate.

                     • HBase - A NoSQL, non-sequential data store.

                                                         Copyright © 2012-2013, Think Big Analytics, All
                                                64                                     Rights Reserved
Thursday, January 10, 13
Pig:
                            Data Flow
                           for Hadoop

                                    Copyright © 2012-2013, Think Big Analytics, All
                               65                                 Rights Reserved
Thursday, January 10, 13
Pig




                                 Copyright © 2012-2013, Think Big Analytics, All
                            66                                 Rights Reserved
Thursday, January 10, 13
Pig

                           Let’s look at WordCount
                                 written in Pig,
                           the Data Flow language
                                  for Hadoop.


                                            Copyright © 2012-2013, Think Big Analytics, All
                                      66                                  Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);

  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;

  grpd = GROUP words BY word;

  cntd = FOREACH grpd
     GENERATE group, COUNT(words);

  STORE cntd INTO 'output';

                                Copyright © 2012-2013, Think Big Analytics, All
                           67                                 Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);

  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;

  grpd = GROUP words BY word;

  cntd = FOREACH grpd
     GENERATE group, COUNT(words);

  STORE cntd INTO 'output';     Let’s drill into this code...



                                Copyright © 2012-2013, Think Big Analytics, All
                           67                                 Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);

  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;

  grpd = GROUP words BY word;

  cntd = FOREACH grpd
     GENERATE group, COUNT(words);

  STORE cntd INTO 'output';

                                Copyright © 2012-2013, Think Big Analytics, All
                           68                                 Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);            Like the Hive example,
                                                       load “docs” content,
                                                       each line is a “field”.
  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;

  grpd = GROUP words BY word;

  cntd = FOREACH grpd
     GENERATE group, COUNT(words);

  STORE cntd INTO 'output';

                                       Copyright © 2012-2013, Think Big Analytics, All
                              68                                     Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);            Tokenize into words (an
                                                       array) and “flatten” into
                                                          separate records.
  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;

  grpd = GROUP words BY word;

  cntd = FOREACH grpd
     GENERATE group, COUNT(words);

  STORE cntd INTO 'output';

                                        Copyright © 2012-2013, Think Big Analytics, All
                              69                                      Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);

  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;
                                 Collect the same words
  grpd = GROUP words BY word;           together.

  cntd = FOREACH grpd
     GENERATE group, COUNT(words);

  STORE cntd INTO 'output';

                                Copyright © 2012-2013, Think Big Analytics, All
                           70                                 Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);

  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;

  grpd = GROUP words BY word;

  cntd = FOREACH grpd
                                               Count each word.
     GENERATE group, COUNT(words);

  STORE cntd INTO 'output';

                                Copyright © 2012-2013, Think Big Analytics, All
                           71                                 Rights Reserved
Thursday, January 10, 13
inpt = LOAD 'docs' using TextLoader
     AS (line:chararray);

  words = FOREACH inpt
     GENERATE flatten(TOKENIZE(line)) AS word;

  grpd = GROUP words BY word;

  cntd = FOREACH grpd
     GENERATE group, COUNT(words);
                                Save the results.
  STORE cntd INTO 'output';          Profit!


                                Copyright © 2012-2013, Think Big Analytics, All
                           72                                 Rights Reserved
Thursday, January 10, 13
Pig




                                 Copyright © 2012-2013, Think Big Analytics, All
                            73                                 Rights Reserved
Thursday, January 10, 13
Pig

                              Pig and Hive overlap,
                           but Pig is popular for ETL,
                           e.g., data transformation,
                            cleansing, ingestion, etc.


                                              Copyright © 2012-2013, Think Big Analytics, All
                                       73                                   Rights Reserved
Thursday, January 10, 13
Questions?


                                    Copyright © 2012-2013, Think Big Analytics, All
                               74                                 Rights Reserved
Thursday, January 10, 13

More Related Content

What's hot

Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop Edureka!
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
 
Learn Hadoop
Learn HadoopLearn Hadoop
Learn HadoopEdureka!
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsEdureka!
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopEdureka!
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 

What's hot (19)

Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Learn Hadoop
Learn HadoopLearn Hadoop
Learn Hadoop
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 

Similar to Intro to HDFS and MapReduce

Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
What is the Point of Hadoop
What is the Point of HadoopWhat is the Point of Hadoop
What is the Point of HadoopDataWorks Summit
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopHortonworks
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopMark Ginnebaugh
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User GroupPentaho
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPS
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPSBig Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPS
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPSMatt Stubbs
 
Gail Zhou on "Big Data Technology, Strategy, and Applications"
Gail Zhou on "Big Data Technology, Strategy, and Applications"Gail Zhou on "Big Data Technology, Strategy, and Applications"
Gail Zhou on "Big Data Technology, Strategy, and Applications"Gail Zhou, MBA, PhD
 
Significance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceSignificance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceRobert Smith
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxPratimakumari213460
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
 

Similar to Intro to HDFS and MapReduce (20)

Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
What is the Point of Hadoop
What is the Point of HadoopWhat is the Point of Hadoop
What is the Point of Hadoop
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on Hadoop
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Big data primer
Big data primerBig data primer
Big data primer
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
Big Data
Big DataBig Data
Big Data
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User Group
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPS
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPSBig Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPS
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPS
 
Gail Zhou on "Big Data Technology, Strategy, and Applications"
Gail Zhou on "Big Data Technology, Strategy, and Applications"Gail Zhou on "Big Data Technology, Strategy, and Applications"
Gail Zhou on "Big Data Technology, Strategy, and Applications"
 
Significance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceSignificance Of Hadoop For Data Science
Significance Of Hadoop For Data Science
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Big Data
Big DataBig Data
Big Data
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptx
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 

Recently uploaded

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Recently uploaded (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Intro to HDFS and MapReduce

  • 1. Introduction to HDFS and MapReduce Copyright © 2012-2013, Think Big Analytics, All Rights Reserved Thursday, January 10, 13
  • 2. Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights Reserved Thursday, January 10, 13
  • 3. Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights Reserved Thursday, January 10, 13
  • 4. Think Big is the leading professional services firm that’s purpose built for Big Data. • One of Silicon Valley’s Fastest Growing Big Data start ups • 100% Focus on Big Data consulting & Data Science solution services • Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 • Clients: 40+ • North America Locations • US East: Boston, New York, Washington D.C. • US Central: Chicago, Austin • US West: HQ Mountain View, San Diego, Salt Lake City • EMEA & APAC Confidential Think Big Analytics 3 Thursday, January 10, 13
  • 5. Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012 Confidential Think Big Analytics 01/04/13 4 Thursday, January 10, 13
  • 6. Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions Copyright © 2012-2013, Think Big Analytics, All 5 Rights Reserved Thursday, January 10, 13
  • 7. Big Data Copyright © 2012-2013, Think Big Analytics, All 6 Rights Reserved Thursday, January 10, 13
  • 8. A Data Shift... Source: EMC Digital Universe Study* Copyright © 2012-2013, Think Big Analytics, All 7 Rights Reserved Thursday, January 10, 13
  • 9. Motivation “Simple algorithms and lots of data trump complex models. ” Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems Copyright © 2012-2013, Think Big Analytics, All 8 Rights Reserved Thursday, January 10, 13
  • 10. Pioneers • Google and Yahoo: - Index 850+ million websites, over one trillion URLs. • Facebook ad targeting: - 840+ million users, > 50% of whom are active daily. Copyright © 2012-2013, Think Big Analytics, All 9 Rights Reserved Thursday, January 10, 13
  • 11. Hadoop Ecosystem Copyright © 2012-2013, Think Big Analytics, All 10 Rights Reserved Thursday, January 10, 13
  • 12. Common Tool? • Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. Copyright © 2012-2013, Think Big Analytics, All 11 Rights Reserved Thursday, January 10, 13
  • 13. Hadoop Origins • MapReduce and Google File System (GFS) pioneered at Google. • Hadoop is the commercially-supported open-source equivalent. Copyright © 2012-2013, Think Big Analytics, All 12 Rights Reserved Thursday, January 10, 13
  • 14. What Is Hadoop? • Hadoop is a platform. • Distributes and replicates data. • Manages parallel tasks created by users. • Runs as several processes on a cluster. • The term Hadoop generally refers to a toolset, not a single tool. Copyright © 2012-2013, Think Big Analytics, All 13 Rights Reserved Thursday, January 10, 13
  • 15. Why Hadoop? • Handles unstructured to semi-structured to structured data. • Handles enormous data volumes. • Flexible data analysis and machine learning tools. • Cost-effective scalability. Copyright © 2012-2013, Think Big Analytics, All 14 Rights Reserved Thursday, January 10, 13
  • 16. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights Reserved Thursday, January 10, 13
  • 17. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights Reserved Thursday, January 10, 13
  • 18. HDFS Copyright © 2012-2013, Think Big Analytics, All 16 Rights Reserved Thursday, January 10, 13
  • 19. What Is HDFS? • Hadoop Distributed File System. • Stores files in blocks across many nodes in a cluster. • Replicates the blocks across nodes for durability. • Master/Slave architecture. Copyright © 2012-2013, Think Big Analytics, All 17 Rights Reserved Thursday, January 10, 13
  • 20. HDFS Traits • Not fully POSIX compliant. • No file updates. • Write once, read many times. • Large blocks, sequential read patterns. • Designed for batch processing. Copyright © 2012-2013, Think Big Analytics, All 18 Rights Reserved Thursday, January 10, 13
  • 21. HDFS Master • NameNode - Runs on a single node as a master process ‣ Holds file metadata (which blocks are where) ‣ Directs client access to files in HDFS • SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata Copyright © 2012-2013, Think Big Analytics, All 19 Rights Reserved Thursday, January 10, 13
  • 22. HDFS Slaves • DataNode - Generally runs on all nodes in the cluster ‣ Block creation/replication/deletion/reads ‣ Takes orders from the NameNode Copyright © 2012-2013, Think Big Analytics, All 20 Rights Reserved Thursday, January 10, 13
  • 23. HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights Reserved Thursday, January 10, 13
  • 24. HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights Reserved Thursday, January 10, 13
  • 25. HDFS Illustrated NameNode 1 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights Reserved Thursday, January 10, 13
  • 26. HDFS Illustrated NameNode 1,4,6 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights Reserved Thursday, January 10, 13
  • 27. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights Reserved Thursday, January 10, 13
  • 28. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights Reserved Thursday, January 10, 13
  • 29. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights Reserved Thursday, January 10, 13
  • 30. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 31. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 32. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 33. Power of Hadoop NameNode ,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 34. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 35. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 36. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time = Transfer DataNode 2 DataNode 3 Rate x Number of Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 37. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time 100 MB/s = x Transfer DataNode 2 DataNode 3 3 Rate x = Number of 300MB/s Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights Reserved Thursday, January 10, 13
  • 38. HDFS Shell • Easy to use command line interface. • Create, copy, move, and delete files. • Administrative duties - chmod, chown, chgrp. • Set replication factor for a file. • Head, tail, cat to view files. Copyright © 2012-2013, Think Big Analytics, All 23 Rights Reserved Thursday, January 10, 13
  • 39. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights Reserved Thursday, January 10, 13
  • 40. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights Reserved Thursday, January 10, 13
  • 41. MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 25 Rights Reserved Thursday, January 10, 13
  • 42. MapReduce Basics • Logical functions: Mappers and Reducers. • Developers write map and reduce functions, then submit a jar to the Hadoop cluster. • Hadoop handles distributing the Map and Reduce tasks across the cluster. • Typically batch oriented. Copyright © 2012-2013, Think Big Analytics, All 26 Rights Reserved Thursday, January 10, 13
  • 43. MapReduce Daemons •JobTracker (Master) - Manages MapReduce jobs, giving tasks to different nodes, managing task failure •TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker Copyright © 2012-2013, Think Big Analytics, All 27 Rights Reserved Thursday, January 10, 13
  • 44. MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 28 Rights Reserved Thursday, January 10, 13
  • 45. MapReduce in Hadoop Let’s look at how MapReduce actually works in Hadoop, using WordCount. Copyright © 2012-2013, Think Big Analytics, All 28 Rights Reserved Thursday, January 10, 13
  • 46. Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) (there, 1) map 1 mapreduce 1 phase 2 (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights Reserved Thursday, January 10, 13
  • 47. Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) We need to convert (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) the Input (there, 1) map 1 mapreduce 1 phase 2 into the Output. (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights Reserved Thursday, January 10, 13
  • 48. Input Mappers Sort, Reducers Output Shuffle Hadoop uses MapReduce a2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 reduce 1 there 2 There is a Reduce phase uses 1 Copyright © 2012-2013, Think Big Analytics, All 30 Rights Reserved Thursday, January 10, 13
  • 49. Input Mappers Hadoop uses MapReduce (doc1, "…") There is a Map phase (doc2, "…") (doc3, "") There is a Reduce phase (doc4, "…") Copyright © 2012-2013, Think Big Analytics, All 31 Rights Reserved Thursday, January 10, 13
  • 50. Input Mappers (hadoop, 1) Hadoop uses MapReduce (doc1, "…") (uses, 1) (mapreduce, 1) (there, 1) (is, 1) There is a Map phase (doc2, "…") (a, 1) (map, 1) (phase, 1) (doc3, "") (there, 1) (is, 1) There is a Reduce phase (doc4, "…") (a, 1) (reduce, 1) (phase, 1) Copyright © 2012-2013, Think Big Analytics, All 32 Rights Reserved Thursday, January 10, 13
  • 51. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (mapreduce, 1) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (there, 1), There is a Reduce phase (doc4, "…") (reduce 1) Copyright © 2012-2013, Think Big Analytics, All 33 Rights Reserved Thursday, January 10, 13
  • 52. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), (mapreduce, 1) (hadoop, [1]), (is, [1,1]) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 34 Rights Reserved Thursday, January 10, 13
  • 53. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 35 Rights Reserved Thursday, January 10, 13
  • 54. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights Reserved Thursday, January 10, 13
  • 55. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N (reduce, [1]), outputs. (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights Reserved Thursday, January 10, 13
  • 56. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") Reduce: • • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N Collect multiple inputs into (reduce, [1]), outputs. (there, 1), one output. (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights Reserved Thursday, January 10, 13
  • 57. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 58. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 59. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 60. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 61. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker * Intermediate Data Is Map Phase k,v M k,v k,v M k,v M k,v Stored Locally DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 62. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 63. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 64. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 65. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 66. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 67. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights Reserved Thursday, January 10, 13
  • 68. The Hadoop Java API Copyright © 2012-2013, Think Big Analytics, All 38 Rights Reserved Thursday, January 10, 13
  • 69. MapReduce in Java Copyright © 2012-2013, Think Big Analytics, All 39 Rights Reserved Thursday, January 10, 13
  • 70. MapReduce in Java Let’s look at WordCount written in the MapReduce Java API. Copyright © 2012-2013, Think Big Analytics, All 39 Rights Reserved Thursday, January 10, 13
  • 71. Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } } } Copyright © 2012-2013, Think Big Analytics, All 40 Rights Reserved Thursday, January 10, 13
  • 72. Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } Let’s drill into this code... } } } Copyright © 2012-2013, Think Big Analytics, All 40 Rights Reserved Thursday, January 10, 13
  • 73. Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } } } Copyright © 2012-2013, Think Big Analytics, All 41 Rights Reserved Thursday, January 10, 13
  • 74. Map Code public class SimpleWordCountMapper Mapper class with 4 extends MapReduceBase implements type parameters for the Mapper<LongWritable, Text, Text, IntWritable> { input key-value types and output types. static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } } } Copyright © 2012-2013, Think Big Analytics, All 41 Rights Reserved Thursday, January 10, 13
  • 75. Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Output key-value objects static final IntWritable one = new IntWritable(1); we’ll reuse. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } } } Copyright © 2012-2013, Think Big Analytics, All 42 Rights Reserved Thursday, January 10, 13
  • 76. Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Map method with input, static final IntWritable one = new IntWritable(1); output “collector”, and reporting object. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } } } Copyright © 2012-2013, Think Big Analytics, All 43 Rights Reserved Thursday, January 10, 13
  • 77. Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); Tokenize the line, } “collect” each } (word, 1) } } Copyright © 2012-2013, Think Big Analytics, All 44 Rights Reserved Thursday, January 10, 13
  • 78. Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } } Copyright © 2012-2013, Think Big Analytics, All 45 Rights Reserved Thursday, January 10, 13
  • 79. Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } } Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 45 Rights Reserved Thursday, January 10, 13
  • 80. Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } } Copyright © 2012-2013, Think Big Analytics, All 46 Rights Reserved Thursday, January 10, 13
  • 81. Reduce Code public class SimpleWordCountReducer Reducer class with 4 extends MapReduceBase implements type parameters for the Reducer<Text, IntWritable, Text, IntWritable> { input key-value types and output types. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } } Copyright © 2012-2013, Think Big Analytics, All 46 Rights Reserved Thursday, January 10, 13
  • 82. Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reduce method with Reducer<Text, IntWritable, Text, IntWritable> { input, output “collector”, and reporting object. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } } Copyright © 2012-2013, Think Big Analytics, All 47 Rights Reserved Thursday, January 10, 13
  • 83. Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { Count the counts per count += counts.next().get(); } word and emit output.collect(key, new IntWritable(count)); (word, N) } } Copyright © 2012-2013, Think Big Analytics, All 48 Rights Reserved Thursday, January 10, 13
  • 84. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights Reserved Thursday, January 10, 13
  • 85. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights Reserved Thursday, January 10, 13
  • 86. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights Reserved Thursday, January 10, 13
  • 87. Conclusions Copyright © 2012-2013, Think Big Analytics, All 50 Rights Reserved Thursday, January 10, 13
  • 88. Hadoop Benefits • A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. Copyright © 2012-2013, Think Big Analytics, All 51 Rights Reserved Thursday, January 10, 13
  • 89. Hadoop Tools • Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). Copyright © 2012-2013, Think Big Analytics, All 52 Rights Reserved Thursday, January 10, 13
  • 90. Hadoop Distributions • A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. Copyright © 2012-2013, Think Big Analytics, All 53 Rights Reserved Thursday, January 10, 13
  • 91. Thank You! - Feel free to contact me at ‣ ryan.tabora@thinkbiganalytics.com - Or our solutions consultant ‣ matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! Copyright © 2012-2013, Think Big Analytics, All 54 Rights Reserved Thursday, January 10, 13
  • 92. Bonus Content Copyright © 2012-2013, Think Big Analytics, All 55 Rights Reserved Thursday, January 10, 13
  • 93. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights Reserved Thursday, January 10, 13
  • 94. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights Reserved Thursday, January 10, 13
  • 95. Hive: SQL for Hadoop Copyright © 2012-2013, Think Big Analytics, All 57 Rights Reserved Thursday, January 10, 13
  • 96. Hive Copyright © 2012-2013, Think Big Analytics, All 58 Rights Reserved Thursday, January 10, 13
  • 97. Hive Let’s look at WordCount written in Hive, the SQL for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 58 Rights Reserved Thursday, January 10, 13
  • 98. CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 59 Rights Reserved Thursday, January 10, 13
  • 99. CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 59 Rights Reserved Thursday, January 10, 13
  • 100. CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights Reserved Thursday, January 10, 13
  • 101. Create a table to hold CREATE TABLE docs (line STRING); the raw text we’re counting. Each line is a “column”. LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights Reserved Thursday, January 10, 13
  • 102. CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' Load the text in the “docs” directory into the OVERWRITE INTO TABLE docs; table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 61 Rights Reserved Thursday, January 10, 13
  • 103. CREATE TABLE docs (line STRING); Create the final table LOAD DATA INPATH 'docs' and fill it with the results OVERWRITE INTO TABLE docs; from a nested query of the docs table that performs WordCount CREATE TABLE word_counts AS on the fly. SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 62 Rights Reserved Thursday, January 10, 13
  • 104. Hive Copyright © 2012-2013, Think Big Analytics, All 63 Rights Reserved Thursday, January 10, 13
  • 105. Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! Copyright © 2012-2013, Think Big Analytics, All 63 Rights Reserved Thursday, January 10, 13
  • 106. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights Reserved Thursday, January 10, 13
  • 107. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights Reserved Thursday, January 10, 13
  • 108. Pig: Data Flow for Hadoop Copyright © 2012-2013, Think Big Analytics, All 65 Rights Reserved Thursday, January 10, 13
  • 109. Pig Copyright © 2012-2013, Think Big Analytics, All 66 Rights Reserved Thursday, January 10, 13
  • 110. Pig Let’s look at WordCount written in Pig, the Data Flow language for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 66 Rights Reserved Thursday, January 10, 13
  • 111. inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Copyright © 2012-2013, Think Big Analytics, All 67 Rights Reserved Thursday, January 10, 13
  • 112. inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 67 Rights Reserved Thursday, January 10, 13
  • 113. inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Copyright © 2012-2013, Think Big Analytics, All 68 Rights Reserved Thursday, January 10, 13
  • 114. inpt = LOAD 'docs' using TextLoader AS (line:chararray); Like the Hive example, load “docs” content, each line is a “field”. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Copyright © 2012-2013, Think Big Analytics, All 68 Rights Reserved Thursday, January 10, 13
  • 115. inpt = LOAD 'docs' using TextLoader AS (line:chararray); Tokenize into words (an array) and “flatten” into separate records. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Copyright © 2012-2013, Think Big Analytics, All 69 Rights Reserved Thursday, January 10, 13
  • 116. inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; Collect the same words grpd = GROUP words BY word; together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Copyright © 2012-2013, Think Big Analytics, All 70 Rights Reserved Thursday, January 10, 13
  • 117. inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd Count each word. GENERATE group, COUNT(words); STORE cntd INTO 'output'; Copyright © 2012-2013, Think Big Analytics, All 71 Rights Reserved Thursday, January 10, 13
  • 118. inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); Save the results. STORE cntd INTO 'output'; Profit! Copyright © 2012-2013, Think Big Analytics, All 72 Rights Reserved Thursday, January 10, 13
  • 119. Pig Copyright © 2012-2013, Think Big Analytics, All 73 Rights Reserved Thursday, January 10, 13
  • 120. Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. Copyright © 2012-2013, Think Big Analytics, All 73 Rights Reserved Thursday, January 10, 13
  • 121. Questions? Copyright © 2012-2013, Think Big Analytics, All 74 Rights Reserved Thursday, January 10, 13