Introduction to
Zak Stone <zak@eecs.harvard.edu>
PhD candidate, Harvard School of Engineering and Applied Sciences
Advisor: Todd Zickler (Computer Vision)
Hadoop distributes data and computation across a
large number of computers.
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Why should you care? - Lots of Data




   LOTS OF DATA
   EVERYWHERE
Why should you care? - Lots of Data




                                      L
                                      O
                                      T
                                      S
                                      !
Why should you care? - Lots of Data
Why should you care? - Even Grocery Stores Care




                      ...
Why!! ! ! ! ! !                    for big data?

• Most credible open-source toolset for large-scale, general-purpose computing


  • Backed by                 ,


  • Used by                   ,              , many others


  • Increasing support from                          web services


  • Hadoop closely imitates infrastructure developed by


  • Hadoop processes petabytes daily, right now
Why!! ! ! ! ! !   for big data?
DISCLAIMER
   • Don’t use Hadoop if your data and computation fit on one machine


   • Getting easier to use, but still complicated




http://www.wired.com/gadgetlab/2008/07/patent-crazines/
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
What exactly is ! ! ! ! ! ! !                    ?

• Actually a growing collection of subprojects
What exactly is ! ! ! ! ! ! !                        ?

• Actually a growing collection of subprojects; focus on two right now
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
An overview of Hadoop Map-Reduce




   Traditional
                              Hadoop
   Computing



    (one computer)

                            (many computers)
An overview of Hadoop Map-Reduce

            (Actually more like this)




                    (many computers, little communication,
                           stragglers and failures)
Map-Reduce: Three phases



              1. Map

              2. Sort

              3. Reduce
Map-Reduce: Map phase


   Only specify operations on key-value pairs!
    INPUT PAIR                    OUTPUT PAIRS
  (key, value)                  (key, value)
                                (key, value)
                                (key, value)
                                (zero or more output pairs)


       (each “elephant” works on an input pair;
         doesn’t know other elephants exist )
Map-Reduce: Map phase, word-count example



   (line1, “Hello there.”)   (“hello”, 1)

                             (“there”, 1)




   (line2, “Why, hello.”)     (“why”, 1)

                              (“hello”, 1)
Map-Reduce: Sort phase

          (key1, value289)
           (key1, value43)
           (key1, value3)
                 ...
          (key2, value512)
           (key2, value11)
           (key2, value67)
                   ...
Map-Reduce: Sort phase, word-count example

                              (“hello”, 1)
                              (“hello”, 1)




                              (“there”, 1)




                               (“why”, 1)
Map-Reduce: Reduce phase




(key1, value289)
(key1, value43)            (key1, output1)
 (key1, value3)

                   ...
Map-Reduce: Reduce phase, word-count example


   (“hello”, 1)
                               (“hello”, 2)
   (“hello”, 1)




   (“there”, 1)                (“there”, 1)




    (“why”, 1)                  (“why”, 1)
Map-Reduce: Code for word-count


     def mapper(key,value):
       for word in value.split():
         yield word,1

     def reducer(key,values):
       yield key,sum(values)
Seems like too much work
   for a word-count!
Map-Reduce: Imagine word-count on the Web
Map-Reduce: The main advantage

With Hadoop, this very same code could run on
      the entire Web! (In theory, at least)
     def mapper(key,value):
       for word in value.split():
         yield word,1

     def reducer(key,values):
       yield key,sum(values)
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
HDFS: Hadoop Distributed File System



                            ...        (chunks of data
                                        on computers)


       Data                 ...      (each chunk
                                   replicated more
                                    than once for
                                       reliability)

                            ...
                          ...
HDFS: Hadoop Distributed File System
                       (key1, value1)
                       (key2, value2)
                             ...



  ...                  (key1, value1)
                       (key2, value2)
                             ...
                                          ...



         Computation is local to the data
Key-value pairs processed independently in parallel
HDFS: Inspired by the Google File System
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Hadoop Map-Reduce and HDFS: Advantages
• Distribute data and computation

   • Computation local to data avoids network overload

• Tasks are independent

   • Easy to handle partial failures - entire nodes can fail and restart

   • Avoid crawling horrors of failure-tolerant synchronous distributed systems

   • Speculative execution to work around stragglers

• Linear scaling in the ideal case

   • Designed for cheap, commodity hardware

• Simple programming model

   • The “end-user” programmer only writes map-reduce tasks
Hadoop Map-Reduce and HDFS: Disadvantages
• Still rough - software under active development

   • e.g. HDFS only recently added support for append operations

• Programming model is very restrictive

   • Lack of central data can be frustrating

• “Joins” of multiple datasets are tricky and slow

   • No indices! Often, entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing software, collecting logs...)

• Still single master, which requires care and may limit scaling

• Managing job flow isn’t trivial when intermediate data should be kept

• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Getting started: Installation options

• Cloudera virtual machine

• Your own virtual machine (install Ubuntu in VirtualBox, which is free)

• Elastic MapReduce on EC2

• StarCluster with Hadoop on EC2

• Cloudera’s distribution of Hadoop on EC2

• Install Cloudera’s distribution of Hadoop on your own machine

   • Available for RPM and Debian deployments

• Or download Hadoop directly from http://hadoop.apache.org/
Getting started: Language choices

• Hadoop is written in Java

• However, Hadoop Streaming allows mappers and reducers in any language!

• Binary data is a little tricky with Hadoop Streaming

   • Could use base64 encoding, but TypedBytes are much better

• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo

   • The Python word-count example and others come with Dumbo

   • Dumbo makes binary data with TypedBytes easy

• Also consider Hadoopy: https://github.com/bwhite/hadoopy
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Useful resources and tips

• The Hadoop homepage: http://hadoop.apache.org/

• Cloudera: http://cloudera.com/

• Dumbo: http://wiki.github.com/klbostee/dumbo

• Hadoopy: https://github.com/bwhite/hadoopy

• Amazon Elastic Compute Cloud Getting Started Guide:
• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/


• Always test locally on a tiny dataset before running on a cluster!
...
Thanks for your attention!

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

  • 1.
    Introduction to Zak Stone<zak@eecs.harvard.edu> PhD candidate, Harvard School of Engineering and Applied Sciences Advisor: Todd Zickler (Computer Vision)
  • 2.
    Hadoop distributes dataand computation across a large number of computers.
  • 3.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 4.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 5.
    Why should youcare? - Lots of Data LOTS OF DATA EVERYWHERE
  • 6.
    Why should youcare? - Lots of Data L O T S !
  • 7.
    Why should youcare? - Lots of Data
  • 8.
    Why should youcare? - Even Grocery Stores Care ...
  • 9.
    Why!! ! !! ! ! for big data? • Most credible open-source toolset for large-scale, general-purpose computing • Backed by , • Used by , , many others • Increasing support from web services • Hadoop closely imitates infrastructure developed by • Hadoop processes petabytes daily, right now
  • 10.
    Why!! ! !! ! ! for big data?
  • 11.
    DISCLAIMER • Don’t use Hadoop if your data and computation fit on one machine • Getting easier to use, but still complicated http://www.wired.com/gadgetlab/2008/07/patent-crazines/
  • 12.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 13.
    What exactly is! ! ! ! ! ! ! ? • Actually a growing collection of subprojects
  • 14.
    What exactly is! ! ! ! ! ! ! ? • Actually a growing collection of subprojects; focus on two right now
  • 15.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 16.
    An overview ofHadoop Map-Reduce Traditional Hadoop Computing (one computer) (many computers)
  • 17.
    An overview ofHadoop Map-Reduce (Actually more like this) (many computers, little communication, stragglers and failures)
  • 18.
    Map-Reduce: Three phases 1. Map 2. Sort 3. Reduce
  • 19.
    Map-Reduce: Map phase Only specify operations on key-value pairs! INPUT PAIR OUTPUT PAIRS (key, value) (key, value) (key, value) (key, value) (zero or more output pairs) (each “elephant” works on an input pair; doesn’t know other elephants exist )
  • 20.
    Map-Reduce: Map phase,word-count example (line1, “Hello there.”) (“hello”, 1) (“there”, 1) (line2, “Why, hello.”) (“why”, 1) (“hello”, 1)
  • 21.
    Map-Reduce: Sort phase (key1, value289) (key1, value43) (key1, value3) ... (key2, value512) (key2, value11) (key2, value67) ...
  • 22.
    Map-Reduce: Sort phase,word-count example (“hello”, 1) (“hello”, 1) (“there”, 1) (“why”, 1)
  • 23.
    Map-Reduce: Reduce phase (key1,value289) (key1, value43) (key1, output1) (key1, value3) ...
  • 24.
    Map-Reduce: Reduce phase,word-count example (“hello”, 1) (“hello”, 2) (“hello”, 1) (“there”, 1) (“there”, 1) (“why”, 1) (“why”, 1)
  • 25.
    Map-Reduce: Code forword-count def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  • 26.
    Seems like toomuch work for a word-count!
  • 27.
  • 28.
    Map-Reduce: The mainadvantage With Hadoop, this very same code could run on the entire Web! (In theory, at least) def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  • 29.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 30.
    HDFS: Hadoop DistributedFile System ... (chunks of data on computers) Data ... (each chunk replicated more than once for reliability) ... ...
  • 31.
    HDFS: Hadoop DistributedFile System (key1, value1) (key2, value2) ... ... (key1, value1) (key2, value2) ... ... Computation is local to the data Key-value pairs processed independently in parallel
  • 32.
    HDFS: Inspired bythe Google File System
  • 33.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 34.
    Hadoop Map-Reduce andHDFS: Advantages • Distribute data and computation • Computation local to data avoids network overload • Tasks are independent • Easy to handle partial failures - entire nodes can fail and restart • Avoid crawling horrors of failure-tolerant synchronous distributed systems • Speculative execution to work around stragglers • Linear scaling in the ideal case • Designed for cheap, commodity hardware • Simple programming model • The “end-user” programmer only writes map-reduce tasks
  • 35.
    Hadoop Map-Reduce andHDFS: Disadvantages • Still rough - software under active development • e.g. HDFS only recently added support for append operations • Programming model is very restrictive • Lack of central data can be frustrating • “Joins” of multiple datasets are tricky and slow • No indices! Often, entire dataset gets copied in the process • Cluster management is hard (debugging, distributing software, collecting logs...) • Still single master, which requires care and may limit scaling • Managing job flow isn’t trivial when intermediate data should be kept • Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
  • 36.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 37.
    Getting started: Installationoptions • Cloudera virtual machine • Your own virtual machine (install Ubuntu in VirtualBox, which is free) • Elastic MapReduce on EC2 • StarCluster with Hadoop on EC2 • Cloudera’s distribution of Hadoop on EC2 • Install Cloudera’s distribution of Hadoop on your own machine • Available for RPM and Debian deployments • Or download Hadoop directly from http://hadoop.apache.org/
  • 38.
    Getting started: Languagechoices • Hadoop is written in Java • However, Hadoop Streaming allows mappers and reducers in any language! • Binary data is a little tricky with Hadoop Streaming • Could use base64 encoding, but TypedBytes are much better • For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo • The Python word-count example and others come with Dumbo • Dumbo makes binary data with TypedBytes easy • Also consider Hadoopy: https://github.com/bwhite/hadoopy
  • 39.
    Outline 1.Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 40.
    Useful resources andtips • The Hadoop homepage: http://hadoop.apache.org/ • Cloudera: http://cloudera.com/ • Dumbo: http://wiki.github.com/klbostee/dumbo • Hadoopy: https://github.com/bwhite/hadoopy • Amazon Elastic Compute Cloud Getting Started Guide: • http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/ • Always test locally on a tiny dataset before running on a cluster!
  • 41.
  • 42.
    Thanks for yourattention!