Introduction toZak Stone <zak@eecs.harvard.edu>PhD candidate, Harvard School of Engineering and Applied SciencesAdvisor: T...
Hadoop distributes data and computation across alarge number of computers.
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
Why should you care? - Lots of Data   LOTS OF DATA   EVERYWHERE
Why should you care? - Lots of Data                                      L                                      O         ...
Why should you care? - Lots of Data
Why should you care? - Even Grocery Stores Care                      ...
Why!! ! ! ! ! !                    for big data?• Most credible open-source toolset for large-scale, general-purpose compu...
Why!! ! ! ! ! !   for big data?
DISCLAIMER   • Don’t use Hadoop if your data and computation fit on one machine   • Getting easier to use, but still compli...
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
What exactly is ! ! ! ! ! ! !                    ?• Actually a growing collection of subprojects
What exactly is ! ! ! ! ! ! !                        ?• Actually a growing collection of subprojects; focus on two right now
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
An overview of Hadoop Map-Reduce   Traditional                              Hadoop   Computing    (one computer)          ...
An overview of Hadoop Map-Reduce            (Actually more like this)                    (many computers, little communica...
Map-Reduce: Three phases              1. Map              2. Sort              3. Reduce
Map-Reduce: Map phase   Only specify operations on key-value pairs!    INPUT PAIR                    OUTPUT PAIRS  (key, v...
Map-Reduce: Map phase, word-count example   (line1, “Hello there.”)   (“hello”, 1)                             (“there”, 1...
Map-Reduce: Sort phase          (key1, value289)           (key1, value43)           (key1, value3)                 ...   ...
Map-Reduce: Sort phase, word-count example                              (“hello”, 1)                              (“hello”...
Map-Reduce: Reduce phase(key1, value289)(key1, value43)            (key1, output1) (key1, value3)                   ...
Map-Reduce: Reduce phase, word-count example   (“hello”, 1)                               (“hello”, 2)   (“hello”, 1)   (“...
Map-Reduce: Code for word-count     def mapper(key,value):       for word in value.split():         yield word,1     def r...
Seems like too much work   for a word-count!
Map-Reduce: Imagine word-count on the Web
Map-Reduce: The main advantageWith Hadoop, this very same code could run on      the entire Web! (In theory, at least)    ...
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
HDFS: Hadoop Distributed File System                            ...        (chunks of data                                ...
HDFS: Hadoop Distributed File System                       (key1, value1)                       (key2, value2)            ...
HDFS: Inspired by the Google File System
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
Hadoop Map-Reduce and HDFS: Advantages• Distribute data and computation   • Computation local to data avoids network overl...
Hadoop Map-Reduce and HDFS: Disadvantages• Still rough - software under active development   • e.g. HDFS only recently add...
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
Getting started: Installation options• Cloudera virtual machine• Your own virtual machine (install Ubuntu in VirtualBox, w...
Getting started: Language choices• Hadoop is written in Java• However, Hadoop Streaming allows mappers and reducers in any...
Outline  1. Why should you care about Hadoop?  2. What exactly is Hadoop?  3. An overview of Hadoop Map-Reduce  4. The Had...
Useful resources and tips• The Hadoop homepage: http://hadoop.apache.org/• Cloudera: http://cloudera.com/• Dumbo: http://w...
...
Thanks for your attention!
Upcoming SlideShare
Loading in …5
×

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

6,073 views

Published on

http://cs264.org

Published in: Education, Technology

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

  1. 1. Introduction toZak Stone <zak@eecs.harvard.edu>PhD candidate, Harvard School of Engineering and Applied SciencesAdvisor: Todd Zickler (Computer Vision)
  2. 2. Hadoop distributes data and computation across alarge number of computers.
  3. 3. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  4. 4. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  5. 5. Why should you care? - Lots of Data LOTS OF DATA EVERYWHERE
  6. 6. Why should you care? - Lots of Data L O T S !
  7. 7. Why should you care? - Lots of Data
  8. 8. Why should you care? - Even Grocery Stores Care ...
  9. 9. Why!! ! ! ! ! ! for big data?• Most credible open-source toolset for large-scale, general-purpose computing • Backed by , • Used by , , many others • Increasing support from web services • Hadoop closely imitates infrastructure developed by • Hadoop processes petabytes daily, right now
  10. 10. Why!! ! ! ! ! ! for big data?
  11. 11. DISCLAIMER • Don’t use Hadoop if your data and computation fit on one machine • Getting easier to use, but still complicatedhttp://www.wired.com/gadgetlab/2008/07/patent-crazines/
  12. 12. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  13. 13. What exactly is ! ! ! ! ! ! ! ?• Actually a growing collection of subprojects
  14. 14. What exactly is ! ! ! ! ! ! ! ?• Actually a growing collection of subprojects; focus on two right now
  15. 15. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  16. 16. An overview of Hadoop Map-Reduce Traditional Hadoop Computing (one computer) (many computers)
  17. 17. An overview of Hadoop Map-Reduce (Actually more like this) (many computers, little communication, stragglers and failures)
  18. 18. Map-Reduce: Three phases 1. Map 2. Sort 3. Reduce
  19. 19. Map-Reduce: Map phase Only specify operations on key-value pairs! INPUT PAIR OUTPUT PAIRS (key, value) (key, value) (key, value) (key, value) (zero or more output pairs) (each “elephant” works on an input pair; doesn’t know other elephants exist )
  20. 20. Map-Reduce: Map phase, word-count example (line1, “Hello there.”) (“hello”, 1) (“there”, 1) (line2, “Why, hello.”) (“why”, 1) (“hello”, 1)
  21. 21. Map-Reduce: Sort phase (key1, value289) (key1, value43) (key1, value3) ... (key2, value512) (key2, value11) (key2, value67) ...
  22. 22. Map-Reduce: Sort phase, word-count example (“hello”, 1) (“hello”, 1) (“there”, 1) (“why”, 1)
  23. 23. Map-Reduce: Reduce phase(key1, value289)(key1, value43) (key1, output1) (key1, value3) ...
  24. 24. Map-Reduce: Reduce phase, word-count example (“hello”, 1) (“hello”, 2) (“hello”, 1) (“there”, 1) (“there”, 1) (“why”, 1) (“why”, 1)
  25. 25. Map-Reduce: Code for word-count def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  26. 26. Seems like too much work for a word-count!
  27. 27. Map-Reduce: Imagine word-count on the Web
  28. 28. Map-Reduce: The main advantageWith Hadoop, this very same code could run on the entire Web! (In theory, at least) def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  29. 29. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  30. 30. HDFS: Hadoop Distributed File System ... (chunks of data on computers) Data ... (each chunk replicated more than once for reliability) ... ...
  31. 31. HDFS: Hadoop Distributed File System (key1, value1) (key2, value2) ... ... (key1, value1) (key2, value2) ... ... Computation is local to the dataKey-value pairs processed independently in parallel
  32. 32. HDFS: Inspired by the Google File System
  33. 33. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  34. 34. Hadoop Map-Reduce and HDFS: Advantages• Distribute data and computation • Computation local to data avoids network overload• Tasks are independent • Easy to handle partial failures - entire nodes can fail and restart • Avoid crawling horrors of failure-tolerant synchronous distributed systems • Speculative execution to work around stragglers• Linear scaling in the ideal case • Designed for cheap, commodity hardware• Simple programming model • The “end-user” programmer only writes map-reduce tasks
  35. 35. Hadoop Map-Reduce and HDFS: Disadvantages• Still rough - software under active development • e.g. HDFS only recently added support for append operations• Programming model is very restrictive • Lack of central data can be frustrating• “Joins” of multiple datasets are tricky and slow • No indices! Often, entire dataset gets copied in the process• Cluster management is hard (debugging, distributing software, collecting logs...)• Still single master, which requires care and may limit scaling• Managing job flow isn’t trivial when intermediate data should be kept• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
  36. 36. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  37. 37. Getting started: Installation options• Cloudera virtual machine• Your own virtual machine (install Ubuntu in VirtualBox, which is free)• Elastic MapReduce on EC2• StarCluster with Hadoop on EC2• Cloudera’s distribution of Hadoop on EC2• Install Cloudera’s distribution of Hadoop on your own machine • Available for RPM and Debian deployments• Or download Hadoop directly from http://hadoop.apache.org/
  38. 38. Getting started: Language choices• Hadoop is written in Java• However, Hadoop Streaming allows mappers and reducers in any language!• Binary data is a little tricky with Hadoop Streaming • Could use base64 encoding, but TypedBytes are much better• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo • The Python word-count example and others come with Dumbo • Dumbo makes binary data with TypedBytes easy• Also consider Hadoopy: https://github.com/bwhite/hadoopy
  39. 39. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  40. 40. Useful resources and tips• The Hadoop homepage: http://hadoop.apache.org/• Cloudera: http://cloudera.com/• Dumbo: http://wiki.github.com/klbostee/dumbo• Hadoopy: https://github.com/bwhite/hadoopy• Amazon Elastic Compute Cloud Getting Started Guide:• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/• Always test locally on a tiny dataset before running on a cluster!
  41. 41. ...
  42. 42. Thanks for your attention!

×