• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

on

  • 5,666 views

http://cs264.org

http://cs264.org

Statistics

Views

Total Views
5,666
Views on SlideShare
5,663
Embed Views
3

Actions

Likes
10
Downloads
348
Comments
0

1 Embed 3

http://www.slashdocs.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard) [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard) Presentation Transcript

    • Introduction toZak Stone <zak@eecs.harvard.edu>PhD candidate, Harvard School of Engineering and Applied SciencesAdvisor: Todd Zickler (Computer Vision)
    • Hadoop distributes data and computation across alarge number of computers.
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • Why should you care? - Lots of Data LOTS OF DATA EVERYWHERE
    • Why should you care? - Lots of Data L O T S !
    • Why should you care? - Lots of Data
    • Why should you care? - Even Grocery Stores Care ...
    • Why!! ! ! ! ! ! for big data?• Most credible open-source toolset for large-scale, general-purpose computing • Backed by , • Used by , , many others • Increasing support from web services • Hadoop closely imitates infrastructure developed by • Hadoop processes petabytes daily, right now
    • Why!! ! ! ! ! ! for big data?
    • DISCLAIMER • Don’t use Hadoop if your data and computation fit on one machine • Getting easier to use, but still complicatedhttp://www.wired.com/gadgetlab/2008/07/patent-crazines/
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • What exactly is ! ! ! ! ! ! ! ?• Actually a growing collection of subprojects
    • What exactly is ! ! ! ! ! ! ! ?• Actually a growing collection of subprojects; focus on two right now
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • An overview of Hadoop Map-Reduce Traditional Hadoop Computing (one computer) (many computers)
    • An overview of Hadoop Map-Reduce (Actually more like this) (many computers, little communication, stragglers and failures)
    • Map-Reduce: Three phases 1. Map 2. Sort 3. Reduce
    • Map-Reduce: Map phase Only specify operations on key-value pairs! INPUT PAIR OUTPUT PAIRS (key, value) (key, value) (key, value) (key, value) (zero or more output pairs) (each “elephant” works on an input pair; doesn’t know other elephants exist )
    • Map-Reduce: Map phase, word-count example (line1, “Hello there.”) (“hello”, 1) (“there”, 1) (line2, “Why, hello.”) (“why”, 1) (“hello”, 1)
    • Map-Reduce: Sort phase (key1, value289) (key1, value43) (key1, value3) ... (key2, value512) (key2, value11) (key2, value67) ...
    • Map-Reduce: Sort phase, word-count example (“hello”, 1) (“hello”, 1) (“there”, 1) (“why”, 1)
    • Map-Reduce: Reduce phase(key1, value289)(key1, value43) (key1, output1) (key1, value3) ...
    • Map-Reduce: Reduce phase, word-count example (“hello”, 1) (“hello”, 2) (“hello”, 1) (“there”, 1) (“there”, 1) (“why”, 1) (“why”, 1)
    • Map-Reduce: Code for word-count def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
    • Seems like too much work for a word-count!
    • Map-Reduce: Imagine word-count on the Web
    • Map-Reduce: The main advantageWith Hadoop, this very same code could run on the entire Web! (In theory, at least) def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • HDFS: Hadoop Distributed File System ... (chunks of data on computers) Data ... (each chunk replicated more than once for reliability) ... ...
    • HDFS: Hadoop Distributed File System (key1, value1) (key2, value2) ... ... (key1, value1) (key2, value2) ... ... Computation is local to the dataKey-value pairs processed independently in parallel
    • HDFS: Inspired by the Google File System
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • Hadoop Map-Reduce and HDFS: Advantages• Distribute data and computation • Computation local to data avoids network overload• Tasks are independent • Easy to handle partial failures - entire nodes can fail and restart • Avoid crawling horrors of failure-tolerant synchronous distributed systems • Speculative execution to work around stragglers• Linear scaling in the ideal case • Designed for cheap, commodity hardware• Simple programming model • The “end-user” programmer only writes map-reduce tasks
    • Hadoop Map-Reduce and HDFS: Disadvantages• Still rough - software under active development • e.g. HDFS only recently added support for append operations• Programming model is very restrictive • Lack of central data can be frustrating• “Joins” of multiple datasets are tricky and slow • No indices! Often, entire dataset gets copied in the process• Cluster management is hard (debugging, distributing software, collecting logs...)• Still single master, which requires care and may limit scaling• Managing job flow isn’t trivial when intermediate data should be kept• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • Getting started: Installation options• Cloudera virtual machine• Your own virtual machine (install Ubuntu in VirtualBox, which is free)• Elastic MapReduce on EC2• StarCluster with Hadoop on EC2• Cloudera’s distribution of Hadoop on EC2• Install Cloudera’s distribution of Hadoop on your own machine • Available for RPM and Debian deployments• Or download Hadoop directly from http://hadoop.apache.org/
    • Getting started: Language choices• Hadoop is written in Java• However, Hadoop Streaming allows mappers and reducers in any language!• Binary data is a little tricky with Hadoop Streaming • Could use base64 encoding, but TypedBytes are much better• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo • The Python word-count example and others come with Dumbo • Dumbo makes binary data with TypedBytes easy• Also consider Hadoopy: https://github.com/bwhite/hadoopy
    • Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
    • Useful resources and tips• The Hadoop homepage: http://hadoop.apache.org/• Cloudera: http://cloudera.com/• Dumbo: http://wiki.github.com/klbostee/dumbo• Hadoopy: https://github.com/bwhite/hadoopy• Amazon Elastic Compute Cloud Getting Started Guide:• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/• Always test locally on a tiny dataset before running on a cluster!
    • ...
    • Thanks for your attention!