• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop
 

Hadoop

on

  • 2,187 views

 

Statistics

Views

Total Views
2,187
Views on SlideShare
2,187
Embed Views
0

Actions

Likes
2
Downloads
90
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Hadoop Presentation Transcript

    • Knowledge ShareHadoop
      Yu Zhao
      Platform
    • Hadoop
      What is Hadoop
      The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
      Hadoop Includes
      MapReduce
      HDFS
      Hadoop Common
    • Hadoop
      Position
      APPLICATION
      HADOOP
      OS
      OS
      OS
      OS
      HOST
      HOST
      HOST
      HOST
    • MapReduce
      A simple programming model that applies to many large-scale computing problems
      Hide messy details in MapReduce runtime library:
      automatic parallelization
      load balancing
      network and disk transfer optimization
      handling of machine failures
      robustness
    • MapReduce
      Typical problem solved by MapReduce
      Read a lot of data
      Map: extract something you care about from each record
      Shuffle and Sort
      Reduce: aggregate, summarize, filter, or transform
      Write the results
    • MapReduce
      Outline stays the same, map and reduce change to fit the problem
      More specifically…
      Programmer specifies two primary methods:
      map: (K1, V1) -> list(K2, V2)
      reduce: (K2, list(V2)) -> list(K3, V3)
    • MapReduce
      Example- word count
      Counting the number of occurrences of each word in a large collection of documents.
      Key:“document1”
      Value:“to be or not to be”
      MAP
      Key Value
      “to” “1”
      “be” “1”
      “or” “1”
      “not” ”1”
      “to” ”1”
      “be” ”1”
      Key Value
      “be” “1”“1”
      “not” “1”
      “or” “1”
      “to” “1””1”
      Key Value
      “be” “2”
      “not” “1”
      “or” “1”
      “to” “2”
      REDUCE
      SHUFFLE
      &
      SORT
    • MapReduce
      Example- word count
      Pseudo-code
      Map
      (String key, String value):
      // key: document name
      // value: document contents
      for each word w in value:
      EmitIntermediate(w, “1”);
      Reduce
      (String key, Iterator values):
      // key: a word
      // values: a list of counts
      int result = 0;
      for each v in values:
      result += ParseInt(v);
      Emit(AsString(result));
    • MapReduce
      Shuffle and Sort
    • HDFS
      Design
      Very large files
      Streaming data access
      Commodity hardware
      Ignores
      Low-latency data access
      Lots of small files
      Multiple writers, arbitrary file modifications
    • HDFS
      Concepts
      Blocks
      Namenodes and Datanodes
    • HDFS
      Network Topology and Hadoop
      Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.
      Distances compute
      Processes on the same node
      Different nodes on the same rack
      Nodes on different racks in the same data center
      Nodes in different data centers
    • HDFS
      Network Topology and Hadoop
    • HDFS
      Data Flow- read
    • HDFS
      Data Flow- write
    • Setup
      Required Software
      JavaTM 1.6.x, preferably from Sun, must be installed.
      ssh
      Cygwin - Required on windows for shell support in addition to the required software above.
    • Setup
      Configure
      Setup passphraselessssh
      Configuration Files
      core-site.xml
      hdfs-site.xml 
      mapred-site.xml
      masters
      slaves
    • References
      Hadoop: The Definitive Guide, Tom White
      MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
      Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean