Hadoop

  • 1,920 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,920
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
94
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Knowledge ShareHadoop
    Yu Zhao
    Platform
  • 2. Hadoop
    What is Hadoop
    The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
    Hadoop Includes
    MapReduce
    HDFS
    Hadoop Common
  • 3. Hadoop
    Position
    APPLICATION
    HADOOP
    OS
    OS
    OS
    OS
    HOST
    HOST
    HOST
    HOST
  • 4. MapReduce
    A simple programming model that applies to many large-scale computing problems
    Hide messy details in MapReduce runtime library:
    automatic parallelization
    load balancing
    network and disk transfer optimization
    handling of machine failures
    robustness
  • 5. MapReduce
    Typical problem solved by MapReduce
    Read a lot of data
    Map: extract something you care about from each record
    Shuffle and Sort
    Reduce: aggregate, summarize, filter, or transform
    Write the results
  • 6. MapReduce
    Outline stays the same, map and reduce change to fit the problem
    More specifically…
    Programmer specifies two primary methods:
    map: (K1, V1) -> list(K2, V2)
    reduce: (K2, list(V2)) -> list(K3, V3)
  • 7. MapReduce
    Example- word count
    Counting the number of occurrences of each word in a large collection of documents.
    Key:“document1”
    Value:“to be or not to be”
    MAP
    Key Value
    “to” “1”
    “be” “1”
    “or” “1”
    “not” ”1”
    “to” ”1”
    “be” ”1”
    Key Value
    “be” “1”“1”
    “not” “1”
    “or” “1”
    “to” “1””1”
    Key Value
    “be” “2”
    “not” “1”
    “or” “1”
    “to” “2”
    REDUCE
    SHUFFLE
    &
    SORT
  • 8. MapReduce
    Example- word count
    Pseudo-code
    Map
    (String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
    EmitIntermediate(w, “1”);
    Reduce
    (String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
    result += ParseInt(v);
    Emit(AsString(result));
  • 9. MapReduce
    Shuffle and Sort
  • 10. HDFS
    Design
    Very large files
    Streaming data access
    Commodity hardware
    Ignores
    Low-latency data access
    Lots of small files
    Multiple writers, arbitrary file modifications
  • 11. HDFS
    Concepts
    Blocks
    Namenodes and Datanodes
  • 12. HDFS
    Network Topology and Hadoop
    Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.
    Distances compute
    Processes on the same node
    Different nodes on the same rack
    Nodes on different racks in the same data center
    Nodes in different data centers
  • 13. HDFS
    Network Topology and Hadoop
  • 14. HDFS
    Data Flow- read
  • 15. HDFS
    Data Flow- write
  • 16. Setup
    Required Software
    JavaTM 1.6.x, preferably from Sun, must be installed.
    ssh
    Cygwin - Required on windows for shell support in addition to the required software above.
  • 17. Setup
    Configure
    Setup passphraselessssh
    Configuration Files
    core-site.xml
    hdfs-site.xml 
    mapred-site.xml
    masters
    slaves
  • 18. References
    Hadoop: The Definitive Guide, Tom White
    MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
    Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean