Your SlideShare is downloading. ×
Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop

1,969

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,969
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
94
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Knowledge ShareHadoop
    Yu Zhao
    Platform
  • 2. Hadoop
    What is Hadoop
    The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
    Hadoop Includes
    MapReduce
    HDFS
    Hadoop Common
  • 3. Hadoop
    Position
    APPLICATION
    HADOOP
    OS
    OS
    OS
    OS
    HOST
    HOST
    HOST
    HOST
  • 4. MapReduce
    A simple programming model that applies to many large-scale computing problems
    Hide messy details in MapReduce runtime library:
    automatic parallelization
    load balancing
    network and disk transfer optimization
    handling of machine failures
    robustness
  • 5. MapReduce
    Typical problem solved by MapReduce
    Read a lot of data
    Map: extract something you care about from each record
    Shuffle and Sort
    Reduce: aggregate, summarize, filter, or transform
    Write the results
  • 6. MapReduce
    Outline stays the same, map and reduce change to fit the problem
    More specifically…
    Programmer specifies two primary methods:
    map: (K1, V1) -> list(K2, V2)
    reduce: (K2, list(V2)) -> list(K3, V3)
  • 7. MapReduce
    Example- word count
    Counting the number of occurrences of each word in a large collection of documents.
    Key:“document1”
    Value:“to be or not to be”
    MAP
    Key Value
    “to” “1”
    “be” “1”
    “or” “1”
    “not” ”1”
    “to” ”1”
    “be” ”1”
    Key Value
    “be” “1”“1”
    “not” “1”
    “or” “1”
    “to” “1””1”
    Key Value
    “be” “2”
    “not” “1”
    “or” “1”
    “to” “2”
    REDUCE
    SHUFFLE
    &
    SORT
  • 8. MapReduce
    Example- word count
    Pseudo-code
    Map
    (String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
    EmitIntermediate(w, “1”);
    Reduce
    (String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
    result += ParseInt(v);
    Emit(AsString(result));
  • 9. MapReduce
    Shuffle and Sort
  • 10. HDFS
    Design
    Very large files
    Streaming data access
    Commodity hardware
    Ignores
    Low-latency data access
    Lots of small files
    Multiple writers, arbitrary file modifications
  • 11. HDFS
    Concepts
    Blocks
    Namenodes and Datanodes
  • 12. HDFS
    Network Topology and Hadoop
    Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.
    Distances compute
    Processes on the same node
    Different nodes on the same rack
    Nodes on different racks in the same data center
    Nodes in different data centers
  • 13. HDFS
    Network Topology and Hadoop
  • 14. HDFS
    Data Flow- read
  • 15. HDFS
    Data Flow- write
  • 16. Setup
    Required Software
    JavaTM 1.6.x, preferably from Sun, must be installed.
    ssh
    Cygwin - Required on windows for shell support in addition to the required software above.
  • 17. Setup
    Configure
    Setup passphraselessssh
    Configuration Files
    core-site.xml
    hdfs-site.xml 
    mapred-site.xml
    masters
    slaves
  • 18. References
    Hadoop: The Definitive Guide, Tom White
    MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
    Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean

×