Knowledge ShareHadoop<br />Yu Zhao<br />Platform<br />
Hadoop<br />What is Hadoop<br />The Apache Hadoop project develops open-source software for reliable, scalable, distribute...
Hadoop<br />Position<br />APPLICATION<br />HADOOP<br />OS<br />OS<br />OS<br />OS<br />HOST<br />HOST<br />HOST<br />HOST<...
MapReduce<br />A simple programming model that applies to many large-scale computing problems<br />Hide messy details in M...
MapReduce<br />Typical problem solved by MapReduce<br />Read a lot of data<br />Map: extract something you care about from...
MapReduce<br />Outline stays the same, map and reduce change to fit the problem<br />More specifically…<br />Programmer sp...
MapReduce<br />Example- word count<br />Counting the number of occurrences of each word in a large  collection of document...
MapReduce<br />Example- word count<br />Pseudo-code<br />Map<br />(String key, String value):<br />// key: document name<b...
MapReduce<br />Shuffle and Sort<br />
HDFS<br />Design <br />Very large files<br /> Streaming data access<br /> Commodity hardware<br />Ignores<br /> Low-latenc...
HDFS<br />Concepts<br />Blocks<br />Namenodes and Datanodes<br />
HDFS<br />Network Topology and Hadoop<br />Network is represented as a tree and the distance between two nodes is the sum ...
HDFS<br />Network Topology and Hadoop<br />
HDFS<br />Data Flow- read<br />
HDFS<br />Data Flow- write<br />
Setup<br />Required Software<br />JavaTM 1.6.x, preferably from Sun, must be installed.<br />ssh<br />Cygwin - Required on...
Setup<br />Configure<br />Setup passphraselessssh<br />Configuration Files<br />	core-site.xml<br />	hdfs-site.xml <br />	...
References<br />Hadoop: The Definitive Guide, Tom White<br />MapReduce: Simplified Data Processing on Large Clusters, Jeff...
Upcoming SlideShare
Loading in...5
×

Hadoop

2,002

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,002
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
95
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop

  1. 1. Knowledge ShareHadoop<br />Yu Zhao<br />Platform<br />
  2. 2. Hadoop<br />What is Hadoop<br />The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.<br />Hadoop Includes<br />MapReduce<br />HDFS<br />Hadoop Common<br />
  3. 3. Hadoop<br />Position<br />APPLICATION<br />HADOOP<br />OS<br />OS<br />OS<br />OS<br />HOST<br />HOST<br />HOST<br />HOST<br />
  4. 4. MapReduce<br />A simple programming model that applies to many large-scale computing problems<br />Hide messy details in MapReduce runtime library:<br />automatic parallelization<br />load balancing<br />network and disk transfer optimization<br />handling of machine failures<br />robustness<br />
  5. 5. MapReduce<br />Typical problem solved by MapReduce<br />Read a lot of data<br />Map: extract something you care about from each record<br />Shuffle and Sort<br />Reduce: aggregate, summarize, filter, or transform<br />Write the results<br />
  6. 6. MapReduce<br />Outline stays the same, map and reduce change to fit the problem<br />More specifically…<br />Programmer specifies two primary methods:<br />map: (K1, V1) -> list(K2, V2) <br />reduce: (K2, list(V2)) -> list(K3, V3)<br />
  7. 7. MapReduce<br />Example- word count<br />Counting the number of occurrences of each word in a large collection of documents. <br />Key:“document1”<br />Value:“to be or not to be”<br />MAP<br />Key Value<br />“to” “1”<br />“be” “1”<br />“or” “1”<br />“not” ”1”<br />“to” ”1”<br />“be” ”1”<br />Key Value<br /> “be” “1”“1”<br />“not” “1”<br />“or” “1”<br />“to” “1””1”<br />Key Value<br />“be” “2”<br />“not” “1”<br />“or” “1”<br />“to” “2”<br />REDUCE<br />SHUFFLE<br />&<br />SORT<br />
  8. 8. MapReduce<br />Example- word count<br />Pseudo-code<br />Map<br />(String key, String value):<br />// key: document name<br />// value: document contents<br />for each word w in value:<br />EmitIntermediate(w, “1”);<br />Reduce<br />(String key, Iterator values):<br />// key: a word<br />// values: a list of counts<br />int result = 0;<br />for each v in values:<br />result += ParseInt(v);<br />Emit(AsString(result));<br />
  9. 9. MapReduce<br />Shuffle and Sort<br />
  10. 10. HDFS<br />Design <br />Very large files<br /> Streaming data access<br /> Commodity hardware<br />Ignores<br /> Low-latency data access<br /> Lots of small files<br /> Multiple writers, arbitrary file modifications<br />
  11. 11. HDFS<br />Concepts<br />Blocks<br />Namenodes and Datanodes<br />
  12. 12. HDFS<br />Network Topology and Hadoop<br />Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.<br />Distances compute<br />Processes on the same node<br />Different nodes on the same rack<br />Nodes on different racks in the same data center<br />Nodes in different data centers<br />
  13. 13. HDFS<br />Network Topology and Hadoop<br />
  14. 14. HDFS<br />Data Flow- read<br />
  15. 15. HDFS<br />Data Flow- write<br />
  16. 16. Setup<br />Required Software<br />JavaTM 1.6.x, preferably from Sun, must be installed.<br />ssh<br />Cygwin - Required on windows for shell support in addition to the required software above.<br />
  17. 17. Setup<br />Configure<br />Setup passphraselessssh<br />Configuration Files<br /> core-site.xml<br /> hdfs-site.xml <br /> mapred-site.xml<br /> masters<br /> slaves <br />
  18. 18. References<br />Hadoop: The Definitive Guide, Tom White<br />MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat<br />Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×