2. Hadoop What is Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop Includes MapReduce HDFS Hadoop Common
4. MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness
5. MapReduce Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results
6. MapReduce Outline stays the same, map and reduce change to fit the problem More specifically… Programmer specifies two primary methods: map: (K1, V1) -> list(K2, V2) reduce: (K2, list(V2)) -> list(K3, V3)
7. MapReduce Example- word count Counting the number of occurrences of each word in a large collection of documents. Key:“document1” Value:“to be or not to be” MAP Key Value “to” “1” “be” “1” “or” “1” “not” ”1” “to” ”1” “be” ”1” Key Value “be” “1”“1” “not” “1” “or” “1” “to” “1””1” Key Value “be” “2” “not” “1” “or” “1” “to” “2” REDUCE SHUFFLE & SORT
8. MapReduce Example- word count Pseudo-code Map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); Reduce (String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
10. HDFS Design Very large files Streaming data access Commodity hardware Ignores Low-latency data access Lots of small files Multiple writers, arbitrary file modifications
12. HDFS Network Topology and Hadoop Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Distances compute Processes on the same node Different nodes on the same rack Nodes on different racks in the same data center Nodes in different data centers
16. Setup Required Software JavaTM 1.6.x, preferably from Sun, must be installed. ssh Cygwin - Required on windows for shell support in addition to the required software above.
18. References Hadoop: The Definitive Guide, Tom White MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean