Hadoop

Knowledge ShareHadoop Yu Zhao Platform

Hadoop What is Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop Includes MapReduce HDFS Hadoop Common

Hadoop Position APPLICATION HADOOP OS OS OS OS HOST HOST HOST HOST

MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness

MapReduce Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results

MapReduce Outline stays the same, map and reduce change to fit the problem More specifically… Programmer specifies two primary methods: map: (K1, V1) -> list(K2, V2) reduce: (K2, list(V2)) -> list(K3, V3)

MapReduce Example- word count Counting the number of occurrences of each word in a large collection of documents. Key:“document1” Value:“to be or not to be” MAP Key Value “to” “1” “be” “1” “or” “1” “not” ”1” “to” ”1” “be” ”1” Key Value “be” “1”“1” “not” “1” “or” “1” “to” “1””1” Key Value “be” “2” “not” “1” “or” “1” “to” “2” REDUCE SHUFFLE & SORT

MapReduce Example- word count Pseudo-code Map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); Reduce (String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

HDFS Design Very large files Streaming data access Commodity hardware Ignores Low-latency data access Lots of small files Multiple writers, arbitrary file modifications

HDFS Concepts Blocks Namenodes and Datanodes

HDFS Network Topology and Hadoop Network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Distances compute Processes on the same node Different nodes on the same rack Nodes on different racks in the same data center Nodes in different data centers

HDFS Network Topology and Hadoop

Setup Required Software JavaTM 1.6.x, preferably from Sun, must be installed. ssh Cygwin - Required on windows for shell support in addition to the required software above.

Setup Configure Setup passphraselessssh Configuration Files core-site.xml hdfs-site.xml mapred-site.xml masters slaves

References Hadoop: The Definitive Guide, Tom White MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Experiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Hadoop

Similar to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Hadoop