Hadoop

Knowledge ShareHadoopYu ZhaoPlatform

HadoopWhat is HadoopThe Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.Hadoop IncludesMapReduceHDFSHadoop Common

HadoopPositionAPPLICATIONHADOOPOSOSOSOSHOSTHOSTHOSTHOST

MapReduceA simple programming model that applies to many large-scale computing problemsHide messy details in MapReduce runtime library:automatic parallelizationload balancingnetwork and disk transfer optimizationhandling of machine failuresrobustness

MapReduceTypical problem solved by MapReduceRead a lot of dataMap: extract something you care about from each recordShuffle and SortReduce: aggregate, summarize, filter, or transformWrite the results

MapReduceOutline stays the same, map and reduce change to fit the problemMore specifically…Programmer specifies two primary methods:map: (K1, V1) -> list(K2, V2) reduce: (K2, list(V2)) -> list(K3, V3)

MapReduceExample- word countCounting the number of occurrences of each word in a large collection of documents. Key:“document1”Value:“to be or not to be”MAPKey Value“to” “1”“be” “1”“or” “1”“not” ”1”“to” ”1”“be” ”1”Key Value “be” “1”“1”“not” “1”“or” “1”“to” “1””1”Key Value“be” “2”“not” “1”“or” “1”“to” “2”REDUCESHUFFLE&SORT

MapReduceExample- word countPseudo-codeMap(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, “1”);Reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));

HDFSDesign Very large files Streaming data access Commodity hardwareIgnores Low-latency data access Lots of small files Multiple writers, arbitrary file modifications

HDFSConceptsBlocksNamenodes and Datanodes

HDFSNetwork Topology and HadoopNetwork is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.Distances computeProcesses on the same nodeDifferent nodes on the same rackNodes on different racks in the same data centerNodes in different data centers

HDFSNetwork Topology and Hadoop

SetupRequired SoftwareJavaTM 1.6.x, preferably from Sun, must be installed.sshCygwin - Required on windows for shell support in addition to the required software above.

SetupConfigureSetup passphraselesssshConfiguration Files core-site.xml hdfs-site.xml mapred-site.xml masters slaves

ReferencesHadoop: The Definitive Guide, Tom WhiteMapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay GhemawatExperiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean

Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Hadoop

Recently uploaded

Hadoop