Knowledge ShareHadoopYu ZhaoPlatform
HadoopWhat is HadoopThe Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.Hadoop IncludesMapReduceHDFSHadoop Common
HadoopPositionAPPLICATIONHADOOPOSOSOSOSHOSTHOSTHOSTHOST
MapReduceA simple programming model that applies to many large-scale computing problemsHide messy details in MapReduce runtime library:automatic parallelizationload balancingnetwork and disk transfer optimizationhandling of machine failuresrobustness
MapReduceTypical problem solved by MapReduceRead a lot of dataMap: extract something you care about from each recordShuffle and SortReduce: aggregate, summarize, filter, or transformWrite the results
MapReduceOutline stays the same, map and reduce change to fit the problemMore specifically…Programmer specifies two primary methods:map: (K1, V1) -> list(K2, V2) reduce: (K2, list(V2)) -> list(K3, V3)
MapReduceExample- word countCounting the number of occurrences of each word in a large  collection of documents. Key:“document1”Value:“to be or not to be”MAPKey	Value“to”	“1”“be” 	“1”“or”	“1”“not”	”1”“to”	”1”“be”	”1”Key	Value “be”  	“1”“1”“not”	“1”“or”	“1”“to”	“1””1”Key	Value“be”	“2”“not”	“1”“or”	“1”“to”	“2”REDUCESHUFFLE&SORT
MapReduceExample- word countPseudo-codeMap(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, “1”);Reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));
MapReduceShuffle and Sort
HDFSDesign Very large files Streaming data access Commodity hardwareIgnores Low-latency data access Lots of small files Multiple writers, arbitrary file modifications
HDFSConceptsBlocksNamenodes and Datanodes
HDFSNetwork Topology and HadoopNetwork is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.Distances computeProcesses on the same nodeDifferent nodes on the same rackNodes on different racks in the same data centerNodes in different data centers
HDFSNetwork Topology and Hadoop
HDFSData Flow- read
HDFSData Flow- write
SetupRequired SoftwareJavaTM 1.6.x, preferably from Sun, must be installed.sshCygwin - Required on windows for shell support in addition to the required software above.
SetupConfigureSetup passphraselesssshConfiguration Files	core-site.xml	hdfs-site.xml 	mapred-site.xml	masters	slaves
ReferencesHadoop: The Definitive Guide, Tom WhiteMapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay GhemawatExperiences with MapReduce, an Abstraction for Large-Scale Computation, Jeff Dean

Hadoop