Hadoop Overview &   Architecture	            Milind Bhandarkar	  Chief Scientist, Machine Learning Platforms,	        Gree...
About Me	•    http://www.linkedin.com/in/milindb	•    Founding member of Hadoop team at Yahoo! [2005-2010]	•    Contributo...
Agenda	• Motivation	• Hadoop	 • Map-Reduce	 • Distributed File System	 • Hadoop Architecture	 • Next Generation MapReduce	...
Hadoop At Scale      (Some Statistics)	• 40,000 + machines in 20+ clusters	• Largest cluster is 4,000 machines	• 170 Petab...
BEHINDEVERY CLICK
Hadoop Workflow
Who Uses Hadoop ?
Why Hadoop ?	       7
Big Datasets(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
Cost Per Gigabyte                             (http://www.mkomo.com/cost-per-gigabyte)
Storage Trends(Graph by Adam Leventhal, ACM Queue, Dec 2009)
Motivating Examples	          11
Yahoo! Search Assist
Search Assist	• Insight: Related concepts appear close  together in text corpus	• Input: Web pages	 • 1 Billion Pages, 10K...
Search Assist	// Input: List(URL, Text)		foreach URL in Input :		    Words = Tokenize(Text(URL));		    foreach word in Tok...
People You May Know
People You May Know	• Insight: You might also know Joe Smith if a  lot of folks you know, know Joe Smith	 • if you don t k...
People You May Know				 Input: List(UserName, List(Connections))	//			foreach u in UserList : // 100 MM			   foreach x in ...
Performance	• 101 Random accesses for each user	 • Assume 1 ms per random access	 • 100 ms per user	• 100 MM users	 • 100 ...
MapReduce Paradigm	         19
Map  Reduce	• Primitives in Lisp ( Other functional  languages) 1970s	• Google Paper 2004	 • http://labs.google.com/papers...
Map	Output_List = Map (Input_List)	Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =		(1, 4, 9, 16, 25, 36,49, 64, 81, 100)	       ...
Reduce	Output_Element = Reduce (Input_List)	Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385	                            22
Parallelism	• Map is inherently parallel	 • Each list element processed independently	• Reduce is inherently sequential	 •...
Search Assist Map	// Input: http://hadoop.apache.org  		Pairs = Tokenize_And_Pair ( Text ( Input ) )			// Example			      ...
Search Assist Reduce	// Input: GroupedList (word, GroupedList(words))		CountedPairs = CountOccurrences (word, RelatedWords...
Issues with Large Data	• Map Parallelism: Chunking input data	• Reduce Parallelism: Grouping related data	• Dealing with f...
Apache Hadoop	• January 2006: Subproject of Lucene	• January 2008: Top-level Apache project	• Stable Version: 1.0.3	• Late...
Apache Hadoop	• Reliable, Performant Distributed file system	• MapReduce Programming framework	• Ecosystem: HBase, Hive, Pi...
Problem: Bandwidth to        Data	• Scan 100TB Datasets on 1000 node cluster	 • Remote storage @ 10MB/s = 165 mins	 • Loca...
Problem: Scaling Reliably	• Failure is not an option, it s a rule !	 • 1000 nodes, MTBF  1 day	 • 4000 disks, 8000 cores, ...
Hadoop Goals	• Scalable: Petabytes (10      15   Bytes) of data on  thousands on nodes	• Economical: Commodity components ...
Hadoop MapReduce	        33
Think MapReduce	• Record = (Key, Value)	• Key : Comparable, Serializable	• Value: Serializable	• Input, Map, Shuffle, Reduc...
Seems Familiar ?						cat /var/log/auth.log* |  		grep “session opened” | cut -d’ ‘ -f10 | 		sort | Samples	// Code	uniq -...
Map	• Input: (Key , Value )	              1         1• Output: List(Key , Value )	                    2               2• P...
Shuffle	• Input: List(Key , Value )	                  2             2• Output	 • Sort(Partition(List(Key , List(Value ))))	...
Reduce	• Input: List(Key , List(Value ))	                   2                   2• Output: List(Key , Value )	            ...
Hadoop Streaming	• Hadoop is written in Java	 • Java MapReduce code is native 	• What about Non-Java Programmers ?	 • Perl...
Hadoop Streaming	• Thin Java wrapper for Map  Reduce Tasks	• Forks actual Mapper  Reducer	• IPC via stdin, stdout, stderr	...
Hadoop Streaming				 bin/hadoop jar hadoop-streaming.jar 	$	     -input in-files -output out-dir 		     -mapper mapper.sh ...
Hadoop Distributed File   System (HDFS)	           42
HDFS	• Data is organized into files and directories	• Files are divided into uniform sized blocks  (default 128MB) and dist...
HDFS	• Blocks are replicated (default 3) to handle  hardware failure	• Replication for performance and fault  tolerance (R...
HDFS	• Master-Worker Architecture	• Single NameNode	• Many (Thousands) DataNodes	                    45
HDFS Master         (NameNode)	• Manages filesystem namespace	• File metadata (i.e. inode )	• Mapping inode to list of bloc...
Namenode	• Mapping of datanode to list of blocks	• Monitor datanode health	• Replicate missing blocks	• Keeps ALL namespac...
Datanodes	• Handle block storage on multiple volumes   block integrity	• Clients access the blocks directly from data  nod...
HDFS Architecture
Example: Unigrams	• Input: Huge text corpus	 • Wikipedia Articles (40GB uncompressed)	• Output: List of words sorted in de...
Unigrams	$ cat ~/wikipedia.txt | 	sed -e s/ /n/g | grep . | 	sort | 	uniq -c  	~/frequencies.txt		$ cat ~/frequencies.txt ...
MR for Unigrams	mapper (filename, file-contents):	    	for each word in file-contents:	    	    	emit (word, 1)		reducer (...
MR for Unigrams	mapper (word, frequency):	    	emit (frequency, word)		reducer (frequency, words):	    	for each word in w...
Unigrams: Java Mapper	public static class MapClass extends MapReduceBase	     	implements MapperLongWritable, Text, Text, ...
Unigrams: Java Reducer	public static class Reduce extends MapReduceBase	     	implements ReducerText, IntWritable, Text, I...
Unigrams: Driver	public void run(String inputPath, String outputPath) throwsException	{	     	JobConf conf = new JobConf(W...
Configuration	•  Unified Mechanism for	  •  Configuring Daemons	  •  Runtime environment for Jobs/Tasks	•  Defaults: *-defaul...
Example	configuration	    	property	    	    	namemapred.job.tracker/name	    	    	valuehead.server.node.com:9001/value	 ...
Running Hadoop Jobs
Running a Job	[milindb@gateway ~]$ hadoop jar 	$HADOOP_HOME/hadoop-examples.jar wordcount 	/data/newsarchive/20080923 /tmp...
Running a Job	mapred.JobClient: Counters: 18	mapred.JobClient:   Job Counters	mapred.JobClient:     Launched reduce tasks=...
Running a Job	mapred.JobClient:   Map-Reduce Framework	mapred.JobClient:     Combine output records=73828180	mapred.JobCli...
JobTracker WebUI								// Code Samples
JobTracker Status
Jobs Status
Job Details
Job Counters
Job Progress
All Tasks
Task Details
Task Counters
Task Logs
MapReduce Dataflow
MapReduce
Job Submission
Initialization
Scheduling
Execution
Map Task
Sort Buffer
Reduce Task
Next Generation  MapReduce	       82
MapReduce Today	    (Courtesy: Arun Murthy, Hortonworks)
Why ?	• Scalability Limitations today	 • Maximum cluster size: 4000 nodes	 • Maximum Concurrent tasks: 40,000	• Job Tracke...
Why ? (contd)	• MapReduce is not suitable for every  application	• Fine-Grained Iterative applications	 • HaLoop: Hadoop i...
Requirements	• Need scalable cluster resources manager	• Separate scheduling from resource  management	• Multi-Lingual Com...
Bottom Line	• @techmilind #mrng (MapReduce, Next  Gen) is in reality, #rmng (Resource Manager,  Next Gen)	• Expect differe...
Architecture	  (Courtesy: Arun Murthy, Hortonworks)
The New World	•  Resource Manager	  •  Allocates resources (containers) to applications	•  Node Manager	  •  Manages conta...
Container	• In current terminology: A Task Slot	• Slice of the node s hardware resources	• #of cores, virtual memory, disk...
Hadoop Overview & Architecture
Upcoming SlideShare
Loading in...5
×

Hadoop Overview & Architecture

17,489

Published on

Hadoop Overview & Architecture

Published in: Technology

Hadoop Overview & Architecture

  1. 1. Hadoop Overview & Architecture Milind Bhandarkar Chief Scientist, Machine Learning Platforms, Greenplum, A Division of EMC (Twitter: @techmilind)
  2. 2. About Me •  http://www.linkedin.com/in/milindb •  Founding member of Hadoop team at Yahoo! [2005-2010] •  Contributor to Apache Hadoop since v0.1 •  Built and led Grid Solutions Team at Yahoo! [2007-2010] •  Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) •  Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and EMC-Greenplum
  3. 3. Agenda • Motivation • Hadoop • Map-Reduce • Distributed File System • Hadoop Architecture • Next Generation MapReduce • Q & A 2
  4. 4. Hadoop At Scale (Some Statistics) • 40,000 + machines in 20+ clusters • Largest cluster is 4,000 machines • 170 Petabytes of storage • 1000+ users • 1,000,000+ jobs/month 3
  5. 5. BEHINDEVERY CLICK
  6. 6. Hadoop Workflow
  7. 7. Who Uses Hadoop ?
  8. 8. Why Hadoop ? 7
  9. 9. Big Datasets(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
  10. 10. Cost Per Gigabyte (http://www.mkomo.com/cost-per-gigabyte)
  11. 11. Storage Trends(Graph by Adam Leventhal, ACM Queue, Dec 2009)
  12. 12. Motivating Examples 11
  13. 13. Yahoo! Search Assist
  14. 14. Search Assist • Insight: Related concepts appear close together in text corpus • Input: Web pages • 1 Billion Pages, 10K bytes each • 10 TB of input data • Output: List(word, List(related words)) 13
  15. 15. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Code Samples // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 14
  16. 16. People You May Know
  17. 17. People You May Know • Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith • if you don t know Joe Smith already • Numbers: • 100 MM users • Average connections per user is 100 16
  18. 18. People You May Know Input: List(UserName, List(Connections)) // foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 // Code foreach y in Connections(x) : // 100 Samples if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 17
  19. 19. Performance • 101 Random accesses for each user • Assume 1 ms per random access • 100 ms per user • 100 MM users • 100 days on a single machine 18
  20. 20. MapReduce Paradigm 19
  21. 21. Map Reduce • Primitives in Lisp ( Other functional languages) 1970s • Google Paper 2004 • http://labs.google.com/papers/ mapreduce.html 20
  22. 22. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 21
  23. 23. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 22
  24. 24. Parallelism • Map is inherently parallel • Each list element processed independently • Reduce is inherently sequential • Unless processing multiple lists • Grouping to produce multiple lists 23
  25. 25. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) // Example 24
  26. 26. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming,4) (hadoop, mapreduce, 9) ... } 25
  27. 27. Issues with Large Data • Map Parallelism: Chunking input data • Reduce Parallelism: Grouping related data • Dealing with failures load imbalance 26
  28. 28. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable Version: 1.0.3 • Latest Version: 2.0.0 (Alpha) 28
  29. 29. Apache Hadoop • Reliable, Performant Distributed file system • MapReduce Programming framework • Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... 29
  30. 30. Problem: Bandwidth to Data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement 30
  31. 31. Problem: Scaling Reliably • Failure is not an option, it s a rule ! • 1000 nodes, MTBF 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently 31
  32. 32. Hadoop Goals • Scalable: Petabytes (10 15 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive 32
  33. 33. Hadoop MapReduce 33
  34. 34. Think MapReduce • Record = (Key, Value) • Key : Comparable, Serializable • Value: Serializable • Input, Map, Shuffle, Reduce, Output 34
  35. 35. Seems Familiar ? cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | Samples // Code uniq -c ~/userlist 35
  36. 36. Map • Input: (Key , Value ) 1 1• Output: List(Key , Value ) 2 2• Projections, Filtering, Transformation 36
  37. 37. Shuffle • Input: List(Key , Value ) 2 2• Output • Sort(Partition(List(Key , List(Value )))) 2 2• Provided by Hadoop 37
  38. 38. Reduce • Input: List(Key , List(Value )) 2 2• Output: List(Key , Value ) 3 3• Aggregation 38
  39. 39. Hadoop Streaming • Hadoop is written in Java • Java MapReduce code is native • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output 39
  40. 40. Hadoop Streaming • Thin Java wrapper for Map Reduce Tasks • Forks actual Mapper Reducer • IPC via stdin, stdout, stderr • Key.toString() t Value.toString() n • Slower than Java programs • Allows for quick prototyping / debugging 40
  41. 41. Hadoop Streaming bin/hadoop jar hadoop-streaming.jar $ -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh mapper.sh #// Code Samples sed -e s/ /n/g | grep . # reducer.sh uniq -c | awk {print $2 t $1} 41
  42. 42. Hadoop Distributed File System (HDFS) 42
  43. 43. HDFS • Data is organized into files and directories • Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes • HDFS exposes block placement so that computation can be migrated to data 43
  44. 44. HDFS • Blocks are replicated (default 3) to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery 44
  45. 45. HDFS • Master-Worker Architecture • Single NameNode • Many (Thousands) DataNodes 45
  46. 46. HDFS Master (NameNode) • Manages filesystem namespace • File metadata (i.e. inode ) • Mapping inode to list of blocks + locations • Authorization Authentication • Checkpoint journal namespace changes 46
  47. 47. Namenode • Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory • 60M objects (File/Block) in 16GB 47
  48. 48. Datanodes • Handle block storage on multiple volumes block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS s files 48
  49. 49. HDFS Architecture
  50. 50. Example: Unigrams • Input: Huge text corpus • Wikipedia Articles (40GB uncompressed) • Output: List of words sorted in descending order of frequency
  51. 51. Unigrams $ cat ~/wikipedia.txt | sed -e s/ /n/g | grep . | sort | uniq -c ~/frequencies.txt $ cat ~/frequencies.txt | # cat | sort -n -k1,1 -r | # cat ~/unigrams.txt
  52. 52. MR for Unigrams mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)
  53. 53. MR for Unigrams mapper (word, frequency): emit (frequency, word) reducer (frequency, words): for each word in words: emit (word, frequency)
  54. 54. Unigrams: Java Mapper public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word, new IntWritable(1)); } } }
  55. 55. Unigrams: Java Reducer public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key,IteratorIntWritable values, OutputCollectorText,IntWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  56. 56. Unigrams: Driver public void run(String inputPath, String outputPath) throwsException { JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); }
  57. 57. Configuration •  Unified Mechanism for •  Configuring Daemons •  Runtime environment for Jobs/Tasks •  Defaults: *-default.xml •  Site-Specific: *-site.xml •  final parameters
  58. 58. Example configuration property namemapred.job.tracker/name valuehead.server.node.com:9001/value /property property namefs.default.name/name valuehdfs://head.server.node.com:9000/value /property property namemapred.child.java.opts/name value-Xmx512m/value finaltrue/final /property .... /configuration
  59. 59. Running Hadoop Jobs
  60. 60. Running a Job [milindb@gateway ~]$ hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount /data/newsarchive/20080923 /tmp/newsoutinput.FileInputFormat: Total input paths toprocess : 4 mapred.JobClient: Running job: job_200904270516_5709 mapred.JobClient: map 0% reduce 0% mapred.JobClient: map 3% reduce 0% mapred.JobClient: map 7% reduce 0% .... mapred.JobClient: map 100% reduce 21% mapred.JobClient: map 100% reduce 31% mapred.JobClient: map 100% reduce 33% mapred.JobClient: map 100% reduce 66% mapred.JobClient: map 100% reduce 100% mapred.JobClient: Job complete: job_200904270516_5709
  61. 61. Running a Job mapred.JobClient: Counters: 18 mapred.JobClient: Job Counters mapred.JobClient: Launched reduce tasks=1 mapred.JobClient: Rack-local map tasks=10 mapred.JobClient: Launched map tasks=25 mapred.JobClient: Data-local map tasks=1 mapred.JobClient: FileSystemCounters mapred.JobClient: FILE_BYTES_READ=491145085 mapred.JobClient: HDFS_BYTES_READ=3068106537 mapred.JobClient: FILE_BYTES_WRITTEN=724733409 mapred.JobClient: HDFS_BYTES_WRITTEN=377464307
  62. 62. Running a Job mapred.JobClient: Map-Reduce Framework mapred.JobClient: Combine output records=73828180 mapred.JobClient: Map input records=36079096 mapred.JobClient: Reduce shuffle bytes=233587524 mapred.JobClient: Spilled Records=78177976 mapred.JobClient: Map output bytes=4278663275 mapred.JobClient: Combine input records=371084796 mapred.JobClient: Map output records=313041519 mapred.JobClient: Reduce input records=15784903
  63. 63. JobTracker WebUI // Code Samples
  64. 64. JobTracker Status
  65. 65. Jobs Status
  66. 66. Job Details
  67. 67. Job Counters
  68. 68. Job Progress
  69. 69. All Tasks
  70. 70. Task Details
  71. 71. Task Counters
  72. 72. Task Logs
  73. 73. MapReduce Dataflow
  74. 74. MapReduce
  75. 75. Job Submission
  76. 76. Initialization
  77. 77. Scheduling
  78. 78. Execution
  79. 79. Map Task
  80. 80. Sort Buffer
  81. 81. Reduce Task
  82. 82. Next Generation MapReduce 82
  83. 83. MapReduce Today (Courtesy: Arun Murthy, Hortonworks)
  84. 84. Why ? • Scalability Limitations today • Maximum cluster size: 4000 nodes • Maximum Concurrent tasks: 40,000 • Job Tracker SPOF • Fixed map and reduce containers (slots) • Punishes pleasantly parallel apps 84
  85. 85. Why ? (contd) • MapReduce is not suitable for every application • Fine-Grained Iterative applications • HaLoop: Hadoop in a Loop • Message passing applications • Graph Processing 85
  86. 86. Requirements • Need scalable cluster resources manager • Separate scheduling from resource management • Multi-Lingual Communication Protocols 86
  87. 87. Bottom Line • @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen) • Expect different programming paradigms to be implemented • Including MPI (soon) 87
  88. 88. Architecture (Courtesy: Arun Murthy, Hortonworks)
  89. 89. The New World •  Resource Manager •  Allocates resources (containers) to applications •  Node Manager •  Manages containers on nodes •  Application Master •  Specific to paradigm e.g. MapReduce application master, MPI application master etc 89
  90. 90. Container • In current terminology: A Task Slot • Slice of the node s hardware resources • #of cores, virtual memory, disk size, disk and network bandwidth etc • Currently, only memory usage is sliced 90
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×