SlideShare a Scribd company logo
1 of 46
An Introduction to  MapReduce  Francisco P érez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Polit é cnica de Madrid 10/Apr/2008
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Motivation ,[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Motivation (II) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
What is MapReduce? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Simple Example An Introduction to MapReduce Input data Mapped data on Node 1 Mapped data on Node 2 Result
What is MapReduce ’ s Main Goal? An Introduction to MapReduce Simplify the parallelization and  distribution of large-scale  computations in clusters
MapReduce Main Features ,[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
What does MapReduce solves? ,[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
What does MapReduce solves? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Programming Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Programming Model: Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Framework Overview An Introduction to MapReduce
Framework Overview An Introduction to MapReduce
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 1) Split File into 10 pieces of 64MB R = 4 output files (Set by the user)‏ a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Worker Idle Worker Idle Worker Idle (There are 26 different keys letters in the range [a..z]) Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle 1 2 3 4 5 6 7 8 9 10
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 2) Assign map and reduce tasks a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 5 6 7 8 9 10 Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Mappers Reducers
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 3) Read the split data a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 Map T. In progress Map T. In progress Map T. In progress Reduce T. Idle Reduce T. Idle Reduce T. Idle Map T. In progress Reduce T. Idle
Example: Count # of Each Letter in a Big File An Introduction to MapReduce a   b  c d e f g  h i  j k l  m  n  n  o   p  q  r  s  t  v w x  y  z  Machine 1 Big File 640MB 4) Process data (in memory) a y b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 Partition Function (used to map the letters in regions): R2 R3 R4 Simulating the execution in memory Map T.1 In-Progress R1 R2 R3 R4 (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 5) Apply combiner function a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Simulating the execution in memory R1 R2 R3 R4 Map T.1 In-Progress (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1) (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 6) Store results on disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Memory R1 R2 R3 R4 Disk Map T.1 In-Progress (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 7) Inform the master about the position of the intermediate results in local disk  a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 MT1 Results Location MT1 Results Map T.1 In-Progress (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 8) The Master assigns the next task (Map Task 5) to the Worker recently free a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 T1 Results Worker In-Progress Data for  Map Task 5 (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1) Task 5
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Master 9) The Master forwards the location of the intermediate results of Map Task 1 to reducers Machine 1 R1 R2 R3 R4 MT1 Results MT1 Results Location (R1) MT1 Results Location (Rx) Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l ... Map T.5 In-Progress Reduce T.1 Idle (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB a  t  b  o m  a  p r r  e d  u c e  g o o o  g  l e a  p i m  a c a c a b  r a a  r r o z  f e i j  a  o t o m  a t  e c  r u i m  e s s o l R1 a b c d e f g Letters in  Region 1 : Reduce T.1 Idle (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1) Data read from  each Map Task stored in region 1 10) The RT 1 reads the data in R=1 from each MT Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1)  (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 11) The reduce task 1 sorts the data Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1)  (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N 12)   Finally, generates the output file 1 of R, after executing the user's reduce   (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) (a, 10) (b, 2) (c, 4) (d, 1) (e, 6) (f, 1) (g, 1) Reduce T.1 In-Progress
Other Features: Failures ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Other Features: Locality ,[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Other Features: Backup Tasks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Performance ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Performance: Distributed Grep Program ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Performance: Grep ,[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce 1764 Workers Maps are starting to finish Scan Rate
Hadoop: A MapReduce Implementation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop: A MapReduce Implementation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Hadoop: HDFS Console An Introduction to MapReduce
Hadoop: JobTracker Console An Introduction to MapReduce
Hadoop: Word Count Example ,[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Hadoop: Running the Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Hadoop: Word Count Example An Introduction to MapReduce public class  WordCount  extends  Configured  implements  Tool  { ... public static class  MapClass  extends  MapReduceBase   implements  Mapper < LongWritable ,  Text ,  Text , IntWritable > { ...  // Map Task Definition } public static class  Reduce  extends  MapReduceBase implements  Reducer < Text ,  IntWritable ,  Text , IntWritable > { ...  // Reduce Task Definition } public int  run ( String []  args ) throws  Exception  { ...  // Job Configuration } public static void  main ( String []  args ) throws  Exception  { int  res  =  ToolRunner . run (new  Configuration (),  new  WordCount (),  args ); System . exit ( res );  } }
Hadoop: Job Configuration An Introduction to MapReduce public int  run ( String []  args ) throws  Exception  {  	 JobConf   conf  = new  JobConf ( getConf (),  WordCount .class);  conf . setJobName ( &quot;wordcount&quot;);  // the keys are words (strings)  conf . setOutputKeyClass ( Text . class);  // the values are counts (ints)  conf . setOutputValueClass ( IntWritable . class);  	 conf . setMapperClass ( MapClass .class);  conf . setCombinerClass ( Reduce .class);  conf . setReducerClass ( Reduce .class); 	conf . setInputPath ( new  Path ( args . get (0)));  			conf . setOutputPath (new  Path ( args . get (1)));  JobClient . runJob ( conf ); return 0;  }
Hadoop: Map Class An Introduction to MapReduce public static class  MapClass  extends  MapReduceBase   implements  Mapper < LongWritable ,  Text ,  Text ,  IntWritable > {  private final static  IntWritable   one  = new  IntWritable (1);  private  Text   word  = new  Text ();  // map(WritableComparable, Writable, OutputCollector, Reporter) public void  map ( LongWritable   key ,  Text   value ,  OutputCollector < Text ,  IntWritable >  output ,  Reporter   reporter ) throws  IOException  {  String   line  =  value . toString ();  StringTokenizer   itr  = new  StringTokenizer ( line );  while ( itr . hasMoreTokens ()) {  word . set ( itr . nextToken ());  output . collect ( word ,  one );  }  } }
Hadoop: Reduce Class An Introduction to MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
References ,[object Object],[object Object],[object Object],[object Object],An Introduction to MapReduce
Questions? An Introduction to MapReduce

More Related Content

What's hot (20)

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
MapReduce
MapReduceMapReduce
MapReduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 

Similar to An Introduction To Map-Reduce

Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analyticsViet-Trung TRAN
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?TerrierTeam
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 

Similar to An Introduction To Map-Reduce (20)

Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 

An Introduction To Map-Reduce

  • 1. An Introduction to MapReduce Francisco P érez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Polit é cnica de Madrid 10/Apr/2008
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Simple Example An Introduction to MapReduce Input data Mapped data on Node 1 Mapped data on Node 2 Result
  • 7. What is MapReduce ’ s Main Goal? An Introduction to MapReduce Simplify the parallelization and distribution of large-scale computations in clusters
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. Framework Overview An Introduction to MapReduce
  • 14. Framework Overview An Introduction to MapReduce
  • 15. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 1) Split File into 10 pieces of 64MB R = 4 output files (Set by the user)‏ a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Worker Idle Worker Idle Worker Idle (There are 26 different keys letters in the range [a..z]) Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle 1 2 3 4 5 6 7 8 9 10
  • 16. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 2) Assign map and reduce tasks a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 5 6 7 8 9 10 Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Mappers Reducers
  • 17. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 3) Read the split data a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 Map T. In progress Map T. In progress Map T. In progress Reduce T. Idle Reduce T. Idle Reduce T. Idle Map T. In progress Reduce T. Idle
  • 18. Example: Count # of Each Letter in a Big File An Introduction to MapReduce a b c d e f g h i j k l m n n o p q r s t v w x y z Machine 1 Big File 640MB 4) Process data (in memory) a y b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 Partition Function (used to map the letters in regions): R2 R3 R4 Simulating the execution in memory Map T.1 In-Progress R1 R2 R3 R4 (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 19. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 5) Apply combiner function a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Simulating the execution in memory R1 R2 R3 R4 Map T.1 In-Progress (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1) (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 20. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 6) Store results on disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Memory R1 R2 R3 R4 Disk Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 21. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 7) Inform the master about the position of the intermediate results in local disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 MT1 Results Location MT1 Results Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 22. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 8) The Master assigns the next task (Map Task 5) to the Worker recently free a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 T1 Results Worker In-Progress Data for Map Task 5 (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1) Task 5
  • 23. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Master 9) The Master forwards the location of the intermediate results of Map Task 1 to reducers Machine 1 R1 R2 R3 R4 MT1 Results MT1 Results Location (R1) MT1 Results Location (Rx) Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l ... Map T.5 In-Progress Reduce T.1 Idle (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 24. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 a b c d e f g Letters in Region 1 : Reduce T.1 Idle (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1)
  • 25. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1) Data read from each Map Task stored in region 1 10) The RT 1 reads the data in R=1 from each MT Reduce T.1 In-Progress
  • 26. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 11) The reduce task 1 sorts the data Reduce T.1 In-Progress
  • 27. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) Reduce T.1 In-Progress
  • 28. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N 12) Finally, generates the output file 1 of R, after executing the user's reduce (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) (a, 10) (b, 2) (c, 4) (d, 1) (e, 6) (f, 1) (g, 1) Reduce T.1 In-Progress
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. Hadoop: HDFS Console An Introduction to MapReduce
  • 38. Hadoop: JobTracker Console An Introduction to MapReduce
  • 39.
  • 40.
  • 41. Hadoop: Word Count Example An Introduction to MapReduce public class WordCount extends Configured implements Tool { ... public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { ... // Map Task Definition } public static class Reduce extends MapReduceBase implements Reducer < Text , IntWritable , Text , IntWritable > { ... // Reduce Task Definition } public int run ( String [] args ) throws Exception { ... // Job Configuration } public static void main ( String [] args ) throws Exception { int res = ToolRunner . run (new Configuration (), new WordCount (), args ); System . exit ( res ); } }
  • 42. Hadoop: Job Configuration An Introduction to MapReduce public int run ( String [] args ) throws Exception { JobConf conf = new JobConf ( getConf (), WordCount .class); conf . setJobName ( &quot;wordcount&quot;); // the keys are words (strings) conf . setOutputKeyClass ( Text . class); // the values are counts (ints) conf . setOutputValueClass ( IntWritable . class); conf . setMapperClass ( MapClass .class); conf . setCombinerClass ( Reduce .class); conf . setReducerClass ( Reduce .class); conf . setInputPath ( new Path ( args . get (0))); conf . setOutputPath (new Path ( args . get (1))); JobClient . runJob ( conf ); return 0; }
  • 43. Hadoop: Map Class An Introduction to MapReduce public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { private final static IntWritable one = new IntWritable (1); private Text word = new Text (); // map(WritableComparable, Writable, OutputCollector, Reporter) public void map ( LongWritable key , Text value , OutputCollector < Text , IntWritable > output , Reporter reporter ) throws IOException { String line = value . toString (); StringTokenizer itr = new StringTokenizer ( line ); while ( itr . hasMoreTokens ()) { word . set ( itr . nextToken ()); output . collect ( word , one ); } } }
  • 44.
  • 45.

Editor's Notes

  1. Existen librerias para programar culsters como PVM (Parallel virtual machine), MPI (Message Passing interface)
  2. Encima de GFS
  3. Given a collection of Shapes we split this collection into 2 parts and send every part to a grid node. Each node will count number of Shapes provided and will return it back to caller. The caller then will add results received from remote nodes and provide the reduced result back to the user (the counts are displayed next to every shape).
  4. Simplificar la paralelizaci ón y distribución de procesamientos masivos de datos.
  5. So, in order to achieve this goal, MR provides...
  6. La entrada es una lista de URLs de p áginas web
  7. Vamos a ver con un ejemplo como funciona el framework de MapReduce. El programa de ejemplo cuenta el n úmero de ocurrencias de cada letra que aparece en un fichero grande y clasifica las letras en 4 ficheros de salida distintos por rangos de siete letras (e.g. De la A a la G, de la H a la N...). Para esto se utiliza una función de particionamiento definida por el usuario, y que suele ser una función hash.
  8. El master asigna a cada worker una tarea espec ífica. En este caso suponemos que los workers de la izquierda van a ser asignados con tareas map y los de la derecha con tareas de reducción.
  9. El siguiente paso asigna a cada tarea map su correspondiente parte de la entrada del fichero
  10. Utilizando la funci ón de particionamiento, la tarea 1 v a clasificando las letras en cada region de la siguiente manera: la a a la region 1, la y a la region 4
  11. Completed map tasks need to be re-executed because their results are stored in local disks and they are inaccessible. Completed reduce tasks don’t need to be re-executed because their results are in the GFS that provides fault-tolerance through data replication.