Map/Reduce intro
Upcoming SlideShare
Loading in...5
×
 

Map/Reduce intro

on

  • 3,653 views

Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns. ...

Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.

Special thanks to the next authors:

-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Statistics

Views

Total Views
3,653
Views on SlideShare
3,257
Embed Views
396

Actions

Likes
6
Downloads
145
Comments
0

2 Embeds 396

http://www.josemalvarez.es 395
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Map/Reduce intro Map/Reduce intro Presentation Transcript

    • MapReduce Intro The MapReduce Programming Model Introduction and Examples Dr. Jose Mar´ Alvarez-Rodr´ ıa ıguez “Quality Management in Service-based Systems and Cloud Applications” FP7 RELATE-ITN South East European Research Center Thessaloniki, 10th of April, 2013 1 / 61
    • MapReduce Intro 1 MapReduce in a nutshell 2 Thinking in MapReduce 3 Applying MapReduce 4 Success Stories with MapReduce 5 Summary and Conclusions 2 / 61
    • MapReduce Intro MapReduce in a nutshell Features A programming model... 1 Large-scale distributed data processing 2 Simple but restricted 3 Paralell programming 4 Extensible 3 / 61
    • MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦  ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 4 / 61
    • MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦  ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 5 / 61
    • MapReduce Intro MapReduce in a nutshell ...Other examples... Example in Python “Return the sum of the squares of a list of numbers between 1 and 50” § ¤ import operator reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0) ¦  ¥ “reduce” is equivalent to “foldl” in other func. languages as Haskell other math considerations should be taken into account (kind of operator)... 6 / 61
    • MapReduce Intro MapReduce in a nutshell Some interesting points... The Map Reduce framework... 1 Inspired in functional programming concepts (but not equivalent) 2 Problems that can be paralellized 3 Sometimes recursive solutions 4 ... 7 / 61
    • MapReduce Intro MapReduce in a nutshell Basic Model “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 8 / 61
    • MapReduce Intro MapReduce in a nutshell Map Function Figure: Mapping creates a new output list by applying a function to individual elements of an input list. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 9 / 61
    • MapReduce Intro MapReduce in a nutshell Reduce Function Figure: Reducing a list iterates over the input values to produce an aggregate value as output. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 10 / 61
    • MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: High-level MapReduce pipeline. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 11 / 61
    • MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: Detailed Hadoop MapReduce data flow. 12 / 61
    • MapReduce Intro MapReduce in a nutshell Tip What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. 13 / 61
    • MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 14 / 61
    • MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 15 / 61
    • MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 16 / 61
    • MapReduce Intro Thinking in MapReduce How Google uses MapReduce (80% of data processing) Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google Trend Processing of satellite imagery data Language model processing for statistical machine translation Large-scale machine learning problems ... 17 / 61
    • MapReduce Intro Thinking in MapReduce Comparison of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 18 / 61
    • MapReduce Intro Thinking in MapReduce Evaluation of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 19 / 61
    • MapReduce Intro Thinking in MapReduce Apache Hadoop MapReduce definition The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets Figure: Apache Hadoop Logo. across clusters of computers using simple programming models. 20 / 61
    • MapReduce Intro Thinking in MapReduce Tip What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. 21 / 61
    • MapReduce Intro Applying MapReduce MapReduce in Action MapReduce Patterns 1 Summarization 2 Filtering 3 Data Organization (sort, merging, etc.) 4 Relational-based (join, selection, projection, etc.) 5 Iterative Message Passing (graph processing) 6 Others (depending on the implementation): Simulation of distributed systems Cross-correlation Metapatterns Input-output ... 22 / 61
    • MapReduce Intro Applying MapReduce Overview (stages)-Counting Letters 23 / 61
    • MapReduce Intro Applying MapReduce Summarization Types 1 Numerical summarizations 2 Inverted index 3 Counting and counters 24 / 61
    • MapReduce Intro Applying MapReduce Numerical Summarization-I Description A general pattern for calculating aggregate statistical values over your data. Intent Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. 25 / 61
    • MapReduce Intro Applying MapReduce Numerical Summarization-II Applicability To deal with numerical data or counting. To group data by specific fields Examples 1 Word count 2 Record count 3 Min/Max/Count 4 Average/Median/Standard deviation 5 ... 26 / 61
    • MapReduce Intro Applying MapReduce Numerical Summarization-Pseudocode class Mapper method Map(recordid id, record r) for all term t in record r do Emit(term t, count 1) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) 27 / 61
    • MapReduce Intro Applying MapReduce Overview-Word Counter 28 / 61
    • MapReduce Intro Applying MapReduce Numerical Summarization-Word Counter § ¤ public void map ( LongWritable key , Text value , Context context ) throws Exception { String line = value . toString () ; StringTokenizer tokenizer = new StringTokenizer ( line ) ; while ( tokenizer . hasMoreTokens () ) { word . set ( tokenizer . nextToken () ) ; context . write ( word , one ) ; } } public void reduce ( Text key , Iterable < IntWritable > values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { int sum = 0; for ( IntWritable val : values ) { sum += val . get () ; } context . write ( key , new IntWritable ( sum ) ) ; } ¦  ¥ 29 / 61
    • MapReduce Intro Applying MapReduce Example-II Min/Max Given a list of tweets (username, date, text) determine first and last time an user commented and the number of times. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 30 / 61
    • MapReduce Intro Applying MapReduce Overview - Min/Max ∗ Min and max creation date are the same in the map phase. 31 / 61
    • MapReduce Intro Applying MapReduce Example II-Min/Max, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map < String , String > parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String userId = parsed . get ( MRDPUtils . USER_ID ) ; if ( strDate == null || userId == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outTuple . setMin ( creationDate ) ; outTuple . setMax ( creationDate ) ; outTuple . setCount (1) ; outUserId . set ( userId ) ; context . write ( outUserId , outTuple ) ; } ¦  ¥ 32 / 61
    • MapReduce Intro Applying MapReduce Example II-Min/Max, function Reduce § ¤ public void reduce ( Text key , Iterable < MinMaxCountTuple > values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { result . setMin ( null ) ; result . setMax ( null ) ; int sum = 0; for ( MinMaxCountTuple val : values ) { if ( result . getMin () == null || val . getMin () . compareTo ( result . getMin () ) < 0) { result . setMin ( val . getMin () ) ; } if ( result . getMax () == null || val . getMax () . compareTo ( result . getMax () ) > 0) { result . setMax ( val . getMax () ) ; } sum += val . getCount () ;} result . setCount ( sum ) ; context . write ( key , result ) ; } ¦  ¥ 33 / 61
    • MapReduce Intro Applying MapReduce Example-III Average Given a list of tweets (username, date, text) determine the average comment length per hour of day. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 34 / 61
    • MapReduce Intro Applying MapReduce Overview - Average 35 / 61
    • MapReduce Intro Applying MapReduce Example III-Average, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map < String , String > parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String text = parsed . get ( MRDPUtils . TEXT ) ; if ( strDate == null || text == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outHour . set ( creationDate . getHours () ) ; outCountAverage . setCount (1) ; outCountAverage . setAverage ( text . length () ) ; context . write ( outHour , outCountAverage ) ; } ¦  ¥ 36 / 61
    • MapReduce Intro Applying MapReduce Example III-Average, function Reduce § ¤ public void reduce ( IntWritable key , Iterable < CountAverageTuple > values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { float sum = 0; float count = 0; for ( Co unt Ave rage Tup le val : values ) { sum += val . getCount () * val . getAverage () ; count += val . getCount () ; } result . setCount ( count ) ; result . setAverage ( sum / count ) ; context . write ( key , result ) ; } ¦  ¥ 37 / 61
    • MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦  ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦  ¥ 38 / 61
    • MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦  ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦  ¥ 39 / 61
    • MapReduce Intro Applying MapReduce Filtering Types 1 Filtering 2 Top N records 3 Bloom filtering 4 Distinct 40 / 61
    • MapReduce Intro Applying MapReduce Filtering-I Description It evaluates each record separately and decides, based on some condition, whether it should stay or go. Intent Filter out records that are not of interest and keep ones that are. 41 / 61
    • MapReduce Intro Applying MapReduce Filtering-II Applicability To collate data Examples 1 Closer view of dataset 2 Data cleansing 3 Tracking a thread of events 4 Simple random sampling 5 Distributed Grep 6 Removing low scoring dataset 7 Log Analysis 8 Data Querying 9 Data Validation 10 . . . 42 / 61
    • MapReduce Intro Applying MapReduce Filtering-Pseudocode class Mapper method Map(recordid id, record r) field f = extract(r) if predicate (f) Emit(recordid id, value(r)) class Reducer method Reduce(recordid id, values [r1, r2,...]) //Whatever Emit(recordid id, aggregate (values)) 43 / 61
    • MapReduce Intro Applying MapReduce Example-IV Distributed Grep Given a list of tweets (username, date, text) determine the tweets that contain a word. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 44 / 61
    • MapReduce Intro Applying MapReduce Overview - Distributed Grep 45 / 61
    • MapReduce Intro Applying MapReduce Example IV-Distributed Grep, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map < String , String > parsed = MRDPUtils . parse ( value . toString () ) ; String txt = parsed . get ( MRDPUtils . TEXT ) ; String mapRegex = " .* b " + context . getConfiguration () . get ( " mapregex " ) + " (.) * b .* " ; if ( txt . matches ( mapRegex ) ) { context . write ( NullWritable . get () , value ) ; } } ¦  ¥ ...and the Reduce function? In this case it is not necessary and output values are directly writing to the output. 46 / 61
    • MapReduce Intro Applying MapReduce Example-V Top 5 Given a list of tweets (username, date, text) determine the 5 users that wrote longer tweets Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 47 / 61
    • MapReduce Intro Applying MapReduce Overview - Top 5 48 / 61
    • MapReduce Intro Applying MapReduce Example V-Top 5, function Map § ¤ private TreeMap < Integer , Text > repToRecordMap = new TreeMap < Integer , Text >() ; public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map < String , String > parsed = MRDPUtils . parse ( value . toString () ) ; if ( parsed == null ) { return ;} String userId = parsed . get ( MRDPUtils . USER_ID ) ; String reputation = String . valueOf ( parsed . get ( MRDPUtils . TEXT ) . length () ) ; // Max reputation if you write tweets longer if ( userId == null || reputation == null ) { return ;} repToRecordMap . put ( Integer . parseInt ( reputation ) , new Text ( value ) ) ; if ( repToRecordMap . size () > MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } ¦  ¥ 49 / 61
    • MapReduce Intro Applying MapReduce Example V-Top 5, function Reduce § ¤ public void reduce ( NullWritable key , Iterable < Text > values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { for ( Text value : values ) { Map < String , String > parsed = MRDPUtils . parse ( value . toString () ) ; repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length () , new Text ( value ) ) ; if ( repToRecordMap . size () > MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } for ( Text t : repToRecordMap . descendingMap () . values () ) { context . write ( NullWritable . get () , t ) ; } } ¦  ¥ 50 / 61
    • MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue < VALUE ; ¦  ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue < VALUE ; ¦  ¥ 51 / 61
    • MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue < VALUE ; ¦  ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue < VALUE ; ¦  ¥ 52 / 61
    • MapReduce Intro Applying MapReduce Tip How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 53 / 61
    • MapReduce Intro Success Stories with MapReduce Tip Who is using MapReduce? All companies that are dealing with Big Data problems for analytics such as: Cloudera Datasalt Elasticsearch ... 54 / 61
    • MapReduce Intro Success Stories with MapReduce Apache Hadoop-Related Projects 55 / 61
    • MapReduce Intro Success Stories with MapReduce More tips FAQ MapReduce is a framework based on a simple programming model ...to deal with large datasets in a distributed fashion ...scalability, replication, fault-tolerant, etc. Apache Hadoop is not a database New frameworks on top of Hadoop for specific tasks: querying, analysis, etc. Other similar frameworks: Storm, Signal/Collect, etc. ... 56 / 61
    • MapReduce Intro Summary and Conclusions Summary 57 / 61
    • MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 58 / 61
    • MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 59 / 61
    • MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 60 / 61
    • MapReduce Intro Summary and Conclusions What’s next? ... Concatenate MapReduce jobs Optimization using combiners and setting the parameters (size of partition, etc.) Pipelining with other languages such as Python Hadoop in Action: more examples, etc. New trending problems (image/video processing) Real-time processing ... 61 / 61
    • MapReduce Intro References J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008. J. L. Jonathan R. Owens, Brian Femiano. Hadoop Real-World Solutions Cookbook. Packt Publishing Ltd, 2013. C. Lam. Hadoop in Action. Manning Publications Co., Greenwich, CT, USA, 1st edition, 2010. J. Lin and C. Dyer. Data-intensive text processing with MapReduce. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion 62 / 61
    • MapReduce Intro References Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. D. Miner and A. Shook. Mapreduce Design Patterns. Oreilly and Associates Inc, 2012. T. G. Srinath Perera. Hadoop MapReduce Cookbook. Packt Publishing Ltd, 2013. T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009. I. H. Witten and E. Frank. Data Mining: Practical Machine LearningTools and Techniques. 63 / 61
    • MapReduce Intro References Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. 64 / 61