Map/Reduce intro

MapReduce Intro

The MapReduce Programming Model

Introduction and Examples

Dr. Jose Mar´ Alvarez-Rodr´
ıa ıguez

“Quality Management in Service-based Systems and Cloud
Applications”

FP7 RELATE-ITN

South East European Research Center

Thessaloniki, 10th of April, 2013

1 / 61

MapReduce Intro

1 MapReduce in a nutshell

2 Thinking in MapReduce

3 Applying MapReduce

4 Success Stories with MapReduce

5 Summary and Conclusions

2 / 61

MapReduce Intro
MapReduce in a nutshell

Features

A programming model...
1 Large-scale distributed data processing
2 Simple but restricted
3 Paralell programming
4 Extensible

3 / 61

MapReduce Intro

Antecedents

Functional programming
1 Inspired
2 ...but not equivalent

Example in Python
“Given a list of numbers between 1 and 50 print only even
numbers”
§ ¤
print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
¦
¥

A list of numbers (data)
A condition (even numbers)
A function ﬁlter that is applied to the list (map)

4 / 61

MapReduce Intro

Antecedents

Functional programming
1 Inspired
2 ...but not equivalent

Example in Python
“Given a list of numbers between 1 and 50 print only even
numbers”
§ ¤
print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
¦
¥

A list of numbers (data)
A condition (even numbers)
A function ﬁlter that is applied to the list (map)

5 / 61

MapReduce Intro

...Other examples...

Example in Python
“Return the sum of the squares of a list of numbers between 1 and
50”
§ ¤
import operator
reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0)
¦
¥

“reduce” is equivalent to “foldl” in other func. languages as
Haskell
other math considerations should be taken into account (kind
of operator)...

6 / 61

MapReduce Intro

Some interesting points...

The Map Reduce framework...
1 Inspired in functional programming concepts (but not
equivalent)
2 Problems that can be paralellized
3 Sometimes recursive solutions
4 ...

7 / 61

MapReduce Intro

Basic Model

“MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.

8 / 61

MapReduce Intro

Map Function

Figure: Mapping creates a new output list by applying a function to
individual elements of an input list.

“Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

9 / 61

MapReduce Intro

Reduce Function

Figure: Reducing a list iterates over the input values to produce an
aggregate value as output.


10 / 61

MapReduce Intro

MapReduce Flow

Figure: High-level MapReduce pipeline.


11 / 61

MapReduce Intro

MapReduce Flow

Figure: Detailed Hadoop MapReduce data ﬂow.
12 / 61

MapReduce Intro

Tip

What is MapReduce?
It is a framework inspired in functional programming to tackle
problems in which steps can be paralellized applying a divide and
conquer approach.

13 / 61

MapReduce Intro
Thinking in MapReduce

When should I use MapReduce?
Query
Index and Search: inverted index
Filtering
Classiﬁcation
Recommendations: clustering or collaborative ﬁltering

Analytics
Summarization and statistics
Sorting and merging
Frequency distribution
SQL-based queries: group-by, having, etc.
Generation of graphics: histograms, scatter plots.

Others
Message passing such as Breadth First-Search or PageRank algorithms.

14 / 61

MapReduce Intro

Query
Filtering
Classiﬁcation

Analytics
Sorting and merging

Others

15 / 61

MapReduce Intro

Query
Filtering
Classiﬁcation

Analytics
Sorting and merging

Others

16 / 61

MapReduce Intro

How Google uses MapReduce (80% of data processing)

Large-scale web search indexing
Clustering problems for Google News
Produce reports for popular queries, e.g. Google Trend
Processing of satellite imagery data
Language model processing for statistical machine translation
Large-scale machine learning problems
...

17 / 61

MapReduce Intro

Comparison of MapReduce and other approaches


18 / 61

MapReduce Intro

Evaluation of MapReduce and other approaches


19 / 61

MapReduce Intro

Apache Hadoop

MapReduce deﬁnition
The Apache Hadoop software
library is a framework that
allows for the distributed
processing of large data sets
Figure: Apache Hadoop Logo.
across clusters of computers
using simple programming
models.

20 / 61

MapReduce Intro

Tip

What can I do in MapReduce?
Three main functions:
1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in oﬀ-line mode for boosting other on-line
processes.

21 / 61

MapReduce Intro
Applying MapReduce

MapReduce in Action

MapReduce Patterns
1 Summarization
2 Filtering
3 Data Organization (sort, merging, etc.)
4 Relational-based (join, selection, projection, etc.)
5 Iterative Message Passing (graph processing)
6 Others (depending on the implementation):
Simulation of distributed systems
Cross-correlation
Metapatterns
Input-output
...

22 / 61

MapReduce Intro
Applying MapReduce

Overview (stages)-Counting Letters

23 / 61

MapReduce Intro
Applying MapReduce

Summarization

Types
1 Numerical summarizations
2 Inverted index
3 Counting and counters

24 / 61

MapReduce Intro
Applying MapReduce

Numerical Summarization-I

Description
A general pattern for calculating aggregate statistical values over
your data.

Intent
Group records together by a key ﬁeld and calculate a numerical
aggregate per group to get a top-level view of the larger data set.

25 / 61

MapReduce Intro
Applying MapReduce

Numerical Summarization-II

Applicability
To deal with numerical data or counting.
To group data by speciﬁc ﬁelds

Examples

1 Word count
2 Record count
3 Min/Max/Count
4 Average/Median/Standard deviation
5 ...

26 / 61

MapReduce Intro
Applying MapReduce

Numerical Summarization-Pseudocode

class Mapper
method Map(recordid id, record r)
for all term t in record r do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

27 / 61

MapReduce Intro
Applying MapReduce

Overview-Word Counter

28 / 61

MapReduce Intro
Applying MapReduce

Numerical Summarization-Word Counter

§ ¤
public void map ( LongWritable key , Text value , Context context )
throws Exception {
String line = value . toString () ;
StringTokenizer tokenizer = new StringTokenizer ( line ) ;
while ( tokenizer . hasMoreTokens () ) {
word . set ( tokenizer . nextToken () ) ;
context . write ( word , one ) ;
}
}

public void reduce ( Text key , Iterable IntWritable values ,
Context context )
throws IOException , I n t e r r u p t e d E x c e p t i o n {
int sum = 0;
for ( IntWritable val : values ) {
sum += val . get () ;
}
context . write ( key , new IntWritable ( sum ) ) ;
}
¦
¥

29 / 61

MapReduce Intro
Applying MapReduce

Example-II

Min/Max
Given a list of tweets (username, date, text) determine ﬁrst and
last time an user commented and the number of times.

Implementation

See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro

30 / 61

MapReduce Intro
Applying MapReduce

Overview - Min/Max

∗ Min and max creation date are the same in the map phase.
31 / 61

MapReduce Intro
Applying MapReduce

Example II-Min/Max, function Map

§ ¤
public void map ( Object key , Text value , Context context )
throws IOException , InterruptedException , ParseException {
Map String , String parsed = MRDPUtils . parse ( value .
toString () ) ;
String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
String userId = parsed . get ( MRDPUtils . USER_ID ) ;
if ( strDate == null || userId == null ) {
return ;
}
Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
outTuple . setMin ( creationDate ) ;
outTuple . setMax ( creationDate ) ;
outTuple . setCount (1) ;
outUserId . set ( userId ) ;
context . write ( outUserId , outTuple ) ;
}
¦
¥

32 / 61

MapReduce Intro
Applying MapReduce

Example II-Min/Max, function Reduce

§ ¤
public void reduce ( Text key , Iterable MinMaxCountTuple values ,
Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
result . setMin ( null ) ;
result . setMax ( null ) ;
int sum = 0;
for ( MinMaxCountTuple val : values ) {
if ( result . getMin () == null
|| val . getMin () . compareTo ( result . getMin () ) 0)
{
result . setMin ( val . getMin () ) ;
}
if ( result . getMax () == null
|| val . getMax () . compareTo ( result . getMax () ) 0)
{
result . setMax ( val . getMax () ) ;
}
sum += val . getCount () ;}
result . setCount ( sum ) ;
context . write ( key , result ) ;
}
¦
¥

33 / 61

MapReduce Intro
Applying MapReduce

Example-III

Average
Given a list of tweets (username, date, text) determine the average
comment length per hour of day.

Implementation


34 / 61

MapReduce Intro
Applying MapReduce

Overview - Average

35 / 61

MapReduce Intro
Applying MapReduce

Example III-Average, function Map

§ ¤
throws IOException , InterruptedException , ParseException {
Map String , String parsed =
MRDPUtils . parse ( value . toString () ) ;
String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
String text = parsed . get ( MRDPUtils . TEXT ) ;
if ( strDate == null || text == null ) {
return ;
}
Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
outHour . set ( creationDate . getHours () ) ;
outCountAverage . setCount (1) ;
outCountAverage . setAverage ( text . length () ) ;
context . write ( outHour , outCountAverage ) ;
}
¦
¥

36 / 61

MapReduce Intro
Applying MapReduce

Example III-Average, function Reduce

§ ¤
public void reduce ( IntWritable key , Iterable CountAverageTuple
values ,
float sum = 0;
float count = 0;
for ( Co unt Ave rage Tup le val : values ) {
sum += val . getCount () * val . getAverage () ;
count += val . getCount () ;
}
result . setCount ( count ) ;
result . setAverage ( sum / count ) ;
context . write ( key , result ) ;
}
¦
¥

37 / 61

MapReduce Intro
Applying MapReduce

Numerical Summarization-Other approaches

Relation to SQL
§ ¤
SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
COUNT (*) FROM table GROUP BY groupcol2 ;
¦
¥

Implementation in PIG
§ ¤
b = GROUP a BY groupcol2 ;
c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
¦
¥

38 / 61

MapReduce Intro
Applying MapReduce

Numerical Summarization-Other approaches

Relation to SQL
§ ¤
SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
COUNT (*) FROM table GROUP BY groupcol2 ;
¦
¥

§ ¤
b = GROUP a BY groupcol2 ;
c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
¦
¥

39 / 61

MapReduce Intro
Applying MapReduce

Filtering

Types
1 Filtering
2 Top N records
3 Bloom ﬁltering
4 Distinct

40 / 61

MapReduce Intro
Applying MapReduce

Filtering-I

Description
It evaluates each record separately and decides, based on some
condition, whether it should stay or go.

Intent
Filter out records that are not of interest and keep ones that are.

41 / 61

MapReduce Intro
Applying MapReduce

Filtering-II

Applicability
To collate data

Examples

1 Closer view of dataset
2 Data cleansing
3 Tracking a thread of events
4 Simple random sampling
5 Distributed Grep
6 Removing low scoring dataset
7 Log Analysis
8 Data Querying
9 Data Validation
10 . . .

42 / 61

MapReduce Intro
Applying MapReduce

Filtering-Pseudocode

class Mapper
method Map(recordid id, record r)
field f = extract(r)
if predicate (f)
Emit(recordid id, value(r))

class Reducer
method Reduce(recordid id, values [r1, r2,...])
//Whatever
Emit(recordid id, aggregate (values))

43 / 61

MapReduce Intro
Applying MapReduce

Example-IV

Distributed Grep
Given a list of tweets (username, date, text) determine the tweets
that contain a word.

Implementation


44 / 61

MapReduce Intro
Applying MapReduce

Overview - Distributed Grep

45 / 61

MapReduce Intro
Applying MapReduce

Example IV-Distributed Grep, function Map

§ ¤
String txt = parsed . get ( MRDPUtils . TEXT ) ;
String mapRegex = .* b + context . getConfiguration ()
. get ( mapregex ) + (.) * b .* ;
if ( txt . matches ( mapRegex ) ) {
context . write ( NullWritable . get () , value ) ;
}
}
¦
¥

...and the Reduce function?

In this case it is not necessary and output values are directly writing to the output.

46 / 61

MapReduce Intro
Applying MapReduce

Example-V

Top 5
Given a list of tweets (username, date, text) determine the 5 users
that wrote longer tweets

Implementation


47 / 61

MapReduce Intro
Applying MapReduce

Overview - Top 5

48 / 61

MapReduce Intro
Applying MapReduce

Example V-Top 5, function Map

§ ¤
private TreeMap Integer , Text repToRecordMap = new TreeMap
Integer , Text () ;
if ( parsed == null ) { return ;}
String userId = parsed . get ( MRDPUtils . USER_ID ) ;
String reputation = String . valueOf ( parsed . get ( MRDPUtils .
TEXT ) . length () ) ;
// Max reputation if you write tweets longer
if ( userId == null || reputation == null ) { return ;}
repToRecordMap . put ( Integer . parseInt ( reputation ) , new
Text ( value ) ) ;
if ( repToRecordMap . size () MAX_TOP ) {
repToRecordMap . remove ( repToRecordMap . firstKey ()
);
}
}
¦
¥

49 / 61

MapReduce Intro
Applying MapReduce

Example V-Top 5, function Reduce

§ ¤
public void reduce ( NullWritable key , Iterable Text values ,
for ( Text value : values ) {
Map String , String parsed = MRDPUtils . parse ( value .
toString () ) ;
repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length
() , new Text ( value ) ) ;
if ( repToRecordMap . size () MAX_TOP ) {
repToRecordMap . remove ( repToRecordMap . firstKey ()
);
}
}
for ( Text t : repToRecordMap . descendingMap () . values ()
) {
context . write ( NullWritable . get () , t ) ;
}
}
¦
¥

50 / 61

MapReduce Intro
Applying MapReduce

Filtering-Other approaches

Relation to SQL
§ ¤
SELECT * FROM table WHERE colvalue VALUE ;
¦
¥

§ ¤
b = FILTER a BY colvalue VALUE ;
¦
¥

51 / 61

MapReduce Intro
Applying MapReduce

Filtering-Other approaches

Relation to SQL
§ ¤
SELECT * FROM table WHERE colvalue VALUE ;
¦
¥

§ ¤
b = FILTER a BY colvalue VALUE ;
¦
¥

52 / 61

MapReduce Intro
Applying MapReduce

Tip

How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and
apply a design pattern to be implemented in a framework such
as Apache Hadoop.

53 / 61

MapReduce Intro
Success Stories with MapReduce

Tip

Who is using MapReduce?
All companies that are dealing with Big Data problems for
analytics such as:
Cloudera
Datasalt
Elasticsearch
...

54 / 61

MapReduce Intro

Apache Hadoop-Related Projects

55 / 61

MapReduce Intro

More tips

FAQ
MapReduce is a framework based on a simple programming
model
...to deal with large datasets in a distributed fashion
...scalability, replication, fault-tolerant, etc.
Apache Hadoop is not a database
New frameworks on top of Hadoop for speciﬁc tasks:
querying, analysis, etc.
Other similar frameworks: Storm, Signal/Collect, etc.
...

56 / 61

MapReduce Intro
Summary and Conclusions

Summary

57 / 61

MapReduce Intro

Conclusions

What is MapReduce?

It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
applying a divide and conquer approach.


1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in oﬀ-line mode for boosting other on-line processes.


You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
framework such as Apache Hadoop.

58 / 61

MapReduce Intro

Conclusions

What is MapReduce?



1 Querying
2 Summarizing
3 Analyzing



59 / 61

MapReduce Intro

Conclusions

What is MapReduce?



1 Querying
2 Summarizing
3 Analyzing



60 / 61

MapReduce Intro

What’s next?

...
Concatenate MapReduce jobs
Optimization using combiners and setting the parameters (size
of partition, etc.)
Pipelining with other languages such as Python
Hadoop in Action: more examples, etc.
New trending problems (image/video processing)
Real-time processing
...

61 / 61

MapReduce Intro
References

J. Dean and S. Ghemawat.
MapReduce: simpliﬁed data processing on large clusters.
Commun. ACM, 51(1):107–113, Jan. 2008.
J. L. Jonathan R. Owens, Brian Femiano.
Hadoop Real-World Solutions Cookbook.
Packt Publishing Ltd, 2013.
C. Lam.
Hadoop in Action.
Manning Publications Co., Greenwich, CT, USA, 1st edition,
2010.
J. Lin and C. Dyer.
Data-intensive text processing with MapReduce.
In Proceedings of Human Language Technologies: The 2009
Annual Conference of the North American Chapter of the
Association for Computational Linguistics, Companion
62 / 61

MapReduce Intro
References

Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2,
Stroudsburg, PA, USA, 2009. Association for Computational
Linguistics.
D. Miner and A. Shook.
Mapreduce Design Patterns.
Oreilly and Associates Inc, 2012.
T. G. Srinath Perera.
Hadoop MapReduce Cookbook.
Packt Publishing Ltd, 2013.
T. White.
Hadoop: The Deﬁnitive Guide.
O’Reilly Media, Inc., 1st edition, 2009.
I. H. Witten and E. Frank.
Data Mining: Practical Machine LearningTools and Techniques.

63 / 61

MapReduce Intro
References

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
2005.

64 / 61

Map/Reduce intro

More Related Content

What's hot

Viewers also liked

Similar to Map/Reduce intro

More from CARLOS III UNIVERSITY OF MADRID

Recently uploaded

Map/Reduce intro