MapReduce Runtime Environments
Design, Performance, Optimizations

Gilles Fedak
Many Figures imported from authors’ works
...
Outline of Topics
1

What is MapReduce ?
The Programming Model
Applications and Execution Runtimes

2

Design and Principl...
Section 1
What is MapReduce ?

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

3 / 84
What is MapReduce ?

• Programming Model for data-intense applications
• Proposed by Google in 2004
• Simple, inspired by ...
Subsection 1
The Programming Model

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

5 / 84
MapReduce in Lisp

• (map f (list l1 , . . . , ln ))
• (map square ‘(1 2 3 4))
• (1 4 9 16)

• (+ 1 4 9 16)
• (+1 (+4 (+9 ...
MapReduce a la Google

Map(k , v ) → (k , v )

Mapper

Group(k , v ) by k

Reduce(k , {v1 , . . . , vn }) → v

Reducer

In...
Exemple: Word Count

• Input: Large number of text documents
• Task: Compute word count across all the document
• Solution...
WordCount Example
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l u e ) :
/ / key...
Subsection 2
Applications and Execution Runtimes

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/20...
MapReduce Applications Skeleton

• Read large data set
• Map: extract relevant information from each data item
• Shuffle an...
MapReduce Usage at Google

Jerry Zhao (MapReduce Group Tech lead @ Google)

G. Fedak (INRIA, France)

MapReduce

Salvador ...
MaReduce + Distributed Storage
!  !"#$%&'(()#*$+),--"./,0$1)#/0*),)1/-0%/2'0$1)-0"%,3$)

-4-0$5)

Powerfull when associate...
The Execution Runtime

• MapReduce runtime handles transparently:
•
•
•
•
•
•

Parallelization
Communications
Load balanci...
MapReduce Compared

Jerry Zhao

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

15 / 84
Existing MapReduce Systems
#$%&'()*+,-./0-(12$3-4$((

""#$%&'()*%+,-%&
• Google MapReduce

  ./")/0%1(/2&(3+&-$"4%+&4",/-%...
Other Specialized Runtime Environments

• Hardware Platform
• Frameworks based on

• Multicore (Phoenix)
• FPGA (FPMR)
• G...
Hadoop Ecosystem

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

18 / 84
Research Areas to Improve MapReduce Performance

• File System
• Higher Throughput, reliability, replication efficiency

• ...
Research Areas to Improve MapReduce Applicability

• Scientific Computing
• Iterative MapReduce, numerical processing, algo...
CloudBlast: MapReduce and Virtualization for BLAST
Computing
NCBI BLAST
• The Basic Local Alignment Search Tool
(BLAST)

•...
,'&L?&-0.>'! 8#LL'&'.>'! 7'/N''.! =1?%8FGHIJ! 0.8!
-,#FGHIJ! #2! -?&'! .?/#>'071'! N@'.! 2'K%'.>'2! 0&'! 2@?&/T!
#.8#>0/#....
Section 2
Design and Principles of MapReduce Execution
Runtime

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Br...
Subsection 1
File Systems

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

24 / 84
File System for Data-intensive Computing
MapReduce is associated with parallel file systems
• GFS (Google File System) and ...
Example: the Hadoop Architecture
Distributed Computing
(MapReduce)

Data	
  Analy)cs	
  
Applica)ons	
  

JobTracker	
  

...
HDFS Main Principles
HDFS Roles

File.txt

• Application

A	
  

• Split a file in chunks
• The more chunks, the more

B	
 ...
Data locality in HDFS
Rack Awareness
• Name node groups Data Node

in racks
rack 1
rack 2

DN1, DN2, DN3, DN4
DN5, DN6, DN...
Writing to HDFS
Client

Data Node 1

Data Node 3

Data Node 5

Name Node

write File.txt in dir d
Writeable

For each Data...
Pipeline Write HDFS
Client

Data Node 1

Data Node 3

Data Node 5

Name Node

For each Data chunk
Send Block A
Block Recei...
Fault Resilience and Replication

NameNode
• Ensures fault detection
• Heartbeat is sent by DataNode to the NameNode every...
Subsection 2
MapReduce Execution

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

32 / 84
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer...
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer...
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer...
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer...
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer...
Anatomy of a MapReduce Execution
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper
Reducer...
Return to WordCount: Application-Level Optimization
The Combiner Optimization
• How to decrease the amount of information ...
Let’s go further: Word Frequency
• Input : Large number of text documents
• Task: Compute word frequency across all the do...
Word Frequency Mapper and Reducer
/ / Pseudo−code f o r t h e Map f u n c t i o n
map( S t r i n g key , S t r i n g v a l...
Subsection 3
Key Features of MapReduce Runtime Environments

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazi...
Handling Failures

On very large cluster made of commodity servers, failures are not a rare
event.
Fault-tolerant Executio...
Scheduling
Data Driven Scheduling
• Data-chunks replication
• GFS divides files in 64MB chunks and stores 3 replica of each...
Speculative Execution in Hadoop

• When a node is idle, the scheduler selects a task
1
2
3

failed tasks have the higher p...
Section 3
Performance Improvements and Optimizations

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/1...
Subsection 1
Handling Heterogeneity and Stragglers

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/...
Issues with Hadoop Scheduler and heterogeneous
environment
Hadoop makes several assumptions :
• nodes are homogeneous
• ta...
The LATE Scheduler
The LATE Scheduler (Longest Approximate Time to End)
• estimate task completion time and backup LATE ta...
Progress RateProgress
Example
1 min

Rate Example

2 min

Node 1

1 task/min

Node 2

3x slower

Node 3

1.9x slower

Time...
LATE Example

LATE Example
2 min
Estimated time left:
(1-0.66) / (1/3) = 1

Node 1
Node 2

Progress = 66%

Estimated time ...
LATE Performance (EC2 Sort benchmark)
Stragglers
!"#$%&'($)*(+$,-(-'&.-/-*(0 emulated by running
$
4 stragglers

Heterogen...
Understanding Outliers
• Stragglers : Tasks that take ≥ 1.5 the median task in that phase
• Re-compute : When a task outpu...
123

!*8%?*&

:9

The Mantri Scheduler
9
9

234

DF;CG4E
:9

;9

<9

=9

>99

@A*')#,*ABC#D4E#8B#1"./)*%8"B#78.*

0"))/%1
...
Resource Aware Restart
• Restart Outliers tasks on better machines
• Consider two options : kill & restart or duplicate

+...
Network Aware Placement

• Schedule tasks so that it minimizes network congestion
• Local approximation of optimal placeme...
Mantri Performances

%"#&

%&

"&
&

!'#!
!"#$

!"#(

'()*+,-./01(/1(
203()*40,5-*4

$&

(&

!&

%""

-!./0/12$34/
-!./0/1...
Subsection 2
Extending MapReduce Application Range

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/...
!"#$%&'()(*%&'+,-&(.+/0&123&(

Twister: Iterative MapReduce
!  !"#$"%&$"'%('%(#$)$"&()%*(

+),")-./(*)$)(

!  (0'%123,)-./...
Moon: MapReduce on Opportunistic Environments
• Consider volatile resources during their idleness time
• Consider the Hado...
Towards Greener Data Center
• Sustainable Computing Infrastructures which minimizes the impact on the

environment
• Reduc...
Section 4
MapReduce beyond the Data Center

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

57...
Introduction to Computational Desktop Grids
High!"#$$%&'%&()$(*$&+&,)-%&'./"'#0)1%*"-&
Throughput Computing
over2%"-/00%$-...
MapReduce over the Internet
Computing on Volunteers’ PCs (a la SETI@Home)
• Potential of aggregate computing power and sto...
MapReduce over the Internet
Computing on Volunteers’ PCs (a la SETI@Home)
• Potential of aggregate computing power and sto...
Subsection 1
Background on BitDew

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

60 / 84
Large Scale Data Management

BitDew : a Programmable Environment for Large Scale Data
Management
Key Idea 1: provides an A...
BitDew : the Big Cloudy Picture
• Aggregates storage in a single

Data Space:

Data Space

• Clients put and get data

fro...
BitDew : the Big Cloudy Picture
• Aggregates storage in a single

Data Space:

Data Space

• Clients put and get data

fro...
Data Attributes

REPLICA

:

RESILIENCE

:

LIFETIME

:

AFFINITY

:

TRANSFER PROTOCOL

:

DISTRIBUTION

:

G. Fedak (INR...
Architecture Overview
Applications
Service
Container

Active Data
Data
Scheduler
SQL
Server

Transfer
Manager

BitDew
Data...
Anatomy of a BitDew Application
• Clients
1
2
3

creates Data and
Attributes
put file into data slot created
schedule the d...
Anatomy of a BitDew Application
• Clients
1
2
3

creates Data and
Attributes
put file into data slot created
schedule the d...
Examples of BitDew Applications

• Data-Intensive Application
• DI-BOT : data-driven master-worker Arabic characters recog...
Experiment Setup

Hardware Setup
• Experiments are conducted in a controlled environment (cluster for the

microbenchmark ...
Fault-Tolerance Scenario
0

20

40

60

80

100

120
492KB/s

DSL01
DSL02

211KB/s

DSL03

254KB/s

DSL04

247KB/s

DSL05
...
Collective File Transfer with Replication
1
d2

0.8

ALL_TO_ALL

Fraction

d1

d1 + d2 + d3 + d4

d4

0.6
0.4

d3

0.2
0
1...
Multi-sources and Multi-protocols File Distribution
Scenario
• 4 clusters of 50 nodes each,

each cluster hosting its own
...
Multi-sources and Multi-protocols File Distribution
BitTorrent (MB/s)
orsay
1st phase
1DR/1DC
2DR/2DC
4DR/4DC
2nd phase
1D...
Subsection 2
MapReduce for Desktop Grid Computing

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2...
Challenge of MapReduce over the Internet
Data Split

Intermediate
Results

Input Files

Final Output

Output Files

Mapper...
Implementing MapReduce over BitDew (1/2)
Initialisation by the Master
1

Data slicing : Master creates
DataCollection
DC =...
2.
3.
4.
5.
6.
7.
8.
9.
10.

Create a single data MapToken with affinity set to DC
{on the Worker node}
if MapToken is sche...
Handling Communications
Latency Hiding
• Multi-thredead worker to overlap
communication and computation.

• The number of ...
Scheduling and Fault-tolerance
2-level scheduling
1 Data placement is ensured by the BitDew scheduler, which is mainly gui...
Data Security and Privacy
Distributed Result Checking
• Traditional DG or VC projects implements result checking on the se...
BitDew Collective Communication
%!!!

56,7-87/9:012.:3;0<4
56,7-87/9:012.:35190,66.+94
=8799.6:012.
>79?.6:012.

$"!!

012...
MapReduce Evaluation
#&!!

0?6,@A?B@9:3C5D/4

#$!!
#!!!
'!!
(!!
&!!
$!!
!
&

'

a 2.7GB

Figure 6.

'

#(

%$
(&
*+,-./

#...
1;38CA<
1;3"CA<
(;6#<

(;08<

(;6A<
78

(;0"<

1;
0
1; 8 <
0" <

(;08<

E

%;68<

8EE

1;:8<

!"#"$%
54,/64

8AE

1;38C$<
...
In case of substantial unnecessary in Table
the HDFS needs
execution progress at all.inter-communication between two
On th...
Section 5
Conclusion

G. Fedak (INRIA, France)

MapReduce

Salvador da Bahia, Brazil, 22/10/2013

83 / 84
Conclusion
• MapReduce Runtimes
• Attractive programming model for data-intense computing
• Many research issues concernin...
Upcoming SlideShare
Loading in …5
×

Mapreduce Runtime Environments: Design, Performance, Optimizations

1,869
-1

Published on

II Escola Regional de Alto Desempenho - Região Nordeste, Savaldor de Bahia, Brazil, October 22, 2013
Seminar Datenverarbeitung mit Map-Reduce University of Heidelberg (15/05/2012).

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,869
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
96
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Mapreduce Runtime Environments: Design, Performance, Optimizations

  1. 1. MapReduce Runtime Environments Design, Performance, Optimizations Gilles Fedak Many Figures imported from authors’ works Gilles.Fedak@inria.fr INRIA/University of Lyon, France Escola Regional de Alto Desempenho - Região Nordeste G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 1 / 84
  2. 2. Outline of Topics 1 What is MapReduce ? The Programming Model Applications and Execution Runtimes 2 Design and Principles of MapReduce Execution Runtime File Systems MapReduce Execution Key Features of MapReduce Runtime Environments 3 Performance Improvements and Optimizations Handling Heterogeneity and Stragglers Extending MapReduce Application Range 4 MapReduce beyond the Data Center Background on BitDew MapReduce for Desktop Grid Computing 5 Conclusion G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 2 / 84
  3. 3. Section 1 What is MapReduce ? G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 3 / 84
  4. 4. What is MapReduce ? • Programming Model for data-intense applications • Proposed by Google in 2004 • Simple, inspired by functionnal programming • programmer simply defines Map and Reduce tasks • Building block for other parallel programming tools • Strong open-source implementation: Hadoop • Highly scalable • Accommodate large scale clusters: faulty and unreliable resources MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, in OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 4 / 84
  5. 5. Subsection 1 The Programming Model G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 5 / 84
  6. 6. MapReduce in Lisp • (map f (list l1 , . . . , ln )) • (map square ‘(1 2 3 4)) • (1 4 9 16) • (+ 1 4 9 16) • (+1 (+4 (+9 16))) • 30 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 6 / 84
  7. 7. MapReduce a la Google Map(k , v ) → (k , v ) Mapper Group(k , v ) by k Reduce(k , {v1 , . . . , vn }) → v Reducer Input Output • Map(key , value) is run on each item in the input data set • Emits a new (key , val) pair • Reduce(key , val) is run for each unique key emitted by the Map • Emits final output G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 7 / 84
  8. 8. Exemple: Word Count • Input: Large number of text documents • Task: Compute word count across all the document • Solution : • Mapper : For every word in document, emits (word, 1) • Reducer : Sum all occurrences of word and emits (word, total_count) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 8 / 84
  9. 9. WordCount Example / / Pseudo−code f o r t h e Map f u n c t i o n map( S t r i n g key , S t r i n g v a l u e ) : / / key : document name , / / v a l u e : document c o n t e n t s f o r e a c h word i n s p l i t ( v a l u e ) : E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ; / / Peudo−code f o r t h e Reduce f u n c t i o n reduce ( S t r i n g key , I t e r a t o r v a l u e ) : / / key : a word / / v a l u e s : a l i s t o f counts i n t word_count = 0 ; foreach v i n values : word_count += P a r s e I n t ( v ) ; Emit ( key , A s S t r i n g ( word_count ) ) ; G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 9 / 84
  10. 10. Subsection 2 Applications and Execution Runtimes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 10 / 84
  11. 11. MapReduce Applications Skeleton • Read large data set • Map: extract relevant information from each data item • Shuffle and Sort • Reduce : aggregate, summarize, filter, or transform • Store the result • Iterate if necessary G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 11 / 84
  12. 12. MapReduce Usage at Google Jerry Zhao (MapReduce Group Tech lead @ Google) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 12 / 84
  13. 13. MaReduce + Distributed Storage !  !"#$%&'(()#*$+),--"./,0$1)#/0*),)1/-0%/2'0$1)-0"%,3$) -4-0$5) Powerfull when associated with a distributed storage system ! •678)96""3($)78:;)<=78)9<,1"">)7/($)84-0$5:) • GFS (Google FS), HDFS (Hadoop File System) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 13 / 84
  14. 14. The Execution Runtime • MapReduce runtime handles transparently: • • • • • • Parallelization Communications Load balancing I/O: network and disk Machine failures Laggers G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 14 / 84
  15. 15. MapReduce Compared Jerry Zhao G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 15 / 84
  16. 16. Existing MapReduce Systems #$%&'()*+,-./0-(12$3-4$(( ""#$%&'()*%+,-%& • Google MapReduce   ./")/0%1(/2&(3+&-$"4%+&4",/-%& • Proprietary and close source   5$"4%$2&$036%+&1"&!78& to GFS • Closely linked   9:)$%:%31%+&03&5;;&&<01=&.21="3&(3+&>(?(&@03+03#4& • Implemented in C++ with Python and Java Bindings   8(<A($$& • Hadoop (+"")& • Open source (Apache projects) • Cascading, Java, Ruby, Perl, Python, PHP, R, C++ bindings (and certainly   C)%3&4",/-%&DE)(-=%&)/"F%-1G& more)   5(4-(+03#H&>(?(H&*,@2H&.%/$H&.21="3H&.B.H&*H&"/&5;;& • Many side projects: HDFS, PIG, Hive, Hbase, Zookeeper   '(32&40+%&)/"F%-14&I&BJ78H&.9!H&B@(4%H&B0?%H&K""6%)%/& • Used by Yahoo!, Facebook, Amazon and the Google IBM research Cloud   L4%+&@2&M(=""NH&7(-%@""6H&E:(A"3&(3+&1=%&!""#$%&9O'& /%4%(/-=&5$",+& G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 16 / 84
  17. 17. Other Specialized Runtime Environments • Hardware Platform • Frameworks based on • Multicore (Phoenix) • FPGA (FPMR) • GPU (Mars) MapReduce • Iterative MapReduce (Twister) • Security • Incremental MapReduce • Data privacy (Airvat) (Nephele, Online MR) • MapReduce/Virtualization • Languages (PigLatin) • Log analysis : In situ MR • EC2 Amazon • CAM G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 17 / 84
  18. 18. Hadoop Ecosystem G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 18 / 84
  19. 19. Research Areas to Improve MapReduce Performance • File System • Higher Throughput, reliability, replication efficiency • Scheduling • data/task affinity, scheduling multiple MR apps, hybrid distributed computing infrastructures • Failure detections • Failure characterization, failure detector and prediction • Security • Data privacy, distributed result/error checking • Greener MapReduce • Energy efficiency, renewable energy G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 19 / 84
  20. 20. Research Areas to Improve MapReduce Applicability • Scientific Computing • Iterative MapReduce, numerical processing, algorithmics • Parallel Stream Processing • language extension, runtime optimization for parallel stream processing, incremental processing • MapReduce on widely distributed data-sets • Mapreduce on embedded devices (sensors), data integration G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 20 / 84
  21. 21. CloudBlast: MapReduce and Virtualization for BLAST Computing NCBI BLAST • The Basic Local Alignment Search Tool (BLAST) • finds regions of local similarity between Split into chunks based on new line. CCAAGGTT >seq1 AACCGGTT >seq2 >seq3 GGTTAACC >seq4 TTAACCGG sequences. StreamPatternRecordReader splits chunks into sequences as keys for the mapper (empty value). • compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. Hadoop + BLAST • stream of sequences are split by Hadoop. • Mapper compares each sequence to the gene database • provides fault-tolerance, load balancing. G. Fedak (INRIA, France) >seq3 GGTTAACC >seq4 TTAACCGG >seq1 AACCGGTT >seq2 CCAAGGTT BLAST Mapper BLAST Mapper BLAST output is written to a local file. part-00000 part-00001 DFS merger • HDFS is used to merge the result • database are not chunked MapReduce 'I'8%1#</ &';<%&8'; 455A#841#< '/:#&</C Y)!,55 P3!#C4$ 02'/! /' C4#/14#/' 1'C5A41'! 4%1<C41#8 &'A#':';! % ;#1';?! =&< 7'5A<FC' 1<! 7'4A! 0 .'C5A41' E#<A<$#;1; #/1'&5&'1#/ 455A#4/8' StreamInputFormat ! "#$%&'!()!! *+,-.#/$!0#12!3456'7%8')!9#:'/!4!;'1!<=!#/5%1!;'>%'/8';?! @47<<5!;5A#1;!12'!#/5%1!=#A'!#/1<!82%/B;!<=!;4C'!;#D'?!5<;;#EAF!;5A#11#/$!#/! 12'!C#77A'!<=!4!;'>%'/8')!-1&'4CG411'&/6'8<&76'47'&!04;!7':'A<5'7!1<! Cloudblast: Combining mapreduce and virtualization on #/1'&5&'1!8<&&'81AF!12'!E<%/74&#';!<=!;'>%'/8';?!02#82!4&'!54;;'7!4;!B'F;!<=! distributed resources for bioinformatics applications. A. 4!B'FH:4A%'!54#&!1<!12'!3455'&!1241!'I'8%1';!*+,-.)!J"-!&'1&#':';!4AA! Matsunaga, M.A<84A!=#A'!<%15%1;!8<CE#/#/$!#/1<!4!;#/$A'!&';%A1)! Tsugawa, & J. Fortes. (eScience’08), 2008. <5'&41#/$!;F;1'C;!4/7!477#1#</4A!;<=104&'!K')$)!1<!;%55<&1!4! 54&1#8%A4&! 1F5'! <=! /'10<&B! 5&<1<8<A! <&! 1'82/<A<$FL)! -#C#A4&AF?!4!@47<<5ME4;'7!455&<482!&'>%#&';!C482#/';!0#12! 12'! @47<<5! &%/1#C'! 4/7! =#A'! ;F;1'C! ;<=104&'?! 02#82! 84/! </AF!'I'8%1'!#=!</'! <=!;':'&4A!488'514EA'!:'&;#</!<=!N4:4!#;! Salvador da Bahia, Brazil, 22/10/2013 !"! #$%&% *#<#/= ;'1;!<=!': 5&<1'#/! ;' YL)! G'&#<7 <=='&! %5M %5741';!#; &';<%&8'; J414! 9P"-!TYc 12'! 7#;1&# @<0':'&? 7';#$/'7! C4/4$'C <E14#/'7! /<7';?! #7 47:4/14$' 21 / 84
  22. 22. ,'&L?&-0.>'! 8#LL'&'.>'! 7'/N''.! =1?%8FGHIJ! 0.8! -,#FGHIJ! #2! -?&'! .?/#>'071'! N@'.! 2'K%'.>'2! 0&'! 2@?&/T! #.8#>0/#.$! 0.! 'LL#>#'./! #-,1'-'./0/#?.! ?L! #.,%/! 2'K%'.>'! 2,1#/!0.8!V?7!>??&8#.0/#?.!7M!C08??,)!JN?62#/'!&'2%1/2!2@?N! 7'//'&!2,''8%,!/@0.!?.'62#/'!8%'!/?!,&?>'22?&2!/@0/!0&'!L02/!0/! ;=!P0,,&?+#-0/'1M!<)3!/#-'2!L02/'&R)!J@'!,%71#>1M!0B0#1071'! -,#FGHIJ! >0.! 7'! L%&/@'&! @0.86/%.'8! L?&! @#$@'&! JHFG*!EE)!! *5D*SE:*AJHG!I*Y;*A=*I!=CHSH=J*SEIJE=I)! • Use Vine to deploy Hadoop on several clusters (24 Xen-based ?,/#-#_'8! #-,1'-'./0/#?.2!12 .?/! ,'&L?&-0.>'T! 7%/! 2%>@! VM @ Univ. Florida + 0&'! ! "#$!%&'(! Calif.) "#$!)*&+,! ,%71#>1M!0>>'22#71'!a4b)! Xen based VMs at Univ WUX! WUX! -!&.!)/01/'2/)! H1/@?%$@! /@'! './#&'! /0&$'/! 80/0702'! L#/2! #./?! -'-?&M! #.! • Each VM contains Hadoop + BLAST + databse2'/%,! %2'8T! &%.2! N#/@! /@'! 80/0702'! 2'$-'./'8! #./?! [X! /@'! (3.5GB) (TZ<[! Z(X! 3&'(/),!)/01/'2/! L&0$-'./2!N'&'!,'&L?&-'8!#.!?&8'&!/?!#.>&'02'!/@'!0-?%./!?L! 3<(! <X<! 4*&+,/),!)/01/'2/! • Compared with MPIBlast .'/N?&O! >?--%.#>0/#?.! L?&! -,#FGHIJ! &%.2! P"#$)! ZR)! (<(! <X! 4,5'65+6!6/785,8&'! I,''8%,! $&0,@2! P"#$)! R! 2@?N! /@0/! /@'&'! N02! .?! #-,0>/! ?.! <TX3T<4[! 43(T[(! 9&,5%!'1:;/+!&.! /@'!,'&L?&-0.>'!8%'!/?!9A)! '12%/&,86/)! ! <<<U)Z3! 44[)4! <7/+5(/!'12%/&,86/)! >?-,%/'68?-#.0/'8! 0,,1#>0/#?.2! ?&! 'B'.! #-,&?B#.$ =/+!)/01/'2/! ,'&L?&-0.>'! 8%'! /?! 0&/#L0>/2! ?L! #-,1'-'./0/#?.T! ')$)! L#1' 4[!@?%&2!0.8! <!@?%&2!0.8! 4/01/',85%! 2M2/'-!8?%71'!>0>@#.$!a4Ub)!! 4U!-#.%/'2! XU!-#.%/'2! />/21,8&'!,8:/! J?! 'B01%0/'! /@'! 2>0107#1#/M! ?L! -,#FGHIJ! 0.8 =1?%8FGHIJT!'+,'&#-'./2!N#/@!8#LL'&'./!.%-7'&2!?L!.?8'2T ,&?>'22?&2T! 80/0702'! L&0$-'./2! 0.8! #.,%/! 80/0! 2'/2! N'&' E9)!*9HG;HJE]A!HA^!*5D*SE:*AJHG!S*I;GJI! '+'>%/'8)! *+0>/1M! 3! ,&?>'22?&2! 0&'! %2'8! #.! '0>@! .?8') J?! 'B01%0/'! /@'! ?B'&@'08! ?L! -0>@#.'! B#&/%01#_0/#?.T! Q@'.'B'&! ,?22#71'T! /@'! .%-7'&2! ?L! .?8'2! #.! '0>@! 2#/'! 0&' '+,'&#-'./2! N#/@! FGHIJ! N'&'! >?.8%>/'8! ?.! ,@M2#>01! O',/! 7010.>'8! P#)')T! L?&! '+,'&#-'./2! N#/@! Z! ,&?>'22?&2T! 3 -0>@#.'2!P,%&'!G#.%+!2M2/'-2!N#/@?%/!5'.!9::R!0.8!10/'&! .?8'2! #.! '0>@! 2#/'! 0&'! %2'8c! L?&! '+,'&#-'./2! N#/@! <U ?.! /@'! 5'.! 9:2! @?2/'8! 7M! /@'! 20-'! -0>@#.'2T! %2#.$! /@'! ,&?>'22?&2T! 4! .?8'2! 0&'! %2'8! #.! '0>@! 2#/'T! 0.8! 2?! ?.R)! "?& ,@M2#>01! .?8'2! 0/! ;")! 9#&/%01#_0/#?.! ?B'&@'08! #2! 70&'1M! '+,'&#-'./2! N#/@! U4! ,&?>'22?&2T! <3! .?8'2! 0/! ;=! 0.8! 3X! 0/ .?/#>'071'!N@'.!'+'>%/#.$!FGHIJ!02!#/!>0.!7'!?72'&B'8!#.! ! ;"! N'&'! %2'8! 8%'! /?! 1#-#/'8! .%-7'&! ?L! 0B0#1071'! .?8'2! 0/ "#$%&'!U)!! =?-,0&#2?.!?L!FGHIJ!2,''86%,2!?.!,@M2#>01!0.8!B#&/%01! "#$)!U)!J@#2!#2!8%'!/?!/@'!-#.#-01!.''8!L?&!E`]!?,'&0/#?.2!02! "#$%&'!()!! *+,'&#-'./01!2'/%,)!34!5'.6702'8!9:2!0/!;"!0.8!<3!5'.6 ;=)! -0>@#.'2)!I,''8%,!#2!>01>%10/'8!&'10/#B'!/?!2'K%'./#01!FGHIJ!?.!B#&/%01! 702'8!9:2!0/!;=!0&'!>?..'>/'8!/@&?%$@!9#A'!/?!,&?B#8'!0!8#2/&#7%/'8! 0! &'2%1/! ?L! &',1#>0/#.$! /@'! 80/0702'! #.! '0>@! .?8'! 0.8! L%11M! S'2%1/2!0&'!2%--0&#_'8!#.!"#$)!)!"?&!011!#.,%/!80/0!2'/2T &'2?%&>')!9:2!8'1#B'&!,'&L?&-0.>'!>?-,0&071'!N#/@!,@M2#>01!-0>@#.'2! '.B#&?.-'./!/?!'+'>%/'!C08??,!0.8!:DE6702'8!FGHIJ!0,,1#>0/#?.2)!H! 011?>0/#.$! #/! /?! -0#.! -'-?&M)! E/! 012?! >?.L#&-2! ?/@'&! =1?%8FGHIJ! 2@?N2! 0! 2-011! 08B0./0$'! ?B'&! -,#FGHIJ) N@'.!'+'>%/#.$!FGHIJ)! %2'&!>0.!&'K%'2/!0!B#&/%01!>1%2/'&!L?&!FGHIJ!'+'>%/#?.!7M!>?./0>/#.$!/@'! ?72'&B0/#?.2! ?L! B#&/%01#_0/#?.! @0B#.$! .'$1#$#71'! #-,0>/! ?.! "?&!#.,%/!80/0!2'/!?L!1?.$!2'K%'.>'2! 0!(6L?18!2,''8%,!>0. B#&/%01!N?&O2,0>'2!2'&B'&!P9QIR)! 7'! ?72'&B'8! N#/@! =1?%8FGHIJ! N@'.! %2#.$! U4! ,&?>'22?&2 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 22 / 84 ?.!3!2#/'2T!N@#1'!-,#FGHIJ!8'1#B'&2!(36L?18!2,''8%,)!J@' ()!H!9S!9:!#2!>1?.'8!0.8!2/0&/'8!#.!'0>@!2#/'T!?LL'&#.$! #.,%/)! J@'! 7102/+! ,&?$&0-! L&?-! /@'! A=FE! FGHIJ! 2%#/'T! N@#>@! >?-,0&'2! .%>1'?/#8'2! 2'K%'.>'2! 0$0#.2/! ,&?/'#.! 2'K%'.>'2! 7M! /&0.210/#.$! /@'! .%>1'?/#8'2! #./?! 0-#.?! 0>#82T! N02! %2'8)! =@0&0>/'&#2/#>2! ?L! /@'! 3! #.,%/! 2'/2! 0&'! 2@?N.! #.! JHFG*!EE)! CloudBlast Deployment and Performance
  23. 23. Section 2 Design and Principles of MapReduce Execution Runtime G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 23 / 84
  24. 24. Subsection 1 File Systems G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 24 / 84
  25. 25. File System for Data-intensive Computing MapReduce is associated with parallel file systems • GFS (Google File System) and HDFS (Hadoop File System) • Parallel, Scalable, Fault tolerant file system • Based on inexpensive commodity hardware (failures) • Allows to mix storage and computation GFS/HDFS specificities • Many files, large files (> x GB) • Two types of reads • Large stream reads (MapReduce) • Small random reads (usually batched together) • Write Once (rarely modified), Read Many • High sustained throughput (favored over latency) • Consistent concurrent execution G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 25 / 84
  26. 26. Example: the Hadoop Architecture Distributed Computing (MapReduce) Data  Analy)cs   Applica)ons   JobTracker   Distributed Storage (HDFS) NameNode   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Data  Node   TaskTracker   Masters Data  Node   TaskTracker   G. Fedak (INRIA, France) MapReduce Slaves Salvador da Bahia, Brazil, 22/10/2013 26 / 84
  27. 27. HDFS Main Principles HDFS Roles File.txt • Application A   • Split a file in chunks • The more chunks, the more B   C   Write  chunks  A,   B,  C  of  file   File.txt   parallelism • Asks HDFS to store the Ok,  use  Data   nodes  2,  4,  6   NameNode   Applica'on   chunks • Name Node • Keeps track of File metadata (list of chunks) Data  Node  1   • Stores file chunks • Replicates file chunks G. Fedak (INRIA, France) MapReduce Data  Node  6   C   Data  Node  4   B   • Data Node Data  Node  5   Data  Node  3   replication (3) of chunks on Data Nodes Data  Node  4   Data  Node  2   A   • Ensures the distribution and Data  Node  7   Salvador da Bahia, Brazil, 22/10/2013 27 / 84
  28. 28. Data locality in HDFS Rack Awareness • Name node groups Data Node in racks rack 1 rack 2 DN1, DN2, DN3, DN4 DN5, DN6, DN7, DN8 • Assumption that network performance differs if the communication is in-rack vs inter-racks A   Data  Node  1   • Replication Data  Node  2   B   • Keep 2 replicas in-rack and 1 in an other rack A B A   Data  Node  3   Data  Node  4   DN1, DN3, DN5 DN2, DN4, DN7 Data  Node  4   B   Data  Node  5   A   Data  Node  6   Data  Node  7   B   • Minimizes inter-rack communications • Tolerates full rack failure G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 28 / 84
  29. 29. Writing to HDFS Client Data Node 1 Data Node 3 Data Node 5 Name Node write File.txt in dir d Writeable For each Data chunk AddBlock A Data nodes 1, 3, 5 (IP, port, rack number) initiate TCP connection Data Nodes 3, 5 initiate TCP connection Data Node 5 initiate TCP connection Pipeline Ready Pipeline Ready Pipeline Ready • Name Nodes checks permissions and select a list of Data Nodes • A pipeline is opened from the Client to each Data Nodes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 29 / 84
  30. 30. Pipeline Write HDFS Client Data Node 1 Data Node 3 Data Node 5 Name Node For each Data chunk Send Block A Block Received A Send Block A Block Received A Send Block A Block Received A Success Close TCP connection Success Close TCP connection Success Close TCP connection Success • Data nodes send and receive chunk concurrently • Blocks are transferred one by one • Name Node updates the meta data table A G. Fedak (INRIA, France) DN1, DN3, DN5 MapReduce Salvador da Bahia, Brazil, 22/10/2013 30 / 84
  31. 31. Fault Resilience and Replication NameNode • Ensures fault detection • Heartbeat is sent by DataNode to the NameNode every 3 seconds • After 10 successful heartbeats : Block Report • Meta data is updated • Fault is detected • NameNode lists from the meta-data table the missing chunks • Name nodes instruct the Data Nodes to replicate the chunks that were on the faulty nodes Backup NameNode • updated every hour G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 31 / 84
  32. 32. Subsection 2 MapReduce Execution G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 32 / 84
  33. 33. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Distribution • Distribution : • Split the input file in chunks • Distribute the chunks to the DFS G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  34. 34. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Map • Map : • Mappers execute the Map function on each chunks • Compute the intermediate results (i.e., list(k , v )) • Optionally, a Combine function is executed to reduce the size of the IR G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  35. 35. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Partition • Partition : • Intermediate results are sorted • IR are partitioned, each partition corresponds to a reducer • Possibly user-provided partition function G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  36. 36. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Shuffle • Shuffle : • references to IR partitions are passed to their corresponding reducer • Reducers access to the IR partitions using remore-read RPC G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  37. 37. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Reduce • Reduce: • IR are sorted and grouped by their intermediate keys • Reducers execute the reduce function to the set of IR they have received • Reducers write the Output results G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  38. 38. Anatomy of a MapReduce Execution Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper Combine • Combine : • the output of the MapReduce computation is available as R output files • Optionally, output are combined in a single result or passed as a parameter for another MapReduce computation G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 33 / 84
  39. 39. Return to WordCount: Application-Level Optimization The Combiner Optimization • How to decrease the amount of information exchanged between mappers and reducers ? → Combiner : Apply reduce function to map output before it is sent to reducer / / Pseudo−code f o r t h e Combine f u n c t i o n combine ( S t r i n g key , I t e r a t o r v a l u e s ) : / / key : a word , / / v a l u e s : l i s t o f counts i n t partial_word_count = 0; f o r each v i n v a l u e s : p a r t i a l _ w o r d _ c o u n t += p a r s e I n t ( v ) ; Emit ( key , A s S t r i n g ( p a r t i a l _ w o r d _ c o u n t ) ; G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 34 / 84
  40. 40. Let’s go further: Word Frequency • Input : Large number of text documents • Task: Compute word frequency across all the documents → Frequency is computed using the total word count • A naive solution relies on chaining two MapReduce programs • MR1 counts the total number of words (using combiners) • MR2 counts the number of each word and divide it by the total number of words provided by MR1 • How to turn this into a single MR computation ? 1 2 Property : reduce keys are ordered function EmitToAllReducers(k,v) • Map task send their local total word count to ALL the reducers with the key "" • key "" is the first key processed by the reducer • sum of its values gives the total number of words G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 35 / 84
  41. 41. Word Frequency Mapper and Reducer / / Pseudo−code f o r t h e Map f u n c t i o n map( S t r i n g key , S t r i n g v a l u e ) : / / key : document name , v a l u e : document c o n t e n t s i n t word_count = 0 ; f o r e a c h word i n s p l i t ( v a l u e ) : E m i t I n t e r m e d i a t e ( word , ’ ’ 1 ’ ’ ) ; word_count ++; E m i t I n t e r m e d i a t e T o A l l R e d u c e r s ( " " , A s S t r i n g ( word_count ) ) ; / / Peudo−code f o r t h e Reduce f u n c t i o n reduce ( S t r i n g key , I t e r a t o r v a l u e ) : / / key : a word , v a l u e s : a l i s t o f counts i f ( key= " " ) : i n t total_word_count = 0; foreach v i n values : t o t a l _ w o r d _ c o u n t += P a r s e I n t ( v ) ; else : i n t word_count = 0 ; foreach v i n values : word_count += P a r s e I n t ( v ) ; Emit ( key , A s S t r i n g ( word_count / t o t a l _ w o r d _ c o u n t ) ) ; G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 36 / 84
  42. 42. Subsection 3 Key Features of MapReduce Runtime Environments G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 37 / 84
  43. 43. Handling Failures On very large cluster made of commodity servers, failures are not a rare event. Fault-tolerant Execution • Workers’ failures are detected by heartbeat • In case of machine crash : 1 2 3 In progress Map and Reduce task are marked as idle and re-scheduled to another alive machine. Completed Map task are also rescheduled Any reducers are warned of Mappers’ failure • Result replica “disambiguation”: only the first result is considered • Simple FT scheme, yet it handles failure of large number of nodes. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 38 / 84
  44. 44. Scheduling Data Driven Scheduling • Data-chunks replication • GFS divides files in 64MB chunks and stores 3 replica of each chunk on different machines • map task is scheduled first to a machine that hosts a replica of the input chunk • second, to the nearest machine, i.e. on the same network switch • Over-partitioning of input file • #input chunks >> #workers machines • # mappers > # reducers Speculative Execution • Backup task to alleviate stragglers slowdown: • stragglers : machine that takes unusual long time to complete one of the few last map or reduce task in a computation • backup task : at the end of the computation, the scheduler replicates the remaining in-progress tasks. • the result is provided by the first task that completes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 39 / 84
  45. 45. Speculative Execution in Hadoop • When a node is idle, the scheduler selects a task 1 2 3 failed tasks have the higher priority non-running tasks with data local to the node speculative tasks (i.e., backup task) • Speculative task selection relies on a progress score between 0 and 1 • Map task : progress score is the fraction of input read • Reduce task : • considers 3 phases : the copy phase, the sort phase, the reduce phase • in each phase, the score is the fraction of the data processed. Example : a task halfway of the reduce phase has score 1/3 + 1/3 + (1/3 ∗ 1/2) = 5/6 • straggler : tasks which are 0.2 less than the average progress score G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 40 / 84
  46. 46. Section 3 Performance Improvements and Optimizations G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 41 / 84
  47. 47. Subsection 1 Handling Heterogeneity and Stragglers G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 42 / 84
  48. 48. Issues with Hadoop Scheduler and heterogeneous environment Hadoop makes several assumptions : • nodes are homogeneous • tasks are homogeneous which leads to bad decisions : • Too many backups, thrashing shared resources like network bandwidth • Wrong tasks backed up • Backups may be placed on slow nodes • Breaks when tasks start at different times G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 43 / 84
  49. 49. The LATE Scheduler The LATE Scheduler (Longest Approximate Time to End) • estimate task completion time and backup LATE task • choke backup execution : • Limit the number of simultaneous backup tasks (SpequlativeCap) • Slowest nodes are not scheduled backup tasks (SlowNodeTreshold) • Only back up tasks that are sufficiently slow (SlowTaskTreshold) • Estimate task completion time progress score execution time score left = 1−progress rate progress • progress rate = • estimated time In practice SpeculativeCap=10% of available task slots, SlowNodeTreshold, SlowTaskTreshold = 25th percentile of node progress and task progress rate Improving MapReduce Performance in Heterogeneous Environments. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. and Stoica, I. in OSDI’08: Sixth Symposium on Operating System Design and Implementation, San Diego, CA, December, 2008. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 44 / 84
  50. 50. Progress RateProgress Example 1 min Rate Example 2 min Node 1 1 task/min Node 2 3x slower Node 3 1.9x slower Time (min) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 45 / 84
  51. 51. LATE Example LATE Example 2 min Estimated time left: (1-0.66) / (1/3) = 1 Node 1 Node 2 Progress = 66% Estimated time left: (1-0.05) / (1/1.9) = 1.8 Progress = 5.3% Node 3 Time (min) LATE correctly picks Node 3 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 46 / 84
  52. 52. LATE Performance (EC2 Sort benchmark) Stragglers !"#$%&'($)*(+$,-(-'&.-/-*(0 emulated by running $ 4 stragglers Heterogeneity 243VM/99 Hosts 1-7VM/Host ,36&&5$E382-$ F1G!$%;+-67H-'$ B>?$ B>#$ B$ =>A$ =>@$ =>?$ =>#$ =$ C&'4($ D-4($ !"#$%&'($)*(+$%(',--./ dd Each host sorted 128MB C,4&&3$B,72/$ B&$A,:;530$ with a total of #=>$ 30GB data '()*+,-./0&1/23(42/&5-*/& &'()*+,-./01.23'42.05,).0 E&$D3;<754$ 12-'3.-$ D1E!$%:+/45./'$ #=<$ ?=>$ E 2 o S w a d in ?=<$ <=>$ <=<$ @&'0($ A/0($ 12/',-/$ Results Results 12/',-/$!"#$03//453$&2/'$6,72/8$$$%#&&2/' 12-'3.-$!"#$45--675$&2-'$/382-9$$%#$&2-'$/&$:3;<754$ Average 27% speedup over Average 58% speedup over Hadoop, 31% over no backups Hadoop, 220% over noofbackups UIUC Department Computer Science, Department of Computer Science, UIUC G. Fedak (INRIA, France) MapReduce ?A$ Salvador da Bahia, Brazil, 22/10/2013 47 / 84
  53. 53. Understanding Outliers • Stragglers : Tasks that take ≥ 1.5 the median task in that phase • Re-compute : When a task output is lost, dependant tasks that wait until 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1 Cumulative problem, Cumulative Fraction of Phases the output is regenerated high runtime recompute 0.8 0.6 high runtime recompute 0.4 0.2 0 0 1 2 0 0.1 0.2 0.3 0.4 Fraction of Outliers 0.5 4 6 8 10 Ratio of Straggler Duration to the Duration of the Median Task tliers and (a) What fraction of tasks in a (b) How much longer do outr resource phase are outliers? liers take to finish? mics, conFigure : Prevalence of Outliers. e to large of outliers Causes ic opera- skew: few keys corresponds to a lot of data • Data dian task) are well explained by the amount of data they ur knowl• Network contention : 70% of the crossrack traffic induced by reduce process the tasks nces from and Busy or move on half network. Duplicating these5% of the • Bad machines : of recomputes happen on would not make them run faster and will waste resources. machines At the other extreme, in  of the phases Salvador da Bahia, Brazil, 22/10/2013 (bottom left), G. Fedak (INRIA, France) MapReduce 48 / 84
  54. 54. 123 !*8%?*& :9 The Mantri Scheduler 9 9 234 DF;CG4E :9 ;9 <9 =9 >99 @A*')#,*ABC#D4E#8B#1"./)*%8"B#78.* 0"))/%1 • cause- and resource- aware scheduling Figure : Percentage tasks : define • real-time monitoring of the speed-up of job completion time in the ideal case when (some combination of) outliers do not occur. >.*6'?3%).$&"#* ?+$5&%) !.+*%* !"#$%#$&"#' (")')%*"+),%* %=%,+$% -.$/*'/.0%' 1&((2',.3.,&$4 that are ev of a task a )%.1'&#3+$ A#3+$''B%,"9%*' +#.0.&5.B5% Figure : ;#%<+.5'8")6' 1&0&*&"# tasks bef ones nea #%$8")6'.8.)%' • )%35&,.$%'"+$3+$ *$.)$'$.*6*'$/.$' @"5+$&"#* • 1+35&,.$% There • 6&557')%*$.)$ 35.,%9%#$ • 3)%:,"93+$% 1"'9")%'(&)*$ ing the c Figure : The Outlier Problem: Causes and Solutions job comp of every t By inducing high variability in repeat runs of the same copies ar job, outliers make it hard to meet SLAs. At median, the Reining in the Outliers in Map-Reduce Clusters using Mantri in G. Ananthanarayanan, S. Kandula, A. slow dow ratio Lu,stdev E. Harris OSDI’10: Sixth Symposium i.e., jobs System Greenberg, I. Stoica, Y.of mean in job completion time is .,on Operatinghave a Design and up f B. Saha, used Implementation,non-trivial probability of taking twice as long or finishing Vancouver, Canada, December, 2010. are waiti half as quickly. a phase v To summarize, we takeMapReduce the following lessons from our22/10/2013 comp G. Fedak (INRIA, France) Salvador da Bahia, Brazil, up 49 / 84
  55. 55. Resource Aware Restart • Restart Outliers tasks on better machines • Consider two options : kill & restart or duplicate + 0%*+ ,-($)".$/% 67*%$-1)8% ()*!(% &% !% '% '!% !% +!% '!% !% !% !"#$% 234)"5-!$% &% !% !% '!% !% !% '!% !% '% !"#$% 0"))/%1$(!-1!% &% !% '!% !% !% '% !% !% '!% !"#$% >99 78.* mpletion time in the tliers do not occur. Figure : A stylized example to illustrate our main ideas. Tasks that are eventually killed are filled with stripes, repeat instances • restart or duplicate iffilled new <lighter)mesh. of a task are P(t with a trem is high • restart "9%*' ;#%<+.5'8")6' if the remaining time is large trem > E(tnew ) + ∆ tasks before the others to avoid being stuck with the large .B5% 1&0&*&"# • duplicates if it decreases the total amount of resources used ones near completion (§.). +$3+$ *$.)$'$.*6*'$/.$' There > subtle point < 3 is the number reduc$% 1"'9")%'(&)*$ P(trem > tnew (c+1) )is a δ, where c with outlier mitigation:of copies and c δ = 0.25 ing the completion time of a task may in fact increase the s and Solutions job completion time. For example, replicating the output G. Fedak (INRIA, France)every task will drastically reduce recomputations–both MapReduce Salvador da Bahia, Brazil, 22/10/2013 of 50 / 84
  56. 56. Network Aware Placement • Schedule tasks so that it minimizes network congestion • Local approximation of optimal placement • do not track bandwidth changes • do not require global co-ordination • Placement of a reduce phase with n tasks, running on a cluster with r racks, In,r the input data matrix. i i i i • Let du , dv the data to be uploaded and downloaded, and bu , bd the corresponding available bandwidth di • the placement of is arg min(max( u , bi v G. Fedak (INRIA, France) MapReduce i dv i bd )) Salvador da Bahia, Brazil, 22/10/2013 51 / 84
  57. 57. Mantri Performances %"#& %& "& & !'#! !"#$ !"#( '()*+,-./01(/1( 203()*40,5-*4 $& (& !& %"" -!./0/12$34/ -!./0/12$34/!5"$67'8 )& "& %& !"#$ • Speed up the job median by 9#6 55% :3;. 9<=853/41 >324 ?;3/7++++?;-7 7#8 !& %#6 & :05+ <=3>* ?50,@((((?5*@ AB @A ;0,1. 20/1 • Network aware data (a) Completion Time (b)  Resource Usage placement reduces the ine implementation on a production cluster of thousands completion time of typical rvers for the four representative jobs. reduce phase of 31% :9 9 ;:9 9 :9 <9 =9 >9 ?99 )*+,-,./"0%, )*+,-,./"0%,*'1"2345 %"" . . tasks speeds up. of the half . . . reduce phases by > 60% e : Comparing Mantri’s network-aware spread of tasks the baseline implementation on a production cluster of sands of servers. G. Fedak (INRIA, France) !" (a) Change in Completion Time  reduction in • Network mincompletion time aware placement of avg max e cluster. Jobs that occupy the cluster for half the time up by at least .. Figure (b) shows that  of see a reduction in resource consumption while the rs use up a few extra resources. These gains are due !"#$% &$%''( )*+, @$(A4%5B4 @$86"7 #" 0/A4%5B67'8/78/-'C(D467'8/+7C4 re : Comparing Mantri’s straggler mitigation with the Phase Job $" $" !"#$% &$%''( )*+, !"#$%&'(% !"5213 #" 86 !" 76 6 986 976 6 76 86 :6 ;6 <66 -,$%&'(2345,35,$%04'1(%,=0">% (b) Change in Resource Usage Figure : Comparing straggler mitigation strategies. Mantri provides a greater speed-up in completion time while using fewer resources than existing schemes. MapReduce Salvador da Bahia, Brazil, 22/10/2013 52 / 84
  58. 58. Subsection 2 Extending MapReduce Application Range G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 53 / 84
  59. 59. !"#$%&'()(*%&'+,-&(.+/0&123&( Twister: Iterative MapReduce !  !"#$"%&$"'%('%(#$)$"&()%*( +),")-./(*)$)( !  (0'%123,)-./(.'%2(,3%%"%2( 4&)&5/)-./6(7)89,/*3&/($)#:#(( !  ;3-9#3-(7/##)2"%2(-)#/*( &'773%"&)$"'%9*)$)($,)%#</,#(( !  =>&"/%$(#388',$(<',(?$/,)$"+/( @)8A/*3&/(&'783$)$"'%#( 4/B$,/7/.C(<)#$/,($5)%(D)*''8( ',(!,C)*9!,C)*E?FG6(( !  0'7-"%/(85)#/($'(&'../&$()..( ,/*3&/('3$83$#(( !  !)$)()&&/##(+")(.'&).(*"#:#( E"25$H/"25$(4IJKLL(."%/#('<( M)+)(&'*/6(( !  N388',$(<',($C8"&).(@)8A/*3&/( &'783$)$"'%#(O''.#($'(7)%)2/( *)$)(( Ekanayake, Jaliya, et al. Twister: a Runtime for Iterative Mapreduce. International Symposium on High Performance Distributed Computing. HPDC 2010. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 54 / 84
  60. 60. Moon: MapReduce on Opportunistic Environments • Consider volatile resources during their idleness time • Consider the Hadoop runtime → Hadoop does not tolerate massive failures • The Moon approach : • Consider hybrid resources: very volatile nodes and more stable nodes • two queues are managed for speculative execution : frozen and slow tasks • 2-phase task replication : separate the job between normal and homestretch phase → the job enter the homestretch phase when the number of remaining tasks < H% available execution slots Lin, Heshan, et al. Moon: Mapreduce on opportunistic environments International Symposium on High Performance Distributed Computing. HPDC 2010. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 55 / 84
  61. 61. Towards Greener Data Center • Sustainable Computing Infrastructures which minimizes the impact on the environment • Reduction of energy consumption • Reduction of Green House Gaz emission • Improve the use of renewable energy (Smart Grids, Co-location, etc...) Solar Bay Apple Maiden Data Center (20MW, ) 00 acres : 400 000 m2. 20 MW peak G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 56 / 84
  62. 62. Section 4 MapReduce beyond the Data Center G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 57 / 84
  63. 63. Introduction to Computational Desktop Grids High!"#$$%&'%&()$(*$&+&,)-%&'./"'#0)1%*"-& Throughput Computing over2%"-/00%$-Idle Large Sets of Desktop Computers ! 34"54%"&$%-&2*#--)0(%-&'%& ()$(*$&'%&67-&2%0')01&& $%*"&1%82-&'.#0)(1#9#15 • Volunteer Computing Systems " :/$&'%&(;($%&<&5(/0/8#-%*"& (SETI@Home, Distributed.net) '.5(")0 • Open Source Desktop Grid " )22$#()1#/0-&2)")$$=$%-&'%&4")0'%& (BOINC, XtremWeb, Condor, 1)#$$% OurGrid) " >"4)0#-)1#/0&'%&?(/0(/*"-@&'%& 2*#--)0(% • Some numbers with BOINC " A%-&*1#$#-)1%*"-&'/#9%01&2/*9/#"& (October 21, 2013) 8%-*"%"&$%*"&(/01"#,*1#/0 • Active: 0.5M users, 1M 5(/0/8#%&'*&'/0 hosts. • 24-hour average: 12 PetaFLOPS. • EC2 equivalent : 1b$/year G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 58 / 84
  64. 64. MapReduce over the Internet Computing on Volunteers’ PCs (a la SETI@Home) • Potential of aggregate computing power and storage • Average PC : 1TFlops, 4BG RAM, 1TByte disc • 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB • hardware and electrical power are paid for by consumers • Self-maintained by the volunteers But ... • High number of resources • Low performance • Volatility • Owned by volunteer • Scalable but mainly for embarrassingly parallel applications with few I/O requirements • Challenge is to broaden the application domain G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 59 / 84
  65. 65. MapReduce over the Internet Computing on Volunteers’ PCs (a la SETI@Home) • Potential of aggregate computing power and storage • Average PC : 1TFlops, 4BG RAM, 1TByte disc • 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB • hardware and electrical power are paid for by consumers • Self-maintained by the volunteers But ... • High number of resources • Low performance • Volatility • Owned by volunteer • Scalable but mainly for embarrassingly parallel applications with few I/O requirements • Challenge is to broaden the application domain G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 59 / 84
  66. 66. Subsection 1 Background on BitDew G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 60 / 84
  67. 67. Large Scale Data Management BitDew : a Programmable Environment for Large Scale Data Management Key Idea 1: provides an API and a runtime environment which integrates several P2P technologies in a consistent way Key Idea 2: relies on metadata (Data Attributes) to drive transparently data management operation : replication, fault-tolerance, distribution, placement, life-cycle. BitDew: A Programmable Environment for Large-Scale Data Management and Distribution Gilles Fedak, Haiwu He and Franck Cappello International Conference for High Performnance Computing (SC’08), Austin, 2008 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 61 / 84
  68. 68. BitDew : the Big Cloudy Picture • Aggregates storage in a single Data Space: Data Space • Clients put and get data from the data space • Clients defines data attributes • Distinguishes service nodes (stable), client and Worker nodes (volatile) • Service : ensure fault tolerance, indexing and scheduling of data to Worker nodes • Worker : stores data on Desktop PCs REPLICAT =3 put get client • push/pull protocol between client → service ← Worker G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 62 / 84
  69. 69. BitDew : the Big Cloudy Picture • Aggregates storage in a single Data Space: Data Space • Clients put and get data from the data space Service Nodes Reservoir Nodes • Clients defines data attributes • Distinguishes service nodes pull pull (stable), client and Worker nodes (volatile) • Service : ensure fault pull tolerance, indexing and scheduling of data to Worker nodes • Worker : stores data on Desktop PCs put get client • push/pull protocol between client → service ← Worker G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 62 / 84
  70. 70. Data Attributes REPLICA : RESILIENCE : LIFETIME : AFFINITY : TRANSFER PROTOCOL : DISTRIBUTION : G. Fedak (INRIA, France) indicates how many occurrences of data should be available at the same time in the system controls the resilience of data in presence of machine crash is a duration, absolute or relative to the existence of other data, which indicates when a datum is obsolete drives movement of data according to dependency rules gives the runtime environment hints about the file transfer protocol appropriate to distribute the data which indicates the maximum number of pieces of Data with the same Attribute should be sent to particular node. MapReduce Salvador da Bahia, Brazil, 22/10/2013 63 / 84
  71. 71. Architecture Overview Applications Service Container Active Data Data Scheduler SQL Server Transfer Manager BitDew Data Catalog File System DHT Master/ Worker Storage Data Repository Http Ftp Data Transfer Bittorrent Ba ck- en ds Se rvic e AP I Command -line Tool BitDew Runtime Environnement • Programming APIs to create Data, Attributes, manage file transfers and program applications • Services (DC, DR, DT, DS) to store, index, distribute, schedule, transfer and provide resiliency to the Data • Several information storage backends and file transfer protocols G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 64 / 84
  72. 72. Anatomy of a BitDew Application • Clients 1 2 3 creates Data and Attributes put file into data slot created schedule the data with their corresponding attribute Data Space Service Nodes Reservoir Nodes dr/dt dc/ds • Reservoir pull • are notified when data are locally scheduled onDataScheduled or deleted onDataDeleted • programmers install callbacks on these events pull create put schedule • Clients and Reservoirs can both create data and install callbacks G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 65 / 84
  73. 73. Anatomy of a BitDew Application • Clients 1 2 3 creates Data and Attributes put file into data slot created schedule the data with their corresponding attribute Data Space Service Nodes Reservoir Nodes dr/dt dc/ds • Reservoir onDataSceduled • are notified when data are locally scheduled onDataScheduled or deleted onDataDeleted • programmers install callbacks on these events onDataScheduled • Clients and Reservoirs can both create data and install callbacks G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 65 / 84
  74. 74. Examples of BitDew Applications • Data-Intensive Application • DI-BOT : data-driven master-worker Arabic characters recognition (M. Labidi, University of Sfax) • MapReduce, (X. Shi, L. Lu HUST, Wuhan China) • Framework for distributed results checking (M. Moca, G. S. Silaghi Babes-Bolyai University of Cluj-Napoca, Romania) • Data Management Utilities • File Sharing for Social Network (N. Kourtellis, Univ. Florida) • Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI) • Grid Data Stagging (IEP, Chinese Academy of Science) G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 66 / 84
  75. 75. Experiment Setup Hardware Setup • Experiments are conducted in a controlled environment (cluster for the microbenchmark and Grid5K for scalability tests) as well as an Internet platform • GdX Cluster : 310-nodes cluster of dual opteron CPU (AMD-64 2Ghz, 2GB SDRAM), network 1Gbs Ethernet switches • Grid5K : an experimental Grid of 4 clusters including GdX • DSL-lab : Low-power, low-noise cluster hosted by real Internet users on DSL broadband network. 20 Mini-ITX nodes, Pentium-M 1Ghz, 512MB SDRam, with 2GB Flash storage • Each experiments is averaged over 30 measures G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 67 / 84
  76. 76. Fault-Tolerance Scenario 0 20 40 60 80 100 120 492KB/s DSL01 DSL02 211KB/s DSL03 254KB/s DSL04 247KB/s DSL05 384KB/s DSL06 53KB/s DSL07 412KB/s DSL08 332KB/s DSL09 304KB/s DSL10 259KB/s Evaluation of Bitdew in Presence of Host Failures Scenario: a datum is created with the attribute : REPLICA = 5, FAULT _ TOLERANCE = true and PROTOCOL = "ftp". Every 20 seconds, we simulate a machine crash by killing the BitDew process on one machine owning the data. We run the experiment in the DSL-Lab environment Figure: The Gantt chart presents the main events : the blue box shows the download duration, red box is the waiting time, and red star indicates a node crash. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 68 / 84
  77. 77. Collective File Transfer with Replication 1 d2 0.8 ALL_TO_ALL Fraction d1 d1 + d2 + d3 + d4 d4 0.6 0.4 d3 0.2 0 150 Scenario: All-to-all file transfer with replication. 250 300 KBytes/sec 350 400 450 Results: experiments on DSL-Lab • {d1, d2, d3, d4} is created • one run every hour during approximately 12 hours where each datum has the REPLICA attribute set to 4 • during the collective, file • All-to-all collective implemented with the AFFINITY attribute. Each data attributes is modified on the fly so that d1.affinity=d2, d2.affinity=d3, d3.affinity=d4, d4.affinity=d1. G. Fedak (INRIA, France) 200 MapReduce transfers are performed concurrently and the data size is 5MB. • Cumulative Distribution Function (CDF) plot for collective all-to-all. Salvador da Bahia, Brazil, 22/10/2013 69 / 84
  78. 78. Multi-sources and Multi-protocols File Distribution Scenario • 4 clusters of 50 nodes each, each cluster hosting its own data repository (DC/DR). dr1/dc1 C1 • first phase: a client uploads a dr2/dc2 large datum to the 4 data repositories C2 • second phase: the cluster C3 nodes get the datum from the data repository of their cluster. dr3/dc3 We vary: 1 the number of data repositories from 1 to 4 2 C4 dr4/dc4 the protocol used to distribute the data (BitTorrent vs FTP) G. Fedak (INRIA, France) 1st phase MapReduce 2nd phase Salvador da Bahia, Brazil, 22/10/2013 70 / 84
  79. 79. Multi-sources and Multi-protocols File Distribution BitTorrent (MB/s) orsay 1st phase 1DR/1DC 2DR/2DC 4DR/4DC 2nd phase 1DR/1DC 2DR/2DC 4DR/4DC nancy bordeaux 30.34 29.79 22.07 3.93 4.04 8.46 FTP (MB/s) rennes mean orsay nancy bordeaux rennes 10.36 12.35 11.51 9.72 30.34 21.07 13.41 98.50 73.85 49.41 22.95 49.86 26.19 22.89 3.93 4.88 7.58 3.78 9.50 7.65 3.96 5.04 5.14 3.90 5.86 7.21 0.72 1.43 2.90 0.70 1.38 2.78 0.56 1.14 2.33 0.60 1.29 2.54 • for both protocols, increasing the number of data sources also increases the available bandwidth • FTP performs better in the first phase and BitTorrent gives superior performances in the second phase • the best combination (4DC/4DR, FTP for the first phase and BitTorrent for the second phase) takes 26.46 seconds and outperforms by 168.6% the best single data source, single protocol (BitTorrent) configuration G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 71 / 84
  80. 80. Subsection 2 MapReduce for Desktop Grid Computing G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 72 / 84
  81. 81. Challenge of MapReduce over the Internet Data Split Intermediate Results Input Files Final Output Output Files Mapper Reducer Output2 Reducer Mapper Output1 Reducer Mapper Output3 Combine Mapper Mapper • Needs Data Replica Management • Result Certification of • no shared file system nor direct communication between hosts Intermediate Data • Faults and hosts churn • Collective Operation (scatter + gather/reduction) Towards MapReduce for Desktop Grids B. Tang, M. Moca, S. Chevalier, H. He, G. Fedak, in Proceedings of the Fifth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing 3GPCIC, Fukuoka, Japan, Novembre 2010 G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 73 / 84
  82. 82. Implementing MapReduce over BitDew (1/2) Initialisation by the Master 1 Data slicing : Master creates DataCollection DC = d1 , . . . , dn , list of data chuncks sharing the same MapInput attribute. 2 Creates reducer tokens each one with its own TokenReducX attribute (affinity set to the Reducer). 3 Creates and pin collector tokens with TokenCollector attribute (affinity set to the Collector Token). G. Fedak (INRIA, France) Algorithm 2 E XECUTION OF M AP TASKS Require: Let Map be the Map function to execute Require: Let M be the number of Mappers and m a single mapper Require: Let dm = d1,m , . . . dk,m be the set of map input data received by worker m 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. {on the Master node} Create a single data MapToken with affinity set to DC {on the Worker node} if MapToken is scheduled then for all data d j,m ∈ dm do execute Map(d j,m ) create list j,m (k, v) end for end if Algorithm 3 S HUFFLING INTERMEDIATE RESULTS Require: Let of the number of Mappers and m a single worker ExecutionM beMap Tasks Require: Let R be the number of Reducers Require: Let listm (k, v) be the set of key, values pairs of intermediate 1 When a Worker receives results on worker m 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. MapReduce {on MapToken it launches the the Master node} for corresponding MapTask and all r ∈ [1, . . . , R] do Create Attribute ReduceAttrr with distrib = 1 computes the intermediate values Create data ReduceTokenr schedule data ReduceTokenr with attribute ReduceTokenAttr end listj,m (k , v ). for {on the Worker node} split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files for all file i fm,r do create reduce input data irm,r and upload i fm,r Salvador da Bahia, Brazil, 22/10/2013 74 / 84
  83. 83. 2. 3. 4. 5. 6. 7. 8. 9. 10. Create a single data MapToken with affinity set to DC {on the Worker node} if MapToken is scheduled then for all data d j,m ∈ dm do execute Map(d j,m ) create list j,m (k, v) end for end if 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Create a single data MasterToken pinAndSchedule(MasterToken) Implementing MapReduce over BitDew (2/2) Require: Let Require: Let Require: Let results on 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. M be the number of Mappers and m a single worker R be the number of Reducers listm (k, v) be the set of key, values pairs of intermediate worker m {on the Master node} for all r ∈ [1, . . . , R] do Create Attribute ReduceAttrr with distrib = 1 Create data ReduceTokenr schedule data ReduceTokenr with attribute ReduceTokenAttr end for {on the Worker node} split listm (k, v) in i f m,1 , . . . , i f m,r intermediate files for all file i fm,r do create reduce input data irm,r and upload i fm,r schedule data irm,r with a f f inity = ReduceTokenr end for workers, the ReduceTokenAttr has distrib=1 flag value which ensures a fair Shufflingdistribution of the workload between reducers. Finally, Intermediate Results the Shuffle phase is simply implemented by scheduling the portioned intermediate data with an attribute whose affinity tag is equal to the corresponding ReduceToken. 1 Algorithm 4 presents the Reduce phase. When a reducer, that is a worker which has the ReduceToken, starts to receive intermediate results, it calls the Reduce function on the (k, list(v)). When all the intermediate files have been received, all the values have been processed for a specific key. If the user wishes, he can get all the results back to the master and eventually combine them. To proceed to this last step, the worker creates an MasterToken data, which is pinnedAndScheduled. This operation means that the MasterToken is known from the BitDew scheduler but will not be sent to any nodes in the system. Instead, MasterToken is pinned on the Master node, which allows the result of the Reduce tasks, scheduled with a tag affinity set to MasterToken, to be sent to the Master. Worker partitions intermediate results according to the partition function and the number of reducers and creates ReduceInput with the attribute ReducTokenR. D. Desktop Grid Features G. Fedak (INRIA, France) {on the Master node} if all or have been received then Combine or into a single result end if Execution of Reduce tasks overlapping communication with computation and prefetch of data. We have designed a multi-threaded worker (see Figure 3) which can 1 process several concurrent files transfers, even if the protocol used is m,r synchronous such as HTTP for instance. As soon as a file transfer is finished, the corresponding Map or Reduce tasks are enqueued and can be processed concurrently. The number of maximum concurrent Map and Reduce tasks can be configured, as well as the minimum m,r number of tasks in the queue before computations can start. In MapReduce, prefetch of data is natural as data are staged on the 2 r distributed file system before computations are launched. When a reducer receives ReduceInput ir files it lauches the corresponding launch ReduceTask Reduce(ir ). 7+889+04 When all the ir have been processed, the reducer combines !"#$%& the result in a single final result "#$ Or . '6 "#$ *+,,%3 *+,-%./0% %#$ 1234%3 ! :0;%./8% %#$ Algorithm 3 S HUFFLING INTERMEDIATE RESULTS {on the Worker node} if irm,r is scheduled then execute Reduce(irm,r ) if all irr have been processed then create or with a f f inity = MasterToken end if end if %#$ ') -%./0%3 3 eventually, the final result can be sent back to the master using the %#$ '( %#$ TokenCollector attribute. -%./0%35/##%3 Figure 3. Worker handles the following threads: Mapper, Reducer and ReducePutter. Q1, Q2 and Q3 represent queues and are used for communication between threads. Collective file operation: MapReduce processing is composed of several collective file operations which in some aspects, look similar with collective communications in parallel programming such as MPI. Namely, the collectives are the Distribution, the initial distribution of file chunks (similar to MPI_Scatter), the Shuffle, that is the redistribution of intermediate results between the Mapper and MapReduce the Reducer nodes (similar to MPI_AlltoAll) and Salvador da Bahia, Brazil, 22/10/2013 75 / 84
  84. 84. Handling Communications Latency Hiding • Multi-thredead worker to overlap communication and computation. • The number of maximum concurrent Map and Reduce tasks can be configured, as well as the minimum number of tasks in the queue before computations can start. Barrier-free computation • Reducers detect duplication of intermediate results (that happen because of faults and/or lags). • Early reduction : process IR as they arrive → allowed us to remove the barrier between Map and Reduce tasks. • But ... IR are not sorted. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 76 / 84
  85. 85. Scheduling and Fault-tolerance 2-level scheduling 1 Data placement is ensured by the BitDew scheduler, which is mainly guided by the data attribute. 2 Workers periodically report to the MR-scheduler, running on the master node the state of their ongoing computation. 3 The MR-scheduler determines if there are more nodes available than tasks to execute which can avoid the lagger effect. Fault-tolerance • In Desktop Grid, computing resources have high failure rates : → during the computation (either execution of Map or Reduce tasks) → during the communication, that is file upload and download. • MapInput data and ReduceToken token have the resilient attribute enabled. • Because intermediate results irj,r , j ∈ [1, m] have the affinity set to ReduceTokenr , they will be automatically downloaded by the node on which the ReduceToken data is rescheduled. G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 77 / 84
  86. 86. Data Security and Privacy Distributed Result Checking • Traditional DG or VC projects implements result checking on the server. • IR are too large to be sent back to the server → distributed result checking • replicates the MapInput and Reducers • select correct results by majority voting %" * !" &%"'" +" !# !) !$ (% (a) No replication (b) Replication of the Map tasks Figure 2. (c) Replication of both Map and Reduce tasks Dataflows in the MapReduce implementation distinct workers as input for Map task. Consequently, the received intermediate result is present in its own codes list K, ignoring failed results. m versions of intermediate files m corresponding to a map input fi . After receiving r2 + 1 IV. E VALUATION O F T HE R ESULTS C HECKING identical versions, the reducer considers the respective result M ECHANISM correct and further, accepts it as input for a Reduce task. Figure 2b illustrates the dataflow corresponding to a In this section we describe the model for characterizMapReduce execution where rm = 3. ing the errors produced by the MapReduce algorithm in Volunteer Computing. We assume that workers sabotage In order to activate the checking mechanism for the independently of one another, thus we do not take into results received from reducers, the master node schedules the consideration the colluding nodes behavior.Bahia, Brazil, 22/10/2013 ReduceT okens G. Fedak (INRIA, France) T , with the attribute replica = rr (rr > MapReduce Salvador da We employ the Ensuring Datareceives a set of r reducer Privacy • Use a hybrid infrastructure : composed of private and public resources • Use IDA (Information Dispersal Algorithms) approach to distribute and store securely the data 78 / 84
  87. 87. BitDew Collective Communication %!!! 56,7-87/9:012.:3;0<4 56,7-87/9:012.:35190,66.+94 =8799.6:012. >79?.6:012. $"!! 012.3/4 $!!! #"!! #!!! "!! ! # & ' #$ #( $& %$ *+,-./ &' (& '! )( #$' Execution time of the Broadcast, Scatter, Gather collective for a 2.7GB Figure 6. The next experiment evaluates the performance of the collective file operation. Considering that we need enough chunks to measure the Scatter/Gather collective, we selected 16MB as the chunks size E VALUA Figure 5. Figure: Execution time of theof nodes. to throu file to a varying number Broadcast, Scatter, Gather collective for a 2.7GB file the a varying number of nodes G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 79 / 84
  88. 88. MapReduce Evaluation #&!! 0?6,@A?B@9:3C5D/4 #$!! #!!! '!! (!! &!! $!! ! & ' a 2.7GB Figure 6. ' #( %$ (& *+,-./ #$' $"( "#$ Scalability evaluation on the WordCount application: the y axis presents Figure: Scalability evaluationand the xWordCount application: the y from 1presents the the throughput in MB/s on the axis the number of nodes varying axis to 512. throughput in MB/s and the x axis the number of nodes varying from 1 to 512. lective measure ks size Table II E VALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF M APPERS AND R EDUCERS . G. Fedak (INRIA, France) #Mappers 4 8MapReduce 16 32 32 32 32 Salvador da Bahia, Brazil, 22/10/2013 80 / 84
  89. 89. 1;38CA< 1;3"CA< (;6#< (;08< (;6A< 78 (;0"< 1; 0 1; 8 < 0" < (;08< E %;68< 8EE 1;:8< !"#"$% 54,/64 8AE 1;38C$< 5;3"C$< %;6$< (;3"C8< < "< 3 8CA ;3 8C 5; ()!*+)&,-%&'-.*'/0 1;:"< %&' = (;3"C#< 9),4-2&3+/:4 ?6@4,/+4-,&0& 54D6@4,/+4-,&0& 7# (;68< 1;68-B-A< AE 5;3"CA< 5;3"C8< (;3"CA< 1;3"C8< 5 5;38C8< (;38C"< > 1;3"C8< 5;38C8< 1'+)&,-23+4 1;38C"< 5;3"C"< %;6"< (;6"< (;6$< !8 %;68< 7" %;6#< (;6#< !" (;38C$< "EE (;38CA< (;6$< !# 1;3"C#< 5;38C$< 5;3"C#< !$ %;6#< (;38CA< %;6A< (;68< (;38C"< !A 5;38C"< 5;38CA< 5;38C#< Fault-Tolerance Scenario "AE #EE #AE (;:"< $EE $AE (;:8< AEE =3>4 Figure: Fault-tolerance scenario (5 mappers, 2 reducers): crashes are injected during the download of a file (F1), the execution of the map task (F2) and during the execution of the reduce task on the last intermediate result (F3). G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 81 / 84
  90. 90. In case of substantial unnecessary in Table the HDFS needs execution progress at all.inter-communication between two On the other hand, BitDew-MapReduce avoids network failures, as shown fault tol DataNodes. in BitDew-MapReduce, the intermediate Hadoop JobTracker just marks all the tempo works. Because, On the other hand, BitDew-MR works properly outputs of completed map tasks are safely store disconnected nodes as dead - although they are and the performance is almost the same as the baseline in the the stable central storage conditions. master does not re-execute the successfully completedall the tasks done running tasks, and blindly removes map tasks of f normal case of network server, the because it needs inter-communication between worker nodes, andnodes from the successful task list. Re-executing t these BitDew mapreduce is almost independent of workers. Objective Hadoop vs MapReduce/BitDew on Extreme Scenarios E. CPU Heterogeneity and Straggler such network conditions. tasks significantly prolongs the job make-span. Meanw Host •Churn The 5 shows, Hadoop works very well with thousands or evento ten scenarios go tempora millions of peer to that Evaluates MapReduce/BitDew on CPU heterogeneity by adjusting the CPU frequency of half As Figure independent arrival and departure of the BitDew-MR clearly allows workersmachines lea CPU Heterogeneity and Straggler We emulate Grid5K according • to kill worker processes worker offline without node another it host churn. We nodesWe50% kill the MapReduce onhave straggler and oneanythatand launchnode to new no dynamic taskperiodically with wrekavoc. Internet one process(a machine performance penalty. on a the worker an execution on the We nodes a mimic scheduling approach when worker emulate node on launch ontakes an unusually long time The key Hadoop job completion, we increas differenthost churn effect. Totasks in the group process probability the target nodes’ tasks re-execution avoida emulate complete oneof CPU.last fewfrom the fastthe joining and leaving of idea behind mapCPU frequency to 10%. to the• classes of the Nodes increase computation) by adjusting the computation emulate volunteers survival failures, failures, seconds. 20 tasks on average sabotage, fromsetchurn, massive which makes BitDew-MR outperforms HDFSBoth Hadoop andfactorthe MapReduce can launch backupheartbeat timeout value to mapreduce Hadoop u chunk firewall,BitDew onesand the the DataNode tasks on idlenetwork when a 20 laggers, Becaus replica and to 3, host slow group get machines operation network scenarios allowing red about 11 tasks. • Hadoop etc. . has same scheduling Although, BitDew-MR setup: heterogeneity does not stragglers work machine and failing failure 3, host isMapReduce BitDew close to completion perform well.the waste to Figure 6. completed re-download map outputs thatchurn already b by is MapReduce runtimealleviate according the problem. However, as shown in workers, BitDewhave causes to figure tasks to heuristic but it does not HDFS chunk the number to 3 is slower from job completion time. the replica factor uploaded to the table 2, for host churn intervals small The nodesthan Hadoop groups get almostbecause only one copy of ashown in available onstorage rather nodesre-gene effects on the Some results the two in -this process On same other hand, as chunk iscentral stable the worker than at a given time. reason the we DataNode one additional to 80% 20the map phase chunk before launching them. - could only progress timeout to has to benefit the seconds of Host The Hence,is that jobsperforming theMapReduce workerAnotherdownloadof this method is that tasks. churn : We periodically kill the chunk copy 10 and 25 seconds, Hadoopnodeonly maintain heartbeat up time firstof process on onebefore and launchitadoes • node failing. Evaluation: network fault tolerance the map task. introduce any extra overhead since all the intermediate on the worker nodes for the sake of the assumption that there arenew one on another node. and data transfer. Thus, environment, the runtime system must becentral sto no inter-worker communication Network Fault Tolerance To work in a Desktop Grid should be always transferred through the resilient regardless of 25 whether or 50 use the fault tolerant strat not although fast nodes spend half of time to process (sec.)local Churn Interval their 10 to temporary network failures. We set the worker heartbeat 5 timeout value to 30 seconds for both Hadoop and 20 However, the overhead of data transfer to the central st chunks compare to the slow nodes, they still need to take makespan (sec.) BitDew mapreduce, heartbeat set to 20 seconds on Hadoop 25 desktop grid more suitable job progress period to failed 2357 1752 • to download Hadoop job &'(#)* launching failed failed and BitDew-MapReduce much time Worker and inject a 25 seconds off-line window storage makes worker nodes at differentfor the applicat new chunks before points of the tasks. In BitDew-MR joa makespan (sec.) 457 398 marks all temporary data, such as nodes which generate a 361 357 additional map map phase.25 this test, the Hadoop JobTracker simplyto 366 few intermediate disconnectedWord C • We they are workers at different as “dead” when induce stillseconds offline window periods theseGrep. To mitigate the data transfer overh running tasks. The tasks performed by 25 nodes are blindly removed from the and Distributed successful taskERFORMANCE EVALUATION OF STRAGGLERS SCENARIO prolonging churn scenario with large aggregated bandw TABLE III. P list and must be re-executed, significantly we can use a storage cluster Fig. 2: churn interval < 30 seconds progresswhen Performance evaluation of host the job makespan, as table 1. Meanwhile, points → Hadoop fails The emerging public any storage services also provid the BitDew-MapReduce clearly allows workers to go temporarily offline withoutcloud performance penalty. Straggler Number Very small 1 2 5 • alternative different progress included in our fu • Network failure : networkeffect on BitDew 10 25 seconds atsolution, which can betime is shutdown during Hadoop job make477 487 490 • Hadoop fails481 when churn interval <work. 30 seconds span (sec.) BitDew-MR job Job progress of the crash points 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100% 245 267 50 Network Connectivity 211 set custom map tasksand NAT rules on all the300 RELATED WORK turn down We Re-executed firewall 298 100 150 200 250 VI. 350 nodes to worker 400 make-span (sec.) Hadoop There been many studies on network links and observe how Job makespan (sec.) perform. In thishave536 Hadoop cannot improving MapRed MapReduce jobs 425 468 479 512test, 572 589 601 even launch a 3 Re-executed map tasks 0 0 performance [17, 18, 0 28] and 0 0 0 0 19, 0 exploring ways to sup BitDew-MR 256 MapReduce 254 we architectures [16, 22, 23, o GdX has 356 double core nodes, Job to measure the performance on 521 on different 274two workers per node26 so makespan (sec.) 246 249 243 239 nodes 257 run Anthony SIMONET - MapReduce on Desktop Grids related work isscenario [23] 2012 stands 14 closely MOON Table 1: Performance evaluation of the network fault toleranceDecember 4th, which nodes. MapReduce On Opportunistic eNvironments. Unlike lundi 3 décembre 12 • Hadoop marks all temporary disconnected nodes limits the systemtaskswithin a campus are work, MOON as failed → scale dead • The Hadoop job tracker marks all temporary disconnected nodes as and assumes the underlying resources are hybrid re-executed - These tasks must be re-executed organized by provisioning a fixed fraction of dedicated st 4.2 Evaluation of the incremental BitDew MapReduce G. Fedak (INRIA, France) MapReduce Salvador da other volatile personal82 / 84 computers to supplementBahia, Brazil, 22/10/2013 compu
  91. 91. Section 5 Conclusion G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 83 / 84
  92. 92. Conclusion • MapReduce Runtimes • Attractive programming model for data-intense computing • Many research issues concerning execution runtimes (heterogeneity, stragglers, scheduling ) • Many others : security, QoS, iterative MapReduce, • MapReduce beyond the Data Center : • enable large scale data-intensive computing : Internet MapReduce • Want to learn more ? • Home page for the Mini-Curso : http://graal.ens-lyon.fr/ gfedak/pmwiki- lri/pmwiki.php/Main/MarpReduceClass • the MapReduce Workshop@HPDC (http://graal.ens-lyon.fr/mapreduce) • Desktop Grid Computing ed. C. Cerin and G. Fedak – CRC Press • our websites : http://www.bitdew.net and http://www.xtremweb-hep.org G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 84 / 84
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×