EFFICIENT PROCESSING OF RANK-AWARE 
QUERIES IN MAP/REDUCE 
OIKONOMAKIS SPYRIDON 
SOF TWARE / ENGINEER AT PEOPLEPERHOUR
Need for a new model 
 Exponential data growth 
 Need for analysis, utilization and scalability of more and more 
data 
 Need for parallel processing 
 Need to reduce reading time and data recovery 
 Need for convenience in terms of programmer 
 Cost
What is the Map/Reduce? 
Distributed data processing programming model 
and runtime environment that operates in a large 
number of clusters of machines with parallel 
processing
Is the Map/Reduce model reliable?
Map/Reduce
Weaknesses in Top-K Join Queries 
What is the Top-K Join? 
Weaknesses 
 Read all the data for the recovery of K results 
 Non-equitable distribution of workload per Reducer
Goals of the experiment 
 Implementation of Top-K Join queries in 
Map/Reduce model in an efficient manner 
 Troubleshooting shown in Map / Reduce with: 
 Early Termination 
 Load Balancing
Design 
 Comparison of three algorithms (1 default and 2 new) 
 Naive 
 EarlyTermination (using bounds) 
 EarlyTermination & LoadBalancing (using bounds and Longest 
Processing Time) 
 Pre-Elaboration 
 Production of two data tables with Join attributes 
 Statistics for the data in the form of histograms 
 Elaboration 
 Calculating bounds of histograms for each table 
 Run Map/Reduce
Design(2)
Early Termination 
Check Bounds EarlyTermRecordReader 
Send Data 
Send Data 
HDFS 
Generated Sorted 
Data 
Histograms 
EarlyTermInputFormat 
Mapper 
Reducers 
Process
Early Termination & Load Balancing 
EarlyTermRecordReader 
Check 
Bounds 
Send Data 
Send Data 
HDFS 
Generated Sorted 
Data 
Histograms 
EarlyTermInputFormat 
Mapper 
Reducer 
CustomPartitioner 
Reducer Reducer
Experiment (1) 
Parameters Values 
Data Distribution: Zipfian 
Number of data: 1.000.000 / table 
Number of reducers: 10, 6 
Number of K results: 10 
Data skew: 0, 0.5, 1 
Number of Joining Attributes: 10 
Max value for data: 10000 
Sorting: By score 
Histograms: 10 bins 
Cluster: 8 machines
Experiment Part – Comparison of algorithms (2) 
0:50:24 
0:43:12 
0:36:00 
0:28:48 
0:21:36 
0:14:24 
0:07:12 
0:00:00 
0 0.5 1 
Running time 
Skew 
REDUCERS = 10 
Naive 
Early Termination 
Early Termination & Load 
Balancing
Experiment Part – Comparison of algorithms (3) 
2500000 
2000000 
1500000 
1000000 
500000 
0 
0 0.5 1 
Number of records 
Skew 
REDUCERS = 10 
Naive 
Early termination 
Early termination & Load Balancing
Experiment Part – Comparison of algorithms (4) 
0:17:17 
0:14:24 
0:11:31 
0:08:38 
0:05:46 
0:02:53 
0:00:00 
6 10 
Running time 
Number of Reducers 
REDUCERS = 6 
Early Termination 
Early Termination & Load Balancing
Conclusion 
By using the techniques proposed: : 
 Early Termination 
 Load Balancing 
is possible to implement rank aware queries (Top-K) in 
Map / Reduce efficiently and solving disadvantages of 
the model Map / Reduce
Questions 
???? 
Thank you.

Efficient processing of Rank-aware queries in Map/Reduce

  • 1.
    EFFICIENT PROCESSING OFRANK-AWARE QUERIES IN MAP/REDUCE OIKONOMAKIS SPYRIDON SOF TWARE / ENGINEER AT PEOPLEPERHOUR
  • 2.
    Need for anew model  Exponential data growth  Need for analysis, utilization and scalability of more and more data  Need for parallel processing  Need to reduce reading time and data recovery  Need for convenience in terms of programmer  Cost
  • 3.
    What is theMap/Reduce? Distributed data processing programming model and runtime environment that operates in a large number of clusters of machines with parallel processing
  • 4.
    Is the Map/Reducemodel reliable?
  • 5.
  • 6.
    Weaknesses in Top-KJoin Queries What is the Top-K Join? Weaknesses  Read all the data for the recovery of K results  Non-equitable distribution of workload per Reducer
  • 7.
    Goals of theexperiment  Implementation of Top-K Join queries in Map/Reduce model in an efficient manner  Troubleshooting shown in Map / Reduce with:  Early Termination  Load Balancing
  • 8.
    Design  Comparisonof three algorithms (1 default and 2 new)  Naive  EarlyTermination (using bounds)  EarlyTermination & LoadBalancing (using bounds and Longest Processing Time)  Pre-Elaboration  Production of two data tables with Join attributes  Statistics for the data in the form of histograms  Elaboration  Calculating bounds of histograms for each table  Run Map/Reduce
  • 9.
  • 10.
    Early Termination CheckBounds EarlyTermRecordReader Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducers Process
  • 11.
    Early Termination &Load Balancing EarlyTermRecordReader Check Bounds Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducer CustomPartitioner Reducer Reducer
  • 12.
    Experiment (1) ParametersValues Data Distribution: Zipfian Number of data: 1.000.000 / table Number of reducers: 10, 6 Number of K results: 10 Data skew: 0, 0.5, 1 Number of Joining Attributes: 10 Max value for data: 10000 Sorting: By score Histograms: 10 bins Cluster: 8 machines
  • 13.
    Experiment Part –Comparison of algorithms (2) 0:50:24 0:43:12 0:36:00 0:28:48 0:21:36 0:14:24 0:07:12 0:00:00 0 0.5 1 Running time Skew REDUCERS = 10 Naive Early Termination Early Termination & Load Balancing
  • 14.
    Experiment Part –Comparison of algorithms (3) 2500000 2000000 1500000 1000000 500000 0 0 0.5 1 Number of records Skew REDUCERS = 10 Naive Early termination Early termination & Load Balancing
  • 15.
    Experiment Part –Comparison of algorithms (4) 0:17:17 0:14:24 0:11:31 0:08:38 0:05:46 0:02:53 0:00:00 6 10 Running time Number of Reducers REDUCERS = 6 Early Termination Early Termination & Load Balancing
  • 16.
    Conclusion By usingthe techniques proposed: :  Early Termination  Load Balancing is possible to implement rank aware queries (Top-K) in Map / Reduce efficiently and solving disadvantages of the model Map / Reduce
  • 17.