Through the experimental part and the execution of three different algorithms, aims to show the disadvantages of the default operation of the Map/Reduce programming model in Top-K queries, as well as the recommended solution and the effective processing of such query types. Two of the major shortcomings that occur will be managed, namely the Early Termination and the Load Balancing. There is a code which is implemented for this solution.
Efficient processing of Rank-aware queries in Map/Reduce
1. EFFICIENT PROCESSING OF RANK-AWARE
QUERIES IN MAP/REDUCE
OIKONOMAKIS SPYRIDON
SOF TWARE / ENGINEER AT PEOPLEPERHOUR
2. Need for a new model
Exponential data growth
Need for analysis, utilization and scalability of more and more
data
Need for parallel processing
Need to reduce reading time and data recovery
Need for convenience in terms of programmer
Cost
3. What is the Map/Reduce?
Distributed data processing programming model
and runtime environment that operates in a large
number of clusters of machines with parallel
processing
6. Weaknesses in Top-K Join Queries
What is the Top-K Join?
Weaknesses
Read all the data for the recovery of K results
Non-equitable distribution of workload per Reducer
7. Goals of the experiment
Implementation of Top-K Join queries in
Map/Reduce model in an efficient manner
Troubleshooting shown in Map / Reduce with:
Early Termination
Load Balancing
8. Design
Comparison of three algorithms (1 default and 2 new)
Naive
EarlyTermination (using bounds)
EarlyTermination & LoadBalancing (using bounds and Longest
Processing Time)
Pre-Elaboration
Production of two data tables with Join attributes
Statistics for the data in the form of histograms
Elaboration
Calculating bounds of histograms for each table
Run Map/Reduce
10. Early Termination
Check Bounds EarlyTermRecordReader
Send Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
Reducers
Process
11. Early Termination & Load Balancing
EarlyTermRecordReader
Check
Bounds
Send Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
Reducer
CustomPartitioner
Reducer Reducer
12. Experiment (1)
Parameters Values
Data Distribution: Zipfian
Number of data: 1.000.000 / table
Number of reducers: 10, 6
Number of K results: 10
Data skew: 0, 0.5, 1
Number of Joining Attributes: 10
Max value for data: 10000
Sorting: By score
Histograms: 10 bins
Cluster: 8 machines
13. Experiment Part – Comparison of algorithms (2)
0:50:24
0:43:12
0:36:00
0:28:48
0:21:36
0:14:24
0:07:12
0:00:00
0 0.5 1
Running time
Skew
REDUCERS = 10
Naive
Early Termination
Early Termination & Load
Balancing
14. Experiment Part – Comparison of algorithms (3)
2500000
2000000
1500000
1000000
500000
0
0 0.5 1
Number of records
Skew
REDUCERS = 10
Naive
Early termination
Early termination & Load Balancing
15. Experiment Part – Comparison of algorithms (4)
0:17:17
0:14:24
0:11:31
0:08:38
0:05:46
0:02:53
0:00:00
6 10
Running time
Number of Reducers
REDUCERS = 6
Early Termination
Early Termination & Load Balancing
16. Conclusion
By using the techniques proposed: :
Early Termination
Load Balancing
is possible to implement rank aware queries (Top-K) in
Map / Reduce efficiently and solving disadvantages of
the model Map / Reduce