This document summarizes research on optimizing parallel group-by queries using MapReduce. It presents the MapReduce and MapCombineReduce models, discusses cost estimation, and evaluates experiments comparing the two models. The optimized MapCombineReduce model reduces network communication costs by using a combiner to pre-aggregate data locally on worker nodes before transferring results. Experiments show MapCombineReduce provides better speed-up and scalability for queries with reasonable selectivity.
for "Parallelizing Multiple Group-by Queries using MapReduce"
1. Parallelizing Multiple Group-by
queries using MapReduce:
optimization and cost estimation
Jie Pan · Frédéric Magoulès ·
Yann Le Biannic · Christophe Favart
B99705024 林劭軒
B99705021 李奕德
R00725051 郗昀彥
§ Ecole Centrale Paris · † SAP Research
§ §
† †
Telecommunication Systems 2013
8. Data
MapDi MapDi MapDi MapDi
Di IiMap
Reducer
Result
DiDiDiIi
Master Node
Worker Nodes
MapReduce
9. Motivation
• Data Analysis (Business Intelligence)
• Task with Predicates
• High Selectivity => High Communication Cost
•
• Goal: Reduce the Volume of Intermediate Data
DiDiDiIi
Master NodeWorker Nodes
Selectivity =
#Data
#Data Satisfying Predicates
10. Data
MapDi MapDi MapDi MapDi
Di IiMap
signal
Master Node
Worker Nodes
MapCombineReduce (1/2)
29. Experiments Environment (1/2)
• Running the experience over
• 9 sites geographically distributed in France
• featuring 5000 processors
• 1 cluster situated in the Sophia site
• IBM eServer 325
• Total number of nodes in this cluster: 49
[1] https://www.grid5000.fr/
[1]
30. Experiments Environment (2/2)
• Each node is composed of
• 2 CPUs of AMD Opteron 246
• 1 MB of cache, 2 GB of memory
• network: 2xGigabit Ethernet
• Java 1.6, GridGain 2.1.1
31. Dataset
• Dataset: 640000 records
• Each record contains 15 columns
• partition with 5 different fragment sizes
• 1000, 2000, 4000, 8000 and 16000
• with selectivity = 0.0106, 0.099 and 0.185
32. Experiments
• Run a sequential test on
• 1 machine
• Launch the parallel tests in GridGain on
• 5, 10, 15 and 20 machines
36. Result
• When the selectivity is bigger, the optimized version’s
speeds-up better than the initial version.
• When the query’s selectivity is small, only a small
amount of data need to be transferred over network.
• When the query’s selectivity is big, then the
communication cost becomes dominant.
37. Scalability
• use several datasets having the same columns
• composed of 640000, 1280000, 1920000 and 2560000 records
• Fragment: 16000
• Run the queries with the same selectivity
38. Conclusion
• MapReduce Model
• MapCombineReduce Model
• The combiner: pre-aggregator which aggregates over worker
node
• Reduce the amount of intermediate data transferred over network
• Cost estimation
• Experimental results
• Better speed-up and scalability for a reasonable selectivity