for "Parallelizing Multiple Group-by Queries using MapReduce"

•

0 likes•201 views

This document summarizes research on optimizing parallel group-by queries using MapReduce. It presents the MapReduce and MapCombineReduce models, discusses cost estimation, and evaluates experiments comparing the two models. The optimized MapCombineReduce model reduces network communication costs by using a combiner to pre-aggregate data locally on worker nodes before transferring results. Experiments show MapCombineReduce provides better speed-up and scalability for queries with reasonable selectivity.

Technology

Parallelizing Multiple Group-by
queries using MapReduce:
optimization and cost estimation
Jie Pan · Frédéric Magoulès ·
Yann Le Biannic · Christophe Favart
B99705024 林劭軒
B99705021 李奕德
R00725051 郗昀彥
§ Ecole Centrale Paris · † SAP Research
§ §
† †
Telecommunication Systems 2013

Outline
• MapReduce and Optimized MapReduce
• Cost Estimation
• Experiments and Evaluation

MapReduce
Data
MapDi MapDi MapDi MapDi
Master Node
Worker Nodes

MapReduce
Data
MapDi MapDi MapDi MapDi
Di Map
Master Node
Worker Nodes

MapReduce
Data
MapDi MapDi MapDi MapDi
Di Map
Master Node
Worker Nodes
serialize :: structured objects → byte stream
de-serialize :: byte stream → structured objects

MapReduce
Data
MapDi MapDi MapDi MapDi
Di IiMap
Master Node
Worker Nodes

MapReduce
Data
MapDi MapDi MapDi MapDi
Di IiMap
DiDiDiIi
Master Node
Worker Nodes

Data
MapDi MapDi MapDi MapDi
Di IiMap
Reducer
Result
DiDiDiIi
Master Node
Worker Nodes
MapReduce

Motivation
• Data Analysis (Business Intelligence)
• Task with Predicates
• High Selectivity => High Communication Cost
•
• Goal: Reduce the Volume of Intermediate Data
DiDiDiIi
Master NodeWorker Nodes
Selectivity =
#Data
#Data Satisfying Predicates

Data
MapDi MapDi MapDi MapDi
Di IiMap
signal
Master Node
Worker Nodes
MapCombineReduce (1/2)

MapCombineReduce (2/2)
Data
MapDi MapDi MapDi MapDi
IiCombiner
Master Node
Worker Nodes
CombinerCombiner
CombinerCombiner

MapCombineReduce (2/2)
Data
MapDi MapDi MapDi MapDi
Ai IiCombiner
Master Node
Worker Nodes
CombinerCombiner
CombinerCombiner

Data
Reducer
Result
MapDi MapDi MapDi MapDi
DiDiDiAi
Ai IiCombiner
Master Node
Worker Nodes
CombinerCombiner
CombinerCombiner
MapCombineReduce (2/2)

Cost
min ∑ Cst + Cw + Ccl + Ccmm
Data
MapDi MapDi MapDi MapDi
Di IiMap
Reducer
Result
DiDiDiIi
Master Node
Worker Nodes

Initial Build (1/4)
Creating a mapping
Serialize Data
Forall mappers
Network Factor
Mapper’s Data Transfer Cost
Result Transfer Cost

Initial Build (2/4)
De-serialize Data Serialize Result
Fragment
Load to Memory
Filter Cost

Initial Build (3/4)
De-serialize All Result
Selected Data
Aggregation Cost

Initial Build (4/4)
• sizem = 0
• Cmpg * nbm is constant

Optimized Build (1/6)
Nodes to be Combined
Size of Combiner’s Object
Does Not Change

Optimized Build (2/6)
Does Not Change
Does Not Serialize Result

Optimized Build (3/6)
Serialize Intermediate Result

Optimized Build (4/6)
De-serialize Intermediate Result

Optimized Build (5/6)
• Network Factor * (Start to Map +
Worker to Combiner +
Reduce Phare)

Optimized Build (6/6)
• sizem = 0
• sizec = 0
• Cmpg * nbm is constant

Experiments Environment (1/2)
• Running the experience over
• 9 sites geographically distributed in France
• featuring 5000 processors
• 1 cluster situated in the Sophia site
• IBM eServer 325
• Total number of nodes in this cluster: 49
[1] https://www.grid5000.fr/
[1]

Experiments Environment (2/2)
• Each node is composed of
• 2 CPUs of AMD Opteron 246
• 1 MB of cache, 2 GB of memory
• network: 2xGigabit Ethernet
• Java 1.6, GridGain 2.1.1

Dataset
• Dataset: 640000 records
• Each record contains 15 columns
• partition with 5 different fragment sizes
• 1000, 2000, 4000, 8000 and 16000
• with selectivity = 0.0106, 0.099 and 0.185

Experiments
• Run a sequential test on
• 1 machine
• Launch the parallel tests in GridGain on
• 5, 10, 15 and 20 machines

Result
• When the selectivity is bigger, the optimized version’s
speeds-up better than the initial version.
• When the query’s selectivity is small, only a small
amount of data need to be transferred over network.
• When the query’s selectivity is big, then the
communication cost becomes dominant.

Scalability
• use several datasets having the same columns
• composed of 640000, 1280000, 1920000 and 2560000 records
• Fragment: 16000
• Run the queries with the same selectivity

Conclusion
• MapReduce Model
• MapCombineReduce Model
• The combiner: pre-aggregator which aggregates over worker
node
• Reduce the amount of intermediate data transferred over network
• Cost estimation
• Experimental results
• Better speed-up and scalability for a reasonable selectivity

Viewers also liked

Examples for looplessYun-Yan Chi

Program Language - Fall 2013 Yun-Yan Chi

Genetic programmingYun-Yan Chi

Any tutorYun-Yan Chi

Insert 2 MergeYun-Yan Chi

Semantic Genetic Programming TutorialAlbertoMoraglio

Machine X LanguageYun-Yan Chi

Viewers also liked (7)

Examples for loopless

Program Language - Fall 2013

Genetic programming

Any tutor

Insert 2 Merge

Semantic Genetic Programming Tutorial

Machine X Language

Similar to for "Parallelizing Multiple Group-by Queries using MapReduce"

November 2013 HUG: Real-time analytics with in-memory gridYahoo Developer Network

Efficient architecture to condensate visual information driven by attention ...Sara Granados Cabeza

Operational Intelligence Using HadoopDataWorks Summit

Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...TigerGraph

Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems

HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.

CFD on Power Ganesan Narayanasamy

Operationalizing Machine Learning Using GPU-accelerated, In-database AnalyticsKinetica

Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks

Computer vision for transportationWanjin Yu

Mining quasi bicliques using giraphHsiao-Fei Liu

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Databricks

RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World

The DEBS Grand Challenge 2017Roman Katerinenko

Jug gridgain java_grid_computing_made_simpleSubhashiniSukumar

Gpu with cuda architectureDhaval Kaneria

MySQL performance monitoring using Statsd and GraphiteDB-Art

Disco workshopspil-engineering

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

Extending Hadoop for Fun & ProfitMilind Bhandarkar

Similar to for "Parallelizing Multiple Group-by Queries using MapReduce" (20)

November 2013 HUG: Real-time analytics with in-memory grid

Efficient architecture to condensate visual information driven by attention ...

Operational Intelligence Using Hadoop

Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...

Moving Toward Deep Learning Algorithms on HPCC Systems

HP - Jerome Rolia - Hadoop World 2010

CFD on Power

Operationalizing Machine Learning Using GPU-accelerated, In-database Analytics

Web-Scale Graph Analytics with Apache Spark with Tim Hunter

Computer vision for transportation

Mining quasi bicliques using giraph

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...

RAPIDS cuGraph – Accelerating all your Graph needs

The DEBS Grand Challenge 2017

Jug gridgain java_grid_computing_made_simple

Gpu with cuda architecture

MySQL performance monitoring using Statsd and Graphite

Disco workshop

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...

Extending Hadoop for Fun & Profit

Recently uploaded

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

CloudStudio User manual (basic edition):comworks

Key Features Of Token Development (1).pptxLBM Solutions

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

costume and set research powerpoint presentationphoebematthew05

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Recently uploaded (20)

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

CloudStudio User manual (basic edition):

Key Features Of Token Development (1).pptx

Human Factors of XR: Using Human Factors to Design XR Systems

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Benefits Of Flutter Compared To Other Frameworks

The transition to renewables in India.pdf

costume and set research powerpoint presentation

Streamlining Python Development: A Guide to a Modern Project Setup

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Pigging Solutions Piggable Sweeping Elbows

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Unleash Your Potential - Namagunga Girls Coding Club

My INSURER PTE LTD - Insurtech Innovation Award 2024

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Pigging Solutions in Pet Food Manufacturing

for "Parallelizing Multiple Group-by Queries using MapReduce"

1. Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation Jie Pan · Frédéric Magoulès · Yann Le Biannic · Christophe Favart B99705024 林劭軒 B99705021 李奕德 R00725051 郗昀彥 § Ecole Centrale Paris · † SAP Research § § † † Telecommunication Systems 2013

2. Outline • MapReduce and Optimized MapReduce • Cost Estimation • Experiments and Evaluation

3. MapReduce Data MapDi MapDi MapDi MapDi Master Node Worker Nodes

4. MapReduce Data MapDi MapDi MapDi MapDi Di Map Master Node Worker Nodes

5. MapReduce Data MapDi MapDi MapDi MapDi Di Map Master Node Worker Nodes serialize :: structured objects → byte stream de-serialize :: byte stream → structured objects

6. MapReduce Data MapDi MapDi MapDi MapDi Di IiMap Master Node Worker Nodes

7. MapReduce Data MapDi MapDi MapDi MapDi Di IiMap DiDiDiIi Master Node Worker Nodes

8. Data MapDi MapDi MapDi MapDi Di IiMap Reducer Result DiDiDiIi Master Node Worker Nodes MapReduce

9. Motivation • Data Analysis (Business Intelligence) • Task with Predicates • High Selectivity => High Communication Cost • • Goal: Reduce the Volume of Intermediate Data DiDiDiIi Master NodeWorker Nodes Selectivity = #Data #Data Satisfying Predicates

10. Data MapDi MapDi MapDi MapDi Di IiMap signal Master Node Worker Nodes MapCombineReduce (1/2)

11. MapCombineReduce (2/2) Data MapDi MapDi MapDi MapDi IiCombiner Master Node Worker Nodes CombinerCombiner CombinerCombiner

12. MapCombineReduce (2/2) Data MapDi MapDi MapDi MapDi Ai IiCombiner Master Node Worker Nodes CombinerCombiner CombinerCombiner

13. Data Reducer Result MapDi MapDi MapDi MapDi DiDiDiAi Ai IiCombiner Master Node Worker Nodes CombinerCombiner CombinerCombiner MapCombineReduce (2/2)

14. Cost Estimation

15. Notations – general

16. Cost min ∑ Cst + Cw + Ccl + Ccmm Data MapDi MapDi MapDi MapDi Di IiMap Reducer Result DiDiDiIi Master Node Worker Nodes

17. Initial Build (1/4) Creating a mapping Serialize Data Forall mappers Network Factor Mapper’s Data Transfer Cost Result Transfer Cost

18. Initial Build (2/4) De-serialize Data Serialize Result Fragment Load to Memory Filter Cost

19. Initial Build (3/4) De-serialize All Result Selected Data Aggregation Cost

20. Initial Build (4/4) • sizem = 0 • Cmpg * nbm is constant

21. Optimized Build (1/6) Nodes to be Combined Size of Combiner’s Object Does Not Change

22. Optimized Build (2/6) Does Not Change Does Not Serialize Result

23. Optimized Build (3/6) Serialize Intermediate Result

24. Optimized Build (4/6) De-serialize Intermediate Result

25. Optimized Build (5/6) • Network Factor * (Start to Map + Worker to Combiner + Reduce Phare)

26. Optimized Build (6/6) • sizem = 0 • sizec = 0 • Cmpg * nbm is constant

27. Compare The factors has changed!!

28. Experiments and Evaluation

29. Experiments Environment (1/2) • Running the experience over • 9 sites geographically distributed in France • featuring 5000 processors • 1 cluster situated in the Sophia site • IBM eServer 325 • Total number of nodes in this cluster: 49 [1] https://www.grid5000.fr/ [1]

30. Experiments Environment (2/2) • Each node is composed of • 2 CPUs of AMD Opteron 246 • 1 MB of cache, 2 GB of memory • network: 2xGigabit Ethernet • Java 1.6, GridGain 2.1.1

31. Dataset • Dataset: 640000 records • Each record contains 15 columns • partition with 5 different fragment sizes • 1000, 2000, 4000, 8000 and 16000 • with selectivity = 0.0106, 0.099 and 0.185

32. Experiments • Run a sequential test on • 1 machine • Launch the parallel tests in GridGain on • 5, 10, 15 and 20 machines

33. Results - Query Selectivity 0.0106

34. Results - Query Selectivity 0.099

35. Results - Query Selectivity 0.185

36. Result • When the selectivity is bigger, the optimized version’s speeds-up better than the initial version. • When the query’s selectivity is small, only a small amount of data need to be transferred over network. • When the query’s selectivity is big, then the communication cost becomes dominant.

37. Scalability • use several datasets having the same columns • composed of 640000, 1280000, 1920000 and 2560000 records • Fragment: 16000 • Run the queries with the same selectivity

38. Conclusion • MapReduce Model • MapCombineReduce Model • The combiner: pre-aggregator which aggregates over worker node • Reduce the amount of intermediate data transferred over network • Cost estimation • Experimental results • Better speed-up and scalability for a reasonable selectivity

for "Parallelizing Multiple Group-by Queries using MapReduce"

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to for "Parallelizing Multiple Group-by Queries using MapReduce"

Similar to for "Parallelizing Multiple Group-by Queries using MapReduce" (20)

More from Yun-Yan Chi

More from Yun-Yan Chi (7)

Recently uploaded

Recently uploaded (20)

for "Parallelizing Multiple Group-by Queries using MapReduce"