SlideShare a Scribd company logo
1 of 36
MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
Data management landscape flexibility MRShare – sharing framework for MR ,[object Object]
Large scale setups
 Time performanceσπ efficiency 2
MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 4
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 5
network Map Reduce recap. Reduce Map I Output I I Output I HDFS HDFS 6
Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 7
Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8
MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output 9
Meta-reduce MRShare – sharing scans (reduce) Reduce 1 Reduce 2 Reduce 3 Reduce 4 10
Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing  MRShare Evaluation Summary 11
Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user  WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12
Meta-map MRShare – sharing intermediate data (map). Input Map 1 Map 2 Map 3 Map 4 Map output 13
Meta-reduce MRShare – sharing intermediate data (reduce). Reduce 1 Reduce 2 Reduce 3 Reduce 4 14
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15
Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16
Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17
Finding the optimal sharing strategy “NoShare” J3 J3 J2 J2 18 J5 J4 J4 J1 J1 J5 J3 J2 J4 J1 ,[object Object],J5 “GreedyShare”
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy  SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 19
Sharing scans - cost based optimization  20 Read Sort J1 J1+J2+J3 Read Sort J2 Read Sort Read Sort J3 Potential costs Savings Savings come from reduced number of scans The sorting cost  might change The costs of copying  and writing the output do not change ,[object Object],[object Object]
SplitJobs – a DP solution for sharing scans. We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting. J6 J5 J4 J3 J2 J1 ,[object Object],J6 J5 J4 J3 J2 J1 SplitJobs 22 G1 G2 G3
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 23
MultiSplitJobs – an improvement of SplitJobs 24 J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 25
Sharing intermediate data - cost based optimization  26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2
Approximate the size of the intermediate data J3 J1 γ-MultiSplitJobs – the solution for sharing intermediate data 27 J2 J3 J2 J1 = + γ * J1 J2 J3 ,[object Object]
γ set heuristically,[object Object]
Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29
Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data?  MRShare - γ-MultiSplitJobs evaluation 30
Is sharing always beneficial?- ‘GreedyShare’ policy 31
How much we save on sharing scans – MRShare MultiSplitJobs 32
How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33
Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34
Thank you!!! Questions? 35

More Related Content

What's hot

QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2CAPSUCSF
 
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...I3E Technologies
 
Compression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesCompression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesWerner Hoffmann
 
Towards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureTowards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureArmando Guevara
 
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information SystemsTYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information SystemsArti Parab Academics
 
Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Mei Chi Lo
 
Large graph analysis using g mine system
Large graph analysis using g mine systemLarge graph analysis using g mine system
Large graph analysis using g mine systemsaujog
 
Fundamental operations
Fundamental operationsFundamental operations
Fundamental operationssrinivas2036
 
OKCon 2013 Moodboards
OKCon 2013 MoodboardsOKCon 2013 Moodboards
OKCon 2013 Moodboardsthuesing
 
Digitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine planDigitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine planSafdar Ali
 
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and PreparationTYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and PreparationArti Parab Academics
 
Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solutionKarel Charvat
 
TYBSC IT PGIS Unit IV Spacial Data Analysis
TYBSC IT PGIS Unit IV  Spacial Data AnalysisTYBSC IT PGIS Unit IV  Spacial Data Analysis
TYBSC IT PGIS Unit IV Spacial Data AnalysisArti Parab Academics
 
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsTYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsArti Parab Academics
 
Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)esambale
 

What's hot (20)

QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2
 
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
 
Compression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesCompression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure Primites
 
Towards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureTowards and adaptable spatial processing architecture
Towards and adaptable spatial processing architecture
 
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information SystemsTYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
 
Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)
 
Large graph analysis using g mine system
Large graph analysis using g mine systemLarge graph analysis using g mine system
Large graph analysis using g mine system
 
Fundamental operations
Fundamental operationsFundamental operations
Fundamental operations
 
GIS Data Types
GIS Data TypesGIS Data Types
GIS Data Types
 
OKCon 2013 Moodboards
OKCon 2013 MoodboardsOKCon 2013 Moodboards
OKCon 2013 Moodboards
 
Mrp Final
Mrp FinalMrp Final
Mrp Final
 
Digitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine planDigitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine plan
 
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and PreparationTYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
 
Chap02 01
Chap02 01Chap02 01
Chap02 01
 
Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solution
 
TerraWorld
TerraWorldTerraWorld
TerraWorld
 
TYBSC IT PGIS Unit IV Spacial Data Analysis
TYBSC IT PGIS Unit IV  Spacial Data AnalysisTYBSC IT PGIS Unit IV  Spacial Data Analysis
TYBSC IT PGIS Unit IV Spacial Data Analysis
 
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsTYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
 
Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)
 
Domain research presentation Midterm
Domain research presentation MidtermDomain research presentation Midterm
Domain research presentation Midterm
 

Viewers also liked

Sql joins inner join self join outer joins
Sql joins inner join self join outer joinsSql joins inner join self join outer joins
Sql joins inner join self join outer joinsDeepthi Rachumallu
 
Sql server JOIN
Sql server JOINSql server JOIN
Sql server JOINRiteshkiit
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Serverprogrammings guru
 
MS Sql Server: Joining Databases
MS Sql Server: Joining DatabasesMS Sql Server: Joining Databases
MS Sql Server: Joining DatabasesDataminingTools Inc
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query OptimizationBrian Gallagher
 

Viewers also liked (11)

Sql Server
Sql ServerSql Server
Sql Server
 
Sql joins inner join self join outer joins
Sql joins inner join self join outer joinsSql joins inner join self join outer joins
Sql joins inner join self join outer joins
 
Sql server JOIN
Sql server JOINSql server JOIN
Sql server JOIN
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Server
 
MS Sql Server: Joining Databases
MS Sql Server: Joining DatabasesMS Sql Server: Joining Databases
MS Sql Server: Joining Databases
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
Sql joins
Sql joinsSql joins
Sql joins
 
Sql joins
Sql joinsSql joins
Sql joins
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query Optimization
 
joins in database
 joins in database joins in database
joins in database
 
SQL JOIN
SQL JOINSQL JOIN
SQL JOIN
 

Similar to Mr Share 11 Sep 2010

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?TerrierTeam
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...IAEME Publication
 
Parallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsParallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational databaseGraphRM
 
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxshericehewat
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark graphdevroom
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 

Similar to Mr Share 11 Sep 2010 (20)

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Main map reduce
Main map reduceMain map reduce
Main map reduce
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
 
Parallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsParallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applications
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
2013-imMens-EuroVis
2013-imMens-EuroVis2013-imMens-EuroVis
2013-imMens-EuroVis
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational database
 
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark
 
50120140505004
5012014050500450120140505004
50120140505004
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Mr Share 11 Sep 2010

  • 1. MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
  • 2.
  • 4. Time performanceσπ efficiency 2
  • 5. MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3
  • 6. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 4
  • 7. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 5
  • 8. network Map Reduce recap. Reduce Map I Output I I Output I HDFS HDFS 6
  • 9. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 7
  • 10. Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8
  • 11. MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output 9
  • 12. Meta-reduce MRShare – sharing scans (reduce) Reduce 1 Reduce 2 Reduce 3 Reduce 4 10
  • 13. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing MRShare Evaluation Summary 11
  • 14. Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12
  • 15. Meta-map MRShare – sharing intermediate data (map). Input Map 1 Map 2 Map 3 Map 4 Map output 13
  • 16. Meta-reduce MRShare – sharing intermediate data (reduce). Reduce 1 Reduce 2 Reduce 3 Reduce 4 14
  • 17. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15
  • 18. Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16
  • 19. Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17
  • 20.
  • 21. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 19
  • 22.
  • 23.
  • 24. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 23
  • 25. MultiSplitJobs – an improvement of SplitJobs 24 J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs
  • 26. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 25
  • 27. Sharing intermediate data - cost based optimization 26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2
  • 28.
  • 29.
  • 30. Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29
  • 31. Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data? MRShare - γ-MultiSplitJobs evaluation 30
  • 32. Is sharing always beneficial?- ‘GreedyShare’ policy 31
  • 33. How much we save on sharing scans – MRShare MultiSplitJobs 32
  • 34. How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33
  • 35. Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34
  • 37. Ongoing work – sharing expensive computation Sharing across multiple Map Reduce jobs with expensive predicates. 36 Input Meta-map Map 1 Map 2 Map 3 Map 4
  • 38. Ongoing work – dynamic sharing Dynamic sharing. 37 J1+j2 progress J1 J2 time J2 J1

Editor's Notes

  1. Talk about different possibilities of arranging jobs, and the question which one is the optimal one.