SlideShare a Scribd company logo
EFFICIENT PROCESSING OF RANK-AWARE 
QUERIES IN MAP/REDUCE 
OIKONOMAKIS SPYRIDON 
SOF TWARE / ENGINEER AT PEOPLEPERHOUR
Need for a new model 
 Exponential data growth 
 Need for analysis, utilization and scalability of more and more 
data 
 Need for parallel processing 
 Need to reduce reading time and data recovery 
 Need for convenience in terms of programmer 
 Cost
What is the Map/Reduce? 
Distributed data processing programming model 
and runtime environment that operates in a large 
number of clusters of machines with parallel 
processing
Is the Map/Reduce model reliable?
Map/Reduce
Weaknesses in Top-K Join Queries 
What is the Top-K Join? 
Weaknesses 
 Read all the data for the recovery of K results 
 Non-equitable distribution of workload per Reducer
Goals of the experiment 
 Implementation of Top-K Join queries in 
Map/Reduce model in an efficient manner 
 Troubleshooting shown in Map / Reduce with: 
 Early Termination 
 Load Balancing
Design 
 Comparison of three algorithms (1 default and 2 new) 
 Naive 
 EarlyTermination (using bounds) 
 EarlyTermination & LoadBalancing (using bounds and Longest 
Processing Time) 
 Pre-Elaboration 
 Production of two data tables with Join attributes 
 Statistics for the data in the form of histograms 
 Elaboration 
 Calculating bounds of histograms for each table 
 Run Map/Reduce
Design(2)
Early Termination 
Check Bounds EarlyTermRecordReader 
Send Data 
Send Data 
HDFS 
Generated Sorted 
Data 
Histograms 
EarlyTermInputFormat 
Mapper 
Reducers 
Process
Early Termination & Load Balancing 
EarlyTermRecordReader 
Check 
Bounds 
Send Data 
Send Data 
HDFS 
Generated Sorted 
Data 
Histograms 
EarlyTermInputFormat 
Mapper 
Reducer 
CustomPartitioner 
Reducer Reducer
Experiment (1) 
Parameters Values 
Data Distribution: Zipfian 
Number of data: 1.000.000 / table 
Number of reducers: 10, 6 
Number of K results: 10 
Data skew: 0, 0.5, 1 
Number of Joining Attributes: 10 
Max value for data: 10000 
Sorting: By score 
Histograms: 10 bins 
Cluster: 8 machines
Experiment Part – Comparison of algorithms (2) 
0:50:24 
0:43:12 
0:36:00 
0:28:48 
0:21:36 
0:14:24 
0:07:12 
0:00:00 
0 0.5 1 
Running time 
Skew 
REDUCERS = 10 
Naive 
Early Termination 
Early Termination & Load 
Balancing
Experiment Part – Comparison of algorithms (3) 
2500000 
2000000 
1500000 
1000000 
500000 
0 
0 0.5 1 
Number of records 
Skew 
REDUCERS = 10 
Naive 
Early termination 
Early termination & Load Balancing
Experiment Part – Comparison of algorithms (4) 
0:17:17 
0:14:24 
0:11:31 
0:08:38 
0:05:46 
0:02:53 
0:00:00 
6 10 
Running time 
Number of Reducers 
REDUCERS = 6 
Early Termination 
Early Termination & Load Balancing
Conclusion 
By using the techniques proposed: : 
 Early Termination 
 Load Balancing 
is possible to implement rank aware queries (Top-K) in 
Map / Reduce efficiently and solving disadvantages of 
the model Map / Reduce
Questions 
???? 
Thank you.

More Related Content

What's hot

Delegating Data Management to the Cloud: A Case Study in a Telecommunications...
Delegating Data Management to the Cloud: A Case Study in a Telecommunications...Delegating Data Management to the Cloud: A Case Study in a Telecommunications...
Delegating Data Management to the Cloud: A Case Study in a Telecommunications...
Giuseppe Procaccianti
 
Slide 1
Slide 1Slide 1
Slide 1
butest
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
Robert Grossman
 
Jovian Data Amazon Final Version
Jovian Data Amazon Final VersionJovian Data Amazon Final Version
Jovian Data Amazon Final Version
Satya Ramachandran
 
Murphy presentation
Murphy presentationMurphy presentation
Murphy presentation
COGS Presentations
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
Robert Grossman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
K venkata reddy
K venkata reddyK venkata reddy
K venkata reddy
ClimDev15
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
AlexMiowski
 
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBHow to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
Timescale
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool
SoftwareHut | Case Study | Calnex | Improving Calnex Analysis ToolSoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool
SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool
SoftwareHut
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using Hadoop
Najima Begum
 
Tutorial5
Tutorial5Tutorial5
Pdcs2010 balman-presentation
Pdcs2010 balman-presentationPdcs2010 balman-presentation
Pdcs2010 balman-presentation
balmanme
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
Robert Grossman
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
Viet-Trung TRAN
 
Team3 presentation
Team3 presentationTeam3 presentation
Team3 presentation
Amanda Gilbert
 

What's hot (18)

Delegating Data Management to the Cloud: A Case Study in a Telecommunications...
Delegating Data Management to the Cloud: A Case Study in a Telecommunications...Delegating Data Management to the Cloud: A Case Study in a Telecommunications...
Delegating Data Management to the Cloud: A Case Study in a Telecommunications...
 
Slide 1
Slide 1Slide 1
Slide 1
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Jovian Data Amazon Final Version
Jovian Data Amazon Final VersionJovian Data Amazon Final Version
Jovian Data Amazon Final Version
 
Murphy presentation
Murphy presentationMurphy presentation
Murphy presentation
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
K venkata reddy
K venkata reddyK venkata reddy
K venkata reddy
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBHow to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool
SoftwareHut | Case Study | Calnex | Improving Calnex Analysis ToolSoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool
SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using Hadoop
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Pdcs2010 balman-presentation
Pdcs2010 balman-presentationPdcs2010 balman-presentation
Pdcs2010 balman-presentation
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Team3 presentation
Team3 presentationTeam3 presentation
Team3 presentation
 

Viewers also liked

FreshJealous Fall/Winter Collection '16
FreshJealous Fall/Winter Collection '16FreshJealous Fall/Winter Collection '16
FreshJealous Fall/Winter Collection '16
Ana Castanho
 
Bucură te tinere
Bucură te tinereBucură te tinere
Bucură te tinere
Pruna Laurentiu
 
Ok money’s site design
Ok money’s site designOk money’s site design
Ok money’s site design
Marta W
 
Problemsin adolescence reference
Problemsin adolescence referenceProblemsin adolescence reference
Problemsin adolescence reference
Abhishek Kulshreshtha
 
Tthornton code4lib
Tthornton code4libTthornton code4lib
Tthornton code4lib
trevorthornton
 
Linked data for librarians
Linked data for librariansLinked data for librarians
Linked data for librarians
trevorthornton
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museums
trevorthornton
 
An Introduction to Open Source Software and Web Application Development
An Introduction to Open Source Software and Web Application DevelopmentAn Introduction to Open Source Software and Web Application Development
An Introduction to Open Source Software and Web Application Development
trevorthornton
 
задротен
задротензадротен
задротенvaldis82
 
O lectie de patriotism local
O lectie de patriotism localO lectie de patriotism local
O lectie de patriotism local
lazardiana
 
задротен
задротензадротен
задротенvaldis82
 
Задротен
ЗадротенЗадротен
Задротенvaldis82
 
Tarian adat di indonesia
Tarian adat di indonesiaTarian adat di indonesia
Tarian adat di indonesia
amaruf
 

Viewers also liked (17)

FreshJealous Fall/Winter Collection '16
FreshJealous Fall/Winter Collection '16FreshJealous Fall/Winter Collection '16
FreshJealous Fall/Winter Collection '16
 
Bucură te tinere
Bucură te tinereBucură te tinere
Bucură te tinere
 
Ok money’s site design
Ok money’s site designOk money’s site design
Ok money’s site design
 
Problemsin adolescence reference
Problemsin adolescence referenceProblemsin adolescence reference
Problemsin adolescence reference
 
Roy doliner
Roy dolinerRoy doliner
Roy doliner
 
Indiani x kosmos
Indiani x kosmosIndiani x kosmos
Indiani x kosmos
 
Tthornton code4lib
Tthornton code4libTthornton code4lib
Tthornton code4lib
 
Linked data for librarians
Linked data for librariansLinked data for librarians
Linked data for librarians
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museums
 
An Introduction to Open Source Software and Web Application Development
An Introduction to Open Source Software and Web Application DevelopmentAn Introduction to Open Source Software and Web Application Development
An Introduction to Open Source Software and Web Application Development
 
задротен
задротензадротен
задротен
 
O lectie de patriotism local
O lectie de patriotism localO lectie de patriotism local
O lectie de patriotism local
 
задротен
задротензадротен
задротен
 
El ciberassetjament
El ciberassetjamentEl ciberassetjament
El ciberassetjament
 
Задротен
ЗадротенЗадротен
Задротен
 
El ciberassetjament
El ciberassetjamentEl ciberassetjament
El ciberassetjament
 
Tarian adat di indonesia
Tarian adat di indonesiaTarian adat di indonesia
Tarian adat di indonesia
 

Similar to Efficient processing of Rank-aware queries in Map/Reduce

Download It
Download ItDownload It
Download It
butest
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
David Ribeiro Alves
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
Frederic Desprez
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Robert Grossman
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
amarsri
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
Bita Kazemi
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Timescale
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
Hui 3.0
Hui 3.0Hui 3.0
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
James McGalliard
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
Papitha Velumani
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 

Similar to Efficient processing of Rank-aware queries in Map/Reduce (20)

Download It
Download ItDownload It
Download It
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Hui 3.0
Hui 3.0Hui 3.0
Hui 3.0
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 

Recently uploaded

Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 

Recently uploaded (20)

Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 

Efficient processing of Rank-aware queries in Map/Reduce

  • 1. EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE OIKONOMAKIS SPYRIDON SOF TWARE / ENGINEER AT PEOPLEPERHOUR
  • 2. Need for a new model  Exponential data growth  Need for analysis, utilization and scalability of more and more data  Need for parallel processing  Need to reduce reading time and data recovery  Need for convenience in terms of programmer  Cost
  • 3. What is the Map/Reduce? Distributed data processing programming model and runtime environment that operates in a large number of clusters of machines with parallel processing
  • 4. Is the Map/Reduce model reliable?
  • 6. Weaknesses in Top-K Join Queries What is the Top-K Join? Weaknesses  Read all the data for the recovery of K results  Non-equitable distribution of workload per Reducer
  • 7. Goals of the experiment  Implementation of Top-K Join queries in Map/Reduce model in an efficient manner  Troubleshooting shown in Map / Reduce with:  Early Termination  Load Balancing
  • 8. Design  Comparison of three algorithms (1 default and 2 new)  Naive  EarlyTermination (using bounds)  EarlyTermination & LoadBalancing (using bounds and Longest Processing Time)  Pre-Elaboration  Production of two data tables with Join attributes  Statistics for the data in the form of histograms  Elaboration  Calculating bounds of histograms for each table  Run Map/Reduce
  • 10. Early Termination Check Bounds EarlyTermRecordReader Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducers Process
  • 11. Early Termination & Load Balancing EarlyTermRecordReader Check Bounds Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducer CustomPartitioner Reducer Reducer
  • 12. Experiment (1) Parameters Values Data Distribution: Zipfian Number of data: 1.000.000 / table Number of reducers: 10, 6 Number of K results: 10 Data skew: 0, 0.5, 1 Number of Joining Attributes: 10 Max value for data: 10000 Sorting: By score Histograms: 10 bins Cluster: 8 machines
  • 13. Experiment Part – Comparison of algorithms (2) 0:50:24 0:43:12 0:36:00 0:28:48 0:21:36 0:14:24 0:07:12 0:00:00 0 0.5 1 Running time Skew REDUCERS = 10 Naive Early Termination Early Termination & Load Balancing
  • 14. Experiment Part – Comparison of algorithms (3) 2500000 2000000 1500000 1000000 500000 0 0 0.5 1 Number of records Skew REDUCERS = 10 Naive Early termination Early termination & Load Balancing
  • 15. Experiment Part – Comparison of algorithms (4) 0:17:17 0:14:24 0:11:31 0:08:38 0:05:46 0:02:53 0:00:00 6 10 Running time Number of Reducers REDUCERS = 6 Early Termination Early Termination & Load Balancing
  • 16. Conclusion By using the techniques proposed: :  Early Termination  Load Balancing is possible to implement rank aware queries (Top-K) in Map / Reduce efficiently and solving disadvantages of the model Map / Reduce