Efficient processing of Rank-aware queries in Map/Reduce

•Download as PPTX, PDF•

2 likes•638 views

Through the experimental part and the execution of three different algorithms, aims to show the disadvantages of the default operation of the Map/Reduce programming model in Top-K queries, as well as the recommended solution and the effective processing of such query types. Two of the major shortcomings that occur will be managed, namely the Early Termination and the Load Balancing. There is a code which is implemented for this solution.

Engineering

EFFICIENT PROCESSING OF RANK-AWARE
QUERIES IN MAP/REDUCE
OIKONOMAKIS SPYRIDON
SOF TWARE / ENGINEER AT PEOPLEPERHOUR

Need for a new model
 Exponential data growth
 Need for analysis, utilization and scalability of more and more
data
 Need for parallel processing
 Need to reduce reading time and data recovery
 Need for convenience in terms of programmer
 Cost

What is the Map/Reduce?
Distributed data processing programming model
and runtime environment that operates in a large
number of clusters of machines with parallel
processing

Weaknesses in Top-K Join Queries
What is the Top-K Join?
Weaknesses
 Read all the data for the recovery of K results
 Non-equitable distribution of workload per Reducer

Goals of the experiment
 Implementation of Top-K Join queries in
Map/Reduce model in an efficient manner
 Troubleshooting shown in Map / Reduce with:
 Early Termination
 Load Balancing

Design
 Comparison of three algorithms (1 default and 2 new)
 Naive
 EarlyTermination (using bounds)
 EarlyTermination & LoadBalancing (using bounds and Longest
Processing Time)
 Pre-Elaboration
 Production of two data tables with Join attributes
 Statistics for the data in the form of histograms
 Elaboration
 Calculating bounds of histograms for each table
 Run Map/Reduce

Early Termination
Check Bounds EarlyTermRecordReader
Send Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
Reducers
Process

Early Termination & Load Balancing
EarlyTermRecordReader
Check
Bounds
Send Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
Reducer
CustomPartitioner
Reducer Reducer

Experiment (1)
Parameters Values
Data Distribution: Zipfian
Number of data: 1.000.000 / table
Number of reducers: 10, 6
Number of K results: 10
Data skew: 0, 0.5, 1
Number of Joining Attributes: 10
Max value for data: 10000
Sorting: By score
Histograms: 10 bins
Cluster: 8 machines

Experiment Part – Comparison of algorithms (2)
0:50:24
0:43:12
0:36:00
0:28:48
0:21:36
0:14:24
0:07:12
0:00:00
0 0.5 1
Running time
Skew
REDUCERS = 10
Naive
Early Termination
Early Termination & Load
Balancing

Experiment Part – Comparison of algorithms (3)
2500000
2000000
1500000
1000000
500000
0
0 0.5 1
Number of records
Skew
REDUCERS = 10
Naive
Early termination
Early termination & Load Balancing

Experiment Part – Comparison of algorithms (4)
0:17:17
0:14:24
0:11:31
0:08:38
0:05:46
0:02:53
0:00:00
6 10
Running time
Number of Reducers
REDUCERS = 6
Early Termination
Early Termination & Load Balancing

Conclusion
By using the techniques proposed: :
 Early Termination
 Load Balancing
is possible to implement rank aware queries (Top-K) in
Map / Reduce efficiently and solving disadvantages of
the model Map / Reduce

What's hot

Delegating Data Management to the Cloud: A Case Study in a Telecommunications...Giuseppe Procaccianti

Slide 1butest

Project Matsu: Elastic Clouds for Disaster ReliefRobert Grossman

Jovian Data Amazon Final VersionSatya Ramachandran

Murphy presentationCOGS Presentations

Bioclouds CAMDA (Robert Grossman) 09-v9pRobert Grossman

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

K venkata reddyClimDev15

Geospatial Sensor Networks and Partitioning DataAlexMiowski

How to Reduce Your Database Total Cost of Ownership with TimescaleDBTimescale

Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce

SoftwareHut | Case Study | Calnex | Improving Calnex Analysis ToolSoftwareHut

Weather Data Analytics Using HadoopNajima Begum

Tutorial5ShwetaPolicepatil

Pdcs2010 balman-presentationbalmanme

OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman

Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN

Team3 presentationAmanda Gilbert

What's hot (18)

Delegating Data Management to the Cloud: A Case Study in a Telecommunications...

Slide 1

Project Matsu: Elastic Clouds for Disaster Relief

Jovian Data Amazon Final Version

Murphy presentation

Bioclouds CAMDA (Robert Grossman) 09-v9p

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

K venkata reddy

Geospatial Sensor Networks and Partitioning Data

How to Reduce Your Database Total Cost of Ownership with TimescaleDB

Leveraging Map Reduce With Hadoop for Weather Data Analytics

SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool

Weather Data Analytics Using Hadoop

Tutorial5

Pdcs2010 balman-presentation

OCC Overview OMG Clouds Meeting 07-13-09 v3

Large-Scale Geographically Weighted Regression on Spark

Team3 presentation

Similar to Efficient processing of Rank-aware queries in Map/Reduce

Download Itbutest

$IEEE CLOUD \'11$ $IEEE CLOUD \'11$

IEEE CLOUD \'11David Ribeiro Alves

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin

Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic

DIET_BLASTFrederic Desprez

Sawmill - Integrating R and Large Data CloudsRobert Grossman

Apache Lens at Hadoop meetupamarsri

Qiu bosc2010BOSC 2010

Distributed approximate spectral clustering for large scale datasetsBita Kazemi

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster

Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit

Presentation_BigData_NenaMarinn5712036

CS 542 -- Query ExecutionJ Singh

Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale

Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak

Distributed computing poliivascucristian

Hui 3.0Arulkumar Arumugam

SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013James McGalliard

Scalable analytics for iaas cloud availabilityPapitha Velumani

Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov

Similar to Efficient processing of Rank-aware queries in Map/Reduce (20)

Download It

$IEEE CLOUD \'11$ $IEEE CLOUD \'11$

IEEE CLOUD \'11

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

DIET_BLAST

Sawmill - Integrating R and Large Data Clouds

Apache Lens at Hadoop meetup

Qiu bosc2010

Distributed approximate spectral clustering for large scale datasets

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

Presentation_BigData_NenaMarin

CS 542 -- Query Execution

Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off

Optimization of Continuous Queries in Federated Database and Stream Processin...

Distributed computing poli

Hui 3.0

SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013

Scalable analytics for iaas cloud availability

Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010

Recently uploaded

Introduction and different types of Ethernet.pptxupamatechverse

UNIT - IV - Air Compressors and its Performancesivaprakash250

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

Java Programming :Event Handling(Types of Events)simmis5

Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7Call Girls in Nagpur High Profile Call Girls

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Introduction to Multiple Access Protocol.pptxupamatechverse

Recently uploaded (20)

Introduction and different types of Ethernet.pptx

UNIT - IV - Air Compressors and its Performance

Coefficient of Thermal Expansion and their Importance.pptx

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

Java Programming :Event Handling(Types of Events)

Glass Ceramics: Processing and Properties

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

UNIT-III FMM. DIMENSIONAL ANALYSIS

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Introduction to Multiple Access Protocol.pptx

Efficient processing of Rank-aware queries in Map/Reduce

1. EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE OIKONOMAKIS SPYRIDON SOF TWARE / ENGINEER AT PEOPLEPERHOUR

2. Need for a new model  Exponential data growth  Need for analysis, utilization and scalability of more and more data  Need for parallel processing  Need to reduce reading time and data recovery  Need for convenience in terms of programmer  Cost

3. What is the Map/Reduce? Distributed data processing programming model and runtime environment that operates in a large number of clusters of machines with parallel processing

4. Is the Map/Reduce model reliable?

5. Map/Reduce

6. Weaknesses in Top-K Join Queries What is the Top-K Join? Weaknesses  Read all the data for the recovery of K results  Non-equitable distribution of workload per Reducer

7. Goals of the experiment  Implementation of Top-K Join queries in Map/Reduce model in an efficient manner  Troubleshooting shown in Map / Reduce with:  Early Termination  Load Balancing

8. Design  Comparison of three algorithms (1 default and 2 new)  Naive  EarlyTermination (using bounds)  EarlyTermination & LoadBalancing (using bounds and Longest Processing Time)  Pre-Elaboration  Production of two data tables with Join attributes  Statistics for the data in the form of histograms  Elaboration  Calculating bounds of histograms for each table  Run Map/Reduce

9. Design(2)

10. Early Termination Check Bounds EarlyTermRecordReader Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducers Process

11. Early Termination & Load Balancing EarlyTermRecordReader Check Bounds Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducer CustomPartitioner Reducer Reducer

12. Experiment (1) Parameters Values Data Distribution: Zipfian Number of data: 1.000.000 / table Number of reducers: 10, 6 Number of K results: 10 Data skew: 0, 0.5, 1 Number of Joining Attributes: 10 Max value for data: 10000 Sorting: By score Histograms: 10 bins Cluster: 8 machines

13. Experiment Part – Comparison of algorithms (2) 0:50:24 0:43:12 0:36:00 0:28:48 0:21:36 0:14:24 0:07:12 0:00:00 0 0.5 1 Running time Skew REDUCERS = 10 Naive Early Termination Early Termination & Load Balancing

14. Experiment Part – Comparison of algorithms (3) 2500000 2000000 1500000 1000000 500000 0 0 0.5 1 Number of records Skew REDUCERS = 10 Naive Early termination Early termination & Load Balancing

15. Experiment Part – Comparison of algorithms (4) 0:17:17 0:14:24 0:11:31 0:08:38 0:05:46 0:02:53 0:00:00 6 10 Running time Number of Reducers REDUCERS = 6 Early Termination Early Termination & Load Balancing

16. Conclusion By using the techniques proposed: :  Early Termination  Load Balancing is possible to implement rank aware queries (Top-K) in Map / Reduce efficiently and solving disadvantages of the model Map / Reduce

17. Questions ???? Thank you.

Efficient processing of Rank-aware queries in Map/Reduce

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Efficient processing of Rank-aware queries in Map/Reduce

Similar to Efficient processing of Rank-aware queries in Map/Reduce (20)

Recently uploaded

Recently uploaded (20)

Efficient processing of Rank-aware queries in Map/Reduce