Map-Side Merge Joins for Scalable SPARQL BGP Processing

Map-Side Merge Joins for
Scalable SPARQL BGP Processing
IEEE CloudCom 2013
Bristol, UK
Martin Przyjaciel-Zablocki
Alexander Schätzle
Eduard Skaley
Thomas Hornung
Georg Lausen
University of Freiburg
Databases & Information Systems
December 4th, 2013

Map-Side Merge Joins for Scalable SPARQL BGP Processing 2
Motivation
Front-End
Architecture
Storage
Back-End
Architecture

Motivation
Front-End
Architecture
Storage
Back-End
Architecture
Online
 Real-time (~ ms), interactive (~ s)
 Low-latency
 Fast, simple queries
 Accessing precomputed (offline) data
 Avoid Joins (costly)

Motivation
Front-End
Architecture
Storage
Back-End
Architecture
OnlineOffline
 Real-time (~ ms), interactive (~ s)
 Low-latency
 Fast, simple queries
 Accessing precomputed (offline) data
 Avoid Joins (costly)
 Batch processing (≥ h)
 Scalability, Fault Tolerance
 Long-running, complex queries
 Joins between large datasets common

 Used in many DBMS due to stable & efficient performance
 What we present in our paper:
◦ Sort-Merge Join for RDF data with MapReduce
◦ Support for n-way joins
◦ Support for cascaded execution
Sort-Merge Join
L
a a
a b
b a
b b
c a
c b
...
sortedbyjoinkey
R
sortedbyjoinkey
L R
a b
a c
b a
b c
c c
c d
...

 Approach:
◦ Divide Join task into N independent subtasks
◦ Each subtask is processed by exactly one machine
Sort-Merge Joins with MapReduce
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN

 Challenges:
◦ Datasets must be sorted according to the join key
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN

 Challenges:
◦ Datasets must be split into N partitions
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN

 Challenges:
◦ Same join key values must be processed on the same machine
 i.e. they must be in the same partition
L1
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN
[a … c) [a … c)

 Challenges:
◦ Same join key values must be processed on the same machine
 i.e. they must be in the same partition
◦ Cascaded joins: Output must be sorted & partitioned according to
the requirements of the following join
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN

 Main Contributions:
(1) RDF data store that guarantees that one side is already presorted
and partitioned  we only have to sort & split the output
(2) Merge Join is processed in the Map phase
 Reduce phase used to postprocess the output for the next join
(3) Dynamic Bloom Filter to remove dangling intermediate results
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN

 Linear scaling of query runtimes
 Average performance benefit of Merge Joins compared to the
other approaches: 15 – 48 %
Experiments

Map-Side Merge Joins for Scalable SPARQL BGP Processing
Thank you
13
Contact information:
Martin Przyjaciel-Zablocki
Alexander Schätzle
zablocki|schaetzle@informatik.uni-freiburg.de
University of Freiburg

Discussion Slides
1. RDF Data Store Layout
2. Map-Side Merge Join
3. Experiments

RDF Data Store Layout (1)

Map-Side Merge Join (1)

Map-Side Merge Join (2)

Experiments (1)

Experiments (2)

Experiments (3)

Experiments (4)

Map-Side Merge Joins for Scalable SPARQL BGP Processing

More Related Content

What's hot

Similar to Map-Side Merge Joins for Scalable SPARQL BGP Processing

Recently uploaded

Map-Side Merge Joins for Scalable SPARQL BGP Processing