Map-Side Merge Joins for
Scalable SPARQL BGP Processing
IEEE CloudCom 2013
Bristol, UK
Martin Przyjaciel-Zablocki
Alexander Schätzle
Eduard Skaley
Thomas Hornung
Georg Lausen
University of Freiburg
Databases & Information Systems
December 4th, 2013
Map-Side Merge Joins for Scalable SPARQL BGP Processing 2
Motivation
Front-End
Architecture
Storage
Back-End
Architecture
Map-Side Merge Joins for Scalable SPARQL BGP Processing 3
Motivation
Front-End
Architecture
Storage
Back-End
Architecture
Online
 Real-time (~ ms), interactive (~ s)
 Low-latency
 Fast, simple queries
 Accessing precomputed (offline) data
 Avoid Joins (costly)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 4
Motivation
Front-End
Architecture
Storage
Back-End
Architecture
OnlineOffline
 Real-time (~ ms), interactive (~ s)
 Low-latency
 Fast, simple queries
 Accessing precomputed (offline) data
 Avoid Joins (costly)
 Batch processing (≥ h)
 Scalability, Fault Tolerance
 Long-running, complex queries
 Joins between large datasets common
 Used in many DBMS due to stable & efficient performance
 What we present in our paper:
◦ Sort-Merge Join for RDF data with MapReduce
◦ Support for n-way joins
◦ Support for cascaded execution
Map-Side Merge Joins for Scalable SPARQL BGP Processing 5
Sort-Merge Join
L
a a
a b
b a
b b
c a
c b
...
sortedbyjoinkey
R
sortedbyjoinkey
L R
a b
a c
b a
b c
c c
c d
...
 Approach:
◦ Divide Join task into N independent subtasks
◦ Each subtask is processed by exactly one machine
Map-Side Merge Joins for Scalable SPARQL BGP Processing 6
Sort-Merge Joins with MapReduce
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN
 Challenges:
◦ Datasets must be sorted according to the join key
Map-Side Merge Joins for Scalable SPARQL BGP Processing 7
Sort-Merge Joins with MapReduce
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN
 Challenges:
◦ Datasets must be sorted according to the join key
◦ Datasets must be split into N partitions
Map-Side Merge Joins for Scalable SPARQL BGP Processing 8
Sort-Merge Joins with MapReduce
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN
 Challenges:
◦ Datasets must be sorted according to the join key
◦ Datasets must be split into N partitions
◦ Same join key values must be processed on the same machine
 i.e. they must be in the same partition
Map-Side Merge Joins for Scalable SPARQL BGP Processing 9
Sort-Merge Joins with MapReduce
L1
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN
[a … c) [a … c)
 Challenges:
◦ Datasets must be sorted according to the join key
◦ Datasets must be split into N partitions
◦ Same join key values must be processed on the same machine
 i.e. they must be in the same partition
◦ Cascaded joins: Output must be sorted & partitioned according to
the requirements of the following join
Map-Side Merge Joins for Scalable SPARQL BGP Processing 10
Sort-Merge Joins with MapReduce
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN
 Main Contributions:
(1) RDF data store that guarantees that one side is already presorted
and partitioned  we only have to sort & split the output
(2) Merge Join is processed in the Map phase
 Reduce phase used to postprocess the output for the next join
(3) Dynamic Bloom Filter to remove dangling intermediate results
Map-Side Merge Joins for Scalable SPARQL BGP Processing 11
Sort-Merge Joins with MapReduce
L1
[a … c)
L2
[c … e)
…
LN
[x … z]
sortedbyjoinkey
R1
[a … c)
R2
[c … e)
…
RN
[x … z]
sortedbyjoinkey
L1 R1
L2 R2
LN RN
 Linear scaling of query runtimes
 Average performance benefit of Merge Joins compared to the
other approaches: 15 – 48 %
Map-Side Merge Joins for Scalable SPARQL BGP Processing 12
Experiments
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Thank you
13
Contact information:
Martin Przyjaciel-Zablocki
Alexander Schätzle
zablocki|schaetzle@informatik.uni-freiburg.de
University of Freiburg
Discussion Slides
1. RDF Data Store Layout
2. Map-Side Merge Join
3. Experiments
Map-Side Merge Joins for Scalable SPARQL BGP Processing 14
Map-Side Merge Joins for Scalable SPARQL BGP Processing 15
RDF Data Store Layout (1)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 16
RDF Data Store Layout (2)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 17
RDF Data Store Layout (3)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 18
Map-Side Merge Join (1)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 19
Map-Side Merge Join (2)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 20
Experiments (1)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 21
Experiments (2)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 22
Experiments (3)
Map-Side Merge Joins for Scalable SPARQL BGP Processing 23
Experiments (4)

Map-Side Merge Joins for Scalable SPARQL BGP Processing

  • 1.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing IEEE CloudCom 2013 Bristol, UK Martin Przyjaciel-Zablocki Alexander Schätzle Eduard Skaley Thomas Hornung Georg Lausen University of Freiburg Databases & Information Systems December 4th, 2013
  • 2.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 2 Motivation Front-End Architecture Storage Back-End Architecture
  • 3.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 3 Motivation Front-End Architecture Storage Back-End Architecture Online  Real-time (~ ms), interactive (~ s)  Low-latency  Fast, simple queries  Accessing precomputed (offline) data  Avoid Joins (costly)
  • 4.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 4 Motivation Front-End Architecture Storage Back-End Architecture OnlineOffline  Real-time (~ ms), interactive (~ s)  Low-latency  Fast, simple queries  Accessing precomputed (offline) data  Avoid Joins (costly)  Batch processing (≥ h)  Scalability, Fault Tolerance  Long-running, complex queries  Joins between large datasets common
  • 5.
     Used inmany DBMS due to stable & efficient performance  What we present in our paper: ◦ Sort-Merge Join for RDF data with MapReduce ◦ Support for n-way joins ◦ Support for cascaded execution Map-Side Merge Joins for Scalable SPARQL BGP Processing 5 Sort-Merge Join L a a a b b a b b c a c b ... sortedbyjoinkey R sortedbyjoinkey L R a b a c b a b c c c c d ...
  • 6.
     Approach: ◦ DivideJoin task into N independent subtasks ◦ Each subtask is processed by exactly one machine Map-Side Merge Joins for Scalable SPARQL BGP Processing 6 Sort-Merge Joins with MapReduce L1 [a … c) L2 [c … e) … LN [x … z] sortedbyjoinkey R1 [a … c) R2 [c … e) … RN [x … z] sortedbyjoinkey L1 R1 L2 R2 LN RN
  • 7.
     Challenges: ◦ Datasetsmust be sorted according to the join key Map-Side Merge Joins for Scalable SPARQL BGP Processing 7 Sort-Merge Joins with MapReduce L1 [a … c) L2 [c … e) … LN [x … z] sortedbyjoinkey R1 [a … c) R2 [c … e) … RN [x … z] sortedbyjoinkey L1 R1 L2 R2 LN RN
  • 8.
     Challenges: ◦ Datasetsmust be sorted according to the join key ◦ Datasets must be split into N partitions Map-Side Merge Joins for Scalable SPARQL BGP Processing 8 Sort-Merge Joins with MapReduce L1 [a … c) L2 [c … e) … LN [x … z] sortedbyjoinkey R1 [a … c) R2 [c … e) … RN [x … z] sortedbyjoinkey L1 R1 L2 R2 LN RN
  • 9.
     Challenges: ◦ Datasetsmust be sorted according to the join key ◦ Datasets must be split into N partitions ◦ Same join key values must be processed on the same machine  i.e. they must be in the same partition Map-Side Merge Joins for Scalable SPARQL BGP Processing 9 Sort-Merge Joins with MapReduce L1 L2 [c … e) … LN [x … z] sortedbyjoinkey R1 R2 [c … e) … RN [x … z] sortedbyjoinkey L1 R1 L2 R2 LN RN [a … c) [a … c)
  • 10.
     Challenges: ◦ Datasetsmust be sorted according to the join key ◦ Datasets must be split into N partitions ◦ Same join key values must be processed on the same machine  i.e. they must be in the same partition ◦ Cascaded joins: Output must be sorted & partitioned according to the requirements of the following join Map-Side Merge Joins for Scalable SPARQL BGP Processing 10 Sort-Merge Joins with MapReduce L1 [a … c) L2 [c … e) … LN [x … z] sortedbyjoinkey R1 [a … c) R2 [c … e) … RN [x … z] sortedbyjoinkey L1 R1 L2 R2 LN RN
  • 11.
     Main Contributions: (1)RDF data store that guarantees that one side is already presorted and partitioned  we only have to sort & split the output (2) Merge Join is processed in the Map phase  Reduce phase used to postprocess the output for the next join (3) Dynamic Bloom Filter to remove dangling intermediate results Map-Side Merge Joins for Scalable SPARQL BGP Processing 11 Sort-Merge Joins with MapReduce L1 [a … c) L2 [c … e) … LN [x … z] sortedbyjoinkey R1 [a … c) R2 [c … e) … RN [x … z] sortedbyjoinkey L1 R1 L2 R2 LN RN
  • 12.
     Linear scalingof query runtimes  Average performance benefit of Merge Joins compared to the other approaches: 15 – 48 % Map-Side Merge Joins for Scalable SPARQL BGP Processing 12 Experiments
  • 13.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing Thank you 13 Contact information: Martin Przyjaciel-Zablocki Alexander Schätzle zablocki|schaetzle@informatik.uni-freiburg.de University of Freiburg
  • 14.
    Discussion Slides 1. RDFData Store Layout 2. Map-Side Merge Join 3. Experiments Map-Side Merge Joins for Scalable SPARQL BGP Processing 14
  • 15.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 15 RDF Data Store Layout (1)
  • 16.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 16 RDF Data Store Layout (2)
  • 17.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 17 RDF Data Store Layout (3)
  • 18.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 18 Map-Side Merge Join (1)
  • 19.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 19 Map-Side Merge Join (2)
  • 20.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 20 Experiments (1)
  • 21.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 21 Experiments (2)
  • 22.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 22 Experiments (3)
  • 23.
    Map-Side Merge Joinsfor Scalable SPARQL BGP Processing 23 Experiments (4)