Rya: Optimizations to Support Real
Time Graph Queries on Accumulo
Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu
DISTRIBUTION STATEMENT A. Approved for
public release; distribution is unlimited.
ONR Case Number 43-279-15 JB.01.2015
22
Acknowledgements
 This work is the collective effort of:
 Parsons’ Rya Team, sponsored by the Department of
the Navy, Office of Naval Research
 Rya Founders: Roshan Punnoose, Adina Crainiceanu,
and David Rapp
33
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
44
Background: Rya and RDF
 Rya: Resource Description Framework (RDF)
Triplestore built on top of Accumulo
 RDF: W3C standard for representing
linked/graph data
 Represents data as statements (assertions) about
resources
– Serialized as triples in {subject, predicate, object}
form
– Example:
• {Caleb, worksAt, Parsons}
• {Caleb, livesIn, Virginia}
Caleb
Parsons
Virginia
worksAt
livesIn
55
Background: SPARQL
 RDF Queries are described using SPARQL
 SPARQL Protocol and RDF Query Language
 SQL-like syntax for finding triples matching
specific patterns
 Look for subgraphs that match triple statement patterns
 Joins are performed when there are variables common
to two or more statement patterns
SELECT ?people WHERE {
?people <worksAt> <Parsons>.
?people <livesIn> <Virginia>.
}
66
Rya Architecture
 Open RDF Interface for interacting with RDF data
stored on Accumulo
 Open RDF (Sesame): Open
Source Java framework for
storing and querying RDF
data
 Open RDF Provides several
interfaces/abstractions
central for interacting with
a RDF datastore
– SAIL interface for interacting with underlying persisted
RDF model
– SAIL: Storage And Inference Layer
Data storage layer
Query processing in SAIL layer
SPARQL
Rya Open RDF
Rya QueryPlanner
Accumulo
77
Storage: Triple Table Index
 3 Tables
 SPO : subject, predicate, object
 POS : predicate, object, subject
 OSP : object, subject, predicate
 Store triples in the RowID of the table
 Store graph name in the Column Family
 Advantages:
 Native lexicographical sorting of row keys  fast range queries
 All patterns can be translated into a scan of one of these tables
88
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
99
…
worksAt, Netflix, Dan
worksAt, OfficeMax, Zack
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
…
Rya Query Execution
 Implemented OpenRDF Sesame SAIL API
 Parse queries, generate initial query plan, execute plan
 Triple patterns map to range queries in Accumulo
SELECT ?x WHERE { ?x <worksAt> <Parsons>.
?x <livesIn> <Virginia>. }
Step 1: POS Table – scan range
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
Step 2: for each ?x, SPO – index lookup
1010
More Complicated Example of Rya Query
Execution
Step 2: For each ?x,
SPO Table lookup
…
Greta, commuteMethod,
bike
…
John, commuteMethod,
Bus
…
Step 3: For each
remaining ?x, SPO
Table lookup
Step 1: POS Table – scan
range for worksAt, Parsons
?x livesIn Virginia?x worksAt Parsons
?x commuteMethod bike
…
worksAt, Netflix, Dan
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
worksAt, PlayStation,
Alice
…
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
1111
Challenges in Query Execution
 Scalability and Responsiveness
 Massive amounts of data
 Potentially large amounts of comparisons
 Consider the Previous Example:
 Default query execution: comparing each “?x” returned from first
statement pattern query to all subsequent triple patterns
 There are 8.3 million Virginia residents, about 15,000 Parsons
employees, and 750,000 people who commute via bike.
 Only 100 people who work at Parsons commute via bike while 1000
people who work at Parsons live in Virginia.
Poor query execution plans can result in simple queries
taking minutes as opposed to milliseconds
SELECT ?x WHERE {
?x <livesIn> Virginia.
?x <worksAt> Parsons.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
vs. vs.
1212
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
1313
Rya Query Optimizations
 Goal: Optimize query execution (joins) to better
support real time responsiveness
 Three Approaches:
 Reduce the number of joins: Pattern Based Indices
– Pre-calculate common joins
 Limit data in joins: Use more stats to improve query
planning
– Cardinality estimation on individual statement patterns
– Join selectivity estimation on pairs of statement patterns
 Make joins more efficient: Distribute the Join Processing
– Distribute processing using SPARK SQL or MapReduce
– Use Hash Joins and Intersecting Iterators
– Just beginning to start looking at this
1414
Rya Query Optimizations Using Cardinalities
 Goal: Optimize ordering of query execution to
reduce the number of comparison operations
 Order execution based on the number of triples that
match each triple pattern
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
8.3M matches
15k matches
750k matches
1515
Rya Cardinality Usage
 Maintain cardinalities on the following triple patterns
element combinations:
 Single elements: Subject, Predicate, Object
 Composite elements: Subject-Predicate, Subject-Object,
Predicate-Object
 Computed periodically using MapReduce
 Row ID:
– <CardinalityType><TripleElements>
• OBJECT, Parsons
• PREDICATEOBJECT, worksAt, Parsons
 Cardinality stored in the value
 Sparse table: Only store cardinalities above a threshold
 Only need to recompute cardinalities if the
distribution of the data changes significantly
1616
Limitations of Cardinality Approach
 Consider a more complicated query
 Cardinality approach does not take into account
number of results returned by joins
 Solution lies in estimating the “join selectivity” for a
each pair of triples
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
2.1M matches
15k matches
750k matches
8.3M matches
254M matches
1717
Rya Query Optimizations Using Join Selectivity
Query optimized using
only Cardinality Info:
Query optimized using Cardinality
and Join Selectivity Info:
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
 Join Selectivity measures number of results returned by joining two
triple patterns
 Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas
Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008
 Due to computational complexity, estimate of join selectivity for triple
patterns is pre-computed and stored in Accumulo
 Join selectivity estimated by computing the number of results obtained
when each triple pattern is joined with the full table
1818
Join Selectivity: General Algorithm
 For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a
variable and p1, o1 , p2, o2 constant, estimate the number of results
 Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>)
give number of results returned by joining a statement pattern with
the full table along the subject component
 Full table join statistics precomputed and stored in index
 Join statistics for each triple pattern computed using following equation:
 Use analogous definition if variables appear in predicate or object position
 Join selectivity statistics used with cardinalities to generate more
efficient query plans
1919
Join Selectivity: Integration into Rya
 Join Selectivity estimates used to optimize Rya queries
through a greedy algorithm approach
 Query constructed starting with first triple pattern to be
evaluated (the pattern with the smallest cardinality) and then
patterns are added based on minimization of a cost function
 Cost function
 C = leftCard + rightCard + leftCard*rightCard*selectivity
 C measures number of entries Accumulo must scan and the
number of comparisons required to perform the join
 Selectivity set to one if two triple patterns share no common
variables, otherwise precomputed estimates used
 Ensures that patterns with common variables are grouped
together
2020
Construction of Selectivity Tables
 For the pattern <?x, p1, o1>, associate each RDF triple of
the form <c, p1, o1> with the cardinality |<c,?y,?z>| and
then sum the results
 Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits
the key-value pair (c, (p1, o1))
 Map Job 2 processes the cardinality table and emits the key
value pair (c, |<c,?y,?x>|), which consists of the constant c
and its single component, subject cardinality for the table
 Map Job 3 merges the results from jobs 1 and 2 by emitting
the key-value pair ((p1, o1), |<c,?y,?x>|)
 Map Job 4 sums the cardinalities from those key-value pairs
containing (p1, o1) as a key, and the result is written to the
selectivity table
2121
Query Optimizations Using Pre-Computed Joins
 Reduce joins by pre-computing common joins
 Approach taken from: Heese, Ralf, et al. "Index Support for
SPARQL." European Semantic Web Conference, Innsbruck,
Austria. 2007.
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
Pre-compute using
batch processing
and look up during
query execution
2222
Query Optimizations Using Pre-Computed Joins
Index Result Table
.…
Aaron, ToyotaRav4
Caleb, JeepCherokee
Puja, HondaCRV
.…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
SELECT ?person ?car
WHERE {
?person <livesIn> Virginia.
?person <owns> ?car.
?car <vehicleType> SUV.
}
1. Pre-compute a portion of the query
using MapReduce
2. Store SPARQL describing the query
along with pre-computed values in
Accumulo
3. Normalize query variables to match
stored SPARQL variables during
query execution
Stored SPARQL
2323
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
2424
Query Optimization Results
 Ran 14 queries against the Lehigh University Benchmark (LUBM)
dataset (33.34 million triples)
 LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity
– Remaining queries were executed 12 times
 Cluster Specs:
– 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and
48 GB RAM
 Results indicate that cardinality and join selectivity optimizations provide
improved or comparable performance
2525
Summary
 Cardinality estimation and join selectivity can
improve query response times for ad hoc queries
 Effects of join selectivity are more apparent for
complex queries over large datasets
 Pre-computed joins are extremely useful for
optimizing common queries
 Potentially avoid large number of join operations
 Maintaining pre-computed join indices is difficult
2626
Questions?
2727
BACK-UP
2828
Useful Links
 SPARQL
 http://www.w3.org/TR/rdf-sparql-query/
 http://jena.apache.org/tutorials/sparql.html
 RDF
 http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/
 Rya
 https://github.com/LAS-NCSU/rya
– Source on github: Provides documentation and sample client code
– Email Aaron Mihalik (aaron.mihalik@parsons.com) for access (US Citizens only)
 Rya Working Group
– Monthly telecon / update on progress, issues, upcoming features
– Email Puja Valiyil puja.valiyil@parsons.com to join (US Citizens only)
 Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting-
started.docbook?view
 Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html
 Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the
clouds. Proceedings of the 1st International Workshop on Cloud Intelligence.
http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
 Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya.
Information Systems Journal (2013).
http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf
2929
Next Steps
 Maintaining pre-computed join indices
 Dynamically determining potential pre-computed
joins
 Distributing query planning and execution
 SPARK SQL
 Rya backed by other datastores
 Fully open sourcing Rya
3030
Sample LUBM Queries (1 of 3)
Query 1
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:GraduateStudent .
?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>}
}
Query 3
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Publication .
?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>}
}
3131
Sample LUBM Queries (2 of 3)
Query 7
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Course .
?X ub:takesCourse ?Y .
<http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y}
}
Query 8
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Department .
?X ub:memberOf ?Y .
?Y ub:subOrganizationOf <http://www.University0.edu> .
?X ub:emailAddress ?Z}
}
3232
Sample LUBM Queries (3 of 3)
Query 9
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Faculty .
?Z rdf:type ub:Course .
?X ub:advisor ?Y .
?Y ub:teacherOf ?Z .
?X ub:takesCourse ?Z}
}
Query 11
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:ResearchGroup .
?X ub:subOrganizationOf <http://www.University0.edu>}
}

Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

  • 1.
    Rya: Optimizations toSupport Real Time Graph Queries on Accumulo Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. ONR Case Number 43-279-15 JB.01.2015
  • 2.
    22 Acknowledgements  This workis the collective effort of:  Parsons’ Rya Team, sponsored by the Department of the Navy, Office of Naval Research  Rya Founders: Roshan Punnoose, Adina Crainiceanu, and David Rapp
  • 3.
    33 Overview  Rya Overview Query Execution within Rya  Query Optimizations  Results  Summary
  • 4.
    44 Background: Rya andRDF  Rya: Resource Description Framework (RDF) Triplestore built on top of Accumulo  RDF: W3C standard for representing linked/graph data  Represents data as statements (assertions) about resources – Serialized as triples in {subject, predicate, object} form – Example: • {Caleb, worksAt, Parsons} • {Caleb, livesIn, Virginia} Caleb Parsons Virginia worksAt livesIn
  • 5.
    55 Background: SPARQL  RDFQueries are described using SPARQL  SPARQL Protocol and RDF Query Language  SQL-like syntax for finding triples matching specific patterns  Look for subgraphs that match triple statement patterns  Joins are performed when there are variables common to two or more statement patterns SELECT ?people WHERE { ?people <worksAt> <Parsons>. ?people <livesIn> <Virginia>. }
  • 6.
    66 Rya Architecture  OpenRDF Interface for interacting with RDF data stored on Accumulo  Open RDF (Sesame): Open Source Java framework for storing and querying RDF data  Open RDF Provides several interfaces/abstractions central for interacting with a RDF datastore – SAIL interface for interacting with underlying persisted RDF model – SAIL: Storage And Inference Layer Data storage layer Query processing in SAIL layer SPARQL Rya Open RDF Rya QueryPlanner Accumulo
  • 7.
    77 Storage: Triple TableIndex  3 Tables  SPO : subject, predicate, object  POS : predicate, object, subject  OSP : object, subject, predicate  Store triples in the RowID of the table  Store graph name in the Column Family  Advantages:  Native lexicographical sorting of row keys  fast range queries  All patterns can be translated into a scan of one of these tables
  • 8.
    88 Overview  Rya Overview Query Execution within Rya  Query Optimizations  Results  Summary
  • 9.
    99 … worksAt, Netflix, Dan worksAt,OfficeMax, Zack worksAt, Parsons, Bob worksAt, Parsons, Greta worksAt, Parsons, John … Rya Query Execution  Implemented OpenRDF Sesame SAIL API  Parse queries, generate initial query plan, execute plan  Triple patterns map to range queries in Accumulo SELECT ?x WHERE { ?x <worksAt> <Parsons>. ?x <livesIn> <Virginia>. } Step 1: POS Table – scan range … Bob, livesIn, Georgia … Greta, livesIn, Virginia … John, livesIn, Virginia … Step 2: for each ?x, SPO – index lookup
  • 10.
    1010 More Complicated Exampleof Rya Query Execution Step 2: For each ?x, SPO Table lookup … Greta, commuteMethod, bike … John, commuteMethod, Bus … Step 3: For each remaining ?x, SPO Table lookup Step 1: POS Table – scan range for worksAt, Parsons ?x livesIn Virginia?x worksAt Parsons ?x commuteMethod bike … worksAt, Netflix, Dan worksAt, Parsons, Bob worksAt, Parsons, Greta worksAt, Parsons, John worksAt, PlayStation, Alice … … Bob, livesIn, Georgia … Greta, livesIn, Virginia … John, livesIn, Virginia … SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <livesIn> Virginia. ?x <commuteMethod> bike. }
  • 11.
    1111 Challenges in QueryExecution  Scalability and Responsiveness  Massive amounts of data  Potentially large amounts of comparisons  Consider the Previous Example:  Default query execution: comparing each “?x” returned from first statement pattern query to all subsequent triple patterns  There are 8.3 million Virginia residents, about 15,000 Parsons employees, and 750,000 people who commute via bike.  Only 100 people who work at Parsons commute via bike while 1000 people who work at Parsons live in Virginia. Poor query execution plans can result in simple queries taking minutes as opposed to milliseconds SELECT ?x WHERE { ?x <livesIn> Virginia. ?x <worksAt> Parsons. ?x <commuteMethod> bike. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <livesIn> Virginia. ?x <commuteMethod> bike. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. } vs. vs.
  • 12.
    1212 Overview  Rya Overview Query Execution within Rya  Query Optimizations  Results  Summary
  • 13.
    1313 Rya Query Optimizations Goal: Optimize query execution (joins) to better support real time responsiveness  Three Approaches:  Reduce the number of joins: Pattern Based Indices – Pre-calculate common joins  Limit data in joins: Use more stats to improve query planning – Cardinality estimation on individual statement patterns – Join selectivity estimation on pairs of statement patterns  Make joins more efficient: Distribute the Join Processing – Distribute processing using SPARK SQL or MapReduce – Use Hash Joins and Intersecting Iterators – Just beginning to start looking at this
  • 14.
    1414 Rya Query OptimizationsUsing Cardinalities  Goal: Optimize ordering of query execution to reduce the number of comparison operations  Order execution based on the number of triples that match each triple pattern SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. } 8.3M matches 15k matches 750k matches
  • 15.
    1515 Rya Cardinality Usage Maintain cardinalities on the following triple patterns element combinations:  Single elements: Subject, Predicate, Object  Composite elements: Subject-Predicate, Subject-Object, Predicate-Object  Computed periodically using MapReduce  Row ID: – <CardinalityType><TripleElements> • OBJECT, Parsons • PREDICATEOBJECT, worksAt, Parsons  Cardinality stored in the value  Sparse table: Only store cardinalities above a threshold  Only need to recompute cardinalities if the distribution of the data changes significantly
  • 16.
    1616 Limitations of CardinalityApproach  Consider a more complicated query  Cardinality approach does not take into account number of results returned by joins  Solution lies in estimating the “join selectivity” for a each pair of triples SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?vehicle <vehicleType> SUV. ?x <livesIn> Virginia. ?x <owns> ?vehicle. } 2.1M matches 15k matches 750k matches 8.3M matches 254M matches
  • 17.
    1717 Rya Query OptimizationsUsing Join Selectivity Query optimized using only Cardinality Info: Query optimized using Cardinality and Join Selectivity Info: SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?vehicle <vehicleType> SUV. ?x <livesIn> Virginia. ?x <owns> ?vehicle. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. }  Join Selectivity measures number of results returned by joining two triple patterns  Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008  Due to computational complexity, estimate of join selectivity for triple patterns is pre-computed and stored in Accumulo  Join selectivity estimated by computing the number of results obtained when each triple pattern is joined with the full table
  • 18.
    1818 Join Selectivity: GeneralAlgorithm  For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a variable and p1, o1 , p2, o2 constant, estimate the number of results  Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>) give number of results returned by joining a statement pattern with the full table along the subject component  Full table join statistics precomputed and stored in index  Join statistics for each triple pattern computed using following equation:  Use analogous definition if variables appear in predicate or object position  Join selectivity statistics used with cardinalities to generate more efficient query plans
  • 19.
    1919 Join Selectivity: Integrationinto Rya  Join Selectivity estimates used to optimize Rya queries through a greedy algorithm approach  Query constructed starting with first triple pattern to be evaluated (the pattern with the smallest cardinality) and then patterns are added based on minimization of a cost function  Cost function  C = leftCard + rightCard + leftCard*rightCard*selectivity  C measures number of entries Accumulo must scan and the number of comparisons required to perform the join  Selectivity set to one if two triple patterns share no common variables, otherwise precomputed estimates used  Ensures that patterns with common variables are grouped together
  • 20.
    2020 Construction of SelectivityTables  For the pattern <?x, p1, o1>, associate each RDF triple of the form <c, p1, o1> with the cardinality |<c,?y,?z>| and then sum the results  Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits the key-value pair (c, (p1, o1))  Map Job 2 processes the cardinality table and emits the key value pair (c, |<c,?y,?x>|), which consists of the constant c and its single component, subject cardinality for the table  Map Job 3 merges the results from jobs 1 and 2 by emitting the key-value pair ((p1, o1), |<c,?y,?x>|)  Map Job 4 sums the cardinalities from those key-value pairs containing (p1, o1) as a key, and the result is written to the selectivity table
  • 21.
    2121 Query Optimizations UsingPre-Computed Joins  Reduce joins by pre-computing common joins  Approach taken from: Heese, Ralf, et al. "Index Support for SPARQL." European Semantic Web Conference, Innsbruck, Austria. 2007. SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. } Pre-compute using batch processing and look up during query execution
  • 22.
    2222 Query Optimizations UsingPre-Computed Joins Index Result Table .… Aaron, ToyotaRav4 Caleb, JeepCherokee Puja, HondaCRV .… SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. } SELECT ?person ?car WHERE { ?person <livesIn> Virginia. ?person <owns> ?car. ?car <vehicleType> SUV. } 1. Pre-compute a portion of the query using MapReduce 2. Store SPARQL describing the query along with pre-computed values in Accumulo 3. Normalize query variables to match stored SPARQL variables during query execution Stored SPARQL
  • 23.
    2323 Overview  Rya Overview Query Execution within Rya  Query Optimizations  Results  Summary
  • 24.
    2424 Query Optimization Results Ran 14 queries against the Lehigh University Benchmark (LUBM) dataset (33.34 million triples)  LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity – Remaining queries were executed 12 times  Cluster Specs: – 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and 48 GB RAM  Results indicate that cardinality and join selectivity optimizations provide improved or comparable performance
  • 25.
    2525 Summary  Cardinality estimationand join selectivity can improve query response times for ad hoc queries  Effects of join selectivity are more apparent for complex queries over large datasets  Pre-computed joins are extremely useful for optimizing common queries  Potentially avoid large number of join operations  Maintaining pre-computed join indices is difficult
  • 26.
  • 27.
  • 28.
    2828 Useful Links  SPARQL http://www.w3.org/TR/rdf-sparql-query/  http://jena.apache.org/tutorials/sparql.html  RDF  http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/  Rya  https://github.com/LAS-NCSU/rya – Source on github: Provides documentation and sample client code – Email Aaron Mihalik (aaron.mihalik@parsons.com) for access (US Citizens only)  Rya Working Group – Monthly telecon / update on progress, issues, upcoming features – Email Puja Valiyil puja.valiyil@parsons.com to join (US Citizens only)  Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting- started.docbook?view  Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html  Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the clouds. Proceedings of the 1st International Workshop on Cloud Intelligence. http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf  Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya. Information Systems Journal (2013). http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf
  • 29.
    2929 Next Steps  Maintainingpre-computed join indices  Dynamically determining potential pre-computed joins  Distributing query planning and execution  SPARK SQL  Rya backed by other datastores  Fully open sourcing Rya
  • 30.
    3030 Sample LUBM Queries(1 of 3) Query 1 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:GraduateStudent . ?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>} } Query 3 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Publication . ?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>} }
  • 31.
    3131 Sample LUBM Queries(2 of 3) Query 7 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Course . ?X ub:takesCourse ?Y . <http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y} } Query 8 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y ?Z WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Department . ?X ub:memberOf ?Y . ?Y ub:subOrganizationOf <http://www.University0.edu> . ?X ub:emailAddress ?Z} }
  • 32.
    3232 Sample LUBM Queries(3 of 3) Query 9 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y ?Z WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Faculty . ?Z rdf:type ub:Course . ?X ub:advisor ?Y . ?Y ub:teacherOf ?Z . ?X ub:takesCourse ?Z} } Query 11 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:ResearchGroup . ?X ub:subOrganizationOf <http://www.University0.edu>} }

Editor's Notes

  • #2 Abstract The Resource Description Framework (RDF) is a standard model for expressing graph data for the World Wide Web. Developed by the W3C, RDF and related technologies such as OWL and SKOS provide a rich vocabulary for exchanging graph data in a machine understandable manner. As the size of available data continues to grow, there has been an increased desire for methods of storing very large RDF graphs within big data architectures. Rya is a government open source scalable RDF triple store built on top of Apache Accumulo. Originally developed by the Laboratory for Telecommunication Sciences and US Naval Academy, Rya is currently being used by a number of government agencies for storing, inferencing, and querying large amounts of RDF data. As Rya’s user base has grown, there has been a stronger requirement for near real time query responsiveness over massive RDF graphs. In this talk, we detail several query optimization strategies the Rya team has pursued to better satisfy this requirement. We describe recent work allowing for the use of additional indices to eliminate large common joins within complex SPARQL queries. Additionally, we explain a number of statistics based optimizations to improve query planning. Specifically, we detail extensions to existing methods of estimating the selectivity of individual statement patterns (cardinality) and the selectivity of joining two statement patterns (join selectivity) to better fit a “big data” paradigm and utilize Accumulo. Finally, we share preliminary performance evaluation results for the optimizations that have been pursued. Speaker Dr. Caleb Meier, Engineer/Algorithm Developer, Parsons Corporation Dr. Meier received a PhD from the University of California San Diego (UCSD) in Mathematics in 2012. For the past two years, he was a postdoctoral fellow at UCSD's Math department specializing in non-linear elliptic systems of partial differential equations. He received his undergraduate degree in Mathematics from Yale University in 2006. Dr. Meier is currently working as an engineer at Parsons Corporation, specializing in query optimization algorithms for large scale RDF graphs. He is an expert in semantic technologies, Accumulo, the Hadoop Ecosystem, and is actually more fun to be around than his bio suggests. Schedule: 2:45-3:20 on April 29, 2015
  • #10 Find all US citizens that travel to Iran
  • #17 Triple patterns containing no common variables can be joined together creating an external product Among triple patterns with similar cardinalities and common variables, how should they be joined to obtain best execution plan
  • #22 Term “Pattern Based Index” taken from : Heese, Ralf, et al. "Index support for sparql." European Semantic Web Conference, Innsbruck, Austria. 2007. Issues Query planning is difficult Potentially exponentially increase index size Maintaining an external index