SlideShare a Scribd company logo
Rya: Optimizations to Support Real
Time Graph Queries on Accumulo
Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu
DISTRIBUTION STATEMENT A. Approved for
public release; distribution is unlimited.
ONR Case Number 43-279-15 JB.01.2015
22
Acknowledgements
 This work is the collective effort of:
 Parsons’ Rya Team, sponsored by the Department of
the Navy, Office of Naval Research
 Rya Founders: Roshan Punnoose, Adina Crainiceanu,
and David Rapp
33
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
44
Background: Rya and RDF
 Rya: Resource Description Framework (RDF)
Triplestore built on top of Accumulo
 RDF: W3C standard for representing
linked/graph data
 Represents data as statements (assertions) about
resources
– Serialized as triples in {subject, predicate, object}
form
– Example:
• {Caleb, worksAt, Parsons}
• {Caleb, livesIn, Virginia}
Caleb
Parsons
Virginia
worksAt
livesIn
55
Background: SPARQL
 RDF Queries are described using SPARQL
 SPARQL Protocol and RDF Query Language
 SQL-like syntax for finding triples matching
specific patterns
 Look for subgraphs that match triple statement patterns
 Joins are performed when there are variables common
to two or more statement patterns
SELECT ?people WHERE {
?people <worksAt> <Parsons>.
?people <livesIn> <Virginia>.
}
66
Rya Architecture
 Open RDF Interface for interacting with RDF data
stored on Accumulo
 Open RDF (Sesame): Open
Source Java framework for
storing and querying RDF
data
 Open RDF Provides several
interfaces/abstractions
central for interacting with
a RDF datastore
– SAIL interface for interacting with underlying persisted
RDF model
– SAIL: Storage And Inference Layer
Data storage layer
Query processing in SAIL layer
SPARQL
Rya Open RDF
Rya QueryPlanner
Accumulo
77
Storage: Triple Table Index
 3 Tables
 SPO : subject, predicate, object
 POS : predicate, object, subject
 OSP : object, subject, predicate
 Store triples in the RowID of the table
 Store graph name in the Column Family
 Advantages:
 Native lexicographical sorting of row keys  fast range queries
 All patterns can be translated into a scan of one of these tables
88
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
99
…
worksAt, Netflix, Dan
worksAt, OfficeMax, Zack
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
…
Rya Query Execution
 Implemented OpenRDF Sesame SAIL API
 Parse queries, generate initial query plan, execute plan
 Triple patterns map to range queries in Accumulo
SELECT ?x WHERE { ?x <worksAt> <Parsons>.
?x <livesIn> <Virginia>. }
Step 1: POS Table – scan range
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
Step 2: for each ?x, SPO – index lookup
1010
More Complicated Example of Rya Query
Execution
Step 2: For each ?x,
SPO Table lookup
…
Greta, commuteMethod,
bike
…
John, commuteMethod,
Bus
…
Step 3: For each
remaining ?x, SPO
Table lookup
Step 1: POS Table – scan
range for worksAt, Parsons
?x livesIn Virginia?x worksAt Parsons
?x commuteMethod bike
…
worksAt, Netflix, Dan
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
worksAt, PlayStation,
Alice
…
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
1111
Challenges in Query Execution
 Scalability and Responsiveness
 Massive amounts of data
 Potentially large amounts of comparisons
 Consider the Previous Example:
 Default query execution: comparing each “?x” returned from first
statement pattern query to all subsequent triple patterns
 There are 8.3 million Virginia residents, about 15,000 Parsons
employees, and 750,000 people who commute via bike.
 Only 100 people who work at Parsons commute via bike while 1000
people who work at Parsons live in Virginia.
Poor query execution plans can result in simple queries
taking minutes as opposed to milliseconds
SELECT ?x WHERE {
?x <livesIn> Virginia.
?x <worksAt> Parsons.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
vs. vs.
1212
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
1313
Rya Query Optimizations
 Goal: Optimize query execution (joins) to better
support real time responsiveness
 Three Approaches:
 Reduce the number of joins: Pattern Based Indices
– Pre-calculate common joins
 Limit data in joins: Use more stats to improve query
planning
– Cardinality estimation on individual statement patterns
– Join selectivity estimation on pairs of statement patterns
 Make joins more efficient: Distribute the Join Processing
– Distribute processing using SPARK SQL or MapReduce
– Use Hash Joins and Intersecting Iterators
– Just beginning to start looking at this
1414
Rya Query Optimizations Using Cardinalities
 Goal: Optimize ordering of query execution to
reduce the number of comparison operations
 Order execution based on the number of triples that
match each triple pattern
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
8.3M matches
15k matches
750k matches
1515
Rya Cardinality Usage
 Maintain cardinalities on the following triple patterns
element combinations:
 Single elements: Subject, Predicate, Object
 Composite elements: Subject-Predicate, Subject-Object,
Predicate-Object
 Computed periodically using MapReduce
 Row ID:
– <CardinalityType><TripleElements>
• OBJECT, Parsons
• PREDICATEOBJECT, worksAt, Parsons
 Cardinality stored in the value
 Sparse table: Only store cardinalities above a threshold
 Only need to recompute cardinalities if the
distribution of the data changes significantly
1616
Limitations of Cardinality Approach
 Consider a more complicated query
 Cardinality approach does not take into account
number of results returned by joins
 Solution lies in estimating the “join selectivity” for a
each pair of triples
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
2.1M matches
15k matches
750k matches
8.3M matches
254M matches
1717
Rya Query Optimizations Using Join Selectivity
Query optimized using
only Cardinality Info:
Query optimized using Cardinality
and Join Selectivity Info:
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
 Join Selectivity measures number of results returned by joining two
triple patterns
 Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas
Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008
 Due to computational complexity, estimate of join selectivity for triple
patterns is pre-computed and stored in Accumulo
 Join selectivity estimated by computing the number of results obtained
when each triple pattern is joined with the full table
1818
Join Selectivity: General Algorithm
 For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a
variable and p1, o1 , p2, o2 constant, estimate the number of results
 Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>)
give number of results returned by joining a statement pattern with
the full table along the subject component
 Full table join statistics precomputed and stored in index
 Join statistics for each triple pattern computed using following equation:
 Use analogous definition if variables appear in predicate or object position
 Join selectivity statistics used with cardinalities to generate more
efficient query plans
1919
Join Selectivity: Integration into Rya
 Join Selectivity estimates used to optimize Rya queries
through a greedy algorithm approach
 Query constructed starting with first triple pattern to be
evaluated (the pattern with the smallest cardinality) and then
patterns are added based on minimization of a cost function
 Cost function
 C = leftCard + rightCard + leftCard*rightCard*selectivity
 C measures number of entries Accumulo must scan and the
number of comparisons required to perform the join
 Selectivity set to one if two triple patterns share no common
variables, otherwise precomputed estimates used
 Ensures that patterns with common variables are grouped
together
2020
Construction of Selectivity Tables
 For the pattern <?x, p1, o1>, associate each RDF triple of
the form <c, p1, o1> with the cardinality |<c,?y,?z>| and
then sum the results
 Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits
the key-value pair (c, (p1, o1))
 Map Job 2 processes the cardinality table and emits the key
value pair (c, |<c,?y,?x>|), which consists of the constant c
and its single component, subject cardinality for the table
 Map Job 3 merges the results from jobs 1 and 2 by emitting
the key-value pair ((p1, o1), |<c,?y,?x>|)
 Map Job 4 sums the cardinalities from those key-value pairs
containing (p1, o1) as a key, and the result is written to the
selectivity table
2121
Query Optimizations Using Pre-Computed Joins
 Reduce joins by pre-computing common joins
 Approach taken from: Heese, Ralf, et al. "Index Support for
SPARQL." European Semantic Web Conference, Innsbruck,
Austria. 2007.
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
Pre-compute using
batch processing
and look up during
query execution
2222
Query Optimizations Using Pre-Computed Joins
Index Result Table
.…
Aaron, ToyotaRav4
Caleb, JeepCherokee
Puja, HondaCRV
.…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
SELECT ?person ?car
WHERE {
?person <livesIn> Virginia.
?person <owns> ?car.
?car <vehicleType> SUV.
}
1. Pre-compute a portion of the query
using MapReduce
2. Store SPARQL describing the query
along with pre-computed values in
Accumulo
3. Normalize query variables to match
stored SPARQL variables during
query execution
Stored SPARQL
2323
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
2424
Query Optimization Results
 Ran 14 queries against the Lehigh University Benchmark (LUBM)
dataset (33.34 million triples)
 LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity
– Remaining queries were executed 12 times
 Cluster Specs:
– 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and
48 GB RAM
 Results indicate that cardinality and join selectivity optimizations provide
improved or comparable performance
2525
Summary
 Cardinality estimation and join selectivity can
improve query response times for ad hoc queries
 Effects of join selectivity are more apparent for
complex queries over large datasets
 Pre-computed joins are extremely useful for
optimizing common queries
 Potentially avoid large number of join operations
 Maintaining pre-computed join indices is difficult
2626
Questions?
2727
BACK-UP
2828
Useful Links
 SPARQL
 http://www.w3.org/TR/rdf-sparql-query/
 http://jena.apache.org/tutorials/sparql.html
 RDF
 http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/
 Rya
 https://github.com/LAS-NCSU/rya
– Source on github: Provides documentation and sample client code
– Email Aaron Mihalik (aaron.mihalik@parsons.com) for access (US Citizens only)
 Rya Working Group
– Monthly telecon / update on progress, issues, upcoming features
– Email Puja Valiyil puja.valiyil@parsons.com to join (US Citizens only)
 Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting-
started.docbook?view
 Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html
 Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the
clouds. Proceedings of the 1st International Workshop on Cloud Intelligence.
http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
 Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya.
Information Systems Journal (2013).
http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf
2929
Next Steps
 Maintaining pre-computed join indices
 Dynamically determining potential pre-computed
joins
 Distributing query planning and execution
 SPARK SQL
 Rya backed by other datastores
 Fully open sourcing Rya
3030
Sample LUBM Queries (1 of 3)
Query 1
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:GraduateStudent .
?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>}
}
Query 3
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Publication .
?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>}
}
3131
Sample LUBM Queries (2 of 3)
Query 7
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Course .
?X ub:takesCourse ?Y .
<http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y}
}
Query 8
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Department .
?X ub:memberOf ?Y .
?Y ub:subOrganizationOf <http://www.University0.edu> .
?X ub:emailAddress ?Z}
}
3232
Sample LUBM Queries (3 of 3)
Query 9
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Faculty .
?Z rdf:type ub:Course .
?X ub:advisor ?Y .
?Y ub:teacherOf ?Z .
?X ub:takesCourse ?Z}
}
Query 11
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:ResearchGroup .
?X ub:subOrganizationOf <http://www.University0.edu>}
}

More Related Content

What's hot

Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Databricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Databricks
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)Ankit Rathi
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 

What's hot (20)

Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 

Similar to Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
makoto onizuka
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptx
ssuser31398b
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
Revolution Analytics
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
University of Washington
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Craig Chao
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Rakebul Hasan
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
Yu Liu
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
Spiros Oikonomakis
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
Spiros Economakis
 
Landmark Retrieval & Recognition
Landmark Retrieval & RecognitionLandmark Retrieval & Recognition
Landmark Retrieval & Recognition
kenluck2001
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
Sanjeev Mishra
 
RichardPughspatial.ppt
RichardPughspatial.pptRichardPughspatial.ppt
RichardPughspatial.ppt
EnnerHereniodeAlcnta
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 

Similar to Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks] (20)

Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptx
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Landmark Retrieval & Recognition
Landmark Retrieval & RecognitionLandmark Retrieval & Recognition
Landmark Retrieval & Recognition
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
RichardPughspatial.ppt
RichardPughspatial.pptRichardPughspatial.ppt
RichardPughspatial.ppt
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 

Recently uploaded

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

  • 1. Rya: Optimizations to Support Real Time Graph Queries on Accumulo Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. ONR Case Number 43-279-15 JB.01.2015
  • 2. 22 Acknowledgements  This work is the collective effort of:  Parsons’ Rya Team, sponsored by the Department of the Navy, Office of Naval Research  Rya Founders: Roshan Punnoose, Adina Crainiceanu, and David Rapp
  • 3. 33 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 4. 44 Background: Rya and RDF  Rya: Resource Description Framework (RDF) Triplestore built on top of Accumulo  RDF: W3C standard for representing linked/graph data  Represents data as statements (assertions) about resources – Serialized as triples in {subject, predicate, object} form – Example: • {Caleb, worksAt, Parsons} • {Caleb, livesIn, Virginia} Caleb Parsons Virginia worksAt livesIn
  • 5. 55 Background: SPARQL  RDF Queries are described using SPARQL  SPARQL Protocol and RDF Query Language  SQL-like syntax for finding triples matching specific patterns  Look for subgraphs that match triple statement patterns  Joins are performed when there are variables common to two or more statement patterns SELECT ?people WHERE { ?people <worksAt> <Parsons>. ?people <livesIn> <Virginia>. }
  • 6. 66 Rya Architecture  Open RDF Interface for interacting with RDF data stored on Accumulo  Open RDF (Sesame): Open Source Java framework for storing and querying RDF data  Open RDF Provides several interfaces/abstractions central for interacting with a RDF datastore – SAIL interface for interacting with underlying persisted RDF model – SAIL: Storage And Inference Layer Data storage layer Query processing in SAIL layer SPARQL Rya Open RDF Rya QueryPlanner Accumulo
  • 7. 77 Storage: Triple Table Index  3 Tables  SPO : subject, predicate, object  POS : predicate, object, subject  OSP : object, subject, predicate  Store triples in the RowID of the table  Store graph name in the Column Family  Advantages:  Native lexicographical sorting of row keys  fast range queries  All patterns can be translated into a scan of one of these tables
  • 8. 88 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 9. 99 … worksAt, Netflix, Dan worksAt, OfficeMax, Zack worksAt, Parsons, Bob worksAt, Parsons, Greta worksAt, Parsons, John … Rya Query Execution  Implemented OpenRDF Sesame SAIL API  Parse queries, generate initial query plan, execute plan  Triple patterns map to range queries in Accumulo SELECT ?x WHERE { ?x <worksAt> <Parsons>. ?x <livesIn> <Virginia>. } Step 1: POS Table – scan range … Bob, livesIn, Georgia … Greta, livesIn, Virginia … John, livesIn, Virginia … Step 2: for each ?x, SPO – index lookup
  • 10. 1010 More Complicated Example of Rya Query Execution Step 2: For each ?x, SPO Table lookup … Greta, commuteMethod, bike … John, commuteMethod, Bus … Step 3: For each remaining ?x, SPO Table lookup Step 1: POS Table – scan range for worksAt, Parsons ?x livesIn Virginia?x worksAt Parsons ?x commuteMethod bike … worksAt, Netflix, Dan worksAt, Parsons, Bob worksAt, Parsons, Greta worksAt, Parsons, John worksAt, PlayStation, Alice … … Bob, livesIn, Georgia … Greta, livesIn, Virginia … John, livesIn, Virginia … SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <livesIn> Virginia. ?x <commuteMethod> bike. }
  • 11. 1111 Challenges in Query Execution  Scalability and Responsiveness  Massive amounts of data  Potentially large amounts of comparisons  Consider the Previous Example:  Default query execution: comparing each “?x” returned from first statement pattern query to all subsequent triple patterns  There are 8.3 million Virginia residents, about 15,000 Parsons employees, and 750,000 people who commute via bike.  Only 100 people who work at Parsons commute via bike while 1000 people who work at Parsons live in Virginia. Poor query execution plans can result in simple queries taking minutes as opposed to milliseconds SELECT ?x WHERE { ?x <livesIn> Virginia. ?x <worksAt> Parsons. ?x <commuteMethod> bike. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <livesIn> Virginia. ?x <commuteMethod> bike. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. } vs. vs.
  • 12. 1212 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 13. 1313 Rya Query Optimizations  Goal: Optimize query execution (joins) to better support real time responsiveness  Three Approaches:  Reduce the number of joins: Pattern Based Indices – Pre-calculate common joins  Limit data in joins: Use more stats to improve query planning – Cardinality estimation on individual statement patterns – Join selectivity estimation on pairs of statement patterns  Make joins more efficient: Distribute the Join Processing – Distribute processing using SPARK SQL or MapReduce – Use Hash Joins and Intersecting Iterators – Just beginning to start looking at this
  • 14. 1414 Rya Query Optimizations Using Cardinalities  Goal: Optimize ordering of query execution to reduce the number of comparison operations  Order execution based on the number of triples that match each triple pattern SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. } 8.3M matches 15k matches 750k matches
  • 15. 1515 Rya Cardinality Usage  Maintain cardinalities on the following triple patterns element combinations:  Single elements: Subject, Predicate, Object  Composite elements: Subject-Predicate, Subject-Object, Predicate-Object  Computed periodically using MapReduce  Row ID: – <CardinalityType><TripleElements> • OBJECT, Parsons • PREDICATEOBJECT, worksAt, Parsons  Cardinality stored in the value  Sparse table: Only store cardinalities above a threshold  Only need to recompute cardinalities if the distribution of the data changes significantly
  • 16. 1616 Limitations of Cardinality Approach  Consider a more complicated query  Cardinality approach does not take into account number of results returned by joins  Solution lies in estimating the “join selectivity” for a each pair of triples SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?vehicle <vehicleType> SUV. ?x <livesIn> Virginia. ?x <owns> ?vehicle. } 2.1M matches 15k matches 750k matches 8.3M matches 254M matches
  • 17. 1717 Rya Query Optimizations Using Join Selectivity Query optimized using only Cardinality Info: Query optimized using Cardinality and Join Selectivity Info: SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?vehicle <vehicleType> SUV. ?x <livesIn> Virginia. ?x <owns> ?vehicle. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. }  Join Selectivity measures number of results returned by joining two triple patterns  Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008  Due to computational complexity, estimate of join selectivity for triple patterns is pre-computed and stored in Accumulo  Join selectivity estimated by computing the number of results obtained when each triple pattern is joined with the full table
  • 18. 1818 Join Selectivity: General Algorithm  For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a variable and p1, o1 , p2, o2 constant, estimate the number of results  Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>) give number of results returned by joining a statement pattern with the full table along the subject component  Full table join statistics precomputed and stored in index  Join statistics for each triple pattern computed using following equation:  Use analogous definition if variables appear in predicate or object position  Join selectivity statistics used with cardinalities to generate more efficient query plans
  • 19. 1919 Join Selectivity: Integration into Rya  Join Selectivity estimates used to optimize Rya queries through a greedy algorithm approach  Query constructed starting with first triple pattern to be evaluated (the pattern with the smallest cardinality) and then patterns are added based on minimization of a cost function  Cost function  C = leftCard + rightCard + leftCard*rightCard*selectivity  C measures number of entries Accumulo must scan and the number of comparisons required to perform the join  Selectivity set to one if two triple patterns share no common variables, otherwise precomputed estimates used  Ensures that patterns with common variables are grouped together
  • 20. 2020 Construction of Selectivity Tables  For the pattern <?x, p1, o1>, associate each RDF triple of the form <c, p1, o1> with the cardinality |<c,?y,?z>| and then sum the results  Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits the key-value pair (c, (p1, o1))  Map Job 2 processes the cardinality table and emits the key value pair (c, |<c,?y,?x>|), which consists of the constant c and its single component, subject cardinality for the table  Map Job 3 merges the results from jobs 1 and 2 by emitting the key-value pair ((p1, o1), |<c,?y,?x>|)  Map Job 4 sums the cardinalities from those key-value pairs containing (p1, o1) as a key, and the result is written to the selectivity table
  • 21. 2121 Query Optimizations Using Pre-Computed Joins  Reduce joins by pre-computing common joins  Approach taken from: Heese, Ralf, et al. "Index Support for SPARQL." European Semantic Web Conference, Innsbruck, Austria. 2007. SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. } Pre-compute using batch processing and look up during query execution
  • 22. 2222 Query Optimizations Using Pre-Computed Joins Index Result Table .… Aaron, ToyotaRav4 Caleb, JeepCherokee Puja, HondaCRV .… SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. } SELECT ?person ?car WHERE { ?person <livesIn> Virginia. ?person <owns> ?car. ?car <vehicleType> SUV. } 1. Pre-compute a portion of the query using MapReduce 2. Store SPARQL describing the query along with pre-computed values in Accumulo 3. Normalize query variables to match stored SPARQL variables during query execution Stored SPARQL
  • 23. 2323 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 24. 2424 Query Optimization Results  Ran 14 queries against the Lehigh University Benchmark (LUBM) dataset (33.34 million triples)  LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity – Remaining queries were executed 12 times  Cluster Specs: – 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and 48 GB RAM  Results indicate that cardinality and join selectivity optimizations provide improved or comparable performance
  • 25. 2525 Summary  Cardinality estimation and join selectivity can improve query response times for ad hoc queries  Effects of join selectivity are more apparent for complex queries over large datasets  Pre-computed joins are extremely useful for optimizing common queries  Potentially avoid large number of join operations  Maintaining pre-computed join indices is difficult
  • 28. 2828 Useful Links  SPARQL  http://www.w3.org/TR/rdf-sparql-query/  http://jena.apache.org/tutorials/sparql.html  RDF  http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/  Rya  https://github.com/LAS-NCSU/rya – Source on github: Provides documentation and sample client code – Email Aaron Mihalik (aaron.mihalik@parsons.com) for access (US Citizens only)  Rya Working Group – Monthly telecon / update on progress, issues, upcoming features – Email Puja Valiyil puja.valiyil@parsons.com to join (US Citizens only)  Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting- started.docbook?view  Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html  Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the clouds. Proceedings of the 1st International Workshop on Cloud Intelligence. http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf  Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya. Information Systems Journal (2013). http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf
  • 29. 2929 Next Steps  Maintaining pre-computed join indices  Dynamically determining potential pre-computed joins  Distributing query planning and execution  SPARK SQL  Rya backed by other datastores  Fully open sourcing Rya
  • 30. 3030 Sample LUBM Queries (1 of 3) Query 1 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:GraduateStudent . ?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>} } Query 3 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Publication . ?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>} }
  • 31. 3131 Sample LUBM Queries (2 of 3) Query 7 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Course . ?X ub:takesCourse ?Y . <http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y} } Query 8 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y ?Z WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Department . ?X ub:memberOf ?Y . ?Y ub:subOrganizationOf <http://www.University0.edu> . ?X ub:emailAddress ?Z} }
  • 32. 3232 Sample LUBM Queries (3 of 3) Query 9 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y ?Z WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Faculty . ?Z rdf:type ub:Course . ?X ub:advisor ?Y . ?Y ub:teacherOf ?Z . ?X ub:takesCourse ?Z} } Query 11 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:ResearchGroup . ?X ub:subOrganizationOf <http://www.University0.edu>} }

Editor's Notes

  1. Abstract The Resource Description Framework (RDF) is a standard model for expressing graph data for the World Wide Web. Developed by the W3C, RDF and related technologies such as OWL and SKOS provide a rich vocabulary for exchanging graph data in a machine understandable manner. As the size of available data continues to grow, there has been an increased desire for methods of storing very large RDF graphs within big data architectures. Rya is a government open source scalable RDF triple store built on top of Apache Accumulo. Originally developed by the Laboratory for Telecommunication Sciences and US Naval Academy, Rya is currently being used by a number of government agencies for storing, inferencing, and querying large amounts of RDF data. As Rya’s user base has grown, there has been a stronger requirement for near real time query responsiveness over massive RDF graphs. In this talk, we detail several query optimization strategies the Rya team has pursued to better satisfy this requirement. We describe recent work allowing for the use of additional indices to eliminate large common joins within complex SPARQL queries. Additionally, we explain a number of statistics based optimizations to improve query planning. Specifically, we detail extensions to existing methods of estimating the selectivity of individual statement patterns (cardinality) and the selectivity of joining two statement patterns (join selectivity) to better fit a “big data” paradigm and utilize Accumulo. Finally, we share preliminary performance evaluation results for the optimizations that have been pursued. Speaker Dr. Caleb Meier, Engineer/Algorithm Developer, Parsons Corporation Dr. Meier received a PhD from the University of California San Diego (UCSD) in Mathematics in 2012. For the past two years, he was a postdoctoral fellow at UCSD's Math department specializing in non-linear elliptic systems of partial differential equations. He received his undergraduate degree in Mathematics from Yale University in 2006. Dr. Meier is currently working as an engineer at Parsons Corporation, specializing in query optimization algorithms for large scale RDF graphs. He is an expert in semantic technologies, Accumulo, the Hadoop Ecosystem, and is actually more fun to be around than his bio suggests. Schedule: 2:45-3:20 on April 29, 2015
  2. Find all US citizens that travel to Iran
  3. Triple patterns containing no common variables can be joined together creating an external product Among triple patterns with similar cardinalities and common variables, how should they be joined to obtain best execution plan
  4. Term “Pattern Based Index” taken from : Heese, Ralf, et al. "Index support for sparql." European Semantic Web Conference, Innsbruck, Austria. 2007. Issues Query planning is difficult Potentially exponentially increase index size Maintaining an external index