Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
Rya: Accumulo Indexing
Strategies for Searching
Semantic...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
Acknowledgements
• The work presented herein was funded ...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 2
Agenda
• Rya Background
• Materialized Views in Rya
• ...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 3
Agenda
•Rya
Background
• Materialized Views in Rya
• E...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• Apache Rya (incubating): Resource Description
Framewor...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 5
SPARQL
Background
SELECT ?people WHERE {
?people <work...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• RDF4J* Interface for interacting with RDF data
stored ...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• 3 Tables
 SPO : subject, predicate, object
 POS : pr...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
Subject, Predicate,
Object
…
Bob, livesIn, California
…
...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• Out of the box, Rya SPARQL query evaluation requires
j...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• Materialized Views in Rya: Index Frequently Issued Que...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 11
Agenda
• Rya Background
•Materialized
Views in Rya
• ...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• Determine which query results to cache given that
capa...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• During query planning, statement patterns in a query a...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• Batch Update
 Re-compute precomputed join tables peri...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 15
Incremental Transactions Using Apache Fluo (incubatin...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 16
Creating an Application using the Observer Framework
...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 17
Join Observer:
Output: {?patron, ?employee,
?business...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 18
Join Observer:
Output: {?patron, ?employee,
?business...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 19
Join Observer:
Output: {?patron, ?employee,
?business...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
Adding or deleting
statements to the
repository requires...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 21
System Overview
Maintaining Precomputed Joins
PCJ
Cli...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
PCJ
Client
PCJ
Index
Core Rya
Tables
Fluo App
Registerin...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
PCJ
Client
Statement
Stream
Core Rya
Tables
Fluo App
Str...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• The results on this slide and the next were obtained u...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 25
Agenda
• Rya Background
• Materialized Views in Rya
•...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 26
Facebook’s Unicorn Paper: Motivating Example
Entity-C...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
What if the adjacency lists are really large? The terms
...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 28
Unicorn Applied to Accumulo
Entity-Centric Index
Accu...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 29
RowID ColF ColQ
0 bark 6
0 dog 3
0 dog 6
RowID ColF C...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 30
Unicorn Applied to RDF
Entity-Centric Index
• Adjacen...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• For each triple (subj, pred, obj, graph-name), insert ...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 32
Query Evaluation: Merge Joins and the Reduction of Ne...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 33
Which Queries Can We Evaluate?
Entity-Centric Index
•...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• During query planning, group statement patterns accord...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 35
Entity Centric Benchmarking Results
Entity-Centric In...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• Next Steps:
 Implement strategies to determine PCJs d...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
• Next Steps for Entity Centric Index
 Explore ways to ...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 38
Questions?
Rya: Accumulo Indexing Strategies for Searching Semantic Networks 39
• Useful Links
• Entity Centric LUBM Star Queries
App...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
Useful Links
SPARQL
 Standard -- http://www.w3.org/TR/r...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
Entity Centric LUBM Star Queries (1 of 5)
The Entity Cen...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
LUBM Star Q4 = PREFIX rdf: <http://www.w3.org/1999/02/22...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
LUBM Star Q7 = PREFIX rdf: <http://www.w3.org/1999/02/22...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
LUBM Star Q10 = PREFIX rdf: <http://www.w3.org/1999/02/2...
Rya: Accumulo Indexing Strategies for Searching Semantic Networks
LUBM Star Q13 = PREFIX rdf: <http://www.w3.org/1999/02/2...
Upcoming SlideShare
Loading in …5
×

Accumulo Summit 2016: Accumulo Indexing Strategies for Searching Semantic Networks

263 views

Published on

The rapidly increasing amount of semantic network data today provides a wealth of insight into how entities interact and relate with one another. In order to tap into this valuable source of information, organizations require a secure and scalable repository in which to store and explore these interactions and relationships. In this talk we will discuss Apache Rya, an Accumulo-based graph store capable of storing billions of Resource Description Framework (RDF) triples and providing a rich SPARQL query endpoint for exploring complex subgraph relationships. We will talk about two indexing strategies that Rya uses to address some of the challenges associated with storing and querying large graph datasets. In particular, we will discuss how our SPARQL (SPARQL Protocol and RDF Query Language) query caching framework allows users to greatly improve query performance by storing and incrementally maintaining query results using Apache Fluo. We will also discuss our Accumulo-based entity centric index. Inspired by Facebook’s horizontally partitioned graph index, Unicorn , Apache Rya’s entity centric index is a novel way of storing graphs in Accumulo that draws on document partitioned indexing techniques. This graph partitioning and indexing strategy limits network traffic and enables distributed join processing by utilizing a variation of Accumulo’s IntersectingIterator framework to perform joins server side.

The work presented herein was funded by the Office of Naval Research, under contract # N00014-12-C-0365, supporting this effort.

– Speaker – 

Dr. Caleb Meier
Software Engineer, Parsons

Caleb Meier has been a Software Engineer at Parsons Government Services for the last two years. Since joining Parsons, he has investigated and implemented a number of features to improve the query performance of Apache Rya. Caleb earned his Ph.D. in Mathematics from the University of California, San Diego and a B.A. in Mathematics from Yale University. In his spare time he enjoys climbing, biking, playing soccer and spending time with his delightful wife Leslie.

— More Information —

For more information see http://www.accumulosummit.com/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Accumulo Summit 2016: Accumulo Indexing Strategies for Searching Semantic Networks

  1. 1. Rya: Accumulo Indexing Strategies for Searching Semantic Networks Rya: Accumulo Indexing Strategies for Searching Semantic Networks Dr. Caleb Meier, Puja Valiyil, David Lotts, Aaron Mihalik, Dr. Adina Crainiceanu 00.00.00 Presenter’s NameDISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. ONR Case Number 43-2117-16 EXIM APPROVED Parsons #459 8 OCT 16
  2. 2. Rya: Accumulo Indexing Strategies for Searching Semantic Networks Acknowledgements • The work presented herein was funded by the Office of Naval Research (ONR) and the National Geospatial- Intelligence Agency (NGA) under contract # N00014-12-C- 0365 • This presentation was sponsored by Parsons • This work is the collective effort of:  Parsons’ Rya Team: Puja Valiyil, Aaron Mihalik, Caleb Meier, David Lotts, Jennifer Brown  Rya Founders: Roshan Punnoose, Adina Crainiceanu, and David Rapp 1
  3. 3. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 2 Agenda • Rya Background • Materialized Views in Rya • Entity-Centric Index
  4. 4. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 3 Agenda •Rya Background • Materialized Views in Rya • Entity-Centric Index
  5. 5. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • Apache Rya (incubating): Resource Description Framework (RDF) Triplestore built on top of Accumulo or MongoDB • RDF: W3C standard for representing linked/graph data  Represents data as statements (assertions) about resources  Serialized as triples in {subject, predicate, object} form  Example:  {Caleb, worksAt, Parsons}  {Caleb, livesIn, Virginia} 4 Rya and RDF Background Caleb Parsons Virginia worksAt livesIn
  6. 6. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 5 SPARQL Background SELECT ?people WHERE { ?people <worksAt> <Parsons>. ?people <livesIn> <Virginia>. } • RDF Queries are described using SPARQL  SPARQL Protocol and RDF Query Language • SQL-like syntax for finding triples matching specific patterns  Look for subgraphs that match triple statement patterns  Joins are performed when there are variables common to two or more statement patterns
  7. 7. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • RDF4J* Interface for interacting with RDF data stored on Accumulo  RDF4J Open Source Java framework for storing and querying RDF data  RDF4J provides several interfaces/abstractions central for interacting with an RDF datastore  SAIL interface for interacting with underlying persisted RDF model  SAIL: Storage And Inference Layer 6 Rya Architecture Background Data storage layer Query processing in SAIL layer SPARQL Rya and RDF4J Rya QueryPlanner Accumulo *The RDF4J.org project was previously named “Open RDF”, then “Sesame”.
  8. 8. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • 3 Tables  SPO : subject, predicate, object  POS : predicate, object, subject  OSP : object, subject, predicate • Store triples in the Row ID of the table • Store graph name (context) in the Column Family • Advantages:  Native lexicographical sorting of row keys  fast range queries  All patterns can be translated into a scan of one of these tables 7 Storage: Triple Table Index Accumulo Composite Index and Table Design Key for SPO Table Row ID Column Timestam pFamily Qualifier Visibility subject, predicate, object, type graph name (not used) visibility timestamp
  9. 9. Rya: Accumulo Indexing Strategies for Searching Semantic Networks Subject, Predicate, Object … Bob, livesIn, California … Greta, livesIn, Virginia … John, livesIn, Virginia … Anatomy of a RYA SPARQL Query Query Planning Step 2: For each ?x, SPO Table lookup Subject, Predicate, Object … Greta, commuteMethod, bike … John, commuteMethod, Bus … Step 3: For each ?x, SPO Table lookup Step 1: POS Table – scan range for worksAt ?x worksAt ?y ?x livesIn Virginia ?x commuteMethod bike SELECT ?x, ?y WHERE { ?x <worksAt> ?y. ?x <livesIn> Virginia. ?x <commuteMethod> bike. } Predicate, Object, Subject … studiesAt, Joe, Georgetown talksTo, Joe, Bob worksAt, Netflix, Bob worksAt, Parsons, Greta worksAt, PlayStation, John
  10. 10. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • Out of the box, Rya SPARQL query evaluation requires joining Accumulo scan results client side • Joins are performed using a hybrid of nested loop and hash joins  Range prefixes for scans are determined by results of previous scan  Sorted nested loop  Results of scan are joined with any values not used to form range prefix  Hash joins • BatchScanner boosts performance  Evaluates results in batches  Client side join evaluation is still a bottle neck  Especially for queries with:  Large intermediate join results  Large number of joins 9 Costly Joins of Accumulo Scans Query Evaluation
  11. 11. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • Materialized Views in Rya: Index Frequently Issued Queries or Frequent Subgraphs 1. Pre-compute a portion of the query 2. Store SPARQL describing the query along with pre-computed values in Accumulo 3. Normalize query variables to match stored SPARQL variables during query execution • Entity-Centric Index: Add new indices to eliminate need for hash joins 1. Apply Document Partitioned Indexing to graph data  Design tables so that all properties for each entity appear on single tablet 2. Find entities with intersecting properties  Use a variation of an Intersecting Iterator  Perform merge joins on the server • Additional indexing strategies could eliminate the need for client side joins  Clear trade off between query performance and memory footprint  Mitigate “data plume” 10 Indexing Strategies for Dealing With Costly Joins Query Evaluation
  12. 12. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 11 Agenda • Rya Background •Materialized Views in Rya • Entity-Centric Index
  13. 13. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • Determine which query results to cache given that capacity is limited • Caching Strategies  Preference: manually submit queries to cache  Recent use  Complexity: number of joins, estimates of intermediate results  Relevance  Relevance to data (frequent subgraphs)  Relevance to users (query logs) • A comprehensive strategy should use all of the criteria above • Terms: Materialized Views and Precomputed Joins  A Materialized View is a cache of query results  Precomputed Joins in Rya are Materialized Views for SPARQL queries consisting of joins and filters 12 Determining which Queries to Cache Materialized Views in Rya
  14. 14. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • During query planning, statement patterns in a query are grouped according to common variables and common constants • Groups can be precomputed and cached, trading faster response times for increased storage 13 Query Planning: Anatomy of a Rya SPARQL Query using a Precomputed Join Materialized Views in Rya ?x :livesIn :Arlington ?y :talksTo ?x ?y :livesIn :DC. ?x :commutesBy :Bike Pre-Computed Join Index table Select ?a, ?b where { ?a :livesIn :DC . ?a :talksTo ?b .} ?a=Joe, ?b=Caleb ?a=Mike, ?b=Dave … ?a=Rob, ?b=Aaron Pre- Computed Join Index Node Join Join Join select ?x ?y where { ?y :livesIn :DC. ?y :talksTo ?x. ?x :livesIn :Arlington. ?x :commutesBy :Bike.}
  15. 15. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • Batch Update  Re-compute precomputed join tables periodically using MapReduce  Benefit:  Minimizes data plume  Drawback:  Large possibility of stale data  Query plans that use precomputed joins may contain inaccurate results • Incrementally update tables as triples are ingested  Use some sort of observer framework to update intermediate results as new triples are ingested  Benefit:  No staleness in query results  Query plans that use precomputed joins are more accurate  Significantly less latency for updates  Drawbacks:  Data plume  Intermediate query results have to be stored to incrementally update results  Observer framework increases the complexity of the system 14 Batch Updates Strategies for Maintaining Materialized Views
  16. 16. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 15 Incremental Transactions Using Apache Fluo (incubating) Fluo Background • Address maintenance problem by incrementally updating cached results using Fluo  Fluo was created for Accumulo based on the Percolator paper1 by Google Inc.  Fluo provides additional features that Accumulo does not:  Multi-row transactions prevent write-write conflicts  Observer framework (next slide) • Use cases  Maintain large scale computation using series of small transaction updates  Join existing large data cache with new data  Formerly done by periodic batch processing jobs recreating the data cache 1. Daniel Peng, Frank Dabek. USENIX. 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications. http://research.google.com/pubs/pub36726.html
  17. 17. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 16 Creating an Application using the Observer Framework Fluo Background • Overview of the Fluo Observer Framework  Observers monitor a given Fluo table column  When the observed column is updated, a notification is triggered which tells the observer to perform a transaction  The transaction is specified by the implementation of the observer’s process method  Takes in a transaction object, row and column  Uses the data to perform an action such as writing to another column in the table • Perform complex incremental updates by Chaining Observers  Decompose problem into a collection of interacting observers  Observers can write notifications that trigger other observers
  18. 18. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 17 Join Observer: Output: {?patron, ?employee, ?business} Statement Pattern Observer: ?patron <http://talksTo> ?employee Output: {?patron, ?employee} Statement Pattern Observer: ?employee <http://worksAt> ?business Output: {?employee, ?business} People who talk to employees and where that employee works. Pairs of people who talk to each other Where people work Streamed Statements Streamed Statements SPARQL Query SELECT ?patron ?employee ?business WHERE { ?patron <http://talksTo> ?employee. ?employee <http://worksAt> ?business } Formulating a SPARQL Query as a Chain of Observers Maintaining Precomputed Joins
  19. 19. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 18 Join Observer: Output: {?patron, ?employee, ?business} Statement Pattern Observer: ?patron <http://talksTo> ?employee Output: {?patron, ?employee} Statement Pattern Observer: ?employee <http://worksAt> ?business Output: {?employee, ?business} {patron=Alice, employee=Bob, business=CoffeeShop} {patron=Alice, employee=Bob}, {patron=Charlie, employee=David} {employee=Bob, business=CoffeeShop}, {employee=Eve, business=PizzaPlace} {Alice, talksTo, Bob}, {Charlie, talksTo, David}, {Bob, worksAt, CoffeeShop}, {Eve, worksAt, PizzaPlace} {Alice, talksTo, Bob}, {Charlie, talksTo, David}, {Bob, worksAt, CoffeeShop}, {Eve, worksAt, PizzaPlace} Incrementally Creating Results Using Query Observers (1 of 2) Maintaining Precomputed Joins
  20. 20. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 19 Join Observer: Output: {?patron, ?employee, ?business} Statement Pattern Observer: ?patron <http://talksTo> ?employee Output: {?patron, ?employee} Statement Pattern Observer: ?employee <http://worksAt> ?business Output: {?employee, ?business} {patron=Alice, employee=Bob, business=CoffeeShop}, {patron=Charlie, employee=David, business=CoffeeShop} {patron=Alice, employee=Bob}, {patron=Charlie, employee=David} {employee=Bob, business=CoffeeShop}, {employee=Eve, business=PizzaPlace}, {employee=David, business=CoffeeShop} {David, worksAt, CoffeeShop} {David, worksAt, CoffeeShop} Incrementally Creating Results Using Query Observers (2 of 2) Maintaining Precomputed Joins
  21. 21. Rya: Accumulo Indexing Strategies for Searching Semantic Networks Adding or deleting statements to the repository requires updating the precomputed join index table This requires updates to intermediate results within the Fluo Table 20 Overview of Rya Fluo Application Maintaining Precomputed Joins Triple Observer Join Observer Filter Observer Statement Pattern Observer Query Result Observer Fluo Rya Precomputed Join (PCJ) App Rya Client Insert Triples Accumulo PCJ Index Table
  22. 22. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 21 System Overview Maintaining Precomputed Joins PCJ Client Statement Stream Rya Client Rya Core Tables (SPO, POS, OSP) Rya PCJ Table Fluo App Table Accumulo Fluo Incremental PCJ App Processes ExportsResults Inserts Statements Inserts Historic SP Matches Rya PCJ Table Rya PCJ Table Rya PCJ Table 1 2 3
  23. 23. Rya: Accumulo Indexing Strategies for Searching Semantic Networks PCJ Client PCJ Index Core Rya Tables Fluo App Registering a New Query Maintaining Precomputed Joins 1. Register Query 2. Scan for historic Statement Pattern matches 4. Compute Results 5. Export Results to PCJ Index 22 3. Insert Statement Pattern matches
  24. 24. Rya: Accumulo Indexing Strategies for Searching Semantic Networks PCJ Client Statement Stream Core Rya Tables Fluo App Streaming While Registering Query Maintaining Precomputed Joins 2. Scan for historic Statement Pattern matches 3. Insert Statement Pattern matches 4. Compute Results 1. Register Query A. Write new Statement B. Write new Statement 23
  25. 25. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • The results on this slide and the next were obtained using proprietary queries and data  Q1 is the most complex and consists of 14 joins, 6 of which are left joins  Q2 obtained from Q1 by removing two left joins and two joins  Q3-Q5 decrease in complexity as well and are obtained using a similar process  Q5 is similar to Q1, with all left joins replaced by joins • 4192 results were obtained by querying a Rya Instance with 500,000 triples installed on Parsons’ internal cluster  8 worker nodes, each with 2 x 6 Core Xeon E5-2440 (2.4GHz) Processors and 48 GB RAM • Table below presents results with average query time over 10 iterations with standard deviation: 24 Benchmark results for Rya with No Precomputed Joins and No Optimizations Materialized Views in Rya Q Rya with one exact PCJ (s) Rya with no PCJ (s) Q1 1.284 ± 0.047 516.774 ± 6.265 Q2 0.851 ± 0.042 345.606 ± 5.991 Q3 0.598 ± 0.026 180.663 ± 3.354 Q4 0.368 ± 0.026 63.588 ± 1.527 Q5 1.334 ± 0.074 97.101 ± 1.765
  26. 26. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 25 Agenda • Rya Background • Materialized Views in Rya •Entity-Centric Index
  27. 27. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 26 Facebook’s Unicorn Paper: Motivating Example Entity-Centric Index The problem is reduced to finding the intersection of lists: How do I find all documents containing “dog” and “bark”? 1. View docs and terms as a graph, with edges drawn from docs to the terms they contain 2. Efficiently represent graph as a collection of adjacency lists bark doc4 doc5dog doc6 doc1 doc4 doc2 doc5 doc7 dog bark doc3 doc8 doc6 dog bark doc1 doc2 doc3 doc4 doc5 doc6 doc4 doc5 doc6 doc7 doc8 Adjacency lists of dog and bark
  28. 28. Rya: Accumulo Indexing Strategies for Searching Semantic Networks What if the adjacency lists are really large? The terms “dog” and “barks” could appear in lots of documents! • Distribute the problem by partitioning adjacency lists of documents across servers  Involves some type of sharding • Each server finds intersections of smaller lists: 27 Facebook’s Unicorn Paper: Distributing the Problem Entity-Centric Index dog bark dog bark 1 2 3 4 5 6 4 5 … 7 8 … Server 1 ShardID = 0 Server 2 ShardID = 1 ShardID = (doc num)%3 Server 3 ShardID = 2 3 6 ... 6 ... 1 4 ... 4 7 ... 2 5 ... 5 8 ... dog dog bark bark
  29. 29. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 28 Unicorn Applied to Accumulo Entity-Centric Index Accumulo Key Row: doc shard Column CF: term CQ: document id • Unicorn Framework outlines the basis for a distributed document partitioned index • Accumulo has a framework1 in place for creating this index  Uses IndexedDocIterator which is an extension of an IntersectingIterator • Uses the following key design: 1. Accumulo: Application Development, Table Design, and Best Practices, Cordova A., Rinaldi B., Wall M., O’Reilly 2015
  30. 30. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 29 RowID ColF ColQ 0 bark 6 0 dog 3 0 dog 6 RowID ColF ColQ 1 bark 1 1 bark 4 1 dog 4 1 dog 7 RowID ColF ColQ 2 bark 2 2 bark 5 2 dog 5 2 dog 8 Server 1 Server 2 Server 3 Documents that contain dog and bark Iter1 Iter2 Iter1 Iter2 Iter1 Iter2 Q Q Q R:6 R:4 R:5 Elements in adjacency lists of “bark” and “dog” stored in Accumulo in a Document Partitioned Index • RowID = shardID (doc num % 3) • Column Family = term (bark or dog) • Column Qualifier = adjacency element (document number) Using this index, can evaluate “entity- centric queries” entirely on server • On each server, • iter1 scans “bark” • iter2 scans “dog” • Iterators intersect when colQ1 = colQ2, then return result Unicorn Implemented in Accumulo using Intersecting Iterators Entity-Centric Index
  31. 31. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 30 Unicorn Applied to RDF Entity-Centric Index • Adjacency Lists capture the edge label as well as the connection pers1 pers2 pers3 dog SUV pers4 hasPet obj/dog employs subj/USGovt pers1 pers2 pers4 pers2 pers4 Adjacency lists of SUV and USGovt and do • This SPARQL query asks for all people who own a dog, drive a SUV, and work for the U.S. Government: SELECT ?person WHERE { ?person <hasPet> <dog> . ?person <drives> <SUV> . <USGovt> <employs> ?person . } drives hasPet employs USGovt employs drives drives drives obj/SUV pers2 pers3 pers4
  32. 32. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • For each triple (subj, pred, obj, graph-name), insert the following two Accumulo keys in the same Entity-Centric Index table: 31 Key Design in Accumulo Entity-Centric Index Row: uri:John, CF: uri:worksAt, CQ: parsonsEmployeesx00objectx00uri:Parsons Row: uri:Parsons, CF: uri:worksAt, CQ: parsonsEmployeesx00subjectx00uri:John Accumulo Key Row:<subject > Column CF:<predicate > CQ:<graphName>x00objectx00<object> Accumulo Key Row:<object> Column CF:<predicate> CQ:<graphName>x00subjectx00<subject > • The triple (uri:John, uri:worksAt, uri:Parsons, graph context: parsonEmployees) is added as the following two rows:
  33. 33. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 32 Query Evaluation: Merge Joins and the Reduction of Network Traffic Entity-Centric Index SELECT ?person WHERE { ?person <hasPet> <dog>. ?person <drives> <SUV>. <USGovt> <employs> ?person.} Iter2 Q Q R:pers2 R:pers4 Iter1 Iter3 Iter1 Iter3 Iter2 Using this index, can evaluate “entity-centric queries” entirely on server • iter1 scans col: employs colQ: subject USGovt, • iter2 scans col: drives colQ: object SUV, • iter3 scans col: hasPet colQ: object dog • Iterators intersect when rowID 1 = rowID 2 = rowID 3 RowI D ColF ColQ dog hasPet S …. pers1 hasPet O dog pers2 employ s S USGovt pers2 drives O bicycle pers2 drives O SUV pers2 hasPet O dog RowI D ColF ColQ pers3 drives 0 SUV pers4 employ s S USGovt pers4 drives O SUV pers4 hasPet O dog pers5 drives O SUV SUV drives S …. USGov t employ s O pers2 USGov t employ s O pers4 Server 1 Server 2
  34. 34. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 33 Which Queries Can We Evaluate? Entity-Centric Index • Generalize Document Partitioned Index to accommodate a broad range of SPARQL queries • Solve as many Entity-Centric queries server side as possible  Entity-Centric means all statement patterns share a common variable or constant 33 select ?x ?y ?z where{ A aa ?x A bb ?y A cc ?z } select ?x where{ ?x aa C ?x bb B ?x cc D } select ?x ?y ?z where{ B aa ?x ?x bb ?y ?x cc ?z } B C D ?x Entity with Properties ?x ?y ?z A Properties for an Entity B ?y ?z ?x “Friends of Friends” aa bb cc aa bb cc aa bb cc
  35. 35. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • During query planning, group statement patterns according to common variables and common constants • Groups which have the highest “priority” are consolidated into an Entity-Centric Index node 34 Query Planning: Anatomy of a Rya SPARQL Query using the Entity-Centric Index Entity-Centric Index ?x livesIn Arlington ?y talksTo ?x ?y livesIn D.C. ?x commutesBy Bike Entity-Centric Index … Joe, livesIn, D.C. Joe, talksTo, Rob … Rob, commutesBy, Bike Rob, livesIn, Arlington … Entity- Centric Index Node Entity-Centric Index Node 1 2 Join Join Join
  36. 36. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 35 Entity Centric Benchmarking Results Entity-Centric Index • Results were obtained by running 13 queries (see Appendix) against the Lehigh University Benchmark (LUBM) data set consisting of 33.34 million triples • The Entity Centric Index table was split into 19 tablets distributed across 8 servers  All predicates found in LUBM data were set as locality groups for the Entity Centric Index table • Queries were issued using a BatchScanner with 15 threads Query Entity Total Time (s) Rya Total Time (s) Results Ret. LUBMStar Q1 23.6 624.702 1024789 LUBMStar Q2 0.3724 0.732 7 LUBMStar Q3 0.545 1.221 499 LUBMStar Q4 4.37 379.239 180002 LUBMStar Q5 1.475 6.072 40665 LUBMStar Q6 0.222 6.613 5003 LUBMStar Q7 11 0.3258 3 LUBMStar Q8 7.2 0.267 1 LUBMStar Q9 12.763 0.748 8 LUBMStar Q10 34.934 1929.984 1,259,374 LUBMStar Q11 0.0412 0.284 3 LUBMStar Q12 0.0358 0.311 2 LUBMStar Q13 0.0291 0.137 30
  37. 37. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • Next Steps:  Implement strategies to determine PCJs dynamically  Calculate frequent subgraphs  Develop query logging framework  Perform semantic analysis of queries to determine common components to cache  Streamline query planning with respect to PCJs  Query planning time increases as number of PCJs increases  Explore strategies for pruning PCJ query plan search space to quickly determine efficient PCJ combinations for query plans  Index PCJs using underlying query components so that PCJs can be efficiently discovered using the matching subquery 36 Next Steps for Precomputed Joins Future Research
  38. 38. Rya: Accumulo Indexing Strategies for Searching Semantic Networks • Next Steps for Entity Centric Index  Explore ways to improve join performance of Entity Centric Query Nodes  Add capability to explicitly define Entity Types for Index  Entities implicitly defined as nodes containing specified combination of properties  Explicitly register entities with index and allow users to query by type  Specify entities using OWL (Web Ontology Language) class and property combinations  Leverage additional structure using more targeted queries involving identifying features for the give entity type • Future Research in Server Side Join Evaluation  Utilize Spark GraphX or Spark DataFrames to create a distributed query evaluation framework for Rya  Joins performed on Rya Resilient Distributed Datasets (RDDs) in remote SparkContext on Server 37 Next Steps for Entity Centric Index and Server Side Join Evaluation Future Research
  39. 39. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 38 Questions?
  40. 40. Rya: Accumulo Indexing Strategies for Searching Semantic Networks 39 • Useful Links • Entity Centric LUBM Star Queries Appendix
  41. 41. Rya: Accumulo Indexing Strategies for Searching Semantic Networks Useful Links SPARQL  Standard -- http://www.w3.org/TR/rdf-sparql-query/  Tutorial -- http://jena.apache.org/tutorials/sparql.html RDF  Primer -- http://www.w3.org/TR/rdf11-primer/ Unicorn  Paper -- Michael Curtiss, et al., Unicorn: A System for Searching the Social Graph, Facebook Inc. https://research.facebook.com/publications/unicorn-a-system-for-searching-the-social- graph/ Apache Rya (Incubating)  Home -- http://rya.apache.org/ Home page for Apache Rya (Incubating)  Rya Office Hours -- Biweekly phone conference. Updates, issues, upcoming features. Up-coming announcements with dial-in numbers are sent on the dev mailing list  Mailing List -- dev@rya.incubator.apache.org is for usage questions, help, and people who want to contribute code to Rya. subscribe, unsubscribe, archives  Javadoc OpenRDF=Sesame=RDF4J -- http://archive.rdf4j.org/javadoc/sesame-2.7.16/  Tutorial for RDF4J -- http://rdf4j.org/doc/programming-with-rdf4j/  Paper -- Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the clouds. Proceedings of the 1st International Workshop on Cloud Intelligence. http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf  Paper -- Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya. Information Systems Journal (2013). http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf 40
  42. 42. Rya: Accumulo Indexing Strategies for Searching Semantic Networks Entity Centric LUBM Star Queries (1 of 5) The Entity Centric index was tested by issuing the following queries against the LUBM data set. LUBM Star Q1 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 ?Y3 ?Y4 WHERE { ?X ub:doctoralDegreeFrom ?Y2 ?X ub:undergraduateDegreeFrom ?Y4 ?Y1 ub:advisor ?X ?X ub:emailAddress ?Y3 } LUBM Star Q2 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 WHERE { ?X ub:doctoralDegreeFrom <http://www.University104.edu> ?X ub:headOf ?Y1 } LUBM Star Q3 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 WHERE { ?X ub:doctoralDegreeFrom <http://www.University104.edu> ?X ub:teacherOf ?Y1 } 41
  43. 43. Rya: Accumulo Indexing Strategies for Searching Semantic Networks LUBM Star Q4 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y2 ?Y3 ?Y4 WHERE { ?X ub:doctoralDegreeFrom ?Y2 ?X ub:undergraduateDegreeFrom ?Y4 ?Y1 ub:advisor ?X ?X ub:emailAddress ?Y3 } LUBM Star Q5 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 WHERE { ?X ub:headOf ?Y2 ?Y1 ub:advisor ?X } LUBM Star Q6 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 WHERE { ?X ub:headOf ?Y1 ?X ub:doctoralDegreeFrom ?Y2 } 42 Entity Centric LUBM Star Queries (2 of 5)
  44. 44. Rya: Accumulo Indexing Strategies for Searching Semantic Networks LUBM Star Q7 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 WHERE { <http://www.Department0.University0.edu/UndergraduateStudent106> ub:takesCourse ?Y1 ?Y2 ub:teacherOf ?Y1 } LUBM Star Q8 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y WHERE { <http://www.Department0.University114.edu/UndergraduateStudent168> ub:memberOf ?X ?X ub:subOrganizationOf ?Y } LUBM Star Q9 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 ?Y3 ?Y4 WHERE { ?X ub:takesCourse <http://www.Department0.University101.edu/GraduateCourse31> ?X ub:undergraduateDegreeFrom ?Y1 ?X ub:emailAddress ?Y2 ?X ub:memberOf ?Y3 } 43 Entity Centric LUBM Star Queries (3 of 5)
  45. 45. Rya: Accumulo Indexing Strategies for Searching Semantic Networks LUBM Star Q10 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 ?Y3 ?Y4 WHERE { ?X ub:takesCourse ?Y4 ?X ub:undergraduateDegreeFrom ?Y1 ?X ub:emailAddress ?Y2 ?X ub:memberOf ?Y3 } LUBM Star Q11 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 ?Y3 ?Y4 WHERE { <http://www.Department9.University150.edu/GraduateStudent72> ub:takesCourse ?Y4 <http://www.Department9.University150.edu/GraduateStudent72> ub:undergraduateDegreeFrom ?Y1 <http://www.Department9.University150.edu/GraduateStudent72> ub:emailAddress ?Y2 <http://www.Department9.University150.edu/GraduateStudent72> ub:memberOf ?Y3 } LUBM Star Q12 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y1 ?Y2 ?Y3 ?Y4 WHERE { <http://www.Department17.University156.edu/GraduateStudent21> ub:takesCourse ?Y4 <http://www.Department17.University156.edu/GraduateStudent21> ub:undergraduateDegreeFrom ?Y1 <http://www.Department17.University156.edu/GraduateStudent21> ub:emailAddress ?Y2 <http://www.Department17.University156.edu/GraduateStudent21> ub:memberOf ?Y3 "} 44 Entity Centric LUBM Star Queries (4 of 5)
  46. 46. Rya: Accumulo Indexing Strategies for Searching Semantic Networks LUBM Star Q13 = PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub:<http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?Y1 ?Y2 WHERE" { ?Y1 ub:takesCourse <http://www.Department0.University0.edu/Course0> ?Y2 ub:teacherOf <http://www.Department0.University0.edu/Course0> } 45 Entity Centric LUBM Star Queries (5 of 5)

×