Slideshow transcript
Slide 1: MPTStore: A Fast, Scalable, and Stable Resource Index Aaron Birkland and Chris Wilper Open Repositories 2007 San Antonio, TX
Slide 2: Background: RDF in Fedora A natural fit: • Object-object relationships • Object properties • Exposure to services (as a graph) Resource Index introduced: • Fedora 2.0 (January ‘05)
Slide 3: Background: RDF in Fedora Challenges • Scalability Few triplestores designed for 100M+ • Performance • Jena vs. Kowari (Jena: OOM) • Kowari vs. Sesame Native (Sesame: slow • complex queries) Stability • Frequent “rebuilds” •
Slide 4: Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples
Slide 5: Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples ..and has a large volume of writes Driven by periodic OAI harvests Primarily mixed ingests and datastream mods Highly concurrent reads and writes
Slide 6: Motivation: The NSDL Use Case Additionally, NSDL has data model constraints that must be enforced Existential/referential constraints on objects (e.g. “foreign key” constraints) Uniqueness constraints on some object properties
Slide 7: Motivation: The NSDL Use Case These constraints primarily center around RELS-EXT content: Relationships to other NSDL objects (forming a graph) Literal value properties for a particular object itself
Slide 8: <foxml:datastream ID=”RELS-EXT” ...> Must be globally unique ... <example:id>PLUGH-XYZZY</example:id> <example:objectType>Resource</example:objectType> /> <example:memberOf rdf:resource=”info:fedora/demo:73” ... </foxml:datastream> This object... 1) Must exist 2) Must be 'Active' 3) Must be objectType 'Aggregation'
Slide 9: Motivation: The NSDL Use Case No suitable constraint enforcement mechanisms exist in Fedora itself Our approach: Enforce content model in middleware Serialize access where we have to Query RI before ingest or modify
Slide 10: The Challenge Querying the RI to determine correct repository state proved to be the most difficult aspect. To achieve acceptable performance with Kowari, triple writes are buffered and executed in large, infrequent chunks Triples waiting in these buffers are invisible to outside queries
Slide 11: The Challenge Possible solution: Flush the buffer after every write operation New problem: Flushed updates with Kowari are very expensive -- Multiple seconds per operation. This was incompatible with NSDL processing volume This was a real showstopper...
Slide 12: The Challenge Other difficulties the NSDL had with Kowari: RI corruption under concurrent use RI corruption with abnormal shutdowns Scalability. Performance became noticeably worse with increasing repository size Steep memory requirements
Slide 13: The Challenge Searching for a solution.. Other triple stores (e.g. Jena, Sesame) were considered for Fedora in the past, rejected for various reasons RDBMS seemed attractive – efficient transactions, very stable, generally speedy “One big table” paradigm did not seem to give us desired scalability in initial tests
Slide 14: Our Solution Mapped predicate tables One table per predicate, containing indexed 'subject' and 'object' values Mapping table containing metadata correlating predicate URI to a particular db table
Slide 15: Triples Mapping t1 s o <info:fedora/demo:1> <info:fedora/demo:3> <info:fedora/demo:2> <info:fedora/demo:4> Predicate tmap p pkey <http://ns.example.org/rels#memberOf> 1 <info:fedora/fedora-def:model#disseminates> 2
Slide 16: Our Solution Benefits: Low cost adds and deletes Queries with known predicates are very fast Complex queries benefit due to RDBMS planner having finer-grained statistics and query plans Flexible data partitioning
Slide 17: Our Solution Disadvantages: Need to manage predicate to table mapping Complex queries require more effort to formulate With a naïve approach, simple unbound queries scale linearly with the number of predicates
Slide 18: Our Solution Observations: Total number of distinct predicates is much lower than predicates or objects. NSDL has ~ 50 Unbound predicate queries are less common NSDL is heavily biased towards a high volume of writes and simple queries
Slide 19: Our Solution Enter MPTStore Java library that handles all mapping and accounting behind the scenes API for performing triple writes and queries Translates queries from a particular language (e.g. SPO, SPARQL) into SQL statements
Slide 20: Our Solution Designed to expose transaction/connection semantics Calling code has to provide jdbc connection for adding, querying triples Thus, clear path to use advanced transactional capabilities offered by jdbc driver (such as XA)
Slide 21: Results MPTStore performance well suited to NSDL use case Adds or modifies were significantly faster than Kowari case, and were unaffected by database size SPO queries were on-par with Kowari in unbound(common) case
Slide 22: Results Bonus NSDL team was very familiar with operation of RDBMS administration: performance tuning, backups, etc Stored data is transparent and “hackable”: Ad- hoc SQL queries and analysis are relatively simple
Slide 23: Results Fedora Bonus Ability to easily analyze the database: helped us track down our own middleware bugs (improved Kowari Performance).
Slide 24: Fast, Immediate Updates Graph shows average 600 ms. per datastream modification 500 MPTStore achieves 400 virtually same Async 300 performance whether Sync buffering or not 200 Complete test detail 100 in Fedora 2.2 docs 0 Kowari MPTStore
Slide 25: RI: Future Directions External Resource Index Event-based (JMS) updates to external triplestore Analogous to GSearch index updates May be asynchronous May index other datastreams Make full use of triplestore capabilities without compromising the core repository Inference (e.g. krule, RACER) Native APIs
Slide 26: RI: Future Directions Internal (Synchronous) Resource Index Assumption: XA Transactions. Option A: MPTStore Only Pro: Simple, synchronous, JDBC (no need for middleware) Con: Basic queries (no iTQL, maybe SPARQL-Lite) Option B: Mulgara or MPTStore Pro: Richer queries when using Mulgara (iTQL) Con: Complexity (need for XA-aware middleware?)
Slide 27: Thank You More Information http://mptstore.sourceforge.net/ http://www.fedora.info/download/2.2/ http://tripletest.sourceforge.net/



Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 0 (more)