Slideshare.net (beta)

 
Post To TwitterPost to Twitter
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 0 (more)

MPTStore: A Fast, Scalable, and Stable Resource Index

From cwilper, 2 years ago

Describes and motivates the creation of MPTStore within the contex more

2592 views  |  5 comments  |  0 favorites  |  2 embeds (Stats)
Download not available ?
 

Categories

Add Category
 
 

Groups / Events

 

 
Embed
options

More Info

This slideshow is Public
Total Views: 2592
on Slideshare: 2477
from embeds: 115

Slideshow transcript

Slide 1: MPTStore: A Fast, Scalable, and Stable Resource Index Aaron Birkland and Chris Wilper Open Repositories 2007 San Antonio, TX

Slide 2: Background: RDF in Fedora A natural fit: • Object-object relationships • Object properties • Exposure to services (as a graph) Resource Index introduced: • Fedora 2.0 (January ‘05)

Slide 3: Background: RDF in Fedora Challenges • Scalability Few triplestores designed for 100M+ • Performance • Jena vs. Kowari (Jena: OOM) • Kowari vs. Sesame Native (Sesame: slow • complex queries) Stability • Frequent “rebuilds” •

Slide 4: Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects  250 million triples 

Slide 5: Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects  250 million triples  ..and has a large volume of writes Driven by periodic OAI harvests  Primarily mixed ingests and datastream mods  Highly concurrent reads and writes 

Slide 6: Motivation: The NSDL Use Case Additionally, NSDL has data model constraints that must be enforced Existential/referential constraints on objects  (e.g. “foreign key” constraints) Uniqueness constraints on some object  properties

Slide 7: Motivation: The NSDL Use Case These constraints primarily center around RELS-EXT content: Relationships to other NSDL objects (forming a  graph) Literal value properties for a particular object  itself

Slide 8: <foxml:datastream ID=”RELS-EXT” ...> Must be globally unique ... <example:id>PLUGH-XYZZY</example:id> <example:objectType>Resource</example:objectType> /> <example:memberOf rdf:resource=”info:fedora/demo:73” ... </foxml:datastream> This object... 1) Must exist 2) Must be 'Active' 3) Must be objectType 'Aggregation'

Slide 9: Motivation: The NSDL Use Case No suitable constraint enforcement mechanisms exist in Fedora itself Our approach: Enforce content model in middleware  Serialize access where we have to  Query RI before ingest or modify 

Slide 10: The Challenge Querying the RI to determine correct repository state proved to be the most difficult aspect. To achieve acceptable performance with Kowari, triple  writes are buffered and executed in large, infrequent chunks Triples waiting in these buffers are invisible to outside  queries

Slide 11: The Challenge Possible solution: Flush the buffer after every write operation  New problem: Flushed updates with Kowari are very expensive --  Multiple seconds per operation. This was incompatible with NSDL processing volume This was a real showstopper...

Slide 12: The Challenge Other difficulties the NSDL had with Kowari: RI corruption under concurrent use  RI corruption with abnormal shutdowns  Scalability. Performance became noticeably  worse with increasing repository size Steep memory requirements 

Slide 13: The Challenge Searching for a solution.. Other triple stores (e.g. Jena, Sesame) were considered  for Fedora in the past, rejected for various reasons RDBMS seemed attractive – efficient transactions, very  stable, generally speedy “One big table” paradigm did not seem to give us  desired scalability in initial tests

Slide 14: Our Solution Mapped predicate tables One table per predicate, containing indexed  'subject' and 'object' values  Mapping table containing metadata correlating predicate URI to a particular db table

Slide 15: Triples Mapping t1 s o <info:fedora/demo:1> <info:fedora/demo:3> <info:fedora/demo:2> <info:fedora/demo:4> Predicate tmap p pkey <http://ns.example.org/rels#memberOf> 1 <info:fedora/fedora-def:model#disseminates> 2

Slide 16: Our Solution Benefits: Low cost adds and deletes   Queries with known predicates are very fast  Complex queries benefit due to RDBMS planner having finer-grained statistics and query plans  Flexible data partitioning

Slide 17: Our Solution Disadvantages: Need to manage predicate to table mapping   Complex queries require more effort to formulate  With a naïve approach, simple unbound queries scale linearly with the number of predicates

Slide 18: Our Solution Observations: Total number of distinct predicates is much  lower than predicates or objects. NSDL has ~ 50  Unbound predicate queries are less common  NSDL is heavily biased towards a high volume of writes and simple queries

Slide 19: Our Solution Enter MPTStore Java library that handles all mapping and  accounting behind the scenes  API for performing triple writes and queries  Translates queries from a particular language (e.g. SPO, SPARQL) into SQL statements

Slide 20: Our Solution Designed to expose transaction/connection semantics Calling code has to provide jdbc connection for  adding, querying triples  Thus, clear path to use advanced transactional capabilities offered by jdbc driver (such as XA)

Slide 21: Results MPTStore performance well suited to NSDL use case Adds or modifies were significantly faster than  Kowari case, and were unaffected by database size  SPO queries were on-par with Kowari in unbound(common) case

Slide 22: Results Bonus NSDL team was very familiar with operation  of RDBMS administration: performance tuning, backups, etc  Stored data is transparent and “hackable”: Ad- hoc SQL queries and analysis are relatively simple

Slide 23: Results Fedora Bonus Ability to easily analyze the database: helped  us track down our own middleware bugs (improved Kowari Performance).

Slide 24: Fast, Immediate Updates Graph shows average  600 ms. per datastream modification 500 MPTStore achieves  400 virtually same Async 300 performance whether Sync buffering or not 200 Complete test detail  100 in Fedora 2.2 docs 0 Kowari MPTStore

Slide 25: RI: Future Directions External Resource Index Event-based (JMS) updates to external  triplestore Analogous to GSearch index updates   May be asynchronous  May index other datastreams Make full use of triplestore capabilities without  compromising the core repository Inference (e.g. krule, RACER)   Native APIs

Slide 26: RI: Future Directions Internal (Synchronous) Resource Index Assumption: XA Transactions.   Option A: MPTStore Only Pro: Simple, synchronous, JDBC (no need for  middleware)  Con: Basic queries (no iTQL, maybe SPARQL-Lite) Option B: Mulgara or MPTStore  Pro: Richer queries when using Mulgara (iTQL)   Con: Complexity (need for XA-aware middleware?)

Slide 27: Thank You More Information http://mptstore.sourceforge.net/  http://www.fedora.info/download/2.2/  http://tripletest.sourceforge.net/ 