MPTStore: A Fast, Scalable, and Stable Resource Index


Published on

Describes and motivates the creation of MPTStore within the context of the NSDL and Fedora's RDF Resource Index.

Published in: Technology, Education
  • As a management instructor I enjoy viewing the work of others. This is probably the best presentation on planning I have viewed.
    Are you sure you want to  Yes  No
    Your message goes here
  • This is another case where the slide looks ok in powerpoint, but impress and slideshare conversion seem to get confused. The white box should actually be a text box akin to 't1' above but with 'tmap' as the text. Hmmm.
    Are you sure you want to  Yes  No
    Your message goes here
  • This slide didn't convert properly. Opening the original in powerpoint, it looks ok (text not pushed together), but in impress, it looks like it does here.
    Are you sure you want to  Yes  No
    Your message goes here
  • On which slide? The first one? (Just post a comment at the particular slide that did not convert properly).

    Thanks for using Slideshare!

    the slideshare team
    Are you sure you want to  Yes  No
    Your message goes here
  • Hmm.. looks like the tables didn't convert properly. Some of the text is mashed together/obscured.
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MPTStore: A Fast, Scalable, and Stable Resource Index

  1. 1. MPTStore: A Fast, Scalable, and Stable Resource Index Aaron Birkland and Chris Wilper Open Repositories 2007 San Antonio, TX
  2. 2. Background: RDF in Fedora <ul><li>A natural fit: </li></ul><ul><li>Object-object relationships </li></ul><ul><li>Object properties </li></ul><ul><li>Exposure to services (as a graph) </li></ul><ul><li>Resource Index introduced: </li></ul><ul><li>Fedora 2.0 (January ‘05) </li></ul>
  3. 3. Background: RDF in Fedora <ul><li>Challenges </li></ul><ul><li>Scalability </li></ul><ul><ul><li>Few triplestores designed for 100M+ </li></ul></ul><ul><li>Performance </li></ul><ul><ul><li>Jena vs. Kowari (Jena: OOM) </li></ul></ul><ul><ul><li>Kowari vs. Sesame Native (Sesame: slow complex queries) </li></ul></ul><ul><li>Stability </li></ul><ul><ul><li>Frequent “rebuilds” </li></ul></ul>
  4. 4. Motivation: The NSDL Use Case <ul><li>The NSDL has a moderately large repository </li></ul><ul><ul><li>4.7 million objects </li></ul></ul><ul><ul><li>250 million triples </li></ul></ul>
  5. 5. Motivation: The NSDL Use Case <ul><li>The NSDL has a moderately large repository </li></ul><ul><ul><li>4.7 million objects </li></ul></ul><ul><ul><li>250 million triples </li></ul></ul><ul><li>..and has a large volume of writes </li></ul><ul><ul><li>Driven by periodic OAI harvests </li></ul></ul><ul><ul><li>Primarily mixed ingests and datastream mods </li></ul></ul><ul><ul><li>Highly concurrent reads and writes </li></ul></ul>
  6. 6. Motivation: The NSDL Use Case <ul><li>Additionally, NSDL has data model constraints that must be enforced </li></ul><ul><ul><li>Existential/referential constraints on objects (e.g. “foreign key” constraints) </li></ul></ul><ul><ul><li>Uniqueness constraints on some object properties </li></ul></ul>
  7. 7. Motivation: The NSDL Use Case <ul><li>These constraints primarily center around RELS-EXT content: </li></ul><ul><ul><li>Relationships to other NSDL objects (forming a graph) </li></ul></ul><ul><ul><li>Literal value properties for a particular object itself </li></ul></ul>
  8. 8. <foxml:datastream ID=”RELS-EXT” ...> ... <example:id>PLUGH-XYZZY</example:id> <example:memberOf rdf:resource=”info:fedora/demo:73” /> </foxml: datastream > ... Must be globally unique <example:objectType>Resource</example:objectType> This object... 1) Must exist 2) Must be 'Active' 3) Must be objectType 'Aggregation'
  9. 9. Motivation: The NSDL Use Case <ul><li>No suitable constraint enforcement mechanisms exist in Fedora itself </li></ul><ul><li>Our approach: </li></ul><ul><ul><li>Enforce content model in middleware </li></ul></ul><ul><ul><li>Serialize access where we have to </li></ul></ul><ul><ul><li>Query RI before ingest or modify </li></ul></ul>
  10. 10. The Challenge <ul><li>Querying the RI to determine correct repository state proved to be the most difficult aspect. </li></ul><ul><ul><li>To achieve acceptable performance with Kowari, triple writes are buffered and executed in large, infrequent chunks </li></ul></ul><ul><ul><li>Triples waiting in these buffers are invisible to outside queries </li></ul></ul>
  11. 11. The Challenge <ul><li>Possible solution: </li></ul><ul><ul><li>Flush the buffer after every write operation </li></ul></ul><ul><li>New problem: </li></ul><ul><ul><li>Flushed updates with Kowari are very expensive -- Multiple seconds per operation. This was incompatible with NSDL processing volume </li></ul></ul><ul><li>This was a real showstopper... </li></ul>
  12. 12. The Challenge <ul><li>Other difficulties the NSDL had with Kowari: </li></ul><ul><ul><li>RI corruption under concurrent use </li></ul></ul><ul><ul><li>RI corruption with abnormal shutdowns </li></ul></ul><ul><ul><li>Scalability. Performance became noticeably worse with increasing repository size </li></ul></ul><ul><ul><li>Steep memory requirements </li></ul></ul>
  13. 13. The Challenge <ul><li>Searching for a solution.. </li></ul><ul><ul><li>Other triple stores (e.g. Jena, Sesame) were considered for Fedora in the past, rejected for various reasons </li></ul></ul><ul><ul><li>RDBMS seemed attractive – efficient transactions, very stable, generally speedy </li></ul></ul><ul><ul><li>“ One big table” paradigm did not seem to give us desired scalability in initial tests </li></ul></ul>
  14. 14. Our Solution <ul><li>Mapped predicate tables </li></ul><ul><ul><li>One table per predicate, containing indexed 'subject' and 'object' values </li></ul></ul><ul><ul><li>Mapping table containing metadata correlating predicate URI to a particular db table </li></ul></ul>
  15. 15. <info:fedora/demo:1> <info:fedora/demo:2> <info:fedora/demo:3> <info:fedora/demo:4> s o t1 <info:fedora/fedora-def:model#disseminates> <> 1 2 p pkey tmap Triples Predicate Mapping
  16. 16. Our Solution <ul><li>Benefits: </li></ul><ul><ul><li>Low cost adds and deletes </li></ul></ul><ul><ul><li>Queries with known predicates are very fast </li></ul></ul><ul><ul><li>Complex queries benefit due to RDBMS planner having finer-grained statistics and query plans </li></ul></ul><ul><ul><li>Flexible data partitioning </li></ul></ul>
  17. 17. Our Solution <ul><li>Disadvantages: </li></ul><ul><ul><li>Need to manage predicate to table mapping </li></ul></ul><ul><ul><li>Complex queries require more effort to formulate </li></ul></ul><ul><ul><li>With a naïve approach, simple unbound queries scale linearly with the number of predicates </li></ul></ul>
  18. 18. Our Solution <ul><li>Observations: </li></ul><ul><ul><li>Total number of distinct predicates is much lower than predicates or objects. NSDL has ~ 50 </li></ul></ul><ul><ul><li>Unbound predicate queries are less common </li></ul></ul><ul><ul><li>NSDL is heavily biased towards a high volume of writes and simple queries </li></ul></ul>
  19. 19. Our Solution <ul><li>Enter MPTStore </li></ul><ul><ul><li>Java library that handles all mapping and accounting behind the scenes </li></ul></ul><ul><ul><li>API for performing triple writes and queries </li></ul></ul><ul><ul><li>Translates queries from a particular language (e.g. SPO, SPARQL) into SQL statements </li></ul></ul>
  20. 20. Our Solution <ul><li>Designed to expose transaction/connection semantics </li></ul><ul><ul><li>Calling code has to provide jdbc connection for adding, querying triples </li></ul></ul><ul><ul><li>Thus, clear path to use advanced transactional capabilities offered by jdbc driver (such as XA) </li></ul></ul>
  21. 21. Results <ul><li>MPTStore performance well suited to NSDL use case </li></ul><ul><ul><li>Adds or modifies were significantly faster than Kowari case, and were unaffected by database size </li></ul></ul><ul><ul><li>SPO queries were on-par with Kowari in unbound(common) case </li></ul></ul>
  22. 22. Results <ul><li>Bonus </li></ul><ul><ul><li>NSDL team was very familiar with operation of RDBMS administration: performance tuning, backups, etc </li></ul></ul><ul><ul><li>Stored data is transparent and “hackable”: Ad-hoc SQL queries and analysis are relatively simple </li></ul></ul>
  23. 23. Results <ul><li>Fedora Bonus </li></ul><ul><ul><li>Ability to easily analyze the database: helped us track down our own middleware bugs (improved Kowari Performance). </li></ul></ul>
  24. 24. Fast, Immediate Updates <ul><li>Graph shows average ms. per datastream modification </li></ul><ul><li>MPTStore achieves virtually same performance whether buffering or not </li></ul><ul><li>Complete test detail in Fedora 2.2 docs </li></ul>
  25. 25. RI: Future Directions <ul><li>External Resource Index </li></ul><ul><ul><li>Event-based (JMS) updates to external triplestore </li></ul></ul><ul><ul><ul><li>Analogous to GSearch index updates </li></ul></ul></ul><ul><ul><ul><li>May be asynchronous </li></ul></ul></ul><ul><ul><ul><li>May index other datastreams </li></ul></ul></ul><ul><ul><li>Make full use of triplestore capabilities without compromising the core repository </li></ul></ul><ul><ul><ul><li>Inference (e.g. krule, RACER) </li></ul></ul></ul><ul><ul><ul><li>Native APIs </li></ul></ul></ul>
  26. 26. RI: Future Directions <ul><li>Internal (Synchronous) Resource Index </li></ul><ul><ul><li>Assumption: XA Transactions. </li></ul></ul><ul><ul><li>Option A: MPTStore Only </li></ul></ul><ul><ul><ul><li>Pro: Simple, synchronous, JDBC (no need for middleware) </li></ul></ul></ul><ul><ul><ul><li>Con: Basic queries (no iTQL, maybe SPARQL-Lite) </li></ul></ul></ul><ul><ul><li>Option B: Mulgara or MPTStore </li></ul></ul><ul><ul><ul><li>Pro: Richer queries when using Mulgara (iTQL) </li></ul></ul></ul><ul><ul><ul><li>Con: Complexity (need for XA-aware middleware?) </li></ul></ul></ul>
  27. 27. Thank You <ul><li>More Information </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul>