Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The return of the hierarchical model

4,102 views

Published on

Published in: Technology

The return of the hierarchical model

  1. 1. The Return of the Hierarchical Model Jukka Zitting @ Day Software
  2. 2. /Agenda <ul><li>Part 1: Hierarchy </li></ul><ul><li>Concepts - </li></ul><ul><li>Benefits - </li></ul><ul><li>Drawbacks – </li></ul><ul><li>Examples - </li></ul>Part 2: Case Study - JCR - Jackrabbit - Sling - Lessons Learned questions and comments allowed
  3. 3. /Hierarchy/Concepts <ul><li>Every record has a parent record </li></ul><ul><ul><li>Except the root </li></ul></ul><ul><ul><li>No cyclical parent relations allowed </li></ul></ul><ul><ul><li>Referential integrity, but often no other reference types supported </li></ul></ul><ul><li>A name identifies a record within its parent </li></ul><ul><ul><li>The name is not necessarily unique (XML, DNS, etc.) </li></ul></ul><ul><ul><li>Path as an identifier: /path/to/record </li></ul></ul><ul><li>Record hierarchy is distinct from type hierarchy </li></ul><ul><ul><li>Structural flexibility, optionally limited by type constraints </li></ul></ul>A B C D E F
  4. 4. /Hierarchy/Benefits <ul><li>Natural </li></ul><ul><ul><li>Data in many domains is inherently hierarchical </li></ul></ul><ul><ul><li>Easy to understand </li></ul></ul><ul><li>Self-similar </li></ul><ul><ul><li>Recursive algorithms </li></ul></ul><ul><ul><li>Incremental map-reduce! </li></ul></ul><ul><li>Scalable </li></ul><ul><ul><li>Partitioning </li></ul></ul><ul><ul><li>Parallel processing </li></ul></ul><ul><li>Efficient </li></ul><ul><ul><li>Highly optimized path-based access and “joins” on the parent-child and subtree relationships </li></ul></ul>
  5. 5. /Hierarchy/Drawbacks <ul><li>Limited support for references </li></ul><ul><ul><li>Graph databases solve this problem, at a cost </li></ul></ul><ul><ul><li>DAG a partial solution </li></ul></ul><ul><li>Handling of flat structures </li></ul><ul><ul><li>Chronological: blogs, tweets, email, log entries, etc. </li></ul></ul><ul><ul><li>Sets: wiki pages, user accounts, etc. </li></ul></ul><ul><ul><li>Often requires an artificial hierarchy, e.g. /blog/2010/06/entry-for-today </li></ul></ul><ul><li>Standards are domain-specific or limited in scope </li></ul><ul><ul><li>POSIX, DNS, XPath/XQuery, JCR, etc. </li></ul></ul><ul><li>Difficulty of organizing things </li></ul><ul><ul><li>Coming up with good names for records is hard </li></ul></ul><ul><ul><li>Hierarchy requires maintenance </li></ul></ul>
  6. 6. /Hierarchy/Examples <ul><li>File system </li></ul><ul><li>DNS </li></ul><ul><li>LDAP </li></ul><ul><li>XML </li></ul><ul><li>WebDAV </li></ul><ul><li>RDBMS </li></ul>
  7. 7. /Hierarchy/Examples/File System <ul><li>Universally available </li></ul><ul><li>Two main types: files and folders </li></ul><ul><ul><li>Notable extensions: /dev/* and /proc/* </li></ul></ul><ul><ul><li>Unix philosophy: Everything is a file! </li></ul></ul><ul><li>Heavily optimized for specific use cases </li></ul><ul><li>Limited support for fine-grained data </li></ul><ul><ul><li>Some systems support things like extended attributes </li></ul></ul><ul><li>Built-in access controls, but usually no query support </li></ul><ul><li>Major limitations in distributed solutions </li></ul><ul><ul><li>SAN and NAS solutions reasonably efficient but limited in scope </li></ul></ul><ul><ul><li>Truly distributed systems like HDFS applicable only for limited use cases </li></ul></ul>
  8. 8. /Hierarchy/Examples/DNS <ul><li>Globally distributed, heterogenous, eventually consistent </li></ul><ul><ul><li>In production since 1983! </li></ul></ul><ul><li>Standardized query and update protocols </li></ul><ul><li>Domain-specific, highly optimized for scalability </li></ul><ul><li>Multiple records can have the same name </li></ul><ul><li>Fine-grained record types: A, NS, MX, TXT, AAAA, etc. </li></ul><ul><li>Security issues, both in design and implementations </li></ul><ul><ul><li>Not much impact in practice </li></ul></ul>
  9. 9. /Hierarchy/Examples/LDAP <ul><li>Protocol for accessing X.500-style directories </li></ul><ul><li>Record names are constructed from selected properties </li></ul><ul><ul><li>dn: cn=John Doe, dc=example, dc=com </li></ul></ul><ul><li>Record types defined by extensible schemas </li></ul><ul><li>Limited form of record references </li></ul><ul><li>Fairly powerful search </li></ul><ul><ul><li>Though no aggregate queries or arbitrary joins </li></ul></ul><ul><li>Optimized for fine-grained data that is mostly read </li></ul><ul><li>Replication and distributed use widely supported </li></ul>
  10. 10. /Hierarchy/Examples/XML <ul><li>Data storage based on the XML DOM </li></ul><ul><ul><li>Various levels of conformance </li></ul></ul><ul><li>Highly buzzword compliant in the early 2000’s </li></ul><ul><ul><li>Few of the XML database products are still in active use </li></ul></ul><ul><li>Inefficient handling of binary data (at all granularities) </li></ul><ul><li>Powerful query and transformation tooling </li></ul><ul><ul><li>XPath, XQuery, XSLT, etc. </li></ul></ul><ul><ul><li>Many implementations not optimized for performance </li></ul></ul><ul><li>Optional type constraints with XML Schema, etc. </li></ul><ul><li>The result? XML extensions in SQL </li></ul>
  11. 11. /Hierarchy/Examples/WebDAV <ul><li>Extends HTTP with concepts of collections and properties </li></ul><ul><ul><li>Also: locking, versioning, search, etc. </li></ul></ul><ul><li>Often used (only) for HTTP-based access to a file system </li></ul><ul><ul><li>Also leveraged by fs-like systems like Subversion </li></ul></ul><ul><li>Limited XML-based query with PROPFIND </li></ul><ul><ul><li>More query power with DASL </li></ul></ul><ul><li>Somewhat heavy-weight for fine-grained access </li></ul><ul><li>Fragmented and often incompatible implementations </li></ul><ul><ul><li>File system backend as the lowest common denominator </li></ul></ul><ul><ul><li>cf. AtomPub </li></ul></ul>
  12. 12. /Hierarchy/Examples/RDBMS <ul><li>Various ways of representing hierarchies in RDBM systems </li></ul><ul><ul><li>Adjacency model: Each row has a reference to the parent </li></ul></ul><ul><ul><li>Nested sets: Rows numbered in depth-first traversal order </li></ul></ul><ul><ul><li>etc. </li></ul></ul><ul><li>Little structural flexibility </li></ul><ul><li>Expensive parent-child or subtree joins </li></ul><ul><ul><li>Vendor-specific extensions to address this problem </li></ul></ul><ul><li>Two words: Impedance mismatch </li></ul>
  13. 13. /Hierarchy/Summary <ul><li>Data storage/management using an explicit tree hierarchy </li></ul><ul><li>Natural mapping, nice non-functional characteristics </li></ul><ul><li>Limited functionality, lack of generic standards </li></ul><ul><li>Widely used, but in domain-specific ways </li></ul><ul><ul><li>Extremely efficient/scalable for certain data models </li></ul></ul><ul><li>How about a generic, feature-rich hierarchical database? </li></ul>
  14. 14. /Case/JCR <ul><li>Content Repository for Java Technology API (JCR) </li></ul><ul><ul><li>JCR 1.0 out in 2005, specified in JSR 170 </li></ul></ul><ul><ul><li>JCR 2.0 out in 2009, specified in JSR 283 </li></ul></ul><ul><ul><li>Work on JCR 2.1 starting </li></ul></ul><ul><li>A content repository is a hierarchical content store </li></ul><ul><ul><li>with full text search, observation, versioning, transactions, etc. </li></ul></ul><ul><ul><li>JCR 2.0 adds retention, type management, join queries, etc. </li></ul></ul><ul><li>Designed for both structured and unstructured content </li></ul><ul><ul><li>handling of both finely and coarsely grained data </li></ul></ul><ul><li>Application platform more than an integration API </li></ul>
  15. 15. /Case/Jackrabbit <ul><li>Reference implementation of both JCR 1.0 and 2.0 </li></ul><ul><ul><li>Primary focus on feature-completeness </li></ul></ul><ul><li>Apache incubator since 2004, TLP since 2006 </li></ul><ul><li>Internal storage through an abstracted key-value API </li></ul><ul><ul><li>Tree model implemented on top of that </li></ul></ul><ul><ul><li>Lucene search index maintained separately </li></ul></ul><ul><ul><li>Separate journal for cluster deployments </li></ul></ul><ul><li>Advanced WebDAV support </li></ul><ul><li>Jackrabbit 3: Focus on scalability, modularity </li></ul>
  16. 16. /Case/Sling <ul><li>Web framework based on the JCR content model </li></ul><ul><li>Apache incubator since 2007, TLP since 2009 </li></ul><ul><li>Intuitive URL mapping </li></ul><ul><ul><li>Path selects the underlying content resource </li></ul></ul><ul><ul><li>Optional selectors and extensions determine representation </li></ul></ul><ul><li>JSON and POST servlets with Javascript support </li></ul><ul><li>OSGi for server-side modularity </li></ul><ul><li>Everything is content </li></ul>
  17. 17. /Case/Lessons Learned <ul><li>Content-driven development </li></ul><ul><ul><li>Data first, structure later </li></ul></ul><ul><li>Distribute for redundancy </li></ul><ul><ul><li>Modern hardware goes a long way for scalability/performance </li></ul></ul><ul><ul><li>For small/medium deployments, distribution is more important for fault-tolerance especially in cloud environments </li></ul></ul><ul><li>Relationships are important </li></ul><ul><ul><li>JCR 2.0 is a DAG, plus references for expressing full graphs </li></ul></ul><ul><ul><li>Referential integrity not so important </li></ul></ul><ul><li>Notable data sets are flat </li></ul><ul><li>Don’t forget tool support for ad-hoc tasks! </li></ul>
  18. 18. /Questions? <ul><li>http://jackrabbit.apache.org/ </li></ul><ul><li>http://sling.apache.org/ </li></ul><ul><li>http://www.day.com/jsr283 </li></ul>

×