The Return of the Hierarchical Model Jukka Zitting @ Day Software
/Agenda Part 1: Hierarchy Concepts - Benefits - Drawbacks – Examples - Part 2: Case Study - JCR - Jackrabbit - Sling - Lessons Learned questions and comments allowed
/Hierarchy/Concepts Every record has a  parent  record Except the root No cyclical parent relations allowed Referential integrity, but often no other reference types supported A  name  identifies a record within its parent The name is not necessarily unique (XML, DNS, etc.) Path as an identifier: /path/to/record Record hierarchy  is distinct from  type hierarchy Structural flexibility, optionally limited by type constraints A B C D E F
/Hierarchy/Benefits Natural Data in many domains is inherently hierarchical Easy to understand Self-similar Recursive algorithms Incremental map-reduce! Scalable Partitioning Parallel processing Efficient Highly optimized path-based access and “joins” on the parent-child and subtree relationships
/Hierarchy/Drawbacks Limited support for references Graph databases solve this problem, at a cost DAG a partial solution Handling of flat structures Chronological: blogs, tweets, email, log entries, etc. Sets: wiki pages, user accounts, etc. Often requires an artificial hierarchy, e.g. /blog/2010/06/entry-for-today Standards are domain-specific or limited in scope POSIX, DNS, XPath/XQuery, JCR, etc. Difficulty of organizing things Coming up with good names for records is hard Hierarchy requires maintenance
/Hierarchy/Examples File system DNS LDAP XML WebDAV RDBMS
/Hierarchy/Examples/File System Universally available Two main types: files and folders Notable extensions: /dev/* and /proc/* Unix philosophy: Everything is a file! Heavily optimized for specific use cases Limited support for fine-grained data Some systems support things like extended attributes Built-in access controls, but usually no query support Major limitations in distributed solutions SAN and NAS solutions reasonably efficient but limited in scope Truly distributed systems like HDFS applicable only for limited use cases
/Hierarchy/Examples/DNS Globally distributed, heterogenous, eventually consistent In production since 1983! Standardized query and update protocols Domain-specific, highly optimized for scalability Multiple records can have the same name Fine-grained record types: A, NS, MX, TXT, AAAA, etc. Security issues, both in design and implementations Not much impact in practice
/Hierarchy/Examples/LDAP Protocol for accessing X.500-style directories Record names are constructed from selected properties dn: cn=John Doe, dc=example, dc=com  Record types defined by extensible schemas Limited form of record references Fairly powerful search Though no aggregate queries or arbitrary joins Optimized for fine-grained data that is mostly read Replication and distributed use widely supported
/Hierarchy/Examples/XML Data storage based on the XML DOM Various levels of conformance Highly buzzword compliant in the early 2000’s Few of the XML database products are still in active use Inefficient handling of binary data (at all granularities) Powerful query and transformation tooling XPath, XQuery, XSLT, etc. Many implementations not optimized for performance Optional type constraints with XML Schema, etc. The result? XML extensions in SQL
/Hierarchy/Examples/WebDAV Extends HTTP with concepts of collections and properties Also: locking, versioning, search, etc. Often used (only) for HTTP-based access to a file system Also leveraged by fs-like systems like Subversion Limited XML-based query with PROPFIND More query power with DASL Somewhat heavy-weight for fine-grained access Fragmented and often incompatible implementations File system backend as the lowest common denominator cf. AtomPub
/Hierarchy/Examples/RDBMS Various ways of representing hierarchies in RDBM systems Adjacency model: Each row has a reference to the parent Nested sets: Rows numbered in depth-first traversal order etc. Little structural flexibility Expensive parent-child or subtree joins Vendor-specific extensions to address this problem Two words: Impedance mismatch
/Hierarchy/Summary Data storage/management using an explicit tree hierarchy Natural mapping, nice non-functional characteristics Limited functionality, lack of generic standards Widely used, but in domain-specific ways Extremely efficient/scalable for certain data models How about a generic, feature-rich hierarchical database?
/Case/JCR Content Repository for Java Technology API (JCR) JCR 1.0 out in 2005, specified in JSR 170 JCR 2.0 out in 2009, specified in JSR 283 Work on JCR 2.1 starting A  content repository  is a hierarchical content store with full text search, observation, versioning, transactions, etc. JCR 2.0 adds retention, type management, join queries, etc. Designed for both structured and unstructured content handling of both finely and coarsely grained data Application platform more than an integration API
/Case/Jackrabbit Reference implementation of both JCR 1.0 and 2.0 Primary focus on feature-completeness Apache incubator since 2004, TLP since 2006 Internal storage through an abstracted key-value API Tree model implemented on top of that Lucene search index maintained separately Separate journal for cluster deployments Advanced WebDAV support Jackrabbit 3: Focus on scalability, modularity
/Case/Sling Web framework based on the JCR content model Apache incubator since 2007, TLP since 2009 Intuitive URL mapping Path selects the underlying content resource Optional selectors and extensions determine representation JSON and POST servlets with Javascript support OSGi for server-side modularity Everything is content
/Case/Lessons Learned Content-driven development Data first, structure later Distribute for redundancy Modern hardware goes a long way for scalability/performance For small/medium deployments, distribution is more important for fault-tolerance especially in cloud environments Relationships are important JCR 2.0 is a DAG, plus references for expressing full graphs Referential integrity not so important Notable data sets are flat Don’t forget tool support for ad-hoc tasks!
/Questions? http://jackrabbit.apache.org/ http://sling.apache.org/ http://www.day.com/jsr283

The return of the hierarchical model

  • 1.
    The Return ofthe Hierarchical Model Jukka Zitting @ Day Software
  • 2.
    /Agenda Part 1:Hierarchy Concepts - Benefits - Drawbacks – Examples - Part 2: Case Study - JCR - Jackrabbit - Sling - Lessons Learned questions and comments allowed
  • 3.
    /Hierarchy/Concepts Every recordhas a parent record Except the root No cyclical parent relations allowed Referential integrity, but often no other reference types supported A name identifies a record within its parent The name is not necessarily unique (XML, DNS, etc.) Path as an identifier: /path/to/record Record hierarchy is distinct from type hierarchy Structural flexibility, optionally limited by type constraints A B C D E F
  • 4.
    /Hierarchy/Benefits Natural Datain many domains is inherently hierarchical Easy to understand Self-similar Recursive algorithms Incremental map-reduce! Scalable Partitioning Parallel processing Efficient Highly optimized path-based access and “joins” on the parent-child and subtree relationships
  • 5.
    /Hierarchy/Drawbacks Limited supportfor references Graph databases solve this problem, at a cost DAG a partial solution Handling of flat structures Chronological: blogs, tweets, email, log entries, etc. Sets: wiki pages, user accounts, etc. Often requires an artificial hierarchy, e.g. /blog/2010/06/entry-for-today Standards are domain-specific or limited in scope POSIX, DNS, XPath/XQuery, JCR, etc. Difficulty of organizing things Coming up with good names for records is hard Hierarchy requires maintenance
  • 6.
    /Hierarchy/Examples File systemDNS LDAP XML WebDAV RDBMS
  • 7.
    /Hierarchy/Examples/File System Universallyavailable Two main types: files and folders Notable extensions: /dev/* and /proc/* Unix philosophy: Everything is a file! Heavily optimized for specific use cases Limited support for fine-grained data Some systems support things like extended attributes Built-in access controls, but usually no query support Major limitations in distributed solutions SAN and NAS solutions reasonably efficient but limited in scope Truly distributed systems like HDFS applicable only for limited use cases
  • 8.
    /Hierarchy/Examples/DNS Globally distributed,heterogenous, eventually consistent In production since 1983! Standardized query and update protocols Domain-specific, highly optimized for scalability Multiple records can have the same name Fine-grained record types: A, NS, MX, TXT, AAAA, etc. Security issues, both in design and implementations Not much impact in practice
  • 9.
    /Hierarchy/Examples/LDAP Protocol foraccessing X.500-style directories Record names are constructed from selected properties dn: cn=John Doe, dc=example, dc=com Record types defined by extensible schemas Limited form of record references Fairly powerful search Though no aggregate queries or arbitrary joins Optimized for fine-grained data that is mostly read Replication and distributed use widely supported
  • 10.
    /Hierarchy/Examples/XML Data storagebased on the XML DOM Various levels of conformance Highly buzzword compliant in the early 2000’s Few of the XML database products are still in active use Inefficient handling of binary data (at all granularities) Powerful query and transformation tooling XPath, XQuery, XSLT, etc. Many implementations not optimized for performance Optional type constraints with XML Schema, etc. The result? XML extensions in SQL
  • 11.
    /Hierarchy/Examples/WebDAV Extends HTTPwith concepts of collections and properties Also: locking, versioning, search, etc. Often used (only) for HTTP-based access to a file system Also leveraged by fs-like systems like Subversion Limited XML-based query with PROPFIND More query power with DASL Somewhat heavy-weight for fine-grained access Fragmented and often incompatible implementations File system backend as the lowest common denominator cf. AtomPub
  • 12.
    /Hierarchy/Examples/RDBMS Various waysof representing hierarchies in RDBM systems Adjacency model: Each row has a reference to the parent Nested sets: Rows numbered in depth-first traversal order etc. Little structural flexibility Expensive parent-child or subtree joins Vendor-specific extensions to address this problem Two words: Impedance mismatch
  • 13.
    /Hierarchy/Summary Data storage/managementusing an explicit tree hierarchy Natural mapping, nice non-functional characteristics Limited functionality, lack of generic standards Widely used, but in domain-specific ways Extremely efficient/scalable for certain data models How about a generic, feature-rich hierarchical database?
  • 14.
    /Case/JCR Content Repositoryfor Java Technology API (JCR) JCR 1.0 out in 2005, specified in JSR 170 JCR 2.0 out in 2009, specified in JSR 283 Work on JCR 2.1 starting A content repository is a hierarchical content store with full text search, observation, versioning, transactions, etc. JCR 2.0 adds retention, type management, join queries, etc. Designed for both structured and unstructured content handling of both finely and coarsely grained data Application platform more than an integration API
  • 15.
    /Case/Jackrabbit Reference implementationof both JCR 1.0 and 2.0 Primary focus on feature-completeness Apache incubator since 2004, TLP since 2006 Internal storage through an abstracted key-value API Tree model implemented on top of that Lucene search index maintained separately Separate journal for cluster deployments Advanced WebDAV support Jackrabbit 3: Focus on scalability, modularity
  • 16.
    /Case/Sling Web frameworkbased on the JCR content model Apache incubator since 2007, TLP since 2009 Intuitive URL mapping Path selects the underlying content resource Optional selectors and extensions determine representation JSON and POST servlets with Javascript support OSGi for server-side modularity Everything is content
  • 17.
    /Case/Lessons Learned Content-drivendevelopment Data first, structure later Distribute for redundancy Modern hardware goes a long way for scalability/performance For small/medium deployments, distribution is more important for fault-tolerance especially in cloud environments Relationships are important JCR 2.0 is a DAG, plus references for expressing full graphs Referential integrity not so important Notable data sets are flat Don’t forget tool support for ad-hoc tasks!
  • 18.