The return of the hierarchical model

The Return of the Hierarchical Model Jukka Zitting @ Day Software

/Agenda Part 1: Hierarchy Concepts - Benefits - Drawbacks – Examples - Part 2: Case Study - JCR - Jackrabbit - Sling - Lessons Learned questions and comments allowed

/Hierarchy/Concepts Every record has a parent record Except the root No cyclical parent relations allowed Referential integrity, but often no other reference types supported A name identifies a record within its parent The name is not necessarily unique (XML, DNS, etc.) Path as an identifier: /path/to/record Record hierarchy is distinct from type hierarchy Structural flexibility, optionally limited by type constraints A B C D E F

/Hierarchy/Benefits Natural Data in many domains is inherently hierarchical Easy to understand Self-similar Recursive algorithms Incremental map-reduce! Scalable Partitioning Parallel processing Efficient Highly optimized path-based access and “joins” on the parent-child and subtree relationships

/Hierarchy/Drawbacks Limited support for references Graph databases solve this problem, at a cost DAG a partial solution Handling of flat structures Chronological: blogs, tweets, email, log entries, etc. Sets: wiki pages, user accounts, etc. Often requires an artificial hierarchy, e.g. /blog/2010/06/entry-for-today Standards are domain-specific or limited in scope POSIX, DNS, XPath/XQuery, JCR, etc. Difficulty of organizing things Coming up with good names for records is hard Hierarchy requires maintenance

/Hierarchy/Examples File system DNS LDAP XML WebDAV RDBMS

/Hierarchy/Examples/File System Universally available Two main types: files and folders Notable extensions: /dev/* and /proc/* Unix philosophy: Everything is a file! Heavily optimized for specific use cases Limited support for fine-grained data Some systems support things like extended attributes Built-in access controls, but usually no query support Major limitations in distributed solutions SAN and NAS solutions reasonably efficient but limited in scope Truly distributed systems like HDFS applicable only for limited use cases

/Hierarchy/Examples/DNS Globally distributed, heterogenous, eventually consistent In production since 1983! Standardized query and update protocols Domain-specific, highly optimized for scalability Multiple records can have the same name Fine-grained record types: A, NS, MX, TXT, AAAA, etc. Security issues, both in design and implementations Not much impact in practice

/Hierarchy/Examples/LDAP Protocol for accessing X.500-style directories Record names are constructed from selected properties dn: cn=John Doe, dc=example, dc=com Record types defined by extensible schemas Limited form of record references Fairly powerful search Though no aggregate queries or arbitrary joins Optimized for fine-grained data that is mostly read Replication and distributed use widely supported

/Hierarchy/Examples/XML Data storage based on the XML DOM Various levels of conformance Highly buzzword compliant in the early 2000’s Few of the XML database products are still in active use Inefficient handling of binary data (at all granularities) Powerful query and transformation tooling XPath, XQuery, XSLT, etc. Many implementations not optimized for performance Optional type constraints with XML Schema, etc. The result? XML extensions in SQL

/Hierarchy/Examples/WebDAV Extends HTTP with concepts of collections and properties Also: locking, versioning, search, etc. Often used (only) for HTTP-based access to a file system Also leveraged by fs-like systems like Subversion Limited XML-based query with PROPFIND More query power with DASL Somewhat heavy-weight for fine-grained access Fragmented and often incompatible implementations File system backend as the lowest common denominator cf. AtomPub

/Hierarchy/Examples/RDBMS Various ways of representing hierarchies in RDBM systems Adjacency model: Each row has a reference to the parent Nested sets: Rows numbered in depth-first traversal order etc. Little structural flexibility Expensive parent-child or subtree joins Vendor-specific extensions to address this problem Two words: Impedance mismatch

/Hierarchy/Summary Data storage/management using an explicit tree hierarchy Natural mapping, nice non-functional characteristics Limited functionality, lack of generic standards Widely used, but in domain-specific ways Extremely efficient/scalable for certain data models How about a generic, feature-rich hierarchical database?

/Case/JCR Content Repository for Java Technology API (JCR) JCR 1.0 out in 2005, specified in JSR 170 JCR 2.0 out in 2009, specified in JSR 283 Work on JCR 2.1 starting A content repository is a hierarchical content store with full text search, observation, versioning, transactions, etc. JCR 2.0 adds retention, type management, join queries, etc. Designed for both structured and unstructured content handling of both finely and coarsely grained data Application platform more than an integration API

/Case/Jackrabbit Reference implementation of both JCR 1.0 and 2.0 Primary focus on feature-completeness Apache incubator since 2004, TLP since 2006 Internal storage through an abstracted key-value API Tree model implemented on top of that Lucene search index maintained separately Separate journal for cluster deployments Advanced WebDAV support Jackrabbit 3: Focus on scalability, modularity

/Case/Sling Web framework based on the JCR content model Apache incubator since 2007, TLP since 2009 Intuitive URL mapping Path selects the underlying content resource Optional selectors and extensions determine representation JSON and POST servlets with Javascript support OSGi for server-side modularity Everything is content

/Case/Lessons Learned Content-driven development Data first, structure later Distribute for redundancy Modern hardware goes a long way for scalability/performance For small/medium deployments, distribution is more important for fault-tolerance especially in cloud environments Relationships are important JCR 2.0 is a DAG, plus references for expressing full graphs Referential integrity not so important Notable data sets are flat Don’t forget tool support for ad-hoc tasks!

/Questions? http://jackrabbit.apache.org/ http://sling.apache.org/ http://www.day.com/jsr283

The return of the hierarchical model

More Related Content

What's hot

Viewers also liked

Similar to The return of the hierarchical model

More from Jukka Zitting

Recently uploaded

The return of the hierarchical model