Consistent NoSQL data storage with ModeShape (NoSQL Matters 2013)


Published on

ModeShape 3 is an elastic, strongly-consistent hierarchical database that supports queries, full-text search, versioning, events, locking and use of schema-rich or schema-less constraints. It's perfect for storing files and hierarchically structured data that will be accessed by navigation or queries. You can choose where (if at all) you want ModeShape to enforce your schema, but your structure and schema can always evolve as your needs change. Sequencers make it easy to extract structure from stored files, and federation can bring into your database information from external systems. It's fast, sits on top of an Infinispan data grid, and open source. This presentation provides an introduction to how ModeShape 3 works.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Consistent NoSQL data storage with ModeShape (NoSQL Matters 2013)

  1. 1. Elastic consistentNoSQL data storage withModeShape 3NoSQL Matters 2013Cologne, GermanyApril 26, 2013Randall HauchPrincipal Software Engineer at Red Hat@rhauch@modeshape
  2. 2. SQL databases2BLOBorCLOBrecursive JOINsand queriesSQL types(CHAR, VARCHAR, etc.)
  3. 3. SQL databases3BLOBorCLOBrecursive JOINsand queriesSQL types(CHAR, VARCHAR, etc.)
  4. 4. NoSQL databases4
  5. 5. NoSQL databases5DocumentKey/ValueColumn-orientedGraphOthers, including hierarchical...
  6. 6. ModeShapeAn open sourceelastic in-memory hierarchical databasewith queries, transactions, events & more6
  7. 7. Hierarchical• Organize the data into a tree structure– A lot of data has natural hierarchies– Conceptually similar to a file system– Nodes with properties– References enable graphs (not limited to parent/child)• Navigate or query– Quickly navigate to related (or contained) data– Use queries to find data independently of location7
  8. 8. Nodes and names• Node names– consist of a local part and a namespace (like XML names)– need not be unique within a parent node (but it is recommended)• Namespaces– are URIs that are registered and can be assigned a prefix– prefixes are repository-wide, but can be permanently changed oroverridden locally by clients8Each  node  has  a  name.Namespace  prefix:  “”  (empty  string)Local  part:    “equipment”Namespace  prefix:  “jcr”Local  part:    “system”
  9. 9. Node paths• Absolute paths– the sequence of names from the root to the node in question– always start with a ‘/’ signifying the root node– may use a 1-based same-name-sibling positional index (which canchange if order of children are changed)9Each  node  is  identified  by  a  pathThese  paths  are  equivalent:/facilities/San Fransisco/Eastford Plaza/facilities[1]/San Fransisco[1]/Eastford Plaza[1]
  10. 10. Node paths (cont’d)• Relative paths– the sequence of names from one node to another– never start with a ‘/’– similar to file system relative paths10Paths  can  be  relative  and  can  use  “.”  and  “..”From  the  “passenger”  node  to  the  “Eastford Plaza”  node:../../facilities/San Fransisco/Eastford Plaza
  11. 11. Node identifier• Used to lookup that node directly– no navigation is required– will never change after a new node is created, even if moved(unlike paths)– behaves as a “unique key” within the workspace(shared nodes behave differently)– fast• Used within reference properties– both REFERENCE and WEAKREFERENCE• Can be used by applications11Each  node  also  has  an  opaque  string  identifier
  12. 12. Properties• Nodes can have 0+ properties– each property must havea unique name in a node• Properties have values– single-valued: exactly 1non-null value– multi-valued: 0 or morepossibly null values• Values– are immutable– have an implicit type– are accessed by desired typewith auto-conversion; e.g.,value.getString(), getDate(),value.getNode(), etc.12The  only  place  to  store  data  on  the  nodesProperty Type Java typeSTRING java.lang.StringNAME java.lang.StringPATH java.lang.StringBOOLEAN java.lang.BooleanLONG java.lang.LongDOUBLE java.lang.DoubleDATE java.util.CalendarBINARY javax.jcr.BinaryREFERENCE javax.jcr.NodeWEAKREFERENCE javax.jcr.NodeDECIMAL java.math.BigDecimalURI java.lang.String
  13. 13. BINARY property values• Any size binary content– read/written via streams• Separate storage– content keyed by SHA-1– property value stored with nodecontains SHA-1 and resolvedwhen stream is read– streamed content always buffered– all this is transparent to applications• Automatic text extraction– text is used for full-text searching• Choices for binary storage– File, DBMS, MongoDB, data grid (out of the box)– Custom13Binary  Storage
  14. 14. Workspace• Comprised of– a single root node– the “/jcr:system” branch containing the system-wide information– other nodes that have child nodes and properties14Named  segments  of  a  repository
  15. 15. Putting the pieces together• Repository contains– named workspaces– namespaces, node types, version storage, etc.• Workspaces have– hierarchy of nodes– access to the shared system area• Nodes have– name (can change)– identifier (doesn’t change)– path (can change)– properties (can change)• Properties have values– single-valued: exactly 1 non-null value– multi-valued: 0 or more possibly null values• Values– are immutable & can be reused– have an implicit type– are accessed by desired type with auto-conversion; e.g., value.getString()15
  16. 16. Session• Authenticated and authorized– only sees content authorized by credentials– only changes content authorized by credentials– use the built-in auth service or integrate with your own• Stateful– changes are kept in the session’s transient state until the session is saved– changes can be dropped without saving (e.g., “refreshing the session”)• Lightweight– intended to be created, used, then closed– pooling sessions is more trouble than it’s worth• Self-contained– exposed objects are tied to the session; can’t be shared w/ others16An  authenticated  connection  to  a  repository,  used  to  access  a  single  workspace
  17. 17. With or without schema• Choose how much schema is enforced– define patterns for values and structure– use different patterns for different parts of the database– change the patterns over time– use the “best” levels of schema validation– evolve as necessary17STRICTENFORCEMENTNOENFORCEMENT
  18. 18. Queries• Find the data independently of the hierarchy• SQL-like language (including full-text search)18SELECT * FROM [car:Car] WHERE [car:model] LIKE ‘%Toyota%’ AND [car:year] >= 2006SELECT [jcr:primaryType],[jcr:created],[jcr:createdBy] FROM [nt:file]WHERE PATH() LIKE $pathSELECT [jcr:primaryType],[jcr:created],[jcr:createdBy] FROM [nt:file]WHERE PATH() IN (SELECT [vdb:originalFile] FROM [vdb:virtualDatabase]WHERE [vdb:version] <= $maxVersionAND CONTAINS([vdb:description],xml OR xml maybe))SELECT file.*,content.* FROM [nt:file] AS fileJOIN [nt:resource] AS content ON ISCHILDNODE(content,file)WHERE file.[jcr:path] LIKE /files/q*.2.vdb
  19. 19. Sequencing• Automatically extract structured content– just write BINARY or STRING property values on nodes, then save– sequencers run asynchronously based upon path rules & MIME types– output stored in repository at configurable location• Sequencers– DDL (variety)– text (fixed width, delimited)– Microsoft Office™– Java (source & class)– ZIP (and JAR/WAR/EAR)– XML, XSD, and WSDL– Teiid VDBs– audio (MP3)– images– CND– custom191)  upload2)  notify3)  derive  and  storeSequencers4)  navigate  or  query
  20. 20. Federation• Access data in external systems– external data projected as nodeswith properties and node types– supports read and optional writewith same validation rules– transparent to applications• Connector options– File system– Local git– CMIS repository– custom– (more are planned)20External  source  BExternal  source  A
  21. 21. Other features• Events– register listeners to be notified of changes in content– optional criteria limits what listeners are interested in• Versioning– checkin/checkout nodes & subtrees– branch, merge, restore• Locking– short-lived locks (longer than transaction scope)• Namespace management– programmatically (un)register namespaces• Node type management– programmatically/declaratively define or update node types• Monitoring– statistics for a variety of metrics21
  22. 22. Public APIs22
  23. 23. Java API• Standard Java API (JSR-283)– javax.jcr packages– programmatically access,find, update, query content– commonly needed features:events, versioning, etc.– 95% of API• ModeShape extensions– additional node type management methods– additional event types– additional Binary value methods (hash)– additional JCR-QOM language objects– cancel queries– sequencer and text extraction SPIs– monitoring API23
  24. 24. Other APIs• JDBC driver– connect to local or remote repository– execute queries– access database metadata– enables existing applications to access content• RESTful API– POST, PUT, GET, DELETE methods– JSON representations of one or multiple nodes– Streams large binary values– Execute queries• WebDAV API– Exposes content as files and directories– Mount repository using file system24
  25. 25. ModeShapeAn open sourceelastic in-memory hierarchical databasewith queries, transactions, events & more25
  26. 26. Elastic• Add more processes to increase storagecapacity and/or throughput– Transparent to applications!– No master, no slaves– Data is rebalanced as needed– Optionally separate database engine from storageprocesses• Fault tolerant– Processes can fail without loss of data– Cross-data center distribution (in near future)26
  27. 27. In-memory• Memory is really fast (and cheap)• Why not keep all data in application memory?– practical limits to memory on particular machines– memory isn’t shared between machines– data stored in memory isn’t durable– no queries, structure, or transactions• ModeShape– distributes multiple copies of data across the combinedmemory of many machines– persist data to disk or DB (if really needed)– transparent to applications27
  28. 28. Large single- or multi-site cluster28......ModeShape......ModeShapeevents......ModeShapeevents ......ModeShapeevents...Infinispan data griddatadata data data
  29. 29. Strongly consistent• ACID– Atomic, Consistent, Isolated, Durable– Already familiar to most developers– Easy to reason about code– Writes don’t block reads (MVCC)– Writes to one node don’t block writes to others• JTA– Will participate in user transactions– Works with Java EE29
  30. 30. Why not eventually-consistent?• In eventually-consistent databases– changes made by one client will eventually (but notimmediately) be propagated to all processes– other clients won’t see latest data right away, yet can still makeother changes– there may be multiple versions of a particular piece of data• Can be ideal for some scenarios– read-heavy and/or best-effort• Applications that update data may need to– expect inconsistencies (and/or multiple versions)– specify conflict strategies– resolve conflicts (inconsistencies)30
  31. 31. Clustering topologies31
  32. 32. Single process32......ModeShapeInfinispan cache(local)Persistent Storedata
  33. 33. Small cluster33......ModeShapeInfinispan cache(replicated)......ModeShapeInfinispan cache(replicated)......ModeShapeInfinispan cache(replicated)Persistent Storedataeventsdataeventsdatadatadata
  34. 34. Moderate single- or multi-site cluster34......ModeShapeInfinispan(distributed)......ModeShapeInfinispan(distributed)dataevents......ModeShapeInfinispan(distributed)dataevents ......ModeShapeInfinispan(distributed)dataevents...
  35. 35. Best Practices35
  36. 36. Best practices (1 of 2)• Build structure first, then node types– most important to get your node structure right– it will change over time anyway, so don’t define the node types too soon• Prefer hierarchies– moderate numbers of child nodes, use multiple levels if necessary• Limit use of same-name-siblings– useful when required, but can be expensive and difficult to use (i.e., paths change)• Use mixin node types and mixins– where possible define sets of properties as mixins– use in primary types and dynamically add to nodes• Store files and folders with ‘nt:file’ and ‘nt:folder’– use it wherever appropriate; not for all binary data, though!• Verify which JCR features are enabled– improves portability and safety with configuration changes• Import and export– avoid document view; use system view wherever possible36
  37. 37. Best practices (2 of 2)• Prefer JCR-SQL2 and JCR-QOM over other query languages– by far the richest and most useful– do this even when it appears the queries are more complicated• Only Repository is thread-safe; no other APIs are– don’t share sessions– don’t share anything between sessions• Register all listeners in special long-lived sessions– do nothing else with these sessions, however (Session is not threadsafe)– get off the notification thread ASAP, using work queues where necessary– Session is not threadsafe• Create new sessions rather than reusing a pool of sessions– Sessions are intended to be lightweight as possible– Create a session, use it, log out (even web applications and services!)• Avoid deprecated APIs– either perform poorly or are a bad idea; besides, they’ll be removed eventually• Use not
  38. 38. • Project !• Blog !• Twitter ! @modeshape• IRC ! #modeshape (• Code ! more ModeShape?
  39. 39. Questions?39