Got documents - The Raven Bouns Edition


Published on

Slides related to architectural considerations of document databases, with some additional RavenDB information added in.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Got documents - The Raven Bouns Edition

  2. 2. About Me @maggiepint
  4. 4. MongoDB •Dominant player in document databases •Runs on nearly all platforms •Strongly Consistent in default configuration •Indexes are similar to traditional SQL indexes in nature •Stores data in customized Binary JSON (BSON) format that allows typing •No support for cross-collection querying •Client API’s available in tons of languages •Must use a third party provider like SOLR for advanced search capabilities
  5. 5. CouchDB •Stores documents in plain JSON format •Eventually consistent •Indexes are map-reduce and defined in Javascript •Clients in many languages •Runs on Linux, OSX and Windows •CouchDB-Lucene provides a Lucene integration for search
  6. 6. RavenDB •Stores documents in plain JSON format •Eventually consistent •Indexes are built on Lucene. Lucene search is native to RavenDB. •Server only runs on Windows •.NET, Java, and HTTP Clients •Limited support for cross-collection querying
  7. 7. Other Players •Azure DocumentDB • Very new product from Microsoft •ReactDB • Open source project that integrates push notifications into the database •Cloudant • IBM proprietary implementation of CouchDB
  8. 8. Architectural Considerations
  9. 9. How do document databases work? •Stores related data in a single document •Usually uses JSON format for documents •Enables the storage of complex object graphs together, instead of normalizing data out into tables •Stores documents in collections of the same type •Allows querying within collections •Does not typically allow querying across collections •Offers high availability at the cost of consistency
  10. 10. Consideration: Schema Free PROS Easy to add properties Simple migrations Tolerant of differing data CONS Have to account for properties being missing
  11. 11. ACID Atomicity ◦ Each transaction is all or nothing Consistency ◦ Any transaction brings the database from one valid state to another Isolation ◦ System ensures that transactions operated concurrently bring the database to the same state as if they had been operated serially Durability ◦ Once a transaction is committed, it remains so even in the event of power loss, etc
  12. 12. ACID in Document Databases •Traditional transaction support is not available in any document database (except Raven) •Document databases do support something like transactions within the scope of a document •This makes document databases generally inappropriate for a wide variety of applications • Do a google search for FlexCoin •RavenDB is very close to ACID, but the community doesn’t agree on whether it is ACID
  13. 13. Consideration: Non-Acid PROS Performance Gain CONS No isolation means that concurrent operations can affect each other No way to guarantee that operations succeed or fail together
  14. 14. Case Study: Survey System
  15. 15. Requirements •An administration area is used to define ‘Surveys’. • Surveys have Questions • Questions have answers •Surveys can be administrated in sets called workflows •When a survey changes, this change can only apply to surveys moving forward • Because of this, each user must receive a survey ‘instance’ to track the version of the survey he/she got
  16. 16. A Traditional SQL Schema •With various other requirements not described here, this schema came out to 83 tables •For one of our heaviest usage clients, the average user would have 119 answers in the ‘Saved Answer’ table •With over 200,000 users after two years of use, the ‘Saved Answer’ table had 24,014,330 rows •This table was both read and write heavy, so it was extremely difficult to define effective SQL indexes •The hardware cost for these SQL servers was astronomical •This sucked
  17. 17. Designing Documents •An aggregate is a collection of objects that can be treated as one •An aggregate root is the object that contains all other objects inside of it •When designing document schema, find your aggregates and create documents around them •If you have an entity, it should be persisted as it’s own document because you will likely have to store references to it
  18. 18. Survey System Design •A combination SQL and Document DB design was used •Survey Templates (one type of entity) were put into the SQL Database •When a survey was assigned to a user as part of a workflow (another entity, and also an aggregate), it’s data at that time was put into the document database •The user’s responses were saved as part of the workflow document •Reading a user’s application data became as simple as making one request for her workflow document
  19. 19. Consideration: Models Aggregates Well PROS Improves performance by reducing lookups Allows for easy persistence of object oriented designs CONS none
  20. 20. Sharding •Sharding is the practice of distributing data across multiple servers •All major document database providers support sharding natively •Document Databases are ideal for sharding because document data is self contained (less need to worry about a query having to run on two servers) •Sharding is usually accomplished by selecting a shard key for a collection, and allowing the collection to be distributed to different nodes based on that key •Tenant Id and geographic regions are typical choices for shard keys
  21. 21. Replication •All major document database providers support replication •In most replication setups, a primary node takes all write operations, and a secondary node asynchronously replicates these write operations •In the event of a failure of the primary, the secondary begins to take write operations •MongoDB can be configured to allow reads from secondaries as a performance optimization, resulting in eventual instead of strong consistency
  22. 22. Consideration: Scaling Out PROS Allows hardware to be scaled horizontally Ensures very high availability CONS Consistency is sacrificed
  23. 23. Survey System: End Result •Each user is associated with about 20 documents •Documents are distributed across multiple databases using sharding •Master/Master replication is used to ensure extremely high availability •There have been no database performance issues in the year and a half the app has been in production •Because there is no schema migration concern, deploying updates has been drastically simplified •Hardware cost is reasonable (but not cheap)
  24. 24. Indexes •All document databases support some form of indexing to improve query performance •Some document databases do not allow querying without an index •In general, you shouldn’t query without an index anyways
  25. 25. Consideration: Indexes PROS Improve performance of queries CONS Queries cannot reasonably be issued without an index so indexes must frequently be defined and deployed
  26. 26. Consideration: Eventual Consistency PROS Optimizes performance by allowing data transfer to be a background process CONS Requires entire team to be aware of eventual consistency implications
  27. 27. Case Study 2: CRM
  28. 28. CRM Requirements •Track customers and basic information about them •Track contacts and basic information about them •Track sales deals and where they are in the pipeline •Track orders generated from sales deals •Track user tasks
  29. 29. Customers and Their Deals •Customers and Deals are both entities, which is to say that they have distinct identity •For this reason, Deals and Customer should be two separate collections •There is no native support for cross-collection querying in most Document Databases • The cross-collection querying support in RavenDB can have performance issues
  30. 30. Consideration: One document per interaction PROS Improves performance Encourages modeling aggregates well CONS Not actually achievable in most cases
  31. 31. Searching Deals by Customer Name •The deal document must contain a denormalized customer object with the customer’s ID and name •We have a choice to make with this denormalization • Allow the denormalization to just be wrong in the event the customer name is changed • Maintain the denormalization when the customer name is changed
  32. 32. Denormalization Considerations •Is stale data acceptable? This is the best option in all cases where it is possible. •If stale data is unacceptable, how many documents are likely to need update when a change is made? How many collections? How often are changes going to be made? •Using an event bus to move denormalization updates to a background process can be very beneficial if failure of an update isn’t critical for the user to know
  33. 33. Consideration: Models Relationships Poorly PROS None CONS Stale (out of date) data must be accepted in the system Large amounts of boilerplate code must be written to maintain denormalizations In certain circumstances a queuing/eventing system is unavoidable
  34. 34. Consideration: No Foreign Key Constraints PROS Don’t have to define foreign key constraints CONS No built in checks for data consistency
  35. 35. Consideration: Administration PROS Generally less involved than SQL CONS Server performance must be monitored Hardware must be maintained Index processes must be tuned Settings must be tweaked
  36. 36. Consideration Recap •Schema Free •Non-Acid •Models Aggregates Well •Scales out well •All queries must be indexed •Eventual Consistency •One document per interaction •Models relationships poorly •No foreign key constraints •Requires administration
  37. 37. RavenDB Bonus Section
  38. 38. ACID •RavenDB has a session that allows multiple documents to be written as a transaction •Keep in mind, reads from indexes are still eventually consistent •
  39. 39. Eventual Consistency •Issues with eventual consistency can be circumvented by using the wait for non-stale results functionality •Waiting for non-stale results can result in long wait times •Waiting for non-stale results with a timeout can result in no results
  40. 40. Load Document •RavenDB has limited support for cross collection querying in the form of using LoadDocument •This eliminates some of the concerns with the deals by customer name search example •On Raven’s website they warn that injudicious use of LoadDocument can result in some very expensive computations
  41. 41. Patching •Raven supports partial document updates on a collection of documents using the Patching API •This can be extremely helpful for maintaining denormalizations •Patching is not transactional
  42. 42. Lucene Search •RavenDB’s indexes are built on Lucene •This allows easy full text search with term weighting and proximity searching
  43. 43. …nerds like us are allowed to be unironically enthusiastic about stuff… Nerds are allowed to love stuff, like jump-up-and-down-in-the-chair- can’t-control-yourself love it. -John Green