This document provides an overview of document databases and MongoDB. It discusses key concepts of document databases like dynamic schemas, embedding of related data, and lack of joins. Benefits include scalability, flexibility in data modeling, and performance. The document outlines MongoDB internals such as replication, sharding, and BSON data storage format. It also promotes MongoDB as the most popular open-source document database and provides links for additional .NET resources.
5. Comparison
Article
- id
- authorid
- title
- content
Comment
- id
- articleid
- message
Author
- id
- name
- email
Article
- _id
- title
- content
- author
- comments[]
- _id
- name
- email
Relational Document db
6. Terminology
In parallel with SQL:
Relational Document db
Table Collection
Row Document
Column Field
Index Index
Join Embedding & linking
Schema N/A
7. Data integrity
Shift of responsibilities to the app
Manage data integrity and validity yourself
Database more efficient and more scalable
DB
data integrity &
validity checks
APPLICATION
8. Concepts
Joins
No joins
Joins at "design time", not at "query time“
Due to embedded docs and arrays less joins are needed
Constraints
No foreign key constraints
Unique indexes
Transactions
No commit/rollback
Atomic operations
Multiple actions inside the same document
Incl. embedded documents
9. Dynamic schema
No schema
Implied: definition in the app, not the db
A field can exist in certain docs and not in others
When indexing null as a value
Sparse index: exclude docs without that field
Writing to a non-existent collection or database
Lazy creation
Reading from a non-existent collection
Empty value returned
10. Relations
Embedded fields
Can be queried, the parent doc is returned
Can be indexed
Can’t be used for ordering
Linking
Get the 2nd doc yourself in de app via a reference
Avoid where possible
Use for:
Many-to-many relations
Subdoc often needs to be modified
11. Benefits
Scalable: good for a lot of data / traffic
Horizontal scaling: to more nodes
Good for web-apps
Performance
No joins and constraints
Dev/user friendly
Data is modeled to how the app is going to use it
No conversion between object oriented > relational
No static schema = agile
12. Drawbacks
More mistake-prone
No data integrity checks
Database is app-specific
Less flexibility for shared usage
Data aggregation is harder
Less suitable for reporting
14. Schema design
Start from application-specific queries
“What questions do I have?” vs “What answers”
“Data like the application wants it”
Base parent documents on:
The most common usage
What do I want returned?
15. Schema design
Hybrid embed / link
Changing the author name is a seldom occurring action
First update author.name
Then update the articles async
Article
- _id
- author
- content
- _id
- name
- email
Author
- _id
- name
- email
16. Schema design
Data duplication & denormalisation
Pro
simplicity
optimalisation (less IO operations)
query processing
Con
more disk usage
data integrity
Embedded docs
Recommended < 250 kB
17. Product
Single collection inheritance
Product
- _id
- price
Book
- author
- title
Album
- artist
- title
Jeans
- size
- color
Book
- _id
- price
- author
- title
Relational Document db
Jeans
- _id
- price
- size
- color
18. Product
Single collection inheritance
Product
- _id
- price
Book
- author
- title
Album
- artist
- title
Jeans
- size
- color
_type: Book
- _id
- price
- author
- title
Relational Document db
_type: Jeans
- _id
- price
- size
- color
19. One-to-many
Embedded array / array keys
Some queries get harder
You can index arrays!
Normalized approach
More flexibility
A lot less performance
Article
- _id
- content
- tags: {“foo”, “bar”}
- comments: {“id1”, “id2”}
20. Many-to-many
Using array keys
No join table
References on both sides
Advantage: simple queries
articles.Where(p => p.CategoryIds.Contains(categoryId))
categories.Where(c => c.ArticleIds.Contains(articleId))
Disadvantage: duplication, update two docs
Article
- _id
- content
- category_ids : {“id1”, “id2”}
Category
- _id
- name
- article_ids: {“id7”, “id8”}
21. Many-to-many
References on one side
Advantage: data in one place
Disadvantage: 2 queries
articles.Where(p => p.CategoryIds.Contains(categoryId))
var article = articles.Single(p => p.Id == articleId)
categories.Where(c => c.Id.In(article.CategoryIds))
Article
- _id
- content
- category_ids : {“id1”, “id2”}
Category
- _id
- name
22. To sum up
A new mind set
Serialize complex .NET objects directly to the db
Data duplication and denormalisation are key
Big shift of responsibilities to the app
No built-in data integrity checks
Database has a single responsibility: storing data
Quicker and easier to scale
23.
24. MongoDB
Why MongoDB?
Largest user base, mature
Platform independent
Open source, free
Source: Google Trends
25. MongoDB: internals
Durability
By default through replication
Single server durability: less performance
Eventual consistency
Configure fsync: sync between memory and disk
by default every 60 sec.
Configure replicate before return
26. MongoDB: internals
Safe mode
Turn off eventual consistency
sync directly to the disk
sufficiently replicate data, in replication sets
Calls GetLastError to determine whether the action was
successful
Applies to actions without a return value
On connection or action level
27. MongoDB: internals
Replication sets
Nodes that are copies of each other
Set-up of master and slave nodes
If the master goes down, the slave automatically
takes over and promotes itself to master
28. Sharding
Scale out
Clusters of replica sets
Connected to
a central proxy
used by clients
config servers
contain meta-data
Write to multiple nodes
MongoDB: internals
29. MongoDB: internals
Sharding
Based on a shard key (= field)
Commands are sent to the shard that includes the
relevant range of the data
Data is evenly distributed across the shards
Automatic reallocation of data when adding or removing
servers
30. MongoDB: internals
BSON
Data storage and network transfer format
Binary serialized JSON
System collections
db.systems.collections
db.systems.indexes
Geospatial indexing
Find results closest to coordinate
db.places.find({ loc: {$near: [50, 4], $maxDistance: 5} })