Back to Basics 1: Thinking in documents

Thinking in Documents
Perl Engineer & Evangelist, MongoDB, Inc
Mike Friedman
#mongodb
@friedo

Agenda
• What is a Record?
• Core Concepts
• What is an Entity?
• Associating Entities
• General Recommendations

All application development is
Schema Design

Success comes from
Proper Data Structure

Key → Value
• One-dimensional storage
• Single value is a blob
• Query on key only
• No schema
• Value cannot be updated, only replaced
Key Blob

Relational
• Two-dimensional storage (tuples)
• Each field contains a single value
• Query on any field
• Very structured schema (table)
• In-place updates
• Normalization process requires many tables, joins,
indexes, and poor data locality
Primary
Key

Document
• N-dimensional storage
• Each field can contain 0, 1,
many, or embedded values
• Query on any field & level
• Flexible schema
• Inline updates *
• Embedding related data has optimal data locality,
requires fewer indexes, has better performance
_id

Traditional Schema Design
Focus on data storage

Document Schema Design
Focus on data use

What answers do I have?
What questions do I
have?

Flexibility
• Choices for schema design
• Each record can have different fields
• Field names consistent for programming
• Common structure can be enforced by application
• Easy to evolve as needed

Building Blocks of
Document Schema
Design

1 - Arrays
[
1, 2, 3, "four",
5, "six", [ 7, 8, 9 ]
]

1 – Arrays
Multiple Values per Field
• Absent
• Set to null
• Set to a single value
• Set to an array of many values
Each field in a document can be:

1 – Arrays
Multiple Values per Field
• Query for any matching value
– Can be indexed and each value in the array is in the
index

2 – Embedded
Documents{
"foo": 42,
"bar": 43,
"stuff": { ... },
...
}

2 - Embedded Documents
• Avalue in a document can be another document
• Nested documents provide structure
• Query any field at any level
– Can be indexed

An Entity
• Object in your model
• Associations with other entities
An Entity
• Object in your model
• Associations with other entities
Referencing (Relational) Embedding (Document)
has_one embeds_one
belongs_to embedded_in
has_many embeds_many
has_and_belongs_to_ma
ny

Let's model something
together
How about a business
card?

Referencing
Addresses
{
"_id": ,
"street":
,
"city": ,
"state": ",
"zip_code": ,
"country":
}
Contacts
{
"_id": ,
"name": ,
"title":
,
"company": ",
"phone": ,
"address_id":
}

Embedding
Contacts
{
"_id": ,
"name": ,
"title":
,
"company": ,
"address": {
"street": ,
"city": ,
"state": ,
"zip_code": ,
"country":
},
"phone":
}

Relational Schema
Contact
• name
• company
• title
• phone
Address
• street
• city
• state
• zip_code

Contact
• name
• company
• adress
• Street
• City
• State
• Zip
• title
• phone
• address
• street
• city
• State
• zip_code
Document Schema

How are they different? Why?
Contact
• name
• company
• title
• phone
Address
• street
• city
• state
• zip_code
Contact
• name
• company
• adress
• Street
• City
• State
• Zip
• title
• phone
• address
• street
• city
• state
• zip_code

Schema Flexibility
{
"name": ,
"title":
,
"company": ,
"address": {
"street": ,
"city": ,
"state": ,
"zip_code":
},
"phone":
}
{
"name": ,
"url": ,
"title": ,
"company": ,
"email": ,
"address": {
"street":
,
"city": ,
"state": ,
"zip_code":
}
"phone": ,
"fax"
}

Let’s Look at an
Address Book

Address Book
• What questions do I have?
• What are my entities?
• What are my associations?

Address Book Entity-Relationship
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1

One to One
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1

One to One
Schema Design Choices
contact
• twitter_id
twitter1 1
contact twitter
• contact_id1 1
Redundant to track relationship on both sides
• Both references must be updated for consistency
• May save a fetch?
Contact
• twitter
twitter 1

One to One
General Recommendation
• Full contact info all at once
– Contact embeds twitter
• Parent-child relationship
– "contains"
• No additional data duplication
• Can query or index on embedded field
– e.g., "twitter.name"
– Exceptional cases…
• Reference portrait which has very large data
Contact
• twitter
twitter 1

One to Many
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1

One to Many
contact
• phone_ids: [ ]
phone1 N
contact phone
• contact_id1 N
Redundant to track relationship on both sides
• Both references must be updated for consistency
• Not possible in relational DBs
• Save a fetch?
Contact
• phones
phone N

One to Many
• Full contact info all at once
– Contact embeds multiple phones
• Parent-children relationship
– "contains"
• No additional data duplication
• Can query or index on any field
– e.g., { "phones.type": "mobile" }
– Exceptional cases…
• Scaling: maximum document size is 16MB
Contact
• phones
phone N

Many to Many
Contacts
• name
• company
• title
Addresses
• type
• street
• city
• state
• zip_code
Phones
• type
• number
Emails
• type
• address
Thumbnail
s
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
N
N
N
1
1
1
11
Twitters
• name
• location
• web
• bio
1
1

Many to Many
Traditional Relational Association
Join table
Contacts
• name
• company
• title
• phone
Groups
• name
GroupContacts
• group_id
• contact_id
Use arrays instead
X

Many to Many
group
• contact_ids: [ ]
contactN N
group
contact
• group_ids: [
]
N N
Redundant to track
relationship on both sides
• Both references must be
updated for consistency
Redundant to track
relationship on both sides
• Duplicated data must be
updated for consistency
group
• contacts
contact
N
contact
• groups
group
N

Many to Many
• Depends on use case
1. Simple address book
• Contact references groups
2. Corporate email groups
• Group embeds contacts for performance
• Exceptional cases
– Scaling: maximum document size is 16MB
– Scaling may affect performance and working set
group
contact
• group_ids: [
]
N N

Contacts
• name
• company
• title
addresses
• type
• street
• city
• state
• zip_code
phones
• type
• number
emails
• type
• address
thumbnail
• mime_type
• data
Portraits
• mime_type
• data
Groups
• name
N
1
N
1
twitter
• name
• location
• web
• bio
N
N
N
1
1
Document model - holistic and efficient representation

Contact document example
{
"name" : "Gary J. Murakami, Ph.D.",
"company" : "MongoDB, Inc.",
"title" : "Lead Engineer",
"twitter" : {
"name" : "Gary Murakami", "location" : "New Providence, NJ",
"web" : "http://www.nobell.org"
},
"portrait_id" : 1,
"addresses" :
,
"phones" :
,
"emails" :
}

Working Set
To reduce the working set, consider…
• Reference bulk data, e.g., portrait
• Reference less-used data instead of embedding
– Extract into referenced child document
Also for performance issues with large documents

Legacy Migration
1. Copy existing schema & some data to MongoDB
2. Iterate schema design development
Measure performance, find bottlenecks, and embed
1. one to one associations first
2. one to many associations next
3. many to many associations
3. Migrate full dataset to new schema
New SoftwareApplication? Embed by default

Embedding over Referencing
• Embedding is a bit like pre-joined data
– BSON (Binary JSON) document ops are easy for the
server
• Embed (90/10 following rule of thumb)
– When the "one" or "many" objects are viewed in the
context of their parent
– For performance
– For atomicity
• Reference
– When you need more scaling
– For easy consistency with "many to many" associations
without duplicated data

It’s All About Your Application
• Programs+Databases = (Big) DataApplications
• Your schema is the impedance matcher
– Design choices: normalize/denormalize,
reference/embed
– Melds programming with MongoDB for best of both
– Flexible for development and change
• Programs×MongoDB = Great Big DataApplications

Thank You
Perl Engineer & Evangelist, MongoDB
Mike Friedman
#mongodb
@friedo

Back to Basics 1: Thinking in documents

In this document

More Related Content

What's hot

Similar to Back to Basics 1: Thinking in documents

More from MongoDB

Recently uploaded

Back to Basics 1: Thinking in documents