Schema Design
Mike Friedman
Perl Engineer & Evangelist, MongoDB
Agenda
• What is a Record?
• Core Concepts
• What is an Entity?

• Associating Entities
• General Recommendations
All application development is

Schema Design
Success comes from

Proper Data Structure
What is a Record?
Key → Value
• One-dimensional storage
• Single value is a blob

Key

• Query on key only

• No schema
• Value cannot be updated, only replaced

Blob
Relational
• Two-dimensional storage (tuples)
• Each field contains a single value

Primary
Key

• Query on any field

• Very structured schema (table)
• In-place updates
• Normalization process requires many tables, joins,

indexes, and poor data locality
Document
• N-dimensional storage

_id

• Each field can contain 0, 1,

many, or embedded values
• Query on any field & level
• Flexible schema
• Inline updates *

• Embedding related data has optimal data locality,

requires fewer indexes, has better performance
Core Concepts
Traditional Schema Design

Focus on data storage
Document Schema Design

Focus on data use
Another way to think about it

What answers do I have?

What questions do I
have?
Three Building Blocks of

Document Schema
Design
1 – Flexibility
• Choices for schema design
• Each record can have different fields
• Common structure can be enforced by application

• Easy to evolve as needed
2 – Arrays
Multiple Values per Field
• Each field can be:
– Absent
– Set to null
– Set to a single value
– Set to an array of many values
• Query for any matching value
– Can be indexed and each value in the array is in the
index
3 - Embedded Documents
• An acceptable value is a document
• Nested documents provide structure
• Query any field at any level
– Can be indexed
What is an Entity?
An Entity
• Object in your model
• Associations with other entities

Referencing (Relational) Embedding (Document)
has_one
belongs_to
has_many

embeds_one
embedded_in
embeds_many

has_and_belongs_to_ma
ny
MongoDB has both referencing and embedding for universal
coverage
Let's model something
together

How about a business
card?
Business Card
Referencing
Contacts

Addresses

{

{

}

“_id”: ,
“name”:
“title”:
“company”:
“phone”:
“address_id”:

,

”,

,

,
}

“_id”: ,
“street”:
“city”:
“state”:
”,
“zip_code”:
“country”:

,
,
,
Embedding
Contacts
{

“_id”: ,

“name”:
“title”:
“company”:
“address”: {
“street”:
“city”:
“state”:
,
“zip_code”:
“country”:
},
“phone”:
}

,
,
,
,
,
,
Contact
•
•
•
•

name
company
title
phone

Address
•
•
•
•

street
city
state
zip_code

Relational Schema
Contact
•
•
•
•

name
company
adress
address
• Street
• street
• City
• city
• State
• State
• Zip
• zip_code
• title
• phone

Document Schema
Contact
Contact
•
•
•
•

name
company
title
phone

Address
•
•
•
•

street
city
state
zip_code

• name
• company
• adress
address
• Street
street
• City
city
• State
state
• Zip
zip_code
• title
• phone

How are they different? Why?
Schema Flexibility
{

“name”:
“title”:
“company”:
“address”: {
“street”:
“city”:
“state”:
,
“zip_code”:
},
“phone”:

{
“name”:
“url”:
“title”:
,
“company”:
“email”:
“address”: {
“street”:
“city”:
“state”:
,
“zip_code”:
}
“phone”:
“fax”

,
,
,
,
,

}

}

,
,
,
,

,
,

,
Example
Let’s Look at an

Address Book
Address Book
• What questions do I have?
• What are my entities?
• What are my associations?
•
•
•
•

name
location
web
bio

• name
N
1

N

1

1

Thumbnail
s
• mime_type
• data

Contacts

1

•
•
•
N •
•

type
street
city
state
zip_code

Phones

• name
1 • company
• title

1

1

1

Portraits
• mime_type
• data

Addresses

Groups

Twitters

1

N • type
• number

Emails
N • type
• address

Address Book Entity-Relationship
Associating Entities
•
•
•
•

name
location
web
bio

• name
N
1

N

1

1

Thumbnail
s
• mime_type
• data

Contacts

•
•
•
N •
•

type
street
city
state
zip_code

Phones

• name
1 • company
• title

1

1

1

Portraits
• mime_type
• data

Addresses

Groups

Twitters

1

N • type
• number

Emails
N • type
• address

1

One to One
One to One
Schema Design Choices
contact
• twitter_id

1

1

twitter

Contact
• twitter

twitter

• May save a fetch?

contact

twitter
1

1 • contact_id

Redundant to track relationship on both sides
• Both references must be updated for consistency

1
One to One
General Recommendation
• Full contact info all at once
– Contact embeds twitter
• Parent-child relationship

Contact
• twitter

– “contains”

• No additional data duplication
• Can query or index on embedded field
– e.g., “twitter.name”

twitter

1
•
•
•
•

name
location
web
bio

• name
N
1

N

1

1

Thumbnail
s
• mime_type
• data

Contacts

•
•
•
N •
•

type
street
city
state
zip_code

Phones

• name
1 • company
• title

1

1

1

Portraits
• mime_type
• data

Addresses

Groups

Twitters

1

N • type
• number

Emails
N • type
• address

1

One to Many
One to Many
Schema Design Choices
contact
• phone_ids: [ ]

1

N

phone

• phones

phone N

• Not possible in relational DBs
• Save a fetch?

contact

Contact

phone
1

N • contact_id

Redundant to track relationship on both sides
• Both references must be updated for consistency
One to Many
General Recommendation
• Full contact info all at once
– Contact embeds multiple phones
• Parent-children relationship
– “contains”

Contact
• phones

phone N

• No additional data duplication
• Can query or index on any field
– e.g., { “phones.type”: “mobile” }

– Exceptional cases…

• Scaling: maximum document size is 16MB
•
•
•
•

name
location
web
bio

• name
N
1

N

1

1

Thumbnail
s
• mime_type
• data

Contacts

•
•
•
N •
•

type
street
city
state
zip_code

Phones

• name
1 • company
• title

1

1

1

Portraits
• mime_type
• data

Addresses

Groups

Twitters

1

N • type
• number

Emails
N • type
• address

1

Many to Many
Many to Many
Traditional Relational Association
Join table
Groups
• name

X

GroupContacts
• group_id
• contact_id

Use arrays instead

Contacts
•
•
•
•

name
company
title
phone
Many to Many
Schema Design Choices
group
•

contact_ids: [ ] N N

contact

group
• contacts

contact

group

contact
• groups
N

group

N

contact
N N • group_ids: [
]

Redundant to track
relationship on both sides
•

Both references must be
updated for consistency

Redundant to track
relationship on both sides
•

Duplicated data must be
updated for consistency
Many to Many
General Recommendation
contact
• Depends on use case
group N N • group_ids: [
1.
Simple address book
]
• Contact references groups
2. Corporate email groups
• Group embeds contacts for performance
• Exceptional cases
– Scaling: maximum document size is 16MB
– Scaling may affect performance and working set
Groups

Contacts

• name
N

• name
• company
• title

twitter

N
1

1

Portraits
• mime_type
• data

•
•
•
•

addresses N

1

name
location
web
bio

thumbnail 1
• mime_type
• data

•
•
•
•
•

type
street
city
state
zip_code

phones

N

• type
• number

emails

N

• type
• address

Document model - holistic and efficient representation
Contact document example
{
“name” : “Gary J. Murakami, Ph.D.”,
“company” : “MongoDB, Inc.”,

“title” : “Lead Engineer”,
“twitter” : {
“name” : “Gary Murakami”, “location” : “New Providence, NJ”,
“web” : “http://www.nobell.org”
},
“portrait_id” : 1,
“addresses” :

,
“phones” :

,
“emails” :

}
Working Set
To reduce the working set, consider…
• Reference bulk data, e.g., portrait
• Reference less-used data instead of embedding
– Extract into referenced child document

Also for performance issues with large documents
General Recommendations
Legacy Migration
1. Copy existing schema & some data to MongoDB
2. Iterate schema design development
Measure performance, find bottlenecks, and embed
1. one to one associations first
2. one to many associations next
3. many to many associations
3. Migrate full dataset to new schema

New Software Application? Embed by default
Embedding over Referencing
• Embedding is a bit like pre-joined data
– BSON (Binary JSON) document ops are easy for the
server
• Embed (90/10 following rule of thumb)
– When the “one” or “many” objects are viewed in the
context of their parent
– For performance
– For atomicity
• Reference
– When you need more scaling
– For easy consistency with “many to many” associations
without duplicated data
It’s All About Your Application
• Programs+Databases = (Big) Data Applications
• Your schema is the impedance matcher
– Design choices: normalize/denormalize,
reference/embed
– Melds programming with MongoDB for best of both
– Flexible for development and change
• Programs MongoDB = Great Big Data Applications
Thank You
Mike Friedman
Perl Engineer & Evangelist, MongoDB

Schema Design

Editor's Notes

  • #2 At the venue with the actual display, make sure to preview these notes in presenter mode, and adjust the font size to fit you by using the slider in the left margin of the speaker notes.
  • #3 Your framing interest story should go here and at appropriate intervals recommended by Greg.
  • #5 Schema Design is very important; its impact on your application is pervasive.
  • #6 Wrong data structure will hurt you.Proper data structure can make all the pieces fall into place.
  • #8 One-dimensional storage can be very fast but very relatively limited with respect to other DBMS.
  • #9 Two-dimensional storage of ordered tuples or traditional records.The winning technology is that every field/value is first class,In essence, every field can be addressed in queries and can be indexed for faster processing.Normalization process requires many tables, joins to rehydrate, indexes to make joins faster, and results in poor data locality.
  • #10 The essential capability of the winningtechnology frompersists and gets even better.The document structure can match your data structures – your schema.
  • #14 What questions do I have? What are my use cases?Does your schema take advantage of your application-specific knowledge of known queries, use cases, and client-program data structures?Traditional DBs make it hard to take advantage of them.Document DBs make it easy to take advantage of them.MongoDB documents can match your application – given good schema design.
  • #16 Not “schema-less” but rather “flexible schema”Common structure can be enforced by applicationWhile MongoDB does not enforce common structure, neither does it restrict your applicationDocuments may have a common structure that is optionally extended at the document-levelExample problems for traditionalMany empty columns instead of subclassing via yet another tableThree days for schema migrationKeywords: flexible, choice, evolve, change, modify
  • #17 Concept of arrays incorporates multiple values, associations involving many entities.The lack of multivalued fields is usually the first complaint of programmers that don’t wish to pay the cost for normalization.Keywords: array, multiple, many
  • #18 Documents may have a common structure that is optionally extended at the document-level.The application mapping can enforce the required and optional fields.
  • #22 “Vintage” business card
  • #23 Contact and Address entities areassociated one to one.Traditional relational association is via referencing.In this example, the contact record for Steve Jobs has a reference to his address via the address_id field.
  • #25 Entity-Relational diagram
  • #26 Entity-Relational diagram for embedding documents
  • #27 Left – relational - requires either two fetches/queries (or a join in a relational DB)Right – document – requires only one fetch/query and has data locality
  • #29 A common example will help us understand the joy of flexible document structure.
  • #32 Left: One to one We're going to assume users only have on Twitter account. A thumbnail is a small profile image while portrait is a very large profile image.Right: One to manyMiddle: Many to many
  • #41 Arrays of references are more direct than a join table and save a fetch.
  • #43 fundamentally not “contains”Concerns – exceptional casesExceeding maximum document size due to large data or scalingTransferring very large documents is probably a performance concernScaling may affect working set sizeSchema can be adjusted to improve performance- Fetch only the data that you need
  • #44 Embedding entities in the contact document reduces six fetches to one
  • #45 Embedding is used for both one-to-one and one-to-many associations, resulting in exactly what you expect and want for a contact.(This example has no thumbnail or groups)
  • #46 $project allows you to select top level fields and can be used to reduce data for a fetch. Note that some ODMs may not allow you to specify $project.
  • #48 For many-to-many associations, eliminate join table using array of references or embedded documents
  • #49 Choose embedding by default as oppose to referencing.Referencing is not just the default for relational DBs, there is no other choice.
  • #50 May you build Great Big Data Applications.Perhaps you can say inspiring quotes like Ken Thompson, “Play chess with God.”Ken and I worked on Perceptual Audio Coding, better known as Advanced Audio Coding or AAC as found in the iPod and iPhone.So I hope that this will inspire you to“Play music with God”to design your killer app
  • #51 BSON (Binary JSON) is the “magic” or core technology in MongoDB for data structures and performance.BSON does not have to be parsed like JSON, but is rather a format that can be traversed easily.Can choose a language to fit your application, or multiple languages to fit multiple components of your application as appropriate.