Schema Design
  Bernie Hackett
bernie@10gen.com
Topics

Introduction
• Basic Data Modeling
• Manipulating Data
• Evolving a schema
Common patterns
• Single table inheritance
• One-to-Many & Many-to-Many
• Trees
• Queues
So why model data?




    http://www.flickr.com/photos/42304632@N00/493639870/
Benefits of relational

• Before relational
   • Data and Logic combined
• After relational
   • Separation of concerns
   • Data modeled independent of logic
   • Logic freed from concerns of data design

• MongoDB continues this separation
Normalization

Goals
• Avoid anomalies when inserting, updating or
  deleting
• Minimize redesign when extending the schema
• Make the model informative to users
• Avoid bias toward a particular query
In MongoDB
• Similar goals apply
• The rules are different
Relational made normalized
data look like this
Document databases make
normalized data look like this
Terminology
RDBMS                MongoDB
Table                Collection
Row(s)               JSON
Document
Index                Index
Join                 Embedding
&
Linking
Partition            Shard
Partition
Key        Shard
Key
DB Considerations
How can we manipulate    Access Patterns?
  this data?

 • Dynamic Queries       • Read / Write Ratio
 • Secondary Indexes     • Types of updates
 • Atomic Updates        • Types of queries
 • Map Reduce            • Data life-cycle
      Further Considerations
      • No Joins
      • Document writes are atomic
So today’s example will use...
Design Session
Design documents that simply map to
your application
>   post
=
{
author:
"Hergé",
       







date:
new
Date(),
     











text:
"Destination
Moon",
     











tags:
[
"comic",

     "adventure"
]
}

>
db.post.save(post)
Find the document
>
db.posts.find()



{
_id:
ObjectId("4c4ba5c0672c685e5e8aabf3"),




author:
"Hergé",





date:
"Sat
Jul
24
2010
19:47:11
GMT‐0700
(PDT)",





text:
"Destination
Moon",





tags:
[
"comic",
"adventure"
]


}



Notes:
• ID must be unique, but can be anything you’d like
• MongoDB will generate a default ID if one is not
  supplied
Add and index, find via Index
Secondary index for "author"


//


1
means
ascending,
‐1
means
descending


>
db.posts.ensureIndex(
{author:
1
}
)


>
db.posts.find(
{
author:
'Hergé'
}
)






{
_id:
ObjectId("4c4ba5c0672c685e5e8aabf3"),





date:
"Sat
Jul
24
2010
19:47:11
GMT‐0700
(PDT)",





author:
"Hergé",






...
}
Verifying indexes exist
>
db.posts.getIndexes()

//
Index
on
ID


{
name:
"_id_",





ns:
"test.posts",





key:
{
"_id"
:
1
}
}

//
Index
on
author


{
_id:
ObjectId("4c4ba6c5672c685e5e8aabf4"),





ns:
"test.posts",





key:
{
"author"
:
1
},





name:
"author_1"
}
Examine the query plan
>
db.blogs.find(
{
author:
'Hergé'
}
).explain()
{

   "cursor"
:
"BtreeCursor
author_1",

   "nscanned"
:
1,

   "nscannedObjects"
:
1,

   "n"
:
1,

   "millis"
:
5,

   "indexBounds"
:
{

   
   "author"
:
[

   
   
   [

   
   
   
   "Hergé",

   
   
   
   "Hergé"

   
   
   ]

   
   ]

   }
}
Query operators
Conditional operators:
 $ne, $in, $nin, $mod, $all, $size, $exists, $type,
 $lt, $lte, $gt, $gte

//
find
posts
with
any
tags
>
db.posts.find(
{
tags:
{
$exists:
true
}
}
)
Query operators
Conditional operators:
 $ne, $in, $nin, $mod, $all, $size, $exists, $type,
 $lt, $lte, $gt, $gte

//
find
posts
with
any
tags
>
db.posts.find(
{
tags:
{
$exists:
true
}
}
)

Regular expressions:
//
posts
where
author
starts
with
h
>
db.posts.find(
{
author:
/^h/i
}
)

Query operators
Conditional operators:
 $ne, $in, $nin, $mod, $all, $size, $exists, $type,
 $lt, $lte, $gt, $gte

//
find
posts
with
any
tags
>
db.posts.find(
{
tags:
{
$exists:
true
}
}
)

Regular expressions:
//
posts
where
author
starts
with
h
>
db.posts.find(
{
author:
/^h/i
}
)


Counting:
//
number
of
posts
written
by
Hergé
>
db.posts.find(
{
author:
"Hergé"
}
).count()
Extending the Schema





>
new_comment
=
{
author:
"Bernie",

           













date:
new
Date(),
           













text:
"great
book"
}



>
db.posts.update(











{
text:
"Destination
Moon"
},












{
'$push':
{
comments:
new_comment
},













'$inc':

{
comments_count:
1
}
}
)
Extending the Schema




{
_id
:
ObjectId("4c4ba5c0672c685e5e8aabf3"),





author
:
"Hergé",




date
:
"Sat
Jul
24
2010
19:47:11
GMT‐0700
(PDT)",





text
:
"Destination
Moon",




tags
:
[
"comic",
"adventure"
],









comments
:
[

    {

    
  author
:
"Bernie",

    
  date
:
"Sat
Jul
24
2010
20:51:03
GMT‐0700
(PDT)",

    
  text
:
"great
book"

    }




],




comments_count:
1


}



Extending the Schema
//
create
index
on
nested
documents:
>
db.posts.ensureIndex(
{
"comments.author":
1
}
)

>
db.posts.find(
{
"comments.author":
"Bernie"
}
)
Extending the Schema
//
create
index
on
nested
documents:
>
db.posts.ensureIndex(
{
"comments.author":
1
}
)

>
db.posts.find(
{
"comments.author":
"Bernie"
}
)

//
find
last
5
posts:
>
db.posts.find().sort(
{
date:
‐1
}
).limit(5)
Extending the Schema
//
create
index
on
nested
documents:
>
db.posts.ensureIndex(
{
"comments.author":
1
}
)

>
db.posts.find(
{
"comments.author":
"Bernie"
}
)

//
find
last
5
posts:
>
db.posts.find().sort(
{
date:
‐1
}
).limit(5)

//
most
commented
post:
>
db.posts.find().sort(
{
comments_count:

‐1
}
).limit(1)



      When sorting, check if you need an index
Watch for full table scans

>
db.blogs.find(
{
text:
'Destination

Moon'
}
).explain()


{

   "cursor"
:
"BasicCursor",

   "nscanned"
:
1,

   "nscannedObjects"
:
1,

   "n"
:
1,

   "millis"
:
0,

   "indexBounds"
:
{

   


   }
}
Map Reduce
Map reduce : count tags
mapFunc
=
function
()
{




this.tags.forEach(
function(
z
)
{
emit(
z,
{
count:
1
}
);
}
);
}

reduceFunc
=
function(
k,
v
)
{




var
total
=
0;




for
(
var
i
=
0;
i
<
v.length;
i++
)
{











total
+=
v[i].count;




}




return
{
count:
total
};

}

res
=
db.posts.mapReduce(
mapFunc,
reduceFunc
)

>db[res.result].find()





{
_id
:
"comic",
value
:
{
count
:
1
}
}





{
_id
:
"adventure",
value
:
{
count
:
1
}
}
Group

• Equivalent to a Group By in SQL

• Specify the attributes to group the data

• Process the results in a Reduce function
Group - Count post by Author
cmd
=
{
key:
{
"author":
true
},
   







initial:
{
count:
0
},
   







reduce:
function(obj,
prev)
{
   















prev.count++;














},






};
result
=
db.posts.group(cmd);

[

   {

   
    "author"
:
"Hergé",

   
    "count"
:
1

   },

   {

   
    "author"
:
"Kyle",

   
    "count"
:
3

   }
]
Review

So Far:
- Started out with a simple schema
- Queried Data
- Evolved the schema
- Queried / Updated the data some more
Inheritance
Single Table Inheritance - RDBMS
shapes table

  id     type   area   radius d   length width

  1      circle 3.14   1



  2      square 4            2



  3      rect   10                5      2
Single Table Inheritance -
MongoDB
>
db.shapes.find()

{
_id:
"1",
type:
"circle",
area:
3.14,
radius:
1
}

{
_id:
"2",
type:
"square",
area:
4,
d:
2
}

{
_id:
"3",
type:
"rect",
area:
10,
length:
5,
width:
2
}
Single Table Inheritance -
MongoDB
>
db.shapes.find()

{
_id:
"1",
type:
"circle",
area:
3.14,
radius:
1
}

{
_id:
"2",
type:
"square",
area:
4,
d:
2
}

{
_id:
"3",
type:
"rect",
area:
10,
length:
5,
width:
2
}

//
find
shapes
where
radius
>
0

>
db.shapes.find(
{
radius:
{
$gt:
0
}
}
)
Single Table Inheritance -
MongoDB
>
db.shapes.find()

{
_id:
"1",
type:
"circle",
area:
3.14,
radius:
1
}

{
_id:
"2",
type:
"square",
area:
4,
d:
2
}

{
_id:
"3",
type:
"rect",
area:
10,
length:
5,
width:
2
}

//
find
shapes
where
radius
>
0

>
db.shapes.find(
{
radius:
{
$gt:
0
}
}
)

//
create
index
>
db.shapes.ensureIndex(
{
radius:
1
}
)
One to Many
One to Many relationships can specify
• degree of association between objects
• containment
• life-cycle
One to Many
- Embedded Array / Array Keys
  - slice operator to return subset of array
  - some queries harder
    e.g find latest comments across all documents
blogs:
{








author
:
"Hergé",




date
:
"Sat
Jul
24
2010
19:47:11
GMT‐0700
(PDT)",





comments
:
[

   

{

   
   author
:
"Bernie",

   
   date
:
"Sat
Jul
24
2010
20:51:03
GMT‐0700
(PDT)",

   
   text
:
"great
book"

   

}




]
}
One to Many
- Embedded tree
  - Single document
  - Natural
  - Hard to query
blogs:
{








author
:
"Hergé",




date
:
"Sat
Jul
24
2010
19:47:11
GMT‐0700
(PDT)",





comments
:
[

   

{

   
    author
:
"Bernie",

   
    date
:
"Sat
Jul
24
2010
20:51:03
GMT‐0700
(PDT)",

   
    text
:
"great
book",
     





replies:
[
{
author
:
“James”,
...
}
]

   

}




]
}
One to Many
- Normalized (2 collections)
  - most flexible
  - more queries
blogs:
{








author
:
"Hergé",




date
:
"Sat
Jul
24
2010
19:47:11
GMT‐0700
(PDT)",





comments
:
[

   


{
comment
:
ObjectId(“1”)
}




]
}

comments
:
{
_id
:
“1”,
     












author
:
"James",
           

date
:
"Sat
Jul
24
2010
20:51:03
..."
}
One to Many - patterns


- Embedded Array / Array Keys




- Embedded Array / Array Keys
- Embedded tree
- Normalized
Many - Many
Example:

- Product can be in many categories
- Category can have many products
Many - Many

products:



{
_id:
ObjectId("10"),





name:
"Destination
Moon",





category_ids:
[
ObjectId("20"),
ObjectId("30")
]
}



Many - Many

products:



{
_id:
ObjectId("10"),





name:
"Destination
Moon",





category_ids:
[
ObjectId("20"),
ObjectId("30")
]
}



categories:



{
_id:
ObjectId("20"),






name:
"adventure",






product_ids:
[
ObjectId("10"),
ObjectId("11"),

ObjectId("12")
]
}
Many - Many

products:



{
_id:
ObjectId("10"),





name:
"Destination
Moon",





category_ids:
[
ObjectId("20"),
ObjectId("30")
]
}



categories:



{
_id:
ObjectId("20"),






name:
"adventure",






product_ids:
[
ObjectId("10"),
ObjectId("11"),

ObjectId("12")
]
}

//All
categories
for
a
given
product
>
db.categories.find(
{
product_ids:
ObjectId("10")
}
)
Alternative
products:



{
_id:
ObjectId("10"),





name:
"Destination
Moon",





category_ids:
[
ObjectId("20"),
ObjectId("30")
]
}



categories:



{
_id:
ObjectId("20"),






name:
"adventure"
}
Alternative
products:



{
_id:
ObjectId("10"),





name:
"Destination
Moon",





category_ids:
[
ObjectId("20"),
ObjectId("30")
]
}



categories:



{
_id:
ObjectId("20"),






name:
"adventure"
}

//
All
products
for
a
given
category
>
db.products.find(
{
category_ids:
ObjectId("20")
}
)

Alternative
products:



{
_id:
ObjectId("10"),





name:
"Destination
Moon",





category_ids:
[
ObjectId("20"),
ObjectId("30")
]
}



categories:



{
_id:
ObjectId("20"),






name:
"adventure"
}

//
All
products
for
a
given
category
>
db.products.find(
{
category_ids:
ObjectId("20")
}
)


//
All
categories
for
a
given
product
product

=
db.products.find(_id
:
some_id)
>
db.categories.find(
{
_id
:
{
$in
:

product.category_ids
}
}
)

Trees
Full Tree in Document

{
comments:
[





{
author:
"Bernie",
text:
"...",








replies:
[






















{author:
"James",
text:
"...",























replies:
[
]
}








]
}


]
}

Pros: Single Document, Performance, Intuitive

Cons: Hard to search, Partial Results, 16MB limit




Trees
Parent Links
- Each node is stored as a document
- Contains the id of the parent

Child Links
- Each node contains the id’s of the children
- Can support graphs (multiple parents / child)
Array of Ancestors
- Store all Ancestors of a node


{
_id:
"a"
}


{
_id:
"b",
ancestors:
[
"a"
],
parent:
"a"
}


{
_id:
"c",
ancestors:
[
"a",
"b"
],
parent:
"b"
}


{
_id:
"d",
ancestors:
[
"a",
"b"
],
parent:
"b"
}


{
_id:
"e",
ancestors:
[
"a"
],
parent:
"a"
}


{
_id:
"f",
ancestors:
[
"a",
"e"
],
parent:
"e"
}
Array of Ancestors
- Store all Ancestors of a node


{
_id:
"a"
}


{
_id:
"b",
ancestors:
[
"a"
],
parent:
"a"
}


{
_id:
"c",
ancestors:
[
"a",
"b"
],
parent:
"b"
}


{
_id:
"d",
ancestors:
[
"a",
"b"
],
parent:
"b"
}


{
_id:
"e",
ancestors:
[
"a"
],
parent:
"a"
}


{
_id:
"f",
ancestors:
[
"a",
"e"
],
parent:
"e"
}

//find
all
descendants
of
b:
>
db.tree2.find(
{
ancestors:
'b'
}
)

//find
all
direct
descendants
of
b:
>
db.tree2.find(
{
parent:
'b'
}
)
Array of Ancestors
- Store all Ancestors of a node


{
_id:
"a"
}


{
_id:
"b",
ancestors:
[
"a"
],
parent:
"a"
}


{
_id:
"c",
ancestors:
[
"a",
"b"
],
parent:
"b"
}


{
_id:
"d",
ancestors:
[
"a",
"b"
],
parent:
"b"
}


{
_id:
"e",
ancestors:
[
"a"
],
parent:
"a"
}


{
_id:
"f",
ancestors:
[
"a",
"e"
],
parent:
"e"
}

//find
all
descendants
of
b:
>
db.tree2.find(
{
ancestors:
'b'
}
)

//find
all
direct
descendants
of
b:
>
db.tree2.find(
{
parent:
'b'
}
)

//find
all
ancestors
of
f:
>
ancestors
=
db.tree2.findOne(
{
_id:
'f'
}
).ancestors
>
db.tree2.find(
{
_id:
{
$in
:
ancestors
}
)
Trees as Paths
Store hierarchy as a path expression
- Separate each node by a delimiter, e.g. "/"
- Use text search for find parts of a tree

{
comments:
[





{
author:
"Bernie",
text:
"initial
post",








path:
"/"
},





{
author:
"Jim",

text:
"jim’s
comment",







path:
"/jim"
},





{
author:
"Bernie",
text:
"Bernie’s
reply
to
Jim",







path
:
"/jim/bernie"}
]
}

//
Find
the
conversations
Jim
was
a
part
of
>
db.posts.find(
{
path:
/jim/i
}
)
Queue
• Need to maintain order and state
• Ensure that updates to the queue are atomic



{
inprogress:
false,





priority:
1,




...



}
Queue
• Need to maintain order and state
• Ensure that updates to the queue are atomic



{
inprogress:
false,





priority:
1,




...



}

//
find
highest
priority
job
and
mark
as
in‐progress
job
=
db.jobs.findAndModify(
{















query:

{
inprogress:
false
},















sort:


{
priority:
‐1
},
















update:
{
$set:
{inprogress:
true,

         













started:
new
Date()
}
},















new:
true
}
)


Summary

Schema design is different in MongoDB

Basic data design principals stay the same

Focus on how the apps manipulates data

Rapidly evolve schema to meet your requirements

Enjoy your new freedom, use it wisely :-)
download at mongodb.org

                          We’re Hiring !
                      bernie@10gen.com

          conferences,
appearances,
and
meetups
                    http://www.10gen.com/events








Facebook





|




Twitter




|




LinkedIn
  http://bit.ly/mongo>
       @mongodb         http://linkd.in/joinmongo

Schema Design (Mongo Austin)

Editor's Notes

  • #2 \n
  • #3 \n
  • #4 \n
  • #5 \n
  • #6 \n
  • #7 \n
  • #8 \n
  • #9 \n
  • #10 \n
  • #11 \n
  • #12 \n
  • #13 \n
  • #14 \n
  • #15 \n
  • #16 \n
  • #17 \n
  • #18 \n
  • #19 \n
  • #20 \n
  • #21 \n
  • #22 \n
  • #23 \n
  • #24 \n
  • #25 \n
  • #26 \n
  • #27 In 1.6.x M/R results stored in temp collection until connection closed. Could define an output collection.\nIn 1.8.x M/R results are stored to a permanent collection unless you specify not to.\n
  • #28 \n
  • #29 key: fields to group by\ninitial: The initial value of the aggregation counter\nreduce: The reduce function that aggregates the objects we iterate over.\n
  • #30 \n
  • #31 \n
  • #32 \n
  • #33 \n
  • #34 find() will only return documents where the field exists (and in this case it&apos;s value is greater than 0\n
  • #35 Indexes can still be created for fields that don&apos;t appear in all documents.\n\nNew in 1.8.x: sparse indexes: The index only includes the documents where the field exists. In a normal index the non-existent fields are treated as null values.\n
  • #36 \n
  • #37 \n
  • #38 The greater the height of the tree the harder it becomes to query.\n
  • #39 Normalized: Two collections instead of one.\n\nMore flexible but requires more queries to retrieve the same data.\n
  • #40 Strong life-cycle association: use embedded array\n\nOtherwise you have options: embedded array/tree or normalize the data\n
  • #41 \n
  • #42 \n
  • #43 Two collections\n\nOne option: arrays of keys (pointers) in each document that point to documents in another collection\n
  • #44 Only one query to find the category for a product given the product id.\n\nOnly one query to find products in a category given the category id.\n
  • #45 Alternative: only store an array of keys in the documents of one collection.\n\nAdvantage: less storage space required in the categories collection\n\n
  • #46 Finding all the products in a given category is still one query.\n\n
  • #47 Disadvantage: Finding all the categories for a given product is two queries.\n
  • #48 4MB limit in 1.6.x\n16MB in 1.8.x\n
  • #49 \n
  • #50 \n
  • #51 \n
  • #52 \n
  • #53 \n
  • #54 \n
  • #55 findAndModify returns one result object, update is atomic\n\nquery: The query filter\nsort: if multiple documents match, return the first one in the sorted results\nupdate: a modifier object that specifies the mods to make\nnew: return the modified object, otherwise return the old object\n
  • #56 \n
  • #57 \n