Schema Design at Scale

MongoDB: Schema
Design at Scale

Rick Copeland
@rick446
http://arborian.com

Who am I?

• Now a consultant/trainer, but formerly...
• Software engineer at SourceForge
• Author of Essential SQLAlchemy
• Author of MongoDB with Python and Ming
• Primarily code Python

The Inspiration
• MongoDB monitoring service
(MMS)
• Free to all MongoDB users
• Minute-by-minute stats on all
your servers
• Hardware cost is important,
use it efﬁciently (remember it’s
a free service!)

Our Experiment
• Similar to MMS but not identical
• Collection of 100 metrics, each with per-
minute values
• “Simulation time” is 300x real time
• Run on 2x AWS small instance
• one MongoDB server (2.0.2)
• one “load generator”

Load Generator

• Increment each metric as many times as
possible during the course of a simulated
minute
• Record number of updates per second
• Occasionally call getLastError to prevent
disconnects

Schema v1
{
_id: "20101010/metric-1",
metadata: {
date: ISODate("2000-10-10T00:00:00Z"),
metric: "metric-1" }, • One document
daily: 5468426,
hourly: {
per metric (per
"00": 227850, server) per day
"01": 210231,
...
"23": 20457 },
minute: { • Per hour/minute
"0000": 3612, statistics stored as
"0001": 3241,
... documents
"1439": 2819 }
}

Update v1
• Use $inc to
update ﬁelds in-
place
increment = { daily: 1 }
increment['hourly.' + hour] = 1
increment['minute.' + minute] = 1
• Use upsert to
db.stats.update(
{ _id: id, metadata: metadata },
create document
{ $inc: update }, if it’s missing
true) // upsert

• Easy, correct,
seems like a good
idea....

Performance of v1

Experiment startup

Performance of v1

Experiment startup

OUCH!

Problems with v1

• The document movement problem
• The midnight problem
• The end-of-the-day problem
• The historical query problem

Document movement
problem
• MongoDB in-place updates are fast
• ... except when they’re not in place
• MongoDB adaptively pads documents
• ... but it’s better to know your doc size
ahead of time

Midnight problem

• Upserts are convenient, but what’s our key?
• date/metric
• At midnight, you get a huge spike in inserts

Fixing the document
movement problem
• Preallocate
db.stats.update(
documents with
{ _id: id, metadata: metadata }, zeros
{ $inc: {
daily: 0,
hourly.0: 0,
hourly.1: 0,
...
• Crontab (?)
minute.0: 0,
minute.1: 0,
... } • NO! (makes
true) // upsert the midnight
problem even
worse)

Fixing the midnight
problem

Fixing the midnight
problem
• Could schedule preallocation for different
metrics, staggered through the day

Fixing the midnight
problem
• Observation: Preallocation isn’t required for
correct operation

Fixing the midnight
problem
• Observation: Preallocation isn’t required for
correct operation
• Let’s just preallocate tomorrow’s docs
randomly as new stats are inserted (with
low probability).

Performance with
Preallocation

Experiment startup

Performance with
Preallocation

• Well, it’s better
Experiment startup

Performance with
Preallocation

• Well, it’s better
• Still have
Experiment startup decreasing
performance
through the day...
WTF?

End-of-day problem
“0000” Value “0001” Value “1439” Value

• Bson stores documents as an association list

• MongoDB must check each key for a match

• Load increases signiﬁcantly at the end of the day
(MongoDB must scan 1439 keys to ﬁnd the right minute!)

Fixing the end-of-day
problem
•
{ _id: "20101010/metric-1",
metadata: { Split up our
date: ISODate("2000-10-10T00:00:00Z"),
metric: "metric-1" }, ‘minute’ property
daily: 5468426,
hourly: { by hour
"0": 227850,
"1": 210231,
...
"23": 20457 }, • Better worst-case
minute: { keys scanned:
"00": {
"0000": 3612,
"0100": 3241,
...
}, ...,
• Old: 1439
"23": { ..., "1439": 2819 }
}
• New: 82

“Hierarchical minutes”
Performance

Historical Query
Problem

• Intra-day queries are great
• What about “performance year to date”?
• Now you’re hitting a lot of “cold”
documents and causing page faults

Fixing the historical
query problem
• Store multiple levels
{ _id: "201010/metric-1", of granularity in
metadata: {
date: ISODate("2000-10-01T00:00:00Z"), different collections
metric: "metric-1" },

•
daily: {
"0": 5468426, 2 updates rather than
"1": ...,
... 1, but historical
}
"31": ... },
queries much faster

• Preallocate along with
daily docs (only
infrequently upserted)

Queries
db.stats.daily.find( { • Updates are by
"metadata.date": { $gte: dt1, $lte: dt2 },
"metadata.metric": "metric-1"},
_id, so no index
{ "metadata.date": 1, "hourly": 1 } }, needed there
sort=[("metadata.date", 1)])

• Chart queries are
by metadata

db.stats.daily.ensureIndex({
'metadata.metric': 1,
• Your range/sort
'metadata.date': 1 }) should be last in
the compound
index

Conclusion
• Monitor your performance. Watch out for
spikes.
• Preallocate to prevent document copying
• Pay attention to the number of keys in your
documents (hierarchy can help)
• Make sure your index is optimized for your
sorts

Questions?
MongoDB Monitoring Service
http://www.10gen.com/mongodb-monitoring-service

Rick Copeland
@rick446
http://arborian.com
MongoDB Consulting & Training

Schema Design at Scale

More Related Content

What's hot

Viewers also liked

Similar to Schema Design at Scale

More from Rick Copeland

Recently uploaded

Schema Design at Scale

Editor's Notes