MongoDB: Schema
 Design at Scale


     Rick Copeland
       @rick446
    http://arborian.com
Who am I?

• Now a consultant/trainer, but formerly...
 • Software engineer at SourceForge
 • Author of Essential SQLAlchemy
 • Author of MongoDB with Python and Ming
 • Primarily code Python
The Inspiration
• MongoDB monitoring service
  (MMS)
• Free to all MongoDB users
• Minute-by-minute stats on all
  your servers
• Hardware cost is important,
  use it efficiently (remember it’s
  a free service!)
Our Experiment
• Similar to MMS but not identical
• Collection of 100 metrics, each with per-
  minute values
• “Simulation time” is 300x real time
• Run on 2x AWS small instance
 • one MongoDB server (2.0.2)
 • one “load generator”
Load Generator

• Increment each metric as many times as
  possible during the course of a simulated
  minute
• Record number of updates per second
• Occasionally call getLastError to prevent
  disconnects
Schema v1
{
    _id: "20101010/metric-1",
    metadata: {
        date: ISODate("2000-10-10T00:00:00Z"),
        metric: "metric-1" },                    •   One document
    daily: 5468426,
    hourly: {
                                                     per metric (per
        "00": 227850,                                server) per day
        "01": 210231,
        ...
        "23": 20457 },
    minute: {                                    •   Per hour/minute
        "0000": 3612,                                statistics stored as
        "0001": 3241,
        ...                                          documents
        "1439": 2819 }
}
Update v1
                                     •   Use $inc to
                                         update fields in-
                                         place
increment = { daily: 1 }
increment['hourly.' + hour] = 1
increment['minute.' + minute] = 1
                                     •   Use upsert to
db.stats.update(
  { _id: id, metadata: metadata },
                                         create document
  { $inc: update },                      if it’s missing
  true) // upsert


                                     •   Easy, correct,
                                         seems like a good
                                         idea....
Performance of v1
Performance of v1

              Experiment startup
Performance of v1

                Experiment startup




            OUCH!
Problems with v1

• The document movement problem
• The midnight problem
• The end-of-the-day problem
• The historical query problem
Document movement
     problem
• MongoDB in-place updates are fast
 • ... except when they’re not in place
• MongoDB adaptively pads documents
 • ... but it’s better to know your doc size
    ahead of time
Midnight problem

• Upserts are convenient, but what’s our key?
 • date/metric
• At midnight, you get a huge spike in inserts
Fixing the document
          movement problem
                                     •   Preallocate
db.stats.update(
                                         documents with
  { _id: id, metadata: metadata },       zeros
  { $inc: {
    daily: 0,
    hourly.0: 0,
    hourly.1: 0,
    ...
                                     •   Crontab (?)
    minute.0: 0,
    minute.1: 0,
    ... }                                • NO! (makes
  true) // upsert                          the midnight
                                           problem even
                                           worse)
Fixing the midnight
      problem
Fixing the midnight
         problem
• Could schedule preallocation for different
  metrics, staggered through the day
Fixing the midnight
         problem
• Could schedule preallocation for different
  metrics, staggered through the day
• Observation: Preallocation isn’t required for
  correct operation
Fixing the midnight
         problem
• Could schedule preallocation for different
  metrics, staggered through the day
• Observation: Preallocation isn’t required for
  correct operation
• Let’s just preallocate tomorrow’s docs
  randomly as new stats are inserted (with
  low probability).
Performance with
  Preallocation


     Experiment startup
Performance with
  Preallocation

                          • Well, it’s better
     Experiment startup
Performance with
  Preallocation

                          • Well, it’s better
                          • Still have
     Experiment startup     decreasing
                            performance
                            through the day...
                            WTF?
Performance with
  Preallocation

                          • Well, it’s better
                          • Still have
     Experiment startup     decreasing
                            performance
                            through the day...
                            WTF?
Problems with v1

• The document movement problem
• The midnight problem
• The end-of-the-day problem
• The historical query problem
End-of-day problem
“0000” Value “0001” Value           “1439” Value




•   Bson stores documents as an association list

•   MongoDB must check each key for a match

•   Load increases significantly at the end of the day
    (MongoDB must scan 1439 keys to find the right minute!)
Fixing the end-of-day
                problem
                                             •
{ _id: "20101010/metric-1",
  metadata: {                                    Split up our
    date: ISODate("2000-10-10T00:00:00Z"),
    metric: "metric-1" },                        ‘minute’ property
  daily: 5468426,
  hourly: {                                      by hour
    "0": 227850,
    "1": 210231,
    ...
    "23": 20457 },                           •   Better worst-case
  minute: {                                      keys scanned:
    "00": {
        "0000": 3612,
        "0100": 3241,
        ...
    }, ...,
                                                 •   Old: 1439
    "23": { ..., "1439": 2819 }
}
                                                 •   New: 82
“Hierarchical minutes”
    Performance
Performance
Comparision
Performance
Comparision (2.2)
Historical Query
         Problem

• Intra-day queries are great
• What about “performance year to date”?
 • Now you’re hitting a lot of “cold”
    documents and causing page faults
Fixing the historical
              query problem
                                             •   Store multiple levels
{ _id: "201010/metric-1",                        of granularity in
  metadata: {
    date: ISODate("2000-10-01T00:00:00Z"),       different collections
    metric: "metric-1" },

                                             •
  daily: {
    "0": 5468426,                                2 updates rather than
    "1": ...,
    ...                                          1, but historical
}
    "31": ... },
                                                 queries much faster

                                             •   Preallocate along with
                                                 daily docs (only
                                                 infrequently upserted)
Queries
db.stats.daily.find( {                           •   Updates are by
    "metadata.date": { $gte: dt1, $lte: dt2 },
    "metadata.metric": "metric-1"},
                                                     _id, so no index
{ "metadata.date": 1, "hourly": 1 } },               needed there
sort=[("metadata.date", 1)])


                                                 •   Chart queries are
                                                     by metadata

db.stats.daily.ensureIndex({
  'metadata.metric': 1,
                                                 •   Your range/sort
  'metadata.date': 1 })                              should be last in
                                                     the compound
                                                     index
Conclusion
• Monitor your performance. Watch out for
  spikes.
• Preallocate to prevent document copying
• Pay attention to the number of keys in your
  documents (hierarchy can help)
• Make sure your index is optimized for your
  sorts
Questions?
          MongoDB Monitoring Service
http://www.10gen.com/mongodb-monitoring-service




                Rick Copeland
                   @rick446
              http://arborian.com
         MongoDB Consulting & Training

Schema Design at Scale

  • 1.
    MongoDB: Schema Designat Scale Rick Copeland @rick446 http://arborian.com
  • 2.
    Who am I? •Now a consultant/trainer, but formerly... • Software engineer at SourceForge • Author of Essential SQLAlchemy • Author of MongoDB with Python and Ming • Primarily code Python
  • 3.
    The Inspiration • MongoDBmonitoring service (MMS) • Free to all MongoDB users • Minute-by-minute stats on all your servers • Hardware cost is important, use it efficiently (remember it’s a free service!)
  • 4.
    Our Experiment • Similarto MMS but not identical • Collection of 100 metrics, each with per- minute values • “Simulation time” is 300x real time • Run on 2x AWS small instance • one MongoDB server (2.0.2) • one “load generator”
  • 5.
    Load Generator • Incrementeach metric as many times as possible during the course of a simulated minute • Record number of updates per second • Occasionally call getLastError to prevent disconnects
  • 6.
    Schema v1 { _id: "20101010/metric-1", metadata: { date: ISODate("2000-10-10T00:00:00Z"), metric: "metric-1" }, • One document daily: 5468426, hourly: { per metric (per "00": 227850, server) per day "01": 210231, ... "23": 20457 }, minute: { • Per hour/minute "0000": 3612, statistics stored as "0001": 3241, ... documents "1439": 2819 } }
  • 7.
    Update v1 • Use $inc to update fields in- place increment = { daily: 1 } increment['hourly.' + hour] = 1 increment['minute.' + minute] = 1 • Use upsert to db.stats.update( { _id: id, metadata: metadata }, create document { $inc: update }, if it’s missing true) // upsert • Easy, correct, seems like a good idea....
  • 8.
  • 9.
    Performance of v1 Experiment startup
  • 10.
    Performance of v1 Experiment startup OUCH!
  • 11.
    Problems with v1 •The document movement problem • The midnight problem • The end-of-the-day problem • The historical query problem
  • 12.
    Document movement problem • MongoDB in-place updates are fast • ... except when they’re not in place • MongoDB adaptively pads documents • ... but it’s better to know your doc size ahead of time
  • 13.
    Midnight problem • Upsertsare convenient, but what’s our key? • date/metric • At midnight, you get a huge spike in inserts
  • 14.
    Fixing the document movement problem • Preallocate db.stats.update( documents with { _id: id, metadata: metadata }, zeros { $inc: { daily: 0, hourly.0: 0, hourly.1: 0, ... • Crontab (?) minute.0: 0, minute.1: 0, ... } • NO! (makes true) // upsert the midnight problem even worse)
  • 15.
  • 16.
    Fixing the midnight problem • Could schedule preallocation for different metrics, staggered through the day
  • 17.
    Fixing the midnight problem • Could schedule preallocation for different metrics, staggered through the day • Observation: Preallocation isn’t required for correct operation
  • 18.
    Fixing the midnight problem • Could schedule preallocation for different metrics, staggered through the day • Observation: Preallocation isn’t required for correct operation • Let’s just preallocate tomorrow’s docs randomly as new stats are inserted (with low probability).
  • 19.
    Performance with Preallocation Experiment startup
  • 20.
    Performance with Preallocation • Well, it’s better Experiment startup
  • 21.
    Performance with Preallocation • Well, it’s better • Still have Experiment startup decreasing performance through the day... WTF?
  • 22.
    Performance with Preallocation • Well, it’s better • Still have Experiment startup decreasing performance through the day... WTF?
  • 23.
    Problems with v1 •The document movement problem • The midnight problem • The end-of-the-day problem • The historical query problem
  • 24.
    End-of-day problem “0000” Value“0001” Value “1439” Value • Bson stores documents as an association list • MongoDB must check each key for a match • Load increases significantly at the end of the day (MongoDB must scan 1439 keys to find the right minute!)
  • 25.
    Fixing the end-of-day problem • { _id: "20101010/metric-1", metadata: { Split up our date: ISODate("2000-10-10T00:00:00Z"), metric: "metric-1" }, ‘minute’ property daily: 5468426, hourly: { by hour "0": 227850, "1": 210231, ... "23": 20457 }, • Better worst-case minute: { keys scanned: "00": { "0000": 3612, "0100": 3241, ... }, ..., • Old: 1439 "23": { ..., "1439": 2819 } } • New: 82
  • 26.
  • 27.
  • 28.
  • 29.
    Historical Query Problem • Intra-day queries are great • What about “performance year to date”? • Now you’re hitting a lot of “cold” documents and causing page faults
  • 30.
    Fixing the historical query problem • Store multiple levels { _id: "201010/metric-1", of granularity in metadata: { date: ISODate("2000-10-01T00:00:00Z"), different collections metric: "metric-1" }, • daily: { "0": 5468426, 2 updates rather than "1": ..., ... 1, but historical } "31": ... }, queries much faster • Preallocate along with daily docs (only infrequently upserted)
  • 31.
    Queries db.stats.daily.find( { • Updates are by "metadata.date": { $gte: dt1, $lte: dt2 }, "metadata.metric": "metric-1"}, _id, so no index { "metadata.date": 1, "hourly": 1 } }, needed there sort=[("metadata.date", 1)]) • Chart queries are by metadata db.stats.daily.ensureIndex({ 'metadata.metric': 1, • Your range/sort 'metadata.date': 1 }) should be last in the compound index
  • 32.
    Conclusion • Monitor yourperformance. Watch out for spikes. • Preallocate to prevent document copying • Pay attention to the number of keys in your documents (hierarchy can help) • Make sure your index is optimized for your sorts
  • 33.
    Questions? MongoDB Monitoring Service http://www.10gen.com/mongodb-monitoring-service Rick Copeland @rick446 http://arborian.com MongoDB Consulting & Training