Scaling with MongoDBJared Rosoff (jsr@10gen.com) - @forjared
How do we do it today? We use a relational database but … We don’t use joinsWe don’t use transactionsWe add read-only slavesWe added a caching layerWe de-normalized our dataWe implemented custom shardingWe buy bigger servers
How’s that working out for you?
Costs go up
Productivity goes down
By engineers, for engineers
The landscapeMemcachedKey / ValueScalability & PerformanceRDBMSDepth of functionality
Scaling your appUse documents Indexes make me happyKnowing your working setDisks are the bottleneckReplication makes reading funSharding for profit
Scaling your data model
Documents{  author : "roger",date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away",tags : [ "Tezuka", "Manga" ],comments : [      {author : "Fred",date : "Sat Jul 24 2010 20:51:03 GMT-0700 (PDT)",text : "Best Movie Ever”      }   ]}
Disk Seeks & Data LocalityRead = really really fastSeek = 5+ ms
Disk Seeks & Data LocalityPostCommentAuthor
Disk Seeks & Data LocalityPostAuthorCommentCommentCommentCommentComment
Optimized indexes
Table scansFind where x equals 71234567Looked at 7 objects
Tree LookupFind where x equals 74627531Looked at 3 objects
Random IndexEntire index must fit in RAM
Right AlignedOnly small portion in RAM
Working set size
Working SetActive Documents + Used IndexesRAMDisk
Page FaultApp requests documentDocument not in memoryEvict a page from memoryRead block from disk Return document from memoryApp152RAM34Disk
Figuring out working Set> db.foo.stats() {	"ns" : "test.foo",	"count" : 1338330,	"size" : 46915928,	"avgObjSize" : 35.05557523181876,	"storageSize" : 86092032,	"numExtents" : 12,	"nindexes" : 2,	"lastExtentSize" : 20872960,	"paddingFactor" : 1,	"flags" : 0,	"totalIndexSize" : 99860480,	"indexSizes" : {		"_id_" : 55877632,		"x_1" : 43982848	},	"ok" : 1}Size of dataAverage document sizeSize on disk (and in memory!)Size of all indexesSize of each index
Disk configurations
Single Disk~200 seeks / second
RAID0~200 seeks / second~200 seeks / second~200 seeks / second
RAID10~400 seeks / second~400 seeks / second~400 seeks / second
replication
Replica SetsRead / WriteSecondaryReadPrimaryReadSecondary
Replica SetsRead / WriteReadSecondarySecondaryReadPrimaryReadSecondarySecondaryRead
Sharding
SecondarySecondarySecondarySecondaryMongoSMongoSShard 10..10Shard 210..20Shard 320..30Shard 430..40PrimaryPrimaryPrimaryPrimarySecondarySecondarySecondarySecondary
400GB Index?
400GB Index?Shard 10..10Shard 210..20Shard 320..30Shard 430..40100GBIndex!100GBIndex!100GBIndex!100GBIndex!
Summary
SummaryUse documents to your advantage!Optimize your indexes Understand your working setUse a sane disk configuratinoUse replicas to scale reads Use sharding to scale writes & working RAM

Scaling with mongo db - SF Mongo User Group 7-19-2011

Editor's Notes

  • #6 Let’s talk about infrastructure costs. You probably started building your application on top of an RDBMS. This is the way we have built enterprise and web applications for years. But the problem is that your RDBMS doesn’t have a smooth cost curve when you scale it up. When you start off, you may be running on a smaller server, totally adequate for your load. When you exceed the capacity of that small server, you need to buy a bigger server. You can’t add a second small server. This process repeats. You exceed the capacity of your new server, and upgrade your hardware. There are two long term problems with this: As you scale up, you end up paying more and more for each transaction that your system processes. A small server may cost you $1,000 per CPU, but when you need 128 processors, you might be paying as much as $100,000 per CPU. Each incremental step up in hardware gets more and more expensive, not cheaper and cheaper. You reach an end of this scaling approach. Once you have scaled up to the biggest hardware platform available on the market, there is nowhere to go; no bigger box to buy. At this point you need to change strategies, even if you can afford those ultra-high-end boxes.
  • #7 And while we’ve been spending more and more money on Hardware, our developer productivity has gone down too. You will hear this storyover and over again from CIO’s and architects: “Well, we use <insert RDBMS> but we don’t use joins or transactions and we’ve de-normalized our schema.” As our hardware gets more and more expensive, we ask our developers to squeeze more and more performance out of the same box. To achieve this, they go through “herculean efforts” to strip their code of advanced features that once made them productive. De-normalizing data, eliminating joins and transactions, adding caching and sharding layers… These are risky projects that slow down feature velocity.