- MongoDB allows scaling by using documents, optimizing indexes, and understanding your working data set size.
- Replica sets can scale reads by adding secondary nodes for load balancing, while sharding scales writes and RAM usage by splitting data across multiple shards.
- Proper disk configuration and replication are important to maximize performance when scaling with MongoDB.
3. How do we do it today? We use a relational database but … We don’t use joins We don’t use transactions We add read-only slaves We added a caching layer We de-normalized our data We implemented custom sharding We buy bigger servers
9. Scaling your app Use documents Indexes make me happy Knowing your working set Disks are the bottleneck Replication makes reading fun Sharding for profit
22. Page Fault App requests document Document not in memory Evict a page from memory Read block from disk Return document from memory App 1 5 2 RAM 3 4 Disk
23. Figuring out working Set > db.foo.stats() { "ns" : "test.foo", "count" : 1338330, "size" : 46915928, "avgObjSize" : 35.05557523181876, "storageSize" : 86092032, "numExtents" : 12, "nindexes" : 2, "lastExtentSize" : 20872960, "paddingFactor" : 1, "flags" : 0, "totalIndexSize" : 99860480, "indexSizes" : { "_id_" : 55877632, "x_1" : 43982848 }, "ok" : 1 } Size of data Average document size Size on disk (and in memory!) Size of all indexes Size of each index
36. Summary Use documents to your advantage! Optimize your indexes Understand your working set Use a sane disk configuratino Use replicas to scale reads Use sharding to scale writes & working RAM
Editor's Notes
Let’s talk about infrastructure costs. You probably started building your application on top of an RDBMS. This is the way we have built enterprise and web applications for years. But the problem is that your RDBMS doesn’t have a smooth cost curve when you scale it up. When you start off, you may be running on a smaller server, totally adequate for your load. When you exceed the capacity of that small server, you need to buy a bigger server. You can’t add a second small server. This process repeats. You exceed the capacity of your new server, and upgrade your hardware. There are two long term problems with this: As you scale up, you end up paying more and more for each transaction that your system processes. A small server may cost you $1,000 per CPU, but when you need 128 processors, you might be paying as much as $100,000 per CPU. Each incremental step up in hardware gets more and more expensive, not cheaper and cheaper. You reach an end of this scaling approach. Once you have scaled up to the biggest hardware platform available on the market, there is nowhere to go; no bigger box to buy. At this point you need to change strategies, even if you can afford those ultra-high-end boxes.
And while we’ve been spending more and more money on Hardware, our developer productivity has gone down too. You will hear this storyover and over again from CIO’s and architects: “Well, we use <insert RDBMS> but we don’t use joins or transactions and we’ve de-normalized our schema.” As our hardware gets more and more expensive, we ask our developers to squeeze more and more performance out of the same box. To achieve this, they go through “herculean efforts” to strip their code of advanced features that once made them productive. De-normalizing data, eliminating joins and transactions, adding caching and sharding layers… These are risky projects that slow down feature velocity.