Scaling to 30,000 Requests Per Second
and Beyond
with MongoDB
Mike Chesnut
Director of Operations Engineering
Crittercism
Scaling to 30,000 Requests Per Second
and Beyond
with MongoDB
Mike Chesnut
Director of Operations Engineering
Crittercism
MongoDB World
June 23-25
world.mongodb.com
Code: 25GN for 25% off
MongoDB World
June 23-25
world.mongodb.com
Code: 25GN for 25% off
What I’ll Talk About
What I’ll Talk About
● Crittercism - Overview
● Router (mongos) Architecture
● Sharding Considerations
● The Balancer and ...
How a Startup Gets Started
● Pick something and go with it
● Make mistakes along the way
● Correct the mistakes you can
● Work around the ones you ca...
Critter-What?
A Brief History...
Critter-What?
Architecture
APIFeedback
Architecture
APIFeedback
App Loads
Crashes
Handled
Exceptions
Architecture
APIFeedback
App Loads
Crashes
Handled
Exceptions
Architecture
DynamoDB
APIFeedback
App Loads
Crashes
Handled
Exceptions
Metadata
Architecture
DynamoDB
APIFeedback
App Loads
Crashes
Handled
Exceptions
Metadata
Architecture
DynamoDB
API
API
Feedback
App Loads
Crashes
Handled
Exceptions
Metadata
Performance
Data
Geo Data
Critter-What?
… Which brings us to today.
Critter-What?
Critter-What?
● feedback widget
● crash reporting
● live stats
● crash grouping
● app performance
management
● geo data
● ...
Architecture
DynamoDB
API
API
Feedback
App Loads
Crashes
Handled
Exceptions
Metadata
Performance
Data
Geo Data
Architecture
DynamoDB
API
API
Feedback
App Loads
Crashes
Handled
Exceptions
Metadata
Performance
Data
Geo Data
40,000+ req...
Growth
Router Architecture
Router Architecture
mongod
server
mongod
server
mongod
server
replica set
mongod
server
mongod
server
mongod
server
replic...
Single mongos per client problems we encountered:
Router Architecture
Router Architecture
Single mongos per client problems we encountered:
● thousands of connections to config servers
● confi...
Router Architecture
mongod
server
mongod
server
mongod
server
replica set
mongod
server
mongod
server
mongod
server
replic...
Router Architecture
Separate mongos tier advantages:
Router Architecture
Separate mongos tier advantages:
● greatly reduced number of connections to each mongod
● far fewer ho...
Router Architecture
Separate mongos tier advantages:
● greatly reduced number of connections to each mongod
● far fewer ho...
Router Architecture
Separate mongos tier advantages:
● greatly reduced number of connections to each mongod
● far fewer ho...
Sharding Considerations
Pick something you want to live with.
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
The Balancer and Me
The Balancer and Me
Why wouldn’t you run the balancer in the first place?
● great question
● for us, it’s because we delet...
The Balancer and Me
Fresh, new, empty cluster… But no balancer running.
The Balancer and Me
The Balancer and Me
The Balancer and Me
The Balancer and Me
The Balancer and Me
The Balancer and Me
Now we’re pretty full, so let’s add another shard...
The Balancer and Me
The Balancer and Me
And keep inserting...
The Balancer and Me
Suddenly we find ourselves with a very unbalanced cluster.
The Balancer and Me
But if we enable the balancer, it will DoS the 5th shard!
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
So what can we do?
The Balancer and Me
So what can we do?
1. add IOPS
The Balancer and Me
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
The Balancer and Me
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manual...
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manual...
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manual...
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manual...
How to manually balance:
The Balancer and Me
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
3. mo...
How to manually balance:
1. determine a chunk on a hot shard
mongos> db.chunks.find({"shard":"<shard_name>",
"ns":"<db_nam...
How to manually balance:
1. determine a chunk on a hot shard
"min" : {
"unsymbolized_hash" :
"1572663b72e87[...]",
"_id" :...
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
iosta...
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
3. mo...
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
3. mo...
Conclusion here:
Run the balancer.
The Balancer and Me
● Design ahead of time
o “NoSQL” lets you play it by ear
o but some of these decisions will bite you later
● Be willing to...
References
● MongoDB Blog post:http://blog.mongodb.org/post/77278906988/crittercism-scaling-to-
billions-of-requests-per-d...
MongoDB World
June 23-25
world.mongodb.com
Code: 25GN for 25% off
Q&A
Thank You!
Upcoming SlideShare
Loading in …5
×

Back to Basics 3: Scaling 30,000 Requests a Second with MongoDB

1,082 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,082
On SlideShare
0
From Embeds
0
Number of Embeds
476
Actions
Shares
0
Downloads
35
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • I’m going to tell you the story of how we’ve scaled to handle over 30k req/s using a storage strategy based on MongoDB
  • Between proposing this talk and now, we’ve actually grown some more, and now top 40-45k r/s on a daily basis
    This is about 3.5B requests per day
  • this is a preview of a talk I’ll be giving at MongoDB World, June 23-25 in NYC
    you can still register
  • and of course Crittercism will be there
  • some advice from our experience about things to do and things not to do
  • I’ll be sure to leave time for Q&A
  • I’ll tell you how Crittercism got started, some of the lessons we’ve learned along the way, and some advice we can share based on those experiences
  • September 2010 (from Wayback Machine)
    Started as a “feedback widget”
    Enable mobile app developers to allow their users to provide “criticism” of their apps (outside of the app store)
    Not just a star rating
  • this is pretty easy -
    set up a (mongo) db, put an api in front of it, collect user feedback from our SDK
  • added more types of data we collect
  • volume starts getting large, so let’s count app loads in a memory-based data store (redis), and persist it to mongo
  • then we added user metadata as well, but that’s a different kind of data and a different volume and access pattern, so let’s add dynamodb into the mix
  • our volume keeps going up, so let’s cache this app data to make our responses faster
  • then we added APM, which introduced a lot of different data types and structures
    so we added another ingest API and postgres into the mix
    (but obviously we’re not going to talk about that part here…)
  • today (2014) - what it’s evolved into
    collecting tons of detailed analytics data - crash reports, groupings
    Geo data launched in 2013 (just kidding, this is stored in postgres)
    iPad app launched in 2014 - more aggregations of performance data (more ways to view it)
  • lots to deal with...
    so we started as a way for people to “criticize” your apps
    then we helped you catch bugs, so we’re the ones doing the “criticism”
  • so how do we handle 40k/s on mongodb?
  • we don’t, but that’s our ingest rate, and most of it ends up in mongodb
    the takeaway here is to be willing to use whatever works
  • 2-year period
    went from 700/s (60M/day)
    to 40-45k/s (3.8B/day)
  • one of the biggest things we did to help ourselves scale was to consolidate the mongos routers
  • default, first-pass architecture (for a sharded cluster): one mongos per client machine
    each client process connects to a local mongos router
    each mongos routes queries and returns results
  • could mean your application is reading stale data, or can’t find the data it needs when it needs it (and maybe it has to retry, which means it’s now slower)
  • move the mongos routers to their own tier
    be smart about how you route to them
    (we use chef to keep it within the same AZ)
  • be aware that this does introduce some disadvantages, too
  • This is a fundamental design decision that will have huge implications for a long time, so think about it carefully.
  • Hard (impossible) to change after the fact!
  • Say you have 4 shards. Let’s say each of the NHL teams that made the playoffs this year has an app, and we shard by app_id.
  • Say you have 4 shards. Let’s say each of the NHL teams that made the playoffs this year has an app, and we shard by app_id.
    Let’s distribute them evenly, as is likely to be the case (assuming a sufficiently randomly-generated app_id)
  • this looks nice and even, right?
  • So now it’s time for the Western Conference Finals, and the Blackhawks are playing the Kings
  • So those 2 apps are going to get heavy use, but they’re on the same shard, so uh-oh...
  • Now this shard isn’t happy
    Higher load, slower response time for queries to this shard (which are your most common queries due to these apps’ popularity)
  • so let’s add another shard
  • That might help if we have more teams’ apps to add
  • Those new apps had somewhere to go, to keep our cluster balanced
    But this hasn’t helped our uneven access pattern at all
  • Only option now is to vertically scale the problem shard
  • and hopefully that cools it off, but now we have an uneven cluster to manage.
    and what happens next year, when it’s two different teams in the conference finals?
    maybe we get lucky and they’re on different shards… but even then, maybe the access is uneven enough that those 2 shards still get hot.
    so maybe you just live with this and have heterogenous shard servers. (this is probably a much lesser evil than trying to re-shard.)
    lesson: you’re going to have to live with the shard key you choose, so choose wisely!
    another option might’ve been to spread data for each app_id across all shards--but then your queries will likely be slower (due to having to read from many/all shards).
    it’s a trade-off.
  • The balancer is a super-important part of a sharded mongo cluster… You should love it.
  • Start with an empty cluster, and start filling it with data
    (we’ll denote “fullness” by going from green to red)
    This is an example of what can happen when the balancer is not running
  • Okay, so now we have a very unbalanced cluster. 3 of our replica sets are very full, one is pretty full, and the newest one is hardly in use.
    (remember that the balancer isn’t running in this scenario)
  • The balancer will see the full shards and one near-empty one, and will want to move a ton of chunks all at once, causing severe I/O strain on the system.
    (no way to tell the balancer to chill)
  • remember that all of these chunk moves are causing updates to your configdb, places load on your config servers, and has to propagate to all mongos routers, too
  • you’re going to be adding a lot of I/O to the system when you move chunks, and it still has to be able to perform its normal functions, so over-provision
    we’re in AWS so we just go for PIOPS… but if you’re on physical hardware, consider RAIDing wider, or upgrading your SAN, or...
  • updating the configdb (when you move chunks) puts load on your config servers, so make sure they’re ready to handle it
  • this is tedious and will take a LONG time (more detail in a minute)
  • gradually you’ll get to a happier place
  • take a deep breath before you...
  • be ready to turn it off and return to step 3 if needed, then try again
  • (this was step 3)
  • here’s an example from our “rawcrashlog” collection (hash and _id truncated)
  • start both commands running on both the source and target
  • don’t need to specify source shard, since your shard key (unsymbolized_hash in our case) and _id are sufficient for mongo to know where it’s coming from
  • watch your monitoring (iostat/mongostat) -- look for spikes in page faults, queued reads/writes, database lock percentages.
    obviously look at your application monitoring too, to ensure no adverse effects.
    use MMS as well (e.g., lock %, page faults)
    if everything looks good, keep going. if not, you need to start over with more IOPS, more config server capacity, etc.
  • seems obvious, but not always the case.
    and if you’re not running it, you can embark on this tedious journey to get it running again.
  • best-case scenario is to make all of the right choices up front… but you’re probably not going to do that. (though hopefully you can learn a bit from our experience and minimize the wrong choices you make).
    the good news is MongoDB is still working for us, despite the headaches we’ve had to deal with.
  • reminder that MongoDB World is right around the corner
    along with all of these great presenters, I’ll be giving a version of this talk there, and would love to meet you
  • Back to Basics 3: Scaling 30,000 Requests a Second with MongoDB

    1. 1. Scaling to 30,000 Requests Per Second and Beyond with MongoDB Mike Chesnut Director of Operations Engineering Crittercism
    2. 2. Scaling to 30,000 Requests Per Second and Beyond with MongoDB Mike Chesnut Director of Operations Engineering Crittercism
    3. 3. MongoDB World June 23-25 world.mongodb.com Code: 25GN for 25% off
    4. 4. MongoDB World June 23-25 world.mongodb.com Code: 25GN for 25% off
    5. 5. What I’ll Talk About
    6. 6. What I’ll Talk About ● Crittercism - Overview ● Router (mongos) Architecture ● Sharding Considerations ● The Balancer and Me ● Q&A
    7. 7. How a Startup Gets Started
    8. 8. ● Pick something and go with it ● Make mistakes along the way ● Correct the mistakes you can ● Work around the ones you can’t How a Startup Gets Started
    9. 9. Critter-What? A Brief History...
    10. 10. Critter-What?
    11. 11. Architecture APIFeedback
    12. 12. Architecture APIFeedback App Loads Crashes Handled Exceptions
    13. 13. Architecture APIFeedback App Loads Crashes Handled Exceptions
    14. 14. Architecture DynamoDB APIFeedback App Loads Crashes Handled Exceptions Metadata
    15. 15. Architecture DynamoDB APIFeedback App Loads Crashes Handled Exceptions Metadata
    16. 16. Architecture DynamoDB API API Feedback App Loads Crashes Handled Exceptions Metadata Performance Data Geo Data
    17. 17. Critter-What? … Which brings us to today.
    18. 18. Critter-What?
    19. 19. Critter-What? ● feedback widget ● crash reporting ● live stats ● crash grouping ● app performance management ● geo data ● user analytics ● executive dashboard
    20. 20. Architecture DynamoDB API API Feedback App Loads Crashes Handled Exceptions Metadata Performance Data Geo Data
    21. 21. Architecture DynamoDB API API Feedback App Loads Crashes Handled Exceptions Metadata Performance Data Geo Data 40,000+ req/s
    22. 22. Growth
    23. 23. Router Architecture
    24. 24. Router Architecture mongod server mongod server mongod server replica set mongod server mongod server mongod server replica set mongod server mongod server mongod server replica set mongos client process application server mongos client process application server Client Application(s) MongoDB Cluster
    25. 25. Single mongos per client problems we encountered: Router Architecture
    26. 26. Router Architecture Single mongos per client problems we encountered: ● thousands of connections to config servers ● config server CPU load ● configdb propagation delays
    27. 27. Router Architecture mongod server mongod server mongod server replica set mongod server mongod server mongod server replica set mongod server mongod server mongod server replica set mongos client process application server mongos client process application server Client Application(s) MongoDB ClusterRouter Tier
    28. 28. Router Architecture Separate mongos tier advantages:
    29. 29. Router Architecture Separate mongos tier advantages: ● greatly reduced number of connections to each mongod ● far fewer hosts talking to the config servers ● much faster configdb propagation
    30. 30. Router Architecture Separate mongos tier advantages: ● greatly reduced number of connections to each mongod ● far fewer hosts talking to the config servers ● much faster configdb propagation Disadvantages:
    31. 31. Router Architecture Separate mongos tier advantages: ● greatly reduced number of connections to each mongod ● far fewer hosts talking to the config servers ● much faster configdb propagation Disadvantages: ● additional network hop ● fewer points of failure
    32. 32. Sharding Considerations
    33. 33. Pick something you want to live with. Sharding Considerations
    34. 34. Sharding Considerations
    35. 35. Sharding Considerations
    36. 36. Sharding Considerations
    37. 37. Sharding Considerations
    38. 38. Sharding Considerations
    39. 39. Sharding Considerations
    40. 40. Sharding Considerations
    41. 41. Sharding Considerations
    42. 42. Sharding Considerations
    43. 43. Sharding Considerations
    44. 44. Sharding Considerations
    45. 45. Sharding Considerations
    46. 46. Sharding Considerations
    47. 47. Sharding Considerations
    48. 48. Sharding Considerations
    49. 49. The Balancer and Me
    50. 50. The Balancer and Me Why wouldn’t you run the balancer in the first place? ● great question ● for us, it’s because we deleted a ton of data at one point, and left a bunch of holes ○ we turned it off while deleting this data ○ and then were unable to turn it back on ● but maybe you start without it ● or maybe you need to turn it off for maintenance and forget to turn it back on Obviously, don’t do this. But if you do, here’s what happens...
    51. 51. The Balancer and Me Fresh, new, empty cluster… But no balancer running.
    52. 52. The Balancer and Me
    53. 53. The Balancer and Me
    54. 54. The Balancer and Me
    55. 55. The Balancer and Me
    56. 56. The Balancer and Me
    57. 57. The Balancer and Me Now we’re pretty full, so let’s add another shard...
    58. 58. The Balancer and Me
    59. 59. The Balancer and Me And keep inserting...
    60. 60. The Balancer and Me Suddenly we find ourselves with a very unbalanced cluster.
    61. 61. The Balancer and Me But if we enable the balancer, it will DoS the 5th shard!
    62. 62. The Balancer and Me The approximate effect looks something like this:
    63. 63. The Balancer and Me The approximate effect looks something like this:
    64. 64. The Balancer and Me The approximate effect looks something like this:
    65. 65. The Balancer and Me The approximate effect looks something like this:
    66. 66. The Balancer and Me The approximate effect looks something like this:
    67. 67. The Balancer and Me The approximate effect looks something like this:
    68. 68. So what can we do? The Balancer and Me
    69. 69. So what can we do? 1. add IOPS The Balancer and Me
    70. 70. So what can we do? 1. add IOPS 2. make sure your config servers have plenty of CPU (and IOPS) The Balancer and Me
    71. 71. So what can we do? 1. add IOPS 2. make sure your config servers have plenty of CPU (and IOPS) 3. slowly move chunks manually The Balancer and Me
    72. 72. So what can we do? 1. add IOPS 2. make sure your config servers have plenty of CPU (and IOPS) 3. slowly move chunks manually 4. approach a balanced state The Balancer and Me
    73. 73. So what can we do? 1. add IOPS 2. make sure your config servers have plenty of CPU (and IOPS) 3. slowly move chunks manually 4. approach a balanced state 5. hold your breath The Balancer and Me
    74. 74. So what can we do? 1. add IOPS 2. make sure your config servers have plenty of CPU (and IOPS) 3. slowly move chunks manually 4. approach a balanced state 5. hold your breath 6. try re-enabling the balancer The Balancer and Me
    75. 75. How to manually balance: The Balancer and Me
    76. 76. How to manually balance: 1. determine a chunk on a hot shard 2. monitor effects on both the source and target shards 3. move the chunk 4. allow the system to settle 5. repeat The Balancer and Me
    77. 77. How to manually balance: 1. determine a chunk on a hot shard mongos> db.chunks.find({"shard":"<shard_name>", "ns":"<db_name>.<collection>"}).limit(1).pretty() You’ll get a single chunk (as both min and max); note its shard key and ObjectId. The Balancer and Me
    78. 78. How to manually balance: 1. determine a chunk on a hot shard "min" : { "unsymbolized_hash" : "1572663b72e87[...]", "_id" : ObjectId("50b97db98238[...]") }, The Balancer and Me
    79. 79. How to manually balance: 1. determine a chunk on a hot shard 2. monitor effects on both the source and target shards iostat -xhm 1 mongostat The Balancer and Me
    80. 80. How to manually balance: 1. determine a chunk on a hot shard 2. monitor effects on both the source and target shards 3. move the chunk mongos> sh.moveChunk("<db_name>.<collection>", { "unsymbolized_hash" : "1572663b72e87[...]", "_id" : ObjectId("50b97db98238[...]") }, "<target_shard>") The Balancer and Me
    81. 81. How to manually balance: 1. determine a chunk on a hot shard 2. monitor effects on both the source and target shards 3. move the chunk 4. allow the system to settle 5. repeat The Balancer and Me
    82. 82. Conclusion here: Run the balancer. The Balancer and Me
    83. 83. ● Design ahead of time o “NoSQL” lets you play it by ear o but some of these decisions will bite you later ● Be willing to correct past mistakes o dedicate time and resources to adapting o learn how to live with the mistakes you can’t correct Summary
    84. 84. References ● MongoDB Blog post:http://blog.mongodb.org/post/77278906988/crittercism-scaling-to- billions-of-requests-per-day-on ● MongoDB Documentation on mongos routers:http://docs.mongodb.org/master/core/sharded-cluster-query-routing/ ● MongoDB Documentation on the balancer:http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/ ● MongoDB Documentation on shard keys:http://docs.mongodb.org/manual/core/sharding-shard-key/ Crittercism: http://www.crittercism.com/
    85. 85. MongoDB World June 23-25 world.mongodb.com Code: 25GN for 25% off
    86. 86. Q&A
    87. 87. Thank You!

    ×