Webinar: Best Practices for MongoDB on AWS


Published on

In this session we will look at best practices for administering large MongoDB deployments in the cloud.  We will discuss tips and tools for capacity planning, fully scripted provisioning using chef and knife-ec2, and snapshotting your data safely, as well as using replica sets for high availability across AZs.  We will cover the good, the bad and the ugly of disk performance options on EC2, as well as several filesystem tricks for wringing more performance out of your block devices.  And finally we will talk about some ways to prevent Mongo disaster spirals and minimize your downtime. This session is appropriate for anyone who already has experience administering MongoDB. Some experience with AWS or cloud computing is useful, but not required, for all of the material.

Published in: Technology

Webinar: Best Practices for MongoDB on AWS

  1. 1. Charity Majors @mipsytipsy
  2. 2. Topics:• Replica sets• Resources and capacity planning• Provisioning with chef• Snapshotting• Scaling tips• Monitoring• Disaster mitigation
  3. 3. Replica sets• Always use replica sets• Distribute across Availability Zones• Avoid situations where you have even # voters • 50% is not a majority!• More votes are better than fewer (max is 7)• Add an arbiter for more flexibility• Always explicitly set the priority of your nodes. Surprise elections are terrible.
  4. 4. Basic sane replica set config• Each node has one vote (default)• Snapshot node does not serve read queries, cannot become master• This configuration can survive any single node or Availability Zone outage
  5. 5. Or manage votes with arbiters• Three separate arbiter processes on each AZ arbiter node, one per cluster• Maximum of seven votes per replica set• Now you can survive all secondaries dying, or an AZ outage• If you have even one healthy node, you can continue to serve traffic• Arbiters tend to be more reliable than nodes because they have less to do.
  6. 6. Provisioning• Memory is your primary constraint, spend your money there • Especially for read-heavy workloads• Your working set should fit into RAM • lots of page faults means it doesn’t fit • 2.4 has a working set estimator in db.serverStatus!• Your snapshot host can usually be smaller, if cost is a concern
  7. 7. Disk options• EBS -- just kidding, EBS is not an option• EBS with Provisioned IOPS• Ephemeral storage• SSD
  8. 8. EBS classic EBS with PIOPS:
  9. 9. PIOPS• Guaranteed # of IOPS, up to 2000/volume• Variability of <0.1%• Raid together multiple volumes for higher performance• Supports EBS snapshots• Costs 2x regular EBS• Can only attach to certain instance types
  10. 10. Estimating PIOPS• estimate how many IOPS to provision with the “tps” column of sar -d 1• multiply that by 2-3x depending on your spikiness• when you exceed your PIOPS limit, your disk stops for a few seconds Avoid this.
  11. 11. Ephemeral storage• Cheap• Fast• No network latency• You can snapshot with LVM + S3• Data is lost forever if you stop or resize the instance• Can use EBS on your snapshot node to take advantage of EBS tools • makes restore a little more complicated
  12. 12. Filesystem• Use ext4• Raise file descriptor limits (cat /proc/<mongo pid>/limits to verify)• If you’re using ubuntu, use upstart• Set your blockdev --set-ra to something sane, or you won’t use all your RAM• If you’re using mdadm, make sure your md device and its volumes have a small enough block size• RAID 10 is the safest and best-performing, RAID 0 is fine if you understand the risks
  13. 13. Chef everything • Role attributes for backup volumes, cluster names • Nodes are effectively disposable • Provision and attach EBS RAID arrays via AWS cookbook • Delete volumes and AWS attributes, run chef- client to re-provision • Restore from snapshot automatically with our backup scriptsOur mongo cookbook and backup scripts: https://github.com/ParsePlatform/Ops/
  14. 14. Bringing up a new node from the most recent mongosnapshot is as simple as this:It’s faster for us to re-provision a node from scratchthan to repair a RAID array or fix most problems.
  15. 15. Each replica set has its own role, where it sets thecluster name, the snapshot host name, and the EBSvolumes to snapshot.When you provision a new node for this role,mongodb::raid_data will build it off the most recentcompleted set of snapshots for the volumes specified inbackups => mongo_volumes.
  16. 16. Snapshots• Snapshot often• Set snapshot node to priority = 0, hidden = 1• Lock Mongo OR stop mongod during snapshot• Snapshot all RAID volumes • We use ec2-consistent-snapshot: http://eric.lubow.org/2011/databases/mongodb/ec2-consistent-snapshot-with-mongo , with a wrapper script for chef to generate the backup volume ids• Always warm up a snapshot before promoting
  17. 17. Warming a secondary• Warm up both indexes and data• Use dd or vmtouch to load files from S3• Scan for most commonly used collections on primary, read those into memory on secondary• Read collections into memory • Natural sort • Full table scan • Search for something that doesn’t exist http://blog.parse.com/2013/03/07/techniques-for-warming-up-mongodb/
  18. 18. Fragmentation• Your RAM gets fragmented too!• Leads to underuse of memory• Deletes are not the only source of fragmentation• db.<collection>.stats to find the padding factor (between 1 - 2, the higher the more fragmentation)• Repair, compact, or reslave regularly (db.printReplicationInfo() to get the length of your oplog to see if repair is a viable option)
  19. 19. Compaction: before and after
  20. 20. Compaction• We recommend running a continuous compaction script on your snapshot host• Every time you provision a new host, it will be freshly compacted.• Plan to rotate in a compacted primary regularly (quarterly, yearly depending on rate of decay)• If you also delete a lot of collections, you may need to periodically run db.repairDatabase() on each db http://blog.parse.com/2013/03/26/always-be-compacting/
  21. 21. Scaling strategies• Horizontal scaling• Query optimization, index optimization• Throw money at it (hardware)• Upgrade to > 2.2 to get rid of global lock• Read from secondaries• Put the journal on a different volume• Repair, compact, or reslave
  22. 22. Monitoring• MMS• Ganglia + nagios • correlate graphs with local metrics like disk i/o • graph your own index ops • graph your own aggregate lock percentages • alert on replication lag, replication error • alert if the primary changes, connection limit• Use chef! Generate all your monitoring from roles
  23. 23. fun with MMS opcounters are color-coded by op type! big bgflush spike means there was an EBS eventlots of page faults means readinglots of cold data into memory fromdisk lock percentage is your single best gauge of fragility.
  24. 24. so ... what can go wrong?• Your queues are rising and queries are piling up• Everything seems to be getting vaguely slower• Your secondaries are in a crash loop• You run out of available connections• You can’t elect a primary• You have an AWS or EBS outage or degradation• You have terrible latency spikes• Replication stops
  25. 25. ... when queries pile up ...• Know what your healthy cluster looks like• Don’t switch your primary or restart when overloaded• Do kill queries before the tipping point• Write your kill script before you need it• Read your mongodb.log. Enable profiling!• Check db.currentOp(): • check to see if you’re building any indexes • check queries with a high numYields • check for long running queries • use explain() on them, check for full table scans • sort by number of queries/write locks per namespace
  26. 26. ... everything getting slower ...• Is your RAID array degraded?• Do you need to compact your collections or databases?• Are you having EBS problems? Check bgflush• Are you reaching your PIOPS limit?• Are you snapshotting while serving traffic? ... terrible latency spikes ...
  27. 27. mongodb.log is your friend.
  28. 28. ... AWS or EBS outage ...• Full outages are often less painful than degradation• Take down the degraded nodes• Stop mongodb to close all connections• Hopefully you have balanced across AZs and are coasting• If you are down and can’t elect a primary, bring up a new node with the same hostname and port as a downed node
  29. 29. that’s all folks :) Charity Majors @mipsytipsy