Five MMS Monitoring Alerts to
Keep Your MongoDB Deployment
on Track
Angshuman Bagchi (angshuman@mongodb.com)
Technical Ser...
Agenda
•
•
•
•
•

What is MMS Monitoring?
What are Alerts?
How to pick an Alert?
Five recommended Alerts
Wrap up
What is MMS Monitoring?
Who uses MMS?
What are MMS alerts?
Source:
http://www.cleanfunnypics.com/no-its-not-empty/#axzz2pqknJJbC
How to pick an Alert?
•
•
•
•
•

Is there an absolute limit to alert on?
What is normal (baseline) ?
What is worrying (warning) ?
What is a defi...
Five recommended alerts
• Host Recovering (All, but by definition
Secondary)
• Replication Lag (Secondary)
• Connections (...
Host Recovering
• General alert triggered if any instance
enters RECOVERING mode
• Required for all use-cases
• All Replic...
Host Recovering
Replication Lag
•
•
•
•

No secondary should be behind
Secondary reads affected
All Replica Sets should have this
Only exc...
Replication Lag
Absolute Limit?

Yes, about 1 or 2s. To prevent false positives absolute
threshold > 240s should be alerte...
Example: replication lag
150,000s of lag ~ almost 2 days of lag!
Example: replication lag
• Secondaries under specified vs primaries
• Access patterns between primary /
secondaries
• Insu...
Example: replication lag
Example:
• ~1500 ops per minute (opcounters)
• 0.1 MB per object (average object size,
local db)
...
Connections
• Each connection consumes ~ 1MB and
a file descriptor
• 5000 connections => 5GB of RAM
• Stability and predic...
Pro-Tip: know thyself
You have to recognize normal to know when it isn’t.

Source: http://www.flickr.com/photos/skippy/685...
Connections
Absolute Limit?

Yes, but this is too high. We need to alert before that

Normal

TBD based on deployment, num...
Lock %
• Lock contention degrades performance
• High lock % starves replication, reads.
• Bounds need to be determined
Lock %
Absolute Limit?

Yes, >80% occasional degraded performance, 90% major
impact regularly

Normal

TBD. Write heavy lo...
Replica
• Represents oplog window
• Depends on
– Rate of operations inserted into oplog
– Size of operations
– Size of opl...
Replica
Absolute Limit?

50% below Normal

Normal

TBD. Say X hours during peak

Worrying

25% below Normal

Critical

50%...
Summary
• Use similar approach for other metrics
• Different audiences for alerts
– Worrying alerts ops team
– Critical go...
I got alerted … now what?
mms.mongodb.com

angshuman@mongodb.com
Webinar: Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track
Webinar: Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track
Webinar: Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track
Upcoming SlideShare
Loading in …5
×

Webinar: Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track

1,724 views

Published on

MongoDB Management Service (MMS) is is a cloud-based suite of services for managing MongoDB deployments, providing both monitoring and backup capabilities. In this webinar we'll outline 5 alerts you should set up in MMS to keep your MongoDB deployment on track. We’ll explore what each alert means for a MongoDB instance, as well as how to calibrate the alert triggers to be relevant to your environment.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,724
On SlideShare
0
From Embeds
0
Number of Embeds
399
Actions
Shares
0
Downloads
44
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A member of a replica set enters RECOVERING state when it is not ready to accept reads. The RECOVERING state can occur during normal operation, and doesn’t necessarily reflect an error condition. Members in the RECOVERING state are eligible to vote in elections, but is not eligible to enter the PRIMARY state.
  • A member of a replica set enters RECOVERING state when it is not ready to accept reads. The RECOVERING state can occur during normal operation, and doesn’t necessarily reflect an error condition. Members in the RECOVERING state are eligible to vote in elections, but is not eligible to enter the PRIMARY state.----- Meeting Notes (1/9/14 11:17) -----Initial Sync, rollback, stale
  • Get the discount code from Meghan or Rhea
  • Webinar: Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track

    1. 1. Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track Angshuman Bagchi (angshuman@mongodb.com) Technical Services Engineer
    2. 2. Agenda • • • • • What is MMS Monitoring? What are Alerts? How to pick an Alert? Five recommended Alerts Wrap up
    3. 3. What is MMS Monitoring?
    4. 4. Who uses MMS?
    5. 5. What are MMS alerts?
    6. 6. Source: http://www.cleanfunnypics.com/no-its-not-empty/#axzz2pqknJJbC
    7. 7. How to pick an Alert?
    8. 8. • • • • • Is there an absolute limit to alert on? What is normal (baseline) ? What is worrying (warning) ? What is a definite problem (critical) ? Likelihood of false positives ? ... there is no magic formula
    9. 9. Five recommended alerts • Host Recovering (All, but by definition Secondary) • Replication Lag (Secondary) • Connections (All mongos, mongod) • Lock % (Primary, Secondary) • Replica (Primary, Secondary)
    10. 10. Host Recovering • General alert triggered if any instance enters RECOVERING mode • Required for all use-cases • All Replica Sets should have this. • Sometimes, during maintenance this may be expected
    11. 11. Host Recovering
    12. 12. Replication Lag • • • • No secondary should be behind Secondary reads affected All Replica Sets should have this Only exception is configured slaveDelay
    13. 13. Replication Lag Absolute Limit? Yes, about 1 or 2s. To prevent false positives absolute threshold > 240s should be alerted Normal Lag is ideally 0s Worrying < 60s, some false positives Critical > 240s False positives Above 240s likelihood low.
    14. 14. Example: replication lag 150,000s of lag ~ almost 2 days of lag!
    15. 15. Example: replication lag • Secondaries under specified vs primaries • Access patterns between primary / secondaries • Insufficient bandwidth • Foreground index builds on secondaries “…when you have eliminated the impossible, whatever remains, however improbable, must be the truth…” -- Sherlock Holmes Sir Arthur Conan Doyle, The Sign of the Four
    16. 16. Example: replication lag Example: • ~1500 ops per minute (opcounters) • 0.1 MB per object (average object size, local db) ~1500 ops/min / 60 seconds * 0.1 MB/op * 8b/B =~ 20 mbps required bandwidth
    17. 17. Connections • Each connection consumes ~ 1MB and a file descriptor • 5000 connections => 5GB of RAM • Stability and predictability are key
    18. 18. Pro-Tip: know thyself You have to recognize normal to know when it isn’t. Source: http://www.flickr.com/photos/skippy/6853920/
    19. 19. Connections Absolute Limit? Yes, but this is too high. We need to alert before that Normal TBD based on deployment, number of nodes, connection pool settings, app servers, load etc. Say, X during peak load Worrying 50% increase, so, 1.5X Critical Double, so 2X
    20. 20. Lock % • Lock contention degrades performance • High lock % starves replication, reads. • Bounds need to be determined
    21. 21. Lock % Absolute Limit? Yes, >80% occasional degraded performance, 90% major impact regularly Normal TBD. Write heavy loads see higher values. Normal, say X% during peak load Worrying Double, so approximately 2X% Critical TBD. For Prod > 80%
    22. 22. Replica • Represents oplog window • Depends on – Rate of operations inserted into oplog – Size of operations – Size of oplog capped collection • Normal maintenance window X 3 • Resizing the oplog is non-trivial
    23. 23. Replica Absolute Limit? 50% below Normal Normal TBD. Say X hours during peak Worrying 25% below Normal Critical 50% below Normal
    24. 24. Summary • Use similar approach for other metrics • Different audiences for alerts – Worrying alerts ops team – Critical goes out to a wider audience • Get started with MMS Monitoring and alerts!
    25. 25. I got alerted … now what?
    26. 26. mms.mongodb.com angshuman@mongodb.com

    ×