2. The Plan
● Review support cases
○ Taken from real issues
○ Names/ips/dates changed to protect identities
● Analyze reported issues
● Distill best practices
● Summarize takeaways
● Repeat...
3. Scenario 1
● Fire, it is on fire!
● Users notice response time takes 1-3 sec
● App logs show timeouts
● Server log show socket exceptions
4. Scenario 1 - Diagnostics
● Logs
● Understanding the timeouts
○ Client read timeout set
○ Connection closed/discarded
○ Symptom not cause
● Server connection exceptions
○ Match timing of client timeouts
○ Symptom not cause
5. Scenario 1 - Monitoring
Graphs speak a thousand words
6. Scenario 1 - Takeaways
● Monitor Logs
○ Alert, escalate
○ Correlate
● Disk
○ Monitor
○ Moved to RAID (10)
● Instrument/Monitor App
● Know your application and application (write)
characteristics
7. Scenario 2
● Alerts warn that server is running hot
● Random (small) slowdowns
● Increased traffic/queries
8. Scenario 2 - Symptoms
High use cpu
Similar query
pattern
9. Scenario 2 - Diagnostics
● Turn on DB Profiling
● Look at logs
Identify query patterns taking longest or with
highest frequency and run explain
11. Scenario 2 - Diagnostics
● Create a compound index
○ Used for criteria and sort
○ Reduced CPU dramatically
12. Scenario 2 - Takeaways
● Performance test/analyze system behavior
● Load test before deployment
● Alert on abnormal states
● High CPU is a sign of poorly indexed
● Rolling upgrade for indexes
16. Scenario 3 - Takeaways
● Pay attention to disk configurations
● Load testing would have found this early
● MongoDB depends on the OS a lot
● Connect the dots from disportionate effects
17. Best Practices Learned
● System provisioning
○ Capacity
○ Performance
○ Scale
○ Configuration
● Logs
○ Review
○ Alert
○ Rotate and collect (per cluster)
18. Best Practices Learned
● Query/Index Analysis
○ Database Profiler
○ Run explain periodically (sampled)
○ Instrument code, generate metrics
● Plan/test rollouts
○ Rolling upgrade for Replica Set
○ Generate indexes on secondaries first
○ Name services, use redirection
19. Thanks, more refs
Please take a look at http://mongodb.org (docs)
● Ask on mongodb-user group
● Use MMS or historic monitoring
○ Watch for trends
○ Create alerts
○ Forecast capacity for provisioning
● logrotate unix command
● monitor disk - munin or the like
● iostat, dstat, vmstat, free, netstat