1JUNE 2014
Performance Tuning
on the Fly at CMP.LY
Michael De Lorenzo
CTO, CMP.LY Inc.
michael@cmp.ly
@mikedelorenzo
2JUNE 2014
Agenda
• CMP.LY and CommandPost
• What is MongoDB Management Service?
• Performance Tuning
• MongoDB Issues we’ve faced
• Slow response times and delayed writes
• Unindexed queries
• Increased Replication Lag and Plummeting oplog Window
• Keep your deployment healthy with MMS
• Using MMS Alerts
• Using MMS Backups
3JUNE 2014
A venture-funded NYC startup that offers proprietary social media, monitoring,
measurement, insight and compliance solutions for Fortune 100
A Monitoring, Measurement & Insights (MMI) tool for managed social
communications.
4JUNE 2014
Use CommandPost to:
• Track and measure cross-platform in real-time
• Identify and attribute high-value engagement
• Analyze and segment engaged audience
• Optimize content and engagement strategies
• Address compliance needs
5JUNE 2014
What is MongoDB
Management Service?
6JUNE 2014
MongoDB Management Service
• Free MongoDB Monitoring
• MongoDB Backup in the Cloud
• Free Cloud service or Available
to run On-Prem for Standard or
Enterprise Subscriptions
• Automation coming soon—FTW!
Ops
Makes MongoDB easier to use and
manage
7JUNE 2014
Who Is MMS for?
• Developers
• Ops Team
• MongoDB Technical Service Team
8JUNE 2014
Performance Tuning
9JUNE 2014
How To Do Performance Tuning?
• Assess the problem and establish acceptable behavior.
• Measure the performance before modification.
• Identify the bottleneck.
• Remove the bottleneck.
• Measure performance after modification to confirm.
• Keep it or revert it and repeat.
Adapted from [http://en.wikipedia.org/wiki/Performance_tuning]
10JUNE 2014
What We’ve Faced
11JUNE 2014
Issues We’ve Faced
• Concurrency Issues
• Slow response times and delayed writes
• Querying without indexes
• Slow reads, timeouts
• Increasing Replication Lag + Plummeting oplog Window
12JUNE 2014
Concurrency
Slow responses and delayed writes
13JUNE 2014
Concurrency
• What is it?
• How did it affect us?
• How did MMS help identify it?
• How did we diagnose the issue in our app and fix it?
• Today
14JUNE 2014
Concurrency in MongoDB
• MongoDB uses a readers-writer lock
• Many read operations can use a read lock
• If a write lock exists, a single write lock holds the lock exclusively
• No other read or write operations can share the lock
• Locks are “writer-greedy”
15JUNE 2014
How Did This Affect Us?
• Slow API response times due to slow database operations
• Delayed writes
• Backed up queues
16JUNE 2014
MMS: Identify Concurrency Issues
17JUNE 2014
Lock % Greater than 100%?!?!?
• time spent in write lock state; sum of global lock + hottest database at that time,
can make value > 100%
• Global lock percentage is a derived metric:
% of time in global lock (small number)
+
% of time locked by hottest (“most locked”) database
• Data is sampled and combined, it is possible to see values over 100%.
18JUNE 2014
Diagnosis
• Identified the write-heavy collections in our applications
• Used application logs to identify slow API responses
• Analyzed MongoDB logs to identify slow database queries
19JUNE 2014
Our Remedies
• Schema changes
• Message queues
• Multiple databases
• Sharding
20JUNE 2014
Schema Changes
• Denormalized our schema
• Allowed for atomic updates
• Customized documents’ _id attribute
• Leveraged existing index on _id attribute
21JUNE 2014
Modeling for Atomic Operations
Document
{
_id: 123456789,
title: "MongoDB: The Definitive Guide",
author: [ "Kristina Chodorow", "Mike Dirolf"
],
published_date: ISODate("2010-09-24"),
pages: 216,
language: "English",
publisher_id: "oreilly",
available: 3,
checkout: [ { by: "joe", date:
ISODate("2012-10-15") } ]
}
Update Operation
db.books.update (
{ _id: 123456789, available: { $gt: 0 } },
{
$inc: { available: -1 },
$push: { checkout: { by: "abc", date: new
Date() } }
}
)
Result
WriteResult({ "nMatched" : 1, "nUpserted" : 0,
"nModified" : 1 })
22JUNE 2014
Message Queues
• Controlled writes to specific collections using Pub/Sub
• We chose Amazon SQS
• Other options include Redis, Beanstalkd, IronMQ or any other message queue
• Created consistent flow of writes versus bursts
• Reduced length and frequency of write locks by controlling flow/speed of writes
23JUNE 2014
Using Multiple Databases
• As of version 2.2, MongoDB implements locks at a per database granularity for
most read and write operations
• Planned to be at the document level in version 2.8
• Moved write-heavy collections to new (separate) databases
24JUNE 2014
Using Sharding
• Improves concurrency by distributing databases across multiple mongod
instances
• Locks are per-mongod instance
25JUNE 2014
Lock %: Today
26JUNE 2014
Queries without Indexes
Slow responses and timeouts
27JUNE 2014
Indexing
• What is it?
• How did it affect us?
• How did MMS help identify it?
• How did we diagnose the issue in our app and fix it?
• Today
28JUNE 2014
Indexing with MongoDB
• Support for efficient execution of queries
• Without indexes, MongoDB must scan every document
• Example
Wed Jul 17 13:40:14 [conn28600] query x.y [snip] ntoreturn:16 ntoskip:0
nscanned:16779 scanAndOrder:1 keyUpdates:0 numYields: 906 locks(micros)
r:46877422 nreturned:16 reslen:6948 38172ms
38 seconds! Scanned 17k documents, returned 16
• Create indexes to cover all queries, especially support common and user-facing
• Collection scans can push entire working set out of RAM
29JUNE 2014
How Did this Affect Us?
• Our web apps became slow
• Queries began to timeout
• Longer operations mean longer lock times
30JUNE 2014
MMS: Identifying Indexing Issues
Page Faults
• The number of times that
MongoDB requires data
not located in physical
memory, and must read
from virtual memory.
31JUNE 2014
Diagnosis
• Log Analysis
• Use mtools to analyze MongoDB logs
• mlogfilter
• filter logs for slow queries, collection scans, etc.
• mplotqueries
• graph query response times and volumes
• https://github.com/rueckstiess/mtools
32JUNE 2014
Diagnosis
• Monitoring application logs
• Enabling ‘notablescan’ option in development and testing versions of apps
• MongoDB profiling
33JUNE 2014
The MongoDB Profiler
• Collects fine grained data about MongoDB write operations, cursors, database
commands on a running mongod instance.
• Default slowOpThreshold value is 100ms, can be changed from the Mongo shell
34JUNE 2014
Our Remedies
• Add indexes!
• Make sure queries are covered
• Utilize the projection specification to limit fields (data) returned
35JUNE 2014
Adding Indexes
• Improved performance for common queries
• Alleviates the need to go to disk for many operations
36JUNE 2014
Projection Specification
Controls the amount of data that needs to be (de-)serialized for use in your app
• We used it to limit data returned in embedded documents and arrays
db.inventory.find( { type: 'food' }, { item: 1, qty: 1 } )
37JUNE 2014
Page Faults: Today
38JUNE 2014
Increasing Replication Lag +
Plummeting oplog Window
39JUNE 2014
Replication
• What is it?
• How did it affect us?
• How did MMS help identify it?
• How did we diagnose the issue in our app?
• How did we fix it?
• Today
40JUNE 2014
What is Replication?
• A replica set is a group of mongod
processes that maintain the same data
set.
• Replica sets provide redundancy and
high availability, and are the basis for all
production deployments
41JUNE 2014
What Is the Oplog?
• A special capped collection that keeps a rolling record of all operations that
modify the data stored in your databases.
• Operations are first applied on the primary and then recorded to its oplog.
• Secondary members then copy and apply these operations in an asynchronous
process.
42JUNE 2014
What is Replication Lag?
• A delay between an operation on the primary and the application of that
operation from the oplog to the secondary.
• Effects of excessive lag
• “Lagged” members ineligible to quickly become primary
• Increases the possibility that distributed read operations will be inconsistent.
43JUNE 2014
How did this affect us?
• Degraded overall health of our production deployment.
• Distributed reads are no longer eventually consistent.
• Unable to bring new secondary members online.
• Caused MMS Backups to do full re-syncs.
44JUNE 2014
Identifying Replication Lag Issues
with MMS
The Replication Lag chart displays the lag for your deployment
45JUNE 2014
Diagnosis
• Possible causes of replication lag include network latency, disk throughput,
concurrency and/or appropriate write concern
• Size of operations to be replicated
• Confirmed Non-Issues for us
• Network latency
• Disk throughput
• Possible Issues for us
• Concurrency/write concern
• Size of op is an issue because entire document is written to oplog
46JUNE 2014
Concurrency/Write Concern
• Our applications apply many updates very quickly
• All operations need to be replicated to secondary members
• We use the default write concern—Acknowledge
• The mongod confirms receipt of the write operation
• Allows clients to catch network, duplicate key and other errors
47JUNE 2014
Concurrency Wasn’t the Issue
Lock Percentage
48JUNE 2014
Operation Size Was the Issue
Collection A (most active)
Total Updates: 3,373
Total Size of updates: 6.5 GB
Activity accounted for nearly 87% of total traffic
Collection B (next most active)
Total Updates: 85,423
Total Size of updates: 740 MB
49JUNE 2014
Fast Growing oplog causes issues
Replication oplog Window – approximate hours available in the primary’s oplog
50JUNE 2014
How We Fixed It
• Changed our schema
• Changed the types of updates that were made to documents
• Both allowed us to utilize atomic operations
• Led to smaller updates
• Smaller updates == less oplog space used
51JUNE 2014
Replication Lag: Today
52JUNE 2014
oplog Window: Today
53JUNE 2014
Keeping Your Deployment
Healthy
54JUNE 2014
MMS Alerts
55JUNE 2014
Watch for Warnings
• Be warned if you are
• Running outdated versions
• Have startup warnings
• If a mongod is publicly visible
• Pay attention to these warnings
56JUNE 2014
MMS Backups
• Engineered by MongoDB
• Continuous backup with point-in-time recovery
• Fully managed backups
57JUNE 2014
Using MMS Backups
• Seeding new secondaries
• Repairing replica set members
• Development and testing databases
• Restores are free!
58JUNE 2014
Summary
• Know what’s expected and “normal” in your systems
• Know when and what changes in your systems
• Utilize MMS alerts, visualizations and warnings to keep things running smoothly
59JUNE 2014
Questions?
Michael De Lorenzo
CTO, CMP.LY Inc.
michael@cmp.ly
@mikedelorenzo

Performance Tuning on the Fly at CMP.LY

  • 1.
    1JUNE 2014 Performance Tuning onthe Fly at CMP.LY Michael De Lorenzo CTO, CMP.LY Inc. michael@cmp.ly @mikedelorenzo
  • 2.
    2JUNE 2014 Agenda • CMP.LYand CommandPost • What is MongoDB Management Service? • Performance Tuning • MongoDB Issues we’ve faced • Slow response times and delayed writes • Unindexed queries • Increased Replication Lag and Plummeting oplog Window • Keep your deployment healthy with MMS • Using MMS Alerts • Using MMS Backups
  • 3.
    3JUNE 2014 A venture-fundedNYC startup that offers proprietary social media, monitoring, measurement, insight and compliance solutions for Fortune 100 A Monitoring, Measurement & Insights (MMI) tool for managed social communications.
  • 4.
    4JUNE 2014 Use CommandPostto: • Track and measure cross-platform in real-time • Identify and attribute high-value engagement • Analyze and segment engaged audience • Optimize content and engagement strategies • Address compliance needs
  • 5.
    5JUNE 2014 What isMongoDB Management Service?
  • 6.
    6JUNE 2014 MongoDB ManagementService • Free MongoDB Monitoring • MongoDB Backup in the Cloud • Free Cloud service or Available to run On-Prem for Standard or Enterprise Subscriptions • Automation coming soon—FTW! Ops Makes MongoDB easier to use and manage
  • 7.
    7JUNE 2014 Who IsMMS for? • Developers • Ops Team • MongoDB Technical Service Team
  • 8.
  • 9.
    9JUNE 2014 How ToDo Performance Tuning? • Assess the problem and establish acceptable behavior. • Measure the performance before modification. • Identify the bottleneck. • Remove the bottleneck. • Measure performance after modification to confirm. • Keep it or revert it and repeat. Adapted from [http://en.wikipedia.org/wiki/Performance_tuning]
  • 10.
  • 11.
    11JUNE 2014 Issues We’veFaced • Concurrency Issues • Slow response times and delayed writes • Querying without indexes • Slow reads, timeouts • Increasing Replication Lag + Plummeting oplog Window
  • 12.
  • 13.
    13JUNE 2014 Concurrency • Whatis it? • How did it affect us? • How did MMS help identify it? • How did we diagnose the issue in our app and fix it? • Today
  • 14.
    14JUNE 2014 Concurrency inMongoDB • MongoDB uses a readers-writer lock • Many read operations can use a read lock • If a write lock exists, a single write lock holds the lock exclusively • No other read or write operations can share the lock • Locks are “writer-greedy”
  • 15.
    15JUNE 2014 How DidThis Affect Us? • Slow API response times due to slow database operations • Delayed writes • Backed up queues
  • 16.
    16JUNE 2014 MMS: IdentifyConcurrency Issues
  • 17.
    17JUNE 2014 Lock %Greater than 100%?!?!? • time spent in write lock state; sum of global lock + hottest database at that time, can make value > 100% • Global lock percentage is a derived metric: % of time in global lock (small number) + % of time locked by hottest (“most locked”) database • Data is sampled and combined, it is possible to see values over 100%.
  • 18.
    18JUNE 2014 Diagnosis • Identifiedthe write-heavy collections in our applications • Used application logs to identify slow API responses • Analyzed MongoDB logs to identify slow database queries
  • 19.
    19JUNE 2014 Our Remedies •Schema changes • Message queues • Multiple databases • Sharding
  • 20.
    20JUNE 2014 Schema Changes •Denormalized our schema • Allowed for atomic updates • Customized documents’ _id attribute • Leveraged existing index on _id attribute
  • 21.
    21JUNE 2014 Modeling forAtomic Operations Document { _id: 123456789, title: "MongoDB: The Definitive Guide", author: [ "Kristina Chodorow", "Mike Dirolf" ], published_date: ISODate("2010-09-24"), pages: 216, language: "English", publisher_id: "oreilly", available: 3, checkout: [ { by: "joe", date: ISODate("2012-10-15") } ] } Update Operation db.books.update ( { _id: 123456789, available: { $gt: 0 } }, { $inc: { available: -1 }, $push: { checkout: { by: "abc", date: new Date() } } } ) Result WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
  • 22.
    22JUNE 2014 Message Queues •Controlled writes to specific collections using Pub/Sub • We chose Amazon SQS • Other options include Redis, Beanstalkd, IronMQ or any other message queue • Created consistent flow of writes versus bursts • Reduced length and frequency of write locks by controlling flow/speed of writes
  • 23.
    23JUNE 2014 Using MultipleDatabases • As of version 2.2, MongoDB implements locks at a per database granularity for most read and write operations • Planned to be at the document level in version 2.8 • Moved write-heavy collections to new (separate) databases
  • 24.
    24JUNE 2014 Using Sharding •Improves concurrency by distributing databases across multiple mongod instances • Locks are per-mongod instance
  • 25.
  • 26.
    26JUNE 2014 Queries withoutIndexes Slow responses and timeouts
  • 27.
    27JUNE 2014 Indexing • Whatis it? • How did it affect us? • How did MMS help identify it? • How did we diagnose the issue in our app and fix it? • Today
  • 28.
    28JUNE 2014 Indexing withMongoDB • Support for efficient execution of queries • Without indexes, MongoDB must scan every document • Example Wed Jul 17 13:40:14 [conn28600] query x.y [snip] ntoreturn:16 ntoskip:0 nscanned:16779 scanAndOrder:1 keyUpdates:0 numYields: 906 locks(micros) r:46877422 nreturned:16 reslen:6948 38172ms 38 seconds! Scanned 17k documents, returned 16 • Create indexes to cover all queries, especially support common and user-facing • Collection scans can push entire working set out of RAM
  • 29.
    29JUNE 2014 How Didthis Affect Us? • Our web apps became slow • Queries began to timeout • Longer operations mean longer lock times
  • 30.
    30JUNE 2014 MMS: IdentifyingIndexing Issues Page Faults • The number of times that MongoDB requires data not located in physical memory, and must read from virtual memory.
  • 31.
    31JUNE 2014 Diagnosis • LogAnalysis • Use mtools to analyze MongoDB logs • mlogfilter • filter logs for slow queries, collection scans, etc. • mplotqueries • graph query response times and volumes • https://github.com/rueckstiess/mtools
  • 32.
    32JUNE 2014 Diagnosis • Monitoringapplication logs • Enabling ‘notablescan’ option in development and testing versions of apps • MongoDB profiling
  • 33.
    33JUNE 2014 The MongoDBProfiler • Collects fine grained data about MongoDB write operations, cursors, database commands on a running mongod instance. • Default slowOpThreshold value is 100ms, can be changed from the Mongo shell
  • 34.
    34JUNE 2014 Our Remedies •Add indexes! • Make sure queries are covered • Utilize the projection specification to limit fields (data) returned
  • 35.
    35JUNE 2014 Adding Indexes •Improved performance for common queries • Alleviates the need to go to disk for many operations
  • 36.
    36JUNE 2014 Projection Specification Controlsthe amount of data that needs to be (de-)serialized for use in your app • We used it to limit data returned in embedded documents and arrays db.inventory.find( { type: 'food' }, { item: 1, qty: 1 } )
  • 37.
  • 38.
    38JUNE 2014 Increasing ReplicationLag + Plummeting oplog Window
  • 39.
    39JUNE 2014 Replication • Whatis it? • How did it affect us? • How did MMS help identify it? • How did we diagnose the issue in our app? • How did we fix it? • Today
  • 40.
    40JUNE 2014 What isReplication? • A replica set is a group of mongod processes that maintain the same data set. • Replica sets provide redundancy and high availability, and are the basis for all production deployments
  • 41.
    41JUNE 2014 What Isthe Oplog? • A special capped collection that keeps a rolling record of all operations that modify the data stored in your databases. • Operations are first applied on the primary and then recorded to its oplog. • Secondary members then copy and apply these operations in an asynchronous process.
  • 42.
    42JUNE 2014 What isReplication Lag? • A delay between an operation on the primary and the application of that operation from the oplog to the secondary. • Effects of excessive lag • “Lagged” members ineligible to quickly become primary • Increases the possibility that distributed read operations will be inconsistent.
  • 43.
    43JUNE 2014 How didthis affect us? • Degraded overall health of our production deployment. • Distributed reads are no longer eventually consistent. • Unable to bring new secondary members online. • Caused MMS Backups to do full re-syncs.
  • 44.
    44JUNE 2014 Identifying ReplicationLag Issues with MMS The Replication Lag chart displays the lag for your deployment
  • 45.
    45JUNE 2014 Diagnosis • Possiblecauses of replication lag include network latency, disk throughput, concurrency and/or appropriate write concern • Size of operations to be replicated • Confirmed Non-Issues for us • Network latency • Disk throughput • Possible Issues for us • Concurrency/write concern • Size of op is an issue because entire document is written to oplog
  • 46.
    46JUNE 2014 Concurrency/Write Concern •Our applications apply many updates very quickly • All operations need to be replicated to secondary members • We use the default write concern—Acknowledge • The mongod confirms receipt of the write operation • Allows clients to catch network, duplicate key and other errors
  • 47.
    47JUNE 2014 Concurrency Wasn’tthe Issue Lock Percentage
  • 48.
    48JUNE 2014 Operation SizeWas the Issue Collection A (most active) Total Updates: 3,373 Total Size of updates: 6.5 GB Activity accounted for nearly 87% of total traffic Collection B (next most active) Total Updates: 85,423 Total Size of updates: 740 MB
  • 49.
    49JUNE 2014 Fast Growingoplog causes issues Replication oplog Window – approximate hours available in the primary’s oplog
  • 50.
    50JUNE 2014 How WeFixed It • Changed our schema • Changed the types of updates that were made to documents • Both allowed us to utilize atomic operations • Led to smaller updates • Smaller updates == less oplog space used
  • 51.
  • 52.
  • 53.
    53JUNE 2014 Keeping YourDeployment Healthy
  • 54.
  • 55.
    55JUNE 2014 Watch forWarnings • Be warned if you are • Running outdated versions • Have startup warnings • If a mongod is publicly visible • Pay attention to these warnings
  • 56.
    56JUNE 2014 MMS Backups •Engineered by MongoDB • Continuous backup with point-in-time recovery • Fully managed backups
  • 57.
    57JUNE 2014 Using MMSBackups • Seeding new secondaries • Repairing replica set members • Development and testing databases • Restores are free!
  • 58.
    58JUNE 2014 Summary • Knowwhat’s expected and “normal” in your systems • Know when and what changes in your systems • Utilize MMS alerts, visualizations and warnings to keep things running smoothly
  • 59.
    59JUNE 2014 Questions? Michael DeLorenzo CTO, CMP.LY Inc. michael@cmp.ly @mikedelorenzo

Editor's Notes

  • #7 Free MongoDB Monitoring - mongodb specific metrics, visualization of performance, custom alerting Backup - industrial strength, point-in-time recovery, free usage tier
  • #8 Developers, what we’re focused on today – track bottlenecks Ops team :: great for small teams where your developers are also part of your ops team (DevOps) – monitor health of clusters, backup dbs, automate updates and add capacity MongoDB technical service team :: helps them help you Important for us because we maintain a small tech team
  • #10 PRO-TIP: Know what is “normal” for you system. Know what changed when something happens, what do you expect to be normal behavior, what are you normal MMS metrics
  • #15 readers-writer lock allows concurrent read access to the db, but exclusive access to a single write “Writer-greedy” - When both a read and write are waiting for a lock, MongoDB grants the lock to the write. The exclusivity of write locks is one of the keys to why getting our lock % under control is so important.
  • #17 Lock % time spent in write lock state; sum of global lock + hottest database at that time, can make value > 100% Our Issue: Primary database maintaining a write lock of 150-175% of the time
  • #26 Global lock percentage has remained about the same Primary client-facing database has seen lock % drop
  • #32 Developed by a MongoDB engineer
  • #45 - Purple bar indicates downtime
  • #55 - Alerts for down hosts, down agents and more
  • #56 - According to Technical Services, In many cases, fixing warnings will fix issues