Workload isolation sounds like a good idea. But what does that mean, are you currently doing it, and what are the pitfalls of not doing it (or doing it incorrectly)?
In this practical talk [for all levels] we will look at ways to isolate different workloads from each other, as well as look at some disaster stories (both real-life and hypothetical) that were a result of "doing it wrong.
13. These Stories Are True
or they are based on stories that are true
Based* on real cases filed with MongoDB Support
All names changed
Some details* may have been omitted
* some cases may have been combined and/or embellished to make a point
* just boring ones, not the really embarrassing ones
15. simple typo
replica set
had 45 days in the oplog
but oldest backup was 90+ days
but ... saved the DB files (all of them)
We accidentally remove()ed an entire collection.
Is there a way to undo?
16. recovery not for the faint of heart ... all data was recovered
Conclusion:
Replication ≠ Backups
Do regular backups.
Don't do production operations "ad hoc"
noSQL ≠ noDBA
We accidentally remove()ed an entire collection.
Is there a way to undo?
52. Query Scaling Rate Comparison
Number of shards
vs
Number of queries
Each query
target one shard
Per shard/system TOTAL
Each query
target all shard
Per shard/system TOTAL
1 10,000/10K 10K/10K
2 5,000/10K 10K/20K
5 2,000/10K 10K/50K
10 1,000/10K 10K/100K
If your application sends 10,000 queries
53. Query Scaling Rate Comparison
Number of shards
vs
Total query capacity
Each query
target one shard
Per shard/system TOTAL
Each query
target all shard
Per shard/system TOTAL
1 10K/10K 10K/10K
2 10K/20K 10K/10K
5 10K/50K 10K/10K
10 10K/100K 10K/10K
If each shard can process 10,000 queries
62. Operational Testing Built-in
User facing latency
Linear scaling of resources
Most important criteria?
• Generates realistic real-life-scale workload
• compared to Twitter, etc.
• Confirms architecture scales linearly
• without loss of responsiveness
72. Accidentally deleted DB directory
Restored recent backup... but mongod won't start
Backups were not good: taken incorrectly
Unable to start mongod process
73. Unable to start mongod process
Conclusion:
Most Important Part of Backups:
RESTORES
Test your backups
noSQL ≠ noDBA
Unable to start mongod process
76. January 13:
"DBA" adds a new shard
"DBA" observes that data does not seem to be migrating to new shard
"DBA" sets out to "fix" the "problem"
By "re-sharding" the database/collection in question
Which doesn't work (because it's already sharded)
"Simple" solution: remove the config DB metadata for chunks!
Try resharding again!
Force it!
Sequence of Events
77. How did we help them fix it?
My colleague
The Operation
81. At 4:30 PM, Friday alpha page comes in
Two senior support engineers work the ticket till 10 PM
The details:
33 node,
17 terabyte sharded cluster (11 shards, 3 node replica each),
single data center,
no journaling
no backups
Add to that:
power failure in the data center
no UPS
Result:
unreadable data on every node
noSQL ≠ noDBA
Sequence of Events
85. Dec 29 2013 10:35:00 AM: db.stats() is showing dataSize > fileSize
Dec 29 2013 04:44:00 PM: "there are data files viewed as missing by `mongod`"
Dec 30 2013 02:12:00 AM: Seeing incorrect fileSize on numerous servers
you can see a drop in fileSize on 12/28 in MMS with
no corresponding drop in the other size metrics.
Dec 30 2013 02:19:00 AM: "[do] these databases have anything in common,
especially with the xxxx DB from yesterday?
Dec 30 2013 02:22:00 AM: Nothing comes to mind ... that DBs have in common
Sequence of Events
Dec 30 2013 02:28:00 AM: We notice this all happening at the same time.
We think something might be deleting data files.
Dec 30 2013 02:31:00 AM: "Something is deleting data files outside mongod?"
86. Dec 29 2013 04:44:00 PM: "there are data files viewed as missing by `mongod`"
Dec 30 2013 02:12:00 AM: Seeing incorrect fileSize on numerous servers
you can see a drop in fileSize on 12/28 in MMS with
no corresponding drop in the other size metrics.
Dec 30 2013 02:19:00 AM: "[do] these databases have anything in common,
especially with the xxxx DB from yesterday?
Dec 30 2013 02:22:00 AM: Nothing comes to mind ... that DBs have in common
Sequence of Events
Dec 30 2013 02:28:00 AM: We notice this all happening at the same time.
We think something might be deleting data files.
Dec 30 2013 02:31:00 AM: "Something is deleting data files outside mongod?"
Dec 30 2013 02:57:00 AM: Yes. We deleted actual db files on both the primaries
and secondaries on the 28th.
87. Dec 29 2013 04:44:00 PM: "there are data files viewed as missing by `mongod`"
Dec 30 2013 02:12:00 AM: Seeing incorrect fileSize on numerous servers
you can see a drop in fileSize on 12/28 in MMS with
no corresponding drop in the other size metrics.
Dec 30 2013 02:19:00 AM: "[do] these databases have anything in common,
especially with the xxxx DB from yesterday?
Dec 30 2013 02:22:00 AM: Nothing comes to mind ... that DBs have in common
Sequence of Events
Dec 28 2013: Someone notices that they are low on disk space and as a solution
writes a shell script that finds every file on every disk on every
server that's bigger than 1GB in size and which hasn't been
accessed in >3 days.
And it then deletes it.
This script ran on every server deleting every database file bigger than 1GB
which hasn't been accessed in the previous few days...
Dec 30 2013 02:28:00 AM: We notice this all happening at the same time.
We think something might be deleting data files.
Dec 30 2013 02:31:00 AM: "Something is deleting data files outside mongod?"
Dec 30 2013 02:57:00 AM: Yes. We deleted actual db files on both the primaries
88. Ultimately, NO data was lost.
running `mongod` process keeps the "deleted" file from being removed
running `mongod` can recreate all data files via db.repair()
BUT... there is no disk space for db.repair()
Luckily, an extra server or two are "found" and allow rotating re-sync of new secondary in each
replica set.
Again: no data was lost. All data was fully and successfully recovered.
Guess the Outcome!