Workload Isolation - Asya Kamsky

Workload Isolation...
@asya999 #askAsya - ask me what you're doing wrong
(you might be doing it wrong)

7 Deadly Sins
(of bad MongoDB deployments)
5 Stages of Grief
(of bad MongoDB deployments)
10 Commandments
(of good MongoDB deployments)

Human Error
Power Outages
Fire
Server Room Issues
Unscheduled Updates & Patches

HA
Global Write
Clusters
Scale out
Workload
Isolation

Scalability
Availability
Recoverability
Security

These Stories Are True
or they are based on stories that are true
• Based* on real cases filed with MongoDB Support
• All names changed
• Some details* may have been omitted
* some cases may have been combined and/or embellished to make a point
* just boring ones, not the really embarrassing ones

January 9
The Case of Mistaken Delete

simple typo
replica set
had 45 days in the oplog
but oldest backup was 90+ days
but ... saved the DB files (all of them)
We accidentally remove()ed an entire collection.
Is there a way to undo?

recovery not for the faint of heart ... all data was recovered
Conclusion:
Replication ≠ Backups
Do regular backups.
Don't do production operations "ad hoc"
noSQL ≠ noDBA
We accidentally remove()ed an entire collection.
Is there a way to undo?

Application
Driver
mongod
mongod
mongod
Replica set

Application
Driver
mongod
mongod
mongod
Shard 1

Application
Driver
mongos
mongod
mongod
mongod
mongod
mongod
mongod
Shard 1 Shard 2

Application
Driver
mongos
mongod
mongod
mongod
mongod
mongod
mongod
Shard 1 Shard 2
mongod
mongod
mongod
Shard 3

Application
Driver
mongos mongos•••
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
•••
Shard 1 Shard 2 Shard 3 Shard N

Application
Driver
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
•••
Application
Driver
mongos

Application
Driver
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
•••
Application
Driver
Application
Driver
mongos•••mongos
•••

mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongos
•••
DATA
DATA
DATA
DATA
mongos mongosmongos mongos

DATA
DATA
DATA
DATA
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongos
•••
mongos mongosmongos mongos

DATA
DATA
DATA
DATA
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
mongod
app 0
•••
Shard 0 Shard 0 Shard 0 Shard 0
app 2 app 3app 1 app 4

www.etiennemansard.com
Horizontal Scaling

Query Scaling Rate Comparison
Number of shards
vs
Number of queries
Each query
target one shard
Per shard/system
TOTAL
Each query
target all shard
Per shard/system
TOTAL
1 10,000/10K 10K/10K
2 5,000/10K 10K/20K
5 2,000/10K 10K/50K
10 1,000/10K 10K/100K
If your application sends 10,000 queries

Query Scaling Rate Comparison
Number of shards
vs
Total query capacity
Each query
target one shard
Per shard/system
TOTAL
Each query
target all shard
Per shard/system
TOTAL
1 10K/10K 10K/10K
2 10K/20K 10K/10K
5 10K/50K 10K/10K
10 10K/100K 10K/10K
If each shard can process 10,000 queries

Can I prove it?
Can I demonstrate it?

PROOF
• Open Source
• Reference Implementation
• Various Fanout Feed Models
• User Graph Implementation
• Content storage
• Configurable models and options
• Built-in benchmarking
https://github.com/mongodb-labs/socialite

Socialite: Architecture
GraphServiceProxy
ContentProxy

Optimized for Performance
7 16
1 2 5 6 9 12 18 21
www.etiennemansard.com
User
timeline
cache
Schema
Indexing Horizontal Scaling

Operational Testing Built-in
User facing latency
Linear scaling of resources
Most important criteria?
• Generates realistic real-life-scale workload
• compared to Twitter, etc.
• Confirms architecture scales linearly
• without loss of responsiveness

Perfect Candidate to Benchmark

Comparison
Shard Everything Separate Clusters
Scalability
Availability
Reliability
Debug-ability

January 18
The case of deleted database

Accidentally deleted DB directory
Restored recent backup... but mongod won't start
Backups were not good: taken incorrectly
Unable to start mongod process

Conclusion:
Most Important Part of Backups:
RESTORES
Test your backups
noSQL ≠ noDBA

January 13
The case of missing metadata

January 13:
"DBA" adds a new shard
"DBA" observes that data does not seem to be migrating to new shard
"DBA" sets out to "fix" the "problem"
By "re-sharding" the database/collection in question
Which doesn't work (because it's already sharded)
"Simple" solution: remove the config DB metadata for chunks!
Try resharding again!
Force it!
Sequence of Events

How did we help them fix it?
My colleague
The Operation

Result
Conclusion
noSQL ≠ noDBA
Everybody has a test environment...
Some people are lucky enough enough
to have a totally separate
production environment.

May 22
The case of the single data center

At 4:30 PM, Friday alpha page comes in
Two senior support engineers work the ticket till 10 PM
The details:
33 node,
17 terabyte sharded cluster (11 shards, 3 node replica each),
single data center,
no journaling
no backups
Add to that:
power failure in the data center
no UPS
Result:
unreadable data on every node
noSQL ≠ noDBA
Sequence of Events

Bad things happen to good data centers

December 30
The case of disappearing data files

Dec 29 2013 10:35:00 AM: db.stats() is showing dataSize > fileSize
Dec 29 2013 04:44:00 PM: "there are data files viewed as missing by `mongod`"
Dec 30 2013 02:12:00 AM: Seeing incorrect fileSize on numerous servers
you can see a drop in fileSize on 12/28 in MMS with
no corresponding drop in the other size metrics.
Dec 30 2013 02:19:00 AM: "[do] these databases have anything in common,
especially with the xxxx DB from yesterday?
Dec 30 2013 02:22:00 AM: Nothing comes to mind ... that DBs have in common
Sequence of Events
Dec 30 2013 02:28:00 AM: We notice this all happening at the same time.
We think something might be deleting data files.
Dec 30 2013 02:31:00 AM: "Something is deleting data files outside mongod?"

Sequence of Events
Dec 30 2013 02:57:00 AM: Yes. We deleted actual db files on both the primaries
and secondaries on the 28th.

Sequence of Events
Dec 28 2013: Someone notices that they are low on disk space and as a solution
writes a shell script that finds every file on every disk on every
server that's bigger than 1GB in size and which hasn't been
accessed in >3 days.
And it then deletes it.
This script ran on every server deleting every database file bigger than 1GB
which hasn't been accessed in the previous few days...
Dec 30 2013 02:57:00 AM: Yes. We deleted actual db files on both the primaries

Ultimately, NO data was lost.
running `mongod` process keeps the "deleted" file from being removed
running `mongod` can recreate all data files via db.repair()
BUT... there is no disk space for db.repair()
Luckily, an extra server or two are "found" and allow rotating re-sync of
new secondary in each replica set.
Again: no data was lost. All data was fully and successfully recovered.
Guess the Outcome!

THANK YOU
Enjoy The Rest of the Day

Workload Isolation - Asya Kamsky

Recommended

Recommended

More Related Content

Similar to Workload Isolation - Asya Kamsky

Similar to Workload Isolation - Asya Kamsky (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

Workload Isolation - Asya Kamsky