Read this presentation to learn lessons from a real MongoDB Technical support story. You’ll see how three issues impacted the performance of a high-volume retail web application.
Learn how we diagnosed a sub-optimal data model (schema), an incorrect storage setting, and an under-tested upgrade to help the customer scale their application.
QCon London: Mastering long-running processes in modern architectures
Fixing Sub-optimal Performance in a Retail Application
1. Ger Hartnett
Director of Technical Services (EMEA), MongoDB @ghartnett #MongoDB
Tales from the Field
Part two: Fixing Sub-optimal Performance in
a Retail Application
3. ●The main talk should take 30-35 minutes
●You can submit questions via the chat box
●We’ll answer as many as possible at the end
●We will send the slides and recording
tomorrow via email
●The final webinar in the series will take place
on Thursday 21rd April – 14:00 BST | 15:00
CEST
Before we start
4. ●You work in operations
●You work in development
●You have a MongoDB system in production
●You have contacted MongoDB Technical
Services (support)
●You attended the last webinar (part1)
A quick poll - add a word to the
chat to let me know your
perspective
5. ●We collect - observations about common
mistakes - to share the experience of many
●Names have been changed to protect the
(mostly) innocent
●No animals were harmed during the making
of this presentation (but maybe some DBAs
and engineers had light emotional scarring)
●While you might be new to MongoDB we
have deep experience that you can leverage
Stories
6. 1. Discovering a DR flaw during a data
centre outage
2. Complex documents, memory and
an upgrade “surprise”
3. Wild success “uncovers” the wrong
shard key
The Stories (part two today)
8. Story #1: Recovering from a
disaster
●Prospect in the process of signing up for a
subscription
●Called us late on Friday, data centre power
outage and 30+ (11 shards) servers down
●When they started bringing up the first
shard, the nodes crashed with data
corruption
●17TB of data, very little free disk space,
JOURNALLING DISABLED!
9. Recovering each shard
1.Start secondary
read only
2.Mount NFS
storage for repair
3.Repair former
primary node
4.Iterative rsync to
seed a secondary
Secondary
Primary
Secondary
10. Key takeaways for you
●If you are departing significantly from
standard config, check with us (i.e. if you
think journalling is a bad idea)
●Two DC in different buildings on different
flood plains, not in the path of the same
storm (i.e. secondaries in AWS)
●DR/backups are useless if you haven’t
tested them
11. Story #2: Complex documents,
memory and an upgrade
“surprise”
●Well established ecommerce site selling
diverse goods in 20+ countries
●After switching to wired tiger in production,
performance dropped, this is the opposite of
what they were expecting
12. {
_id: 375
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... >
inventory: 423
}
Product Catalog: Original
Schema
13. What’s good about this schema?
● Each document contains all the data about a given
product, across all languages/locales
● Very efficient way to retrieve the English, French,
German, etc. translations of a single product’s
information in one query
14. However……
That is not how the product data is
actually used
(except perhaps by translation staff)
17. Consequences
●WiredTiger reads/rewrites the whole document
●Each document contained ~20x more data than
any common use case needed
●MongoDB lets you request just a subset of a
document’s contents (using a projection), but…
o Typically whole document loaded into RAM
●There are other overheads (like readahead)
18. { _id: 42,
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... > }
<READAHEAD OVERHEAD>
{ _id: 709,
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... > }
<READAHEAD OVERHEAD>
{ _id: 3600,
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... > }
Visualising the read problem
- Data in RED are loaded into RAM
and used.
- Data in BLUE take up memory but
are not required.
- Readahead padding in GREEN
makes things even more inefficient
20. What did we recommend?
● Design for your use case, your most common query
pattern
o In this case: 99.99% of queries want the product
data for exactly one locale at a time
o Move the frequently changing fields to a new
collection
● Eliminate inefficiencies on the system
o Make reading from disk less wasteful, maximise I/O
capabilities by reducing readahead
21. { _id: "375-en_US",
name : ..., description : ..., <etc...> }
{ _id: "375-en_GB",
name : ..., description : ..., <etc...> }
{ _id: "375-fr_FR",
name : ..., description : ..., <etc...> }
... and so on for other locales ...
db.inventory
{ _id: "375", count : NumberLong(1234), <etc...> }
Product Catalog: Eventual
Schema
22. Aftermath & lessons learned
●Faster updates
●Queries induced minimal overhead
●Greater than 20x distinct products fit in
memory at once
●Disk I/O utilization reduced
●UI latency decreased
23. Key Takeaways
●When doing a major version/storage-engine
upgrade, test in staging with some
proportion of production data/workload
●Sometimes putting everything into one
document is counter productive
25. Story #3: Wild success uncovers
the wrong shard key
●Started out as error “[Balancer] caught
exception … tag ranges not valid for: db.coll”
●11 shards, they had added 2 new shards to
keep up traffic - 400+ databases
●Lots of code changes ahead of the
Superbowl
●Spotted slow 300+s queries, decided to build
some indexes without telling us
●Production went down
28. ●You can submit questions via the chat box
●We are recording and will send slides
tomorrow
●We will send the slides and recording
tomorrow via email
●Part 3: the next webinar will take place on
Thursday 21st April – 14:00 BST | 15:00
CEST
www.mongodb.com/webinars
Questions
Some borrowed, some merged into a single narrative
Some of the people that inspired them may well be here in this room today
Bill's Bulk Updates randomly affected an ever larger data set.
In order to cope with the database size, Bill added more shards.
The cluster scaled linearly, as intended.
Well, it might fix things, but it’s expensive and the real problem is the efficiency
Bill's Bulk Updates randomly affected an ever larger data set.
In order to cope with the database size, Bill added more shards.
The cluster scaled linearly, as intended.