Talk at the Open Data Science Conference East 2020.
Abstract:
Database indexes can make or break the performance of a database. Efficient indexes need to be tailored to the specific queries that are sent to a database, but since query patterns can vary a lot and change over time, it is often a painful process to manually manage indexes. In this session, I will talk about our data-driven approach to automatically estimate optimal indexes from log data of our MongoDB databases. You will learn how we use Google Cloud Functions to stream log data from Stackdriver to Google BigQuery, how we use BigQuery to scale our data analysis, and how we use Python’s Jupyter Notebooks to visualize and monitor our results.
How AI, OpenAI, and ChatGPT impact business and software.
Managing Database Indexes: A Data-Driven Approach - Amadeus Magrabi
1. Managing Database Indexes:
A Data-Driven Approach
Amadeus Magrabi
Lead Data Scientist, commercetools
Open Data Science Conference East - April 17, 2020
All Rights Reserved @2020 1
2. Plan for today
All Rights Reserved @2020 2
• Background:
o Me
o Company: commercetools
• Data-Driven Approach to Managing Database Indexes:
o Database Indexes
o MongoDB
o Problem: Managing database indexes manually is painful
o Solution: Data pipeline to automate index management
• Questions
3. Topics I will cover
All Rights Reserved @2020 3
Database
Indexes MongoDB
Data Science
Project
Management
Deep
Learning
Machine
Learning
Data
Engineering
Google Cloud
Platform
4. Me
All Rights Reserved @2020 4
• Studied cognitive science
• Research in neuroscience/machine learning
for 3 years
• Working in data science since 4 years
• Based in Berlin
6. All Rights Reserved @2020 6
Data Science @ commercetools:
Image Similarity Search
Search Image Prediction 1 Prediction 2 Prediction 3
7. All Rights Reserved @2020 7
Data Science @ commercetools:
Category Recommendations
8. All Rights Reserved @2020 8
Data Science @ commercetools
Team structure:
• “Vertical” team developing microservices
• Data scientists and software engineers
Team output:
• For merchants:
o APIs that make it easier to manage data and improve data
quality.
• For consumers:
o APIs that enable machine learning features like image search.
• For colleagues:
o Make internal company processes more data-driven,
more efficient and more accurate.
10. What is the Problem?
10
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good database indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Need for a data-driven approach to automate index management.
All Rights Reserved @2020
11. What is the Problem?
11
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good database indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Need for a data-driven approach to automate index management.
All Rights Reserved @2020
12. Database Indexes
12
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
All Rights Reserved @2020
13. Database Indexes
13
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
Example Query:
Find products with
prices > 50
14. Database Indexes
14
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
price
30
60
250
Storage: In Memory
Index
Example Query:
Find products with
prices > 50
15. Database Indexes
15
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
• Disadvantage: Changing data is slower and indexes require space in memory.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
price
30
60
250
Storage: In Memory
Index
Example Query:
Find products with
prices > 50
16. MongoDB
16
• Database to store documents in json format.
• Dynamic schema, easy to adapt to changing requirements.
• Open source
All Rights Reserved @2020
{
“name”: “T-Shirt Blue”,
“price”: 30,
“isSoldOut”: false,
“discounts”: null,
“categories”: [
“t-shirts”,
“summer”
]
}
DocumentDatabase: Example Store
Collection: Products
Documents
Collection: Orders
Documents
...
Fields Values
17. MongoDB
17
• Database to store documents in json format.
• Dynamic schema, easy to adapt to changing requirements.
• Open source
All Rights Reserved @2020
{
“name”: “T-Shirt Blue”,
“price”: 30,
“isSoldOut”: false,
“discounts”: null,
“categories”: [
“t-shirts”,
“summer”
]
}
DocumentDatabase: Example Store
Collection: Products
Documents
Collection: Orders
Documents
...
PyMongo API
Fields Values
18. MongoDB Index Types
18
• Single Field Index
• Compound Index
o Can be used for queries that filter for multiple fields
o The order matters, subsets of the first index fields
can also be used
• Sparse Index
o Excludes documents that do not have a value for a field
o Useful to reduce index size when documents have missing values
• ...
All Rights Reserved @2020
19. When should an Index be added?
19
• When databases are large and response times are important
• When the index matches queries that occur frequently
• When data is more often read than written
• When there is enough space in memory
All Rights Reserved @2020
20. When should an Index be added?
20
• When databases are large and response times are important
• When the index matches queries that occur frequently
• When data is more often read than written
• When there is enough space in memory
• When an index can significantly reduce the search space
o High cardinality (id-like fields)
o Low cardinality indexes can still be useful when
most queries filter for field values that are rare
o Example:
o 100k products: 95k are t-shirts, 5k are pants
o Index on the category field reduces the
search space significantly if users mostly
search for pants, but not if they search
for t-shirts
• ...
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
21. MongoDB Indexes
21
Common problems:
• It can be hard to predict how a database will be used.
• Query patterns can change over time.
All Rights Reserved @2020
→ Indexes on important fields are often missing, making read queries slow.
→ Indexes are set on fields that are not used, making write queries slow and wasting memory.
22. What is the Problem?
22
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Goal: Build a recommendation engine for database indexes.
→ First internal project of our data science team.
All Rights Reserved @2020
timeline
Manual
Index Management
Semi-Automatic
Index Management Full Automation
we are here
23. All Rights Reserved @2020 23
Analysis Pipeline
Google
Stackdriver
Logs of slow
MongoDB queries
25. All Rights Reserved @2020 25
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Logs of slow
MongoDB queries
Data lake
to store logs
New json file
created every
hour
26. All Rights Reserved @2020 26
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Define functions
and requirements
Function triggered
when new file is
created in
storage bucket
Deploy function
27. All Rights Reserved @2020 27
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
28. All Rights Reserved @2020 28
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
How do we rank the importance of an index recommendation?
1. Frequency of slow logs including the field
2. Free memory
3. Cardinality
4. Average search space reduction:
If an index would have been set, how many documents could
have been skipped (in queries of the last week)?
○ This is also used to order fields within compound
indexes, so fields with the highest search space
reduction are in the first position
29. All Rights Reserved @2020 29
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Monitoring index performance
time when index was set
30. All Rights Reserved @2020 30
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Convert notebook to html to share
Include details and make
recommendations explainable
31. All Rights Reserved @2020 31
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
32. All Rights Reserved @2020 32
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Run analysis as weekly
cron job
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
33. All Rights Reserved @2020 33
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Run analysis as weekly
cron job
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
Add index
Store history of
changed indexes
34. Future Plans
34
• Improve index recommendations to make review
step unnecessary and come closer to goal of full automation.
• Use machine learning to improve accuracies:
o Create training dataset to train model on
predicting optimal indexes from query patterns.
o Maybe experiment with machine-learning-based
indexes.
• Open source
All Rights Reserved @2020
35. www.commercetools.com
Office Munich
Adams-Lehmann-Str. 44
80797 Munich
+49 89 99 82 996-0
Office Berlin
Sonnenallee 223
12057 Berlin
+49 30 67 24 21-20
Durham, NC
318 Blackwell St Suite 240
Durham, NC 27701
+1 212 220 3809
All Rights Reserved @2020 35
Thanks for listening!
Any questions?
● twitter.com/AmadeusMagrabi
● linkedin.com/in/amadeusmagrabi
● medium.com/@amadeus.magrabi
Connect: