SlideShare a Scribd company logo
1 of 35
Download to read offline
Managing Database Indexes:
A Data-Driven Approach
Amadeus Magrabi
Lead Data Scientist, commercetools
Open Data Science Conference East - April 17, 2020
All Rights Reserved @2020 1
Plan for today
All Rights Reserved @2020 2
• Background:
o Me
o Company: commercetools
• Data-Driven Approach to Managing Database Indexes:
o Database Indexes
o MongoDB
o Problem: Managing database indexes manually is painful
o Solution: Data pipeline to automate index management
• Questions
Topics I will cover
All Rights Reserved @2020 3
Database
Indexes MongoDB
Data Science
Project
Management
Deep
Learning
Machine
Learning
Data
Engineering
Google Cloud
Platform
Me
All Rights Reserved @2020 4
• Studied cognitive science
• Research in neuroscience/machine learning
for 3 years
• Working in data science since 4 years
• Based in Berlin
Company: commercetools
All Rights Reserved @2020 5
• Offers e-commerce software via cloud-based APIs
• Founded in 2006
• 200+ employees
• Offices in Munich (HQ), Berlin, Durham (US),
Amsterdam, London, Singapore
$curl https://api.europe-west1.gcp.commercetools.com 
/example-store/products/e7ba4c75-b1bb-483d-2c4a10f78472
Example Request:
{
“id”: “e7ba4c75-b1bb-483d-2c4a10f78472”,
“name”: “Awesome flip-flops | Size 42 | Limited edition!”
“images”: [
{“url” : ”https://www.example-store.com/shoe1_front.jpg”},
{“url” : ”https://www.example-store.com/shoe1_side.jpg”}
]
“prices”: [
{“centAmount”: 4000, “currency”: “EUR”},
{“centAmount”: 4350, “currency”: “USD”}
]
“categories”: [
{name: “shoes”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”},
{name: “summer”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”}
]
}
Example Response:
All Rights Reserved @2020 6
Data Science @ commercetools:
Image Similarity Search
Search Image Prediction 1 Prediction 2 Prediction 3
All Rights Reserved @2020 7
Data Science @ commercetools:
Category Recommendations
All Rights Reserved @2020 8
Data Science @ commercetools
Team structure:
• “Vertical” team developing microservices
• Data scientists and software engineers
Team output:
• For merchants:
o APIs that make it easier to manage data and improve data
quality.
• For consumers:
o APIs that enable machine learning features like image search.
• For colleagues:
o Make internal company processes more data-driven,
more efficient and more accurate.
All Rights Reserved @2020 9
Managing Database Indexes
in a Data-Driven Way
What is the Problem?
10
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good database indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Need for a data-driven approach to automate index management.
All Rights Reserved @2020
What is the Problem?
11
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good database indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Need for a data-driven approach to automate index management.
All Rights Reserved @2020
Database Indexes
12
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
All Rights Reserved @2020
Database Indexes
13
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
Example Query:
Find products with
prices > 50
Database Indexes
14
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
price
30
60
250
Storage: In Memory
Index
Example Query:
Find products with
prices > 50
Database Indexes
15
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
• Disadvantage: Changing data is slower and indexes require space in memory.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
price
30
60
250
Storage: In Memory
Index
Example Query:
Find products with
prices > 50
MongoDB
16
• Database to store documents in json format.
• Dynamic schema, easy to adapt to changing requirements.
• Open source
All Rights Reserved @2020
{
“name”: “T-Shirt Blue”,
“price”: 30,
“isSoldOut”: false,
“discounts”: null,
“categories”: [
“t-shirts”,
“summer”
]
}
DocumentDatabase: Example Store
Collection: Products
Documents
Collection: Orders
Documents
...
Fields Values
MongoDB
17
• Database to store documents in json format.
• Dynamic schema, easy to adapt to changing requirements.
• Open source
All Rights Reserved @2020
{
“name”: “T-Shirt Blue”,
“price”: 30,
“isSoldOut”: false,
“discounts”: null,
“categories”: [
“t-shirts”,
“summer”
]
}
DocumentDatabase: Example Store
Collection: Products
Documents
Collection: Orders
Documents
...
PyMongo API
Fields Values
MongoDB Index Types
18
• Single Field Index
• Compound Index
o Can be used for queries that filter for multiple fields
o The order matters, subsets of the first index fields
can also be used
• Sparse Index
o Excludes documents that do not have a value for a field
o Useful to reduce index size when documents have missing values
• ...
All Rights Reserved @2020
When should an Index be added?
19
• When databases are large and response times are important
• When the index matches queries that occur frequently
• When data is more often read than written
• When there is enough space in memory
All Rights Reserved @2020
When should an Index be added?
20
• When databases are large and response times are important
• When the index matches queries that occur frequently
• When data is more often read than written
• When there is enough space in memory
• When an index can significantly reduce the search space
o High cardinality (id-like fields)
o Low cardinality indexes can still be useful when
most queries filter for field values that are rare
o Example:
o 100k products: 95k are t-shirts, 5k are pants
o Index on the category field reduces the
search space significantly if users mostly
search for pants, but not if they search
for t-shirts
• ...
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
MongoDB Indexes
21
Common problems:
• It can be hard to predict how a database will be used.
• Query patterns can change over time.
All Rights Reserved @2020
→ Indexes on important fields are often missing, making read queries slow.
→ Indexes are set on fields that are not used, making write queries slow and wasting memory.
What is the Problem?
22
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Goal: Build a recommendation engine for database indexes.
→ First internal project of our data science team.
All Rights Reserved @2020
timeline
Manual
Index Management
Semi-Automatic
Index Management Full Automation
we are here
All Rights Reserved @2020 23
Analysis Pipeline
Google
Stackdriver
Logs of slow
MongoDB queries
All Rights Reserved @2020 24
Analysis Pipeline
Google
Stackdriver
Logs of slow
MongoDB queries
2020-02-12T16:13:06.591+0000 I COMMAND [conn2282323] command example-store.customers command:
find { find: "customers", filter: { custom.fields.customerReference: "1234-abcd-5678-efgh" }, limit: 20, batchSize:
2147483647, maxTimeMS: 70000 } planSummary: COLLSCAN keysExamined:0 docsExamined:1447113
cursorExhausted:1 numYields:11314 nreturned:1 reslen:1265 locks:{ Global: { acquireCount: { r: 22630 } }, Database: {
acquireCount: { r: 11315 } }, Collection: { acquireCount: { r: 11315 } } } protocol:op_query 265ms
Example of slow MongoDB query
Interface
All Rights Reserved @2020 25
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Logs of slow
MongoDB queries
Data lake
to store logs
New json file
created every
hour
All Rights Reserved @2020 26
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Define functions
and requirements
Function triggered
when new file is
created in
storage bucket
Deploy function
All Rights Reserved @2020 27
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
All Rights Reserved @2020 28
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
How do we rank the importance of an index recommendation?
1. Frequency of slow logs including the field
2. Free memory
3. Cardinality
4. Average search space reduction:
If an index would have been set, how many documents could
have been skipped (in queries of the last week)?
○ This is also used to order fields within compound
indexes, so fields with the highest search space
reduction are in the first position
All Rights Reserved @2020 29
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Monitoring index performance
time when index was set
All Rights Reserved @2020 30
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Convert notebook to html to share
Include details and make
recommendations explainable
All Rights Reserved @2020 31
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
All Rights Reserved @2020 32
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Run analysis as weekly
cron job
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
All Rights Reserved @2020 33
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Run analysis as weekly
cron job
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
Add index
Store history of
changed indexes
Future Plans
34
• Improve index recommendations to make review
step unnecessary and come closer to goal of full automation.
• Use machine learning to improve accuracies:
o Create training dataset to train model on
predicting optimal indexes from query patterns.
o Maybe experiment with machine-learning-based
indexes.
• Open source
All Rights Reserved @2020
www.commercetools.com
Office Munich
Adams-Lehmann-Str. 44
80797 Munich
+49 89 99 82 996-0
Office Berlin
Sonnenallee 223
12057 Berlin
+49 30 67 24 21-20
Durham, NC
318 Blackwell St Suite 240
Durham, NC 27701
+1 212 220 3809
All Rights Reserved @2020 35
Thanks for listening!
Any questions?
● twitter.com/AmadeusMagrabi
● linkedin.com/in/amadeusmagrabi
● medium.com/@amadeus.magrabi
Connect:

More Related Content

What's hot

AWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAmazon Web Services
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSAmazon Web Services
 
Getting Started with Amazon Machine Learning
Getting Started with Amazon Machine LearningGetting Started with Amazon Machine Learning
Getting Started with Amazon Machine LearningAmazon Web Services
 
使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎Amazon Web Services
 
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudTop 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudAmazon Web Services
 
Simple Cloud with Amazon Lightsail
Simple Cloud with Amazon LightsailSimple Cloud with Amazon Lightsail
Simple Cloud with Amazon LightsailAmazon Web Services
 
AWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAmazon Web Services
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWSJulien SIMON
 
Optimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot InstancesOptimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot InstancesAmazon Web Services
 
AWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAmazon Web Services
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesAmazon Web Services
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Amazon Web Services
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfAmazon Web Services
 
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS! AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS! Amazon Web Services
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon AthenaSungmin Kim
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Web Services
 

What's hot (20)

AWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to Profitability
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWS
 
Getting Started with Amazon Machine Learning
Getting Started with Amazon Machine LearningGetting Started with Amazon Machine Learning
Getting Started with Amazon Machine Learning
 
使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎
 
Machine Learning On AWS
Machine Learning On AWSMachine Learning On AWS
Machine Learning On AWS
 
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudTop 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
 
Simple Cloud with Amazon Lightsail
Simple Cloud with Amazon LightsailSimple Cloud with Amazon Lightsail
Simple Cloud with Amazon Lightsail
 
AWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to Scale
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWS
 
Optimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot InstancesOptimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot Instances
 
AWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVP
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdf
 
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS! AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
Cloudonomics
CloudonomicsCloudonomics
Cloudonomics
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart Applications
 

Similar to Managing Database Indexes: A Data-Driven Approach - Amadeus Magrabi

Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Debmalya Biswas
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBMongoDB
 
Unify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceUnify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceMongoDB
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB
 
MongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same treeMongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same treeMongoDB
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Elasticsearch
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceMark West
 
Expanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDBExpanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDBNorberto Leite
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB
 

Similar to Managing Database Indexes: A Data-Driven Approach - Amadeus Magrabi (20)

Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
 
Unify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceUnify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog Service
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data Lake
 
MongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same treeMongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same tree
 
MongoDB + Spring
MongoDB + SpringMongoDB + Spring
MongoDB + Spring
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Compilerpt
CompilerptCompilerpt
Compilerpt
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
 
Expanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDBExpanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDB
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Managing Database Indexes: A Data-Driven Approach - Amadeus Magrabi

  • 1. Managing Database Indexes: A Data-Driven Approach Amadeus Magrabi Lead Data Scientist, commercetools Open Data Science Conference East - April 17, 2020 All Rights Reserved @2020 1
  • 2. Plan for today All Rights Reserved @2020 2 • Background: o Me o Company: commercetools • Data-Driven Approach to Managing Database Indexes: o Database Indexes o MongoDB o Problem: Managing database indexes manually is painful o Solution: Data pipeline to automate index management • Questions
  • 3. Topics I will cover All Rights Reserved @2020 3 Database Indexes MongoDB Data Science Project Management Deep Learning Machine Learning Data Engineering Google Cloud Platform
  • 4. Me All Rights Reserved @2020 4 • Studied cognitive science • Research in neuroscience/machine learning for 3 years • Working in data science since 4 years • Based in Berlin
  • 5. Company: commercetools All Rights Reserved @2020 5 • Offers e-commerce software via cloud-based APIs • Founded in 2006 • 200+ employees • Offices in Munich (HQ), Berlin, Durham (US), Amsterdam, London, Singapore $curl https://api.europe-west1.gcp.commercetools.com /example-store/products/e7ba4c75-b1bb-483d-2c4a10f78472 Example Request: { “id”: “e7ba4c75-b1bb-483d-2c4a10f78472”, “name”: “Awesome flip-flops | Size 42 | Limited edition!” “images”: [ {“url” : ”https://www.example-store.com/shoe1_front.jpg”}, {“url” : ”https://www.example-store.com/shoe1_side.jpg”} ] “prices”: [ {“centAmount”: 4000, “currency”: “EUR”}, {“centAmount”: 4350, “currency”: “USD”} ] “categories”: [ {name: “shoes”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”}, {name: “summer”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”} ] } Example Response:
  • 6. All Rights Reserved @2020 6 Data Science @ commercetools: Image Similarity Search Search Image Prediction 1 Prediction 2 Prediction 3
  • 7. All Rights Reserved @2020 7 Data Science @ commercetools: Category Recommendations
  • 8. All Rights Reserved @2020 8 Data Science @ commercetools Team structure: • “Vertical” team developing microservices • Data scientists and software engineers Team output: • For merchants: o APIs that make it easier to manage data and improve data quality. • For consumers: o APIs that enable machine learning features like image search. • For colleagues: o Make internal company processes more data-driven, more efficient and more accurate.
  • 9. All Rights Reserved @2020 9 Managing Database Indexes in a Data-Driven Way
  • 10. What is the Problem? 10 • commercetools stores e-commerce data in MongoDB databases. • Our databases need to support flexible queries on a large scale. • Databases need good database indexes to perform well. • Managing indexes manually is very hard on a large scale. → Need for a data-driven approach to automate index management. All Rights Reserved @2020
  • 11. What is the Problem? 11 • commercetools stores e-commerce data in MongoDB databases. • Our databases need to support flexible queries on a large scale. • Databases need good database indexes to perform well. • Managing indexes manually is very hard on a large scale. → Need for a data-driven approach to automate index management. All Rights Reserved @2020
  • 12. Database Indexes 12 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. All Rights Reserved @2020
  • 13. Database Indexes 13 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. • Indexes contain a sorted subset of the data, with pointers to the full data. All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products Storage: On Disk Example Query: Find products with prices > 50
  • 14. Database Indexes 14 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. • Indexes contain a sorted subset of the data, with pointers to the full data. All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products Storage: On Disk price 30 60 250 Storage: In Memory Index Example Query: Find products with prices > 50
  • 15. Database Indexes 15 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. • Indexes contain a sorted subset of the data, with pointers to the full data. • Disadvantage: Changing data is slower and indexes require space in memory. All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products Storage: On Disk price 30 60 250 Storage: In Memory Index Example Query: Find products with prices > 50
  • 16. MongoDB 16 • Database to store documents in json format. • Dynamic schema, easy to adapt to changing requirements. • Open source All Rights Reserved @2020 { “name”: “T-Shirt Blue”, “price”: 30, “isSoldOut”: false, “discounts”: null, “categories”: [ “t-shirts”, “summer” ] } DocumentDatabase: Example Store Collection: Products Documents Collection: Orders Documents ... Fields Values
  • 17. MongoDB 17 • Database to store documents in json format. • Dynamic schema, easy to adapt to changing requirements. • Open source All Rights Reserved @2020 { “name”: “T-Shirt Blue”, “price”: 30, “isSoldOut”: false, “discounts”: null, “categories”: [ “t-shirts”, “summer” ] } DocumentDatabase: Example Store Collection: Products Documents Collection: Orders Documents ... PyMongo API Fields Values
  • 18. MongoDB Index Types 18 • Single Field Index • Compound Index o Can be used for queries that filter for multiple fields o The order matters, subsets of the first index fields can also be used • Sparse Index o Excludes documents that do not have a value for a field o Useful to reduce index size when documents have missing values • ... All Rights Reserved @2020
  • 19. When should an Index be added? 19 • When databases are large and response times are important • When the index matches queries that occur frequently • When data is more often read than written • When there is enough space in memory All Rights Reserved @2020
  • 20. When should an Index be added? 20 • When databases are large and response times are important • When the index matches queries that occur frequently • When data is more often read than written • When there is enough space in memory • When an index can significantly reduce the search space o High cardinality (id-like fields) o Low cardinality indexes can still be useful when most queries filter for field values that are rare o Example: o 100k products: 95k are t-shirts, 5k are pants o Index on the category field reduces the search space significantly if users mostly search for pants, but not if they search for t-shirts • ... All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products
  • 21. MongoDB Indexes 21 Common problems: • It can be hard to predict how a database will be used. • Query patterns can change over time. All Rights Reserved @2020 → Indexes on important fields are often missing, making read queries slow. → Indexes are set on fields that are not used, making write queries slow and wasting memory.
  • 22. What is the Problem? 22 • commercetools stores e-commerce data in MongoDB databases. • Our databases need to support flexible queries on a large scale. • Databases need good indexes to perform well. • Managing indexes manually is very hard on a large scale. → Goal: Build a recommendation engine for database indexes. → First internal project of our data science team. All Rights Reserved @2020 timeline Manual Index Management Semi-Automatic Index Management Full Automation we are here
  • 23. All Rights Reserved @2020 23 Analysis Pipeline Google Stackdriver Logs of slow MongoDB queries
  • 24. All Rights Reserved @2020 24 Analysis Pipeline Google Stackdriver Logs of slow MongoDB queries 2020-02-12T16:13:06.591+0000 I COMMAND [conn2282323] command example-store.customers command: find { find: "customers", filter: { custom.fields.customerReference: "1234-abcd-5678-efgh" }, limit: 20, batchSize: 2147483647, maxTimeMS: 70000 } planSummary: COLLSCAN keysExamined:0 docsExamined:1447113 cursorExhausted:1 numYields:11314 nreturned:1 reslen:1265 locks:{ Global: { acquireCount: { r: 22630 } }, Database: { acquireCount: { r: 11315 } }, Collection: { acquireCount: { r: 11315 } } } protocol:op_query 265ms Example of slow MongoDB query Interface
  • 25. All Rights Reserved @2020 25 Analysis Pipeline Google Stackdriver Google Cloud Storage Logs of slow MongoDB queries Data lake to store logs New json file created every hour
  • 26. All Rights Reserved @2020 26 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Logs of slow MongoDB queries Data lake to store logs Parse logs Define functions and requirements Function triggered when new file is created in storage bucket Deploy function
  • 27. All Rights Reserved @2020 27 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs
  • 28. All Rights Reserved @2020 28 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs How do we rank the importance of an index recommendation? 1. Frequency of slow logs including the field 2. Free memory 3. Cardinality 4. Average search space reduction: If an index would have been set, how many documents could have been skipped (in queries of the last week)? ○ This is also used to order fields within compound indexes, so fields with the highest search space reduction are in the first position
  • 29. All Rights Reserved @2020 29 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs Monitoring index performance time when index was set
  • 30. All Rights Reserved @2020 30 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs Convert notebook to html to share Include details and make recommendations explainable
  • 31. All Rights Reserved @2020 31 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs Let engineers review index recommendations
  • 32. All Rights Reserved @2020 32 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Run analysis as weekly cron job Analysis to generate index recommendations from logs Let engineers review index recommendations
  • 33. All Rights Reserved @2020 33 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Run analysis as weekly cron job Analysis to generate index recommendations from logs Let engineers review index recommendations Add index Store history of changed indexes
  • 34. Future Plans 34 • Improve index recommendations to make review step unnecessary and come closer to goal of full automation. • Use machine learning to improve accuracies: o Create training dataset to train model on predicting optimal indexes from query patterns. o Maybe experiment with machine-learning-based indexes. • Open source All Rights Reserved @2020
  • 35. www.commercetools.com Office Munich Adams-Lehmann-Str. 44 80797 Munich +49 89 99 82 996-0 Office Berlin Sonnenallee 223 12057 Berlin +49 30 67 24 21-20 Durham, NC 318 Blackwell St Suite 240 Durham, NC 27701 +1 212 220 3809 All Rights Reserved @2020 35 Thanks for listening! Any questions? ● twitter.com/AmadeusMagrabi ● linkedin.com/in/amadeusmagrabi ● medium.com/@amadeus.magrabi Connect: