SlideShare a Scribd company logo
1 of 35
Download to read offline
Managing Database Indexes:
A Data-Driven Approach
Amadeus Magrabi
Lead Data Scientist, commercetools
Open Data Science Conference East - April 17, 2020
All Rights Reserved @2020 1
Plan for today
All Rights Reserved @2020 2
• Background:
o Me
o Company: commercetools
• Data-Driven Approach to Managing Database Indexes:
o Database Indexes
o MongoDB
o Problem: Managing database indexes manually is painful
o Solution: Data pipeline to automate index management
• Questions
Topics I will cover
All Rights Reserved @2020 3
Database
Indexes MongoDB
Data Science
Project
Management
Deep
Learning
Machine
Learning
Data
Engineering
Google Cloud
Platform
Me
All Rights Reserved @2020 4
• Studied cognitive science
• Research in neuroscience/machine learning
for 3 years
• Working in data science since 4 years
• Based in Berlin
Company: commercetools
All Rights Reserved @2020 5
• Offers e-commerce software via cloud-based APIs
• Founded in 2006
• 200+ employees
• Offices in Munich (HQ), Berlin, Durham (US),
Amsterdam, London, Singapore
$curl https://api.europe-west1.gcp.commercetools.com 
/example-store/products/e7ba4c75-b1bb-483d-2c4a10f78472
Example Request:
{
“id”: “e7ba4c75-b1bb-483d-2c4a10f78472”,
“name”: “Awesome flip-flops | Size 42 | Limited edition!”
“images”: [
{“url” : ”https://www.example-store.com/shoe1_front.jpg”},
{“url” : ”https://www.example-store.com/shoe1_side.jpg”}
]
“prices”: [
{“centAmount”: 4000, “currency”: “EUR”},
{“centAmount”: 4350, “currency”: “USD”}
]
“categories”: [
{name: “shoes”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”},
{name: “summer”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”}
]
}
Example Response:
All Rights Reserved @2020 6
Data Science @ commercetools:
Image Similarity Search
Search Image Prediction 1 Prediction 2 Prediction 3
All Rights Reserved @2020 7
Data Science @ commercetools:
Category Recommendations
All Rights Reserved @2020 8
Data Science @ commercetools
Team structure:
• “Vertical” team developing microservices
• Data scientists and software engineers
Team output:
• For merchants:
o APIs that make it easier to manage data and improve data
quality.
• For consumers:
o APIs that enable machine learning features like image search.
• For colleagues:
o Make internal company processes more data-driven,
more efficient and more accurate.
All Rights Reserved @2020 9
Managing Database Indexes
in a Data-Driven Way
What is the Problem?
10
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good database indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Need for a data-driven approach to automate index management.
All Rights Reserved @2020
What is the Problem?
11
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good database indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Need for a data-driven approach to automate index management.
All Rights Reserved @2020
Database Indexes
12
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
All Rights Reserved @2020
Database Indexes
13
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
Example Query:
Find products with
prices > 50
Database Indexes
14
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
price
30
60
250
Storage: In Memory
Index
Example Query:
Find products with
prices > 50
Database Indexes
15
• Indexes allow to read data from a database faster.
• Analogy: Finding a specific topic in a book with an index.
• Indexes contain a sorted subset of the data, with pointers to the full data.
• Disadvantage: Changing data is slower and indexes require space in memory.
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
Storage: On Disk
price
30
60
250
Storage: In Memory
Index
Example Query:
Find products with
prices > 50
MongoDB
16
• Database to store documents in json format.
• Dynamic schema, easy to adapt to changing requirements.
• Open source
All Rights Reserved @2020
{
“name”: “T-Shirt Blue”,
“price”: 30,
“isSoldOut”: false,
“discounts”: null,
“categories”: [
“t-shirts”,
“summer”
]
}
DocumentDatabase: Example Store
Collection: Products
Documents
Collection: Orders
Documents
...
Fields Values
MongoDB
17
• Database to store documents in json format.
• Dynamic schema, easy to adapt to changing requirements.
• Open source
All Rights Reserved @2020
{
“name”: “T-Shirt Blue”,
“price”: 30,
“isSoldOut”: false,
“discounts”: null,
“categories”: [
“t-shirts”,
“summer”
]
}
DocumentDatabase: Example Store
Collection: Products
Documents
Collection: Orders
Documents
...
PyMongo API
Fields Values
MongoDB Index Types
18
• Single Field Index
• Compound Index
o Can be used for queries that filter for multiple fields
o The order matters, subsets of the first index fields
can also be used
• Sparse Index
o Excludes documents that do not have a value for a field
o Useful to reduce index size when documents have missing values
• ...
All Rights Reserved @2020
When should an Index be added?
19
• When databases are large and response times are important
• When the index matches queries that occur frequently
• When data is more often read than written
• When there is enough space in memory
All Rights Reserved @2020
When should an Index be added?
20
• When databases are large and response times are important
• When the index matches queries that occur frequently
• When data is more often read than written
• When there is enough space in memory
• When an index can significantly reduce the search space
o High cardinality (id-like fields)
o Low cardinality indexes can still be useful when
most queries filter for field values that are rare
o Example:
o 100k products: 95k are t-shirts, 5k are pants
o Index on the category field reduces the
search space significantly if users mostly
search for pants, but not if they search
for t-shirts
• ...
All Rights Reserved @2020
id name price category ...
1 Jeans XXL 60 pants ...
2 T-Shirt Red 30 t-shirts ...
3 Suit Black 250 suits ...
4 T-Shirt Blue 30 t-shirts ...
... ... ... ... ...
Database for Products
MongoDB Indexes
21
Common problems:
• It can be hard to predict how a database will be used.
• Query patterns can change over time.
All Rights Reserved @2020
→ Indexes on important fields are often missing, making read queries slow.
→ Indexes are set on fields that are not used, making write queries slow and wasting memory.
What is the Problem?
22
• commercetools stores e-commerce data in MongoDB databases.
• Our databases need to support flexible queries on a large scale.
• Databases need good indexes to perform well.
• Managing indexes manually is very hard on a large scale.
→ Goal: Build a recommendation engine for database indexes.
→ First internal project of our data science team.
All Rights Reserved @2020
timeline
Manual
Index Management
Semi-Automatic
Index Management Full Automation
we are here
All Rights Reserved @2020 23
Analysis Pipeline
Google
Stackdriver
Logs of slow
MongoDB queries
All Rights Reserved @2020 24
Analysis Pipeline
Google
Stackdriver
Logs of slow
MongoDB queries
2020-02-12T16:13:06.591+0000 I COMMAND [conn2282323] command example-store.customers command:
find { find: "customers", filter: { custom.fields.customerReference: "1234-abcd-5678-efgh" }, limit: 20, batchSize:
2147483647, maxTimeMS: 70000 } planSummary: COLLSCAN keysExamined:0 docsExamined:1447113
cursorExhausted:1 numYields:11314 nreturned:1 reslen:1265 locks:{ Global: { acquireCount: { r: 22630 } }, Database: {
acquireCount: { r: 11315 } }, Collection: { acquireCount: { r: 11315 } } } protocol:op_query 265ms
Example of slow MongoDB query
Interface
All Rights Reserved @2020 25
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Logs of slow
MongoDB queries
Data lake
to store logs
New json file
created every
hour
All Rights Reserved @2020 26
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Define functions
and requirements
Function triggered
when new file is
created in
storage bucket
Deploy function
All Rights Reserved @2020 27
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
All Rights Reserved @2020 28
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
How do we rank the importance of an index recommendation?
1. Frequency of slow logs including the field
2. Free memory
3. Cardinality
4. Average search space reduction:
If an index would have been set, how many documents could
have been skipped (in queries of the last week)?
○ This is also used to order fields within compound
indexes, so fields with the highest search space
reduction are in the first position
All Rights Reserved @2020 29
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Monitoring index performance
time when index was set
All Rights Reserved @2020 30
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Convert notebook to html to share
Include details and make
recommendations explainable
All Rights Reserved @2020 31
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
All Rights Reserved @2020 32
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Run analysis as weekly
cron job
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
All Rights Reserved @2020 33
Analysis Pipeline
Google
Stackdriver
Google
Cloud Storage
Google
Cloud Functions
Google
BigQuery
Logs of slow
MongoDB queries
Data lake
to store logs Parse logs
Data warehouse
to enable
analytics on logs
Run analysis as weekly
cron job
Analysis to
generate index
recommendations
from logs
Let engineers
review index
recommendations
Add index
Store history of
changed indexes
Future Plans
34
• Improve index recommendations to make review
step unnecessary and come closer to goal of full automation.
• Use machine learning to improve accuracies:
o Create training dataset to train model on
predicting optimal indexes from query patterns.
o Maybe experiment with machine-learning-based
indexes.
• Open source
All Rights Reserved @2020
www.commercetools.com
Office Munich
Adams-Lehmann-Str. 44
80797 Munich
+49 89 99 82 996-0
Office Berlin
Sonnenallee 223
12057 Berlin
+49 30 67 24 21-20
Durham, NC
318 Blackwell St Suite 240
Durham, NC 27701
+1 212 220 3809
All Rights Reserved @2020 35
Thanks for listening!
Any questions?
● twitter.com/AmadeusMagrabi
● linkedin.com/in/amadeusmagrabi
● medium.com/@amadeus.magrabi
Connect:

More Related Content

What's hot

AWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAmazon Web Services
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSAmazon Web Services
 
Getting Started with Amazon Machine Learning
Getting Started with Amazon Machine LearningGetting Started with Amazon Machine Learning
Getting Started with Amazon Machine LearningAmazon Web Services
 
使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎Amazon Web Services
 
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudTop 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudAmazon Web Services
 
Simple Cloud with Amazon Lightsail
Simple Cloud with Amazon LightsailSimple Cloud with Amazon Lightsail
Simple Cloud with Amazon LightsailAmazon Web Services
 
AWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAmazon Web Services
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWSJulien SIMON
 
Optimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot InstancesOptimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot InstancesAmazon Web Services
 
AWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAmazon Web Services
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesAmazon Web Services
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Amazon Web Services
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfAmazon Web Services
 
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS! AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS! Amazon Web Services
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon AthenaSungmin Kim
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Web Services
 

What's hot (20)

AWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to Profitability
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWS
 
Getting Started with Amazon Machine Learning
Getting Started with Amazon Machine LearningGetting Started with Amazon Machine Learning
Getting Started with Amazon Machine Learning
 
使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎使用Amazon Machine Learning 建立即時推薦引擎
使用Amazon Machine Learning 建立即時推薦引擎
 
Machine Learning On AWS
Machine Learning On AWSMachine Learning On AWS
Machine Learning On AWS
 
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudTop 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
 
Simple Cloud with Amazon Lightsail
Simple Cloud with Amazon LightsailSimple Cloud with Amazon Lightsail
Simple Cloud with Amazon Lightsail
 
AWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to Scale
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWS
 
Optimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot InstancesOptimize Content Processing in the Cloud with GPU and Spot Instances
Optimize Content Processing in the Cloud with GPU and Spot Instances
 
AWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVP
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdf
 
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS! AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
AWS Summit Singapore | Webinar Edition | Fast Start with AWS & Migrate to AWS!
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
Cloudonomics
CloudonomicsCloudonomics
Cloudonomics
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart Applications
 

Similar to Managing Database Indexes: A Data-Driven Approach - Amadeus Magrabi

Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Debmalya Biswas
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBMongoDB
 
Unify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceUnify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceMongoDB
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB
 
MongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same treeMongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same treeMongoDB
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Elasticsearch
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceMark West
 
Expanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDBExpanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDBNorberto Leite
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB
 

Similar to Managing Database Indexes: A Data-Driven Approach - Amadeus Magrabi (20)

Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
 
Unify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceUnify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog Service
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data Lake
 
MongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same treeMongoDB and Spring - Two leaves of a same tree
MongoDB and Spring - Two leaves of a same tree
 
MongoDB + Spring
MongoDB + SpringMongoDB + Spring
MongoDB + Spring
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Compilerpt
CompilerptCompilerpt
Compilerpt
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
 
Expanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDBExpanding Retail Frontiers with MongoDB
Expanding Retail Frontiers with MongoDB
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Managing Database Indexes: A Data-Driven Approach - Amadeus Magrabi

  • 1. Managing Database Indexes: A Data-Driven Approach Amadeus Magrabi Lead Data Scientist, commercetools Open Data Science Conference East - April 17, 2020 All Rights Reserved @2020 1
  • 2. Plan for today All Rights Reserved @2020 2 • Background: o Me o Company: commercetools • Data-Driven Approach to Managing Database Indexes: o Database Indexes o MongoDB o Problem: Managing database indexes manually is painful o Solution: Data pipeline to automate index management • Questions
  • 3. Topics I will cover All Rights Reserved @2020 3 Database Indexes MongoDB Data Science Project Management Deep Learning Machine Learning Data Engineering Google Cloud Platform
  • 4. Me All Rights Reserved @2020 4 • Studied cognitive science • Research in neuroscience/machine learning for 3 years • Working in data science since 4 years • Based in Berlin
  • 5. Company: commercetools All Rights Reserved @2020 5 • Offers e-commerce software via cloud-based APIs • Founded in 2006 • 200+ employees • Offices in Munich (HQ), Berlin, Durham (US), Amsterdam, London, Singapore $curl https://api.europe-west1.gcp.commercetools.com /example-store/products/e7ba4c75-b1bb-483d-2c4a10f78472 Example Request: { “id”: “e7ba4c75-b1bb-483d-2c4a10f78472”, “name”: “Awesome flip-flops | Size 42 | Limited edition!” “images”: [ {“url” : ”https://www.example-store.com/shoe1_front.jpg”}, {“url” : ”https://www.example-store.com/shoe1_side.jpg”} ] “prices”: [ {“centAmount”: 4000, “currency”: “EUR”}, {“centAmount”: 4350, “currency”: “USD”} ] “categories”: [ {name: “shoes”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”}, {name: “summer”, “id”: “e7ba4c75-b1bb-483d-2c4a10f78473”} ] } Example Response:
  • 6. All Rights Reserved @2020 6 Data Science @ commercetools: Image Similarity Search Search Image Prediction 1 Prediction 2 Prediction 3
  • 7. All Rights Reserved @2020 7 Data Science @ commercetools: Category Recommendations
  • 8. All Rights Reserved @2020 8 Data Science @ commercetools Team structure: • “Vertical” team developing microservices • Data scientists and software engineers Team output: • For merchants: o APIs that make it easier to manage data and improve data quality. • For consumers: o APIs that enable machine learning features like image search. • For colleagues: o Make internal company processes more data-driven, more efficient and more accurate.
  • 9. All Rights Reserved @2020 9 Managing Database Indexes in a Data-Driven Way
  • 10. What is the Problem? 10 • commercetools stores e-commerce data in MongoDB databases. • Our databases need to support flexible queries on a large scale. • Databases need good database indexes to perform well. • Managing indexes manually is very hard on a large scale. → Need for a data-driven approach to automate index management. All Rights Reserved @2020
  • 11. What is the Problem? 11 • commercetools stores e-commerce data in MongoDB databases. • Our databases need to support flexible queries on a large scale. • Databases need good database indexes to perform well. • Managing indexes manually is very hard on a large scale. → Need for a data-driven approach to automate index management. All Rights Reserved @2020
  • 12. Database Indexes 12 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. All Rights Reserved @2020
  • 13. Database Indexes 13 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. • Indexes contain a sorted subset of the data, with pointers to the full data. All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products Storage: On Disk Example Query: Find products with prices > 50
  • 14. Database Indexes 14 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. • Indexes contain a sorted subset of the data, with pointers to the full data. All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products Storage: On Disk price 30 60 250 Storage: In Memory Index Example Query: Find products with prices > 50
  • 15. Database Indexes 15 • Indexes allow to read data from a database faster. • Analogy: Finding a specific topic in a book with an index. • Indexes contain a sorted subset of the data, with pointers to the full data. • Disadvantage: Changing data is slower and indexes require space in memory. All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products Storage: On Disk price 30 60 250 Storage: In Memory Index Example Query: Find products with prices > 50
  • 16. MongoDB 16 • Database to store documents in json format. • Dynamic schema, easy to adapt to changing requirements. • Open source All Rights Reserved @2020 { “name”: “T-Shirt Blue”, “price”: 30, “isSoldOut”: false, “discounts”: null, “categories”: [ “t-shirts”, “summer” ] } DocumentDatabase: Example Store Collection: Products Documents Collection: Orders Documents ... Fields Values
  • 17. MongoDB 17 • Database to store documents in json format. • Dynamic schema, easy to adapt to changing requirements. • Open source All Rights Reserved @2020 { “name”: “T-Shirt Blue”, “price”: 30, “isSoldOut”: false, “discounts”: null, “categories”: [ “t-shirts”, “summer” ] } DocumentDatabase: Example Store Collection: Products Documents Collection: Orders Documents ... PyMongo API Fields Values
  • 18. MongoDB Index Types 18 • Single Field Index • Compound Index o Can be used for queries that filter for multiple fields o The order matters, subsets of the first index fields can also be used • Sparse Index o Excludes documents that do not have a value for a field o Useful to reduce index size when documents have missing values • ... All Rights Reserved @2020
  • 19. When should an Index be added? 19 • When databases are large and response times are important • When the index matches queries that occur frequently • When data is more often read than written • When there is enough space in memory All Rights Reserved @2020
  • 20. When should an Index be added? 20 • When databases are large and response times are important • When the index matches queries that occur frequently • When data is more often read than written • When there is enough space in memory • When an index can significantly reduce the search space o High cardinality (id-like fields) o Low cardinality indexes can still be useful when most queries filter for field values that are rare o Example: o 100k products: 95k are t-shirts, 5k are pants o Index on the category field reduces the search space significantly if users mostly search for pants, but not if they search for t-shirts • ... All Rights Reserved @2020 id name price category ... 1 Jeans XXL 60 pants ... 2 T-Shirt Red 30 t-shirts ... 3 Suit Black 250 suits ... 4 T-Shirt Blue 30 t-shirts ... ... ... ... ... ... Database for Products
  • 21. MongoDB Indexes 21 Common problems: • It can be hard to predict how a database will be used. • Query patterns can change over time. All Rights Reserved @2020 → Indexes on important fields are often missing, making read queries slow. → Indexes are set on fields that are not used, making write queries slow and wasting memory.
  • 22. What is the Problem? 22 • commercetools stores e-commerce data in MongoDB databases. • Our databases need to support flexible queries on a large scale. • Databases need good indexes to perform well. • Managing indexes manually is very hard on a large scale. → Goal: Build a recommendation engine for database indexes. → First internal project of our data science team. All Rights Reserved @2020 timeline Manual Index Management Semi-Automatic Index Management Full Automation we are here
  • 23. All Rights Reserved @2020 23 Analysis Pipeline Google Stackdriver Logs of slow MongoDB queries
  • 24. All Rights Reserved @2020 24 Analysis Pipeline Google Stackdriver Logs of slow MongoDB queries 2020-02-12T16:13:06.591+0000 I COMMAND [conn2282323] command example-store.customers command: find { find: "customers", filter: { custom.fields.customerReference: "1234-abcd-5678-efgh" }, limit: 20, batchSize: 2147483647, maxTimeMS: 70000 } planSummary: COLLSCAN keysExamined:0 docsExamined:1447113 cursorExhausted:1 numYields:11314 nreturned:1 reslen:1265 locks:{ Global: { acquireCount: { r: 22630 } }, Database: { acquireCount: { r: 11315 } }, Collection: { acquireCount: { r: 11315 } } } protocol:op_query 265ms Example of slow MongoDB query Interface
  • 25. All Rights Reserved @2020 25 Analysis Pipeline Google Stackdriver Google Cloud Storage Logs of slow MongoDB queries Data lake to store logs New json file created every hour
  • 26. All Rights Reserved @2020 26 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Logs of slow MongoDB queries Data lake to store logs Parse logs Define functions and requirements Function triggered when new file is created in storage bucket Deploy function
  • 27. All Rights Reserved @2020 27 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs
  • 28. All Rights Reserved @2020 28 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs How do we rank the importance of an index recommendation? 1. Frequency of slow logs including the field 2. Free memory 3. Cardinality 4. Average search space reduction: If an index would have been set, how many documents could have been skipped (in queries of the last week)? ○ This is also used to order fields within compound indexes, so fields with the highest search space reduction are in the first position
  • 29. All Rights Reserved @2020 29 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs Monitoring index performance time when index was set
  • 30. All Rights Reserved @2020 30 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs Convert notebook to html to share Include details and make recommendations explainable
  • 31. All Rights Reserved @2020 31 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Analysis to generate index recommendations from logs Let engineers review index recommendations
  • 32. All Rights Reserved @2020 32 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Run analysis as weekly cron job Analysis to generate index recommendations from logs Let engineers review index recommendations
  • 33. All Rights Reserved @2020 33 Analysis Pipeline Google Stackdriver Google Cloud Storage Google Cloud Functions Google BigQuery Logs of slow MongoDB queries Data lake to store logs Parse logs Data warehouse to enable analytics on logs Run analysis as weekly cron job Analysis to generate index recommendations from logs Let engineers review index recommendations Add index Store history of changed indexes
  • 34. Future Plans 34 • Improve index recommendations to make review step unnecessary and come closer to goal of full automation. • Use machine learning to improve accuracies: o Create training dataset to train model on predicting optimal indexes from query patterns. o Maybe experiment with machine-learning-based indexes. • Open source All Rights Reserved @2020
  • 35. www.commercetools.com Office Munich Adams-Lehmann-Str. 44 80797 Munich +49 89 99 82 996-0 Office Berlin Sonnenallee 223 12057 Berlin +49 30 67 24 21-20 Durham, NC 318 Blackwell St Suite 240 Durham, NC 27701 +1 212 220 3809 All Rights Reserved @2020 35 Thanks for listening! Any questions? ● twitter.com/AmadeusMagrabi ● linkedin.com/in/amadeusmagrabi ● medium.com/@amadeus.magrabi Connect: