SlideShare a Scribd company logo
1 of 21
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.1
Replatforming: Switching to MongoDB
For Flexibility, Scalability, Performance, and Simplicity
September 26, 2018
Ani Hammond
Sr Staff Software Engineer,
Bazaarvoice
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.2
• Senior Staff Software Engineer
and Tech Lead at Bazaarvoice
• Currently excited about
serverless applications and
distributed services
• Always excited about simple,
intuitive products with a clear
mission
Github: aniham
Email: ani.popova@gmail.com
whoami
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.3
• Based in Austin, TX; 700 employees
worldwide; Recently taken private
What is Bazaarvoice?
530M
BLACK FRIDAY
470M
CYBER MONDAY
6000
PAGEVIEWS / SEC
Q
A 4.5
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.4
What is Curations?
• Social collection
• Content enrichment
• Social outreach
• Targeted display
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.5
Legacy Platform
$60,000/mo
Each client adds a few
hundred/month
Monolithic stack
Python/Django
MySQL Database
Single-tenant
Cluster per client
~400 clusters
Multi-tenant
services
Social outreach
Display
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.6
Legacy Platform: Issues
• Maintainability
• Debugging
• Patching
• Releasing
• Managing data
• Cost
• Single-tenant clusters (RDS, EC2)
• Elasticsearch cluster
• ETL and eventual consistency
• Elasticsearch usability
• MySQL usability
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.7
• Support different access patterns
Picking a new DB: Considerations
• Able to scale as the client base and content volume grows
• Be our own database administrator
Service Read Volume Write Volume Query Complexity Fault Tolerance
Collect
Enrich
Display
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.8
• Prototype advantages
• Easy to use
• Flexible schema
• Easy to export and share
Picking a new DB: Early dev and experiments
• Some early numbers
• A note about indexes
• No indexes to start
• Added as needed
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.9
New Platform
$6,500/mo
All services multi-tenant
Display Service
Constant high reads
Enrichment Service
Constant complex reads
Simple updates
Management Service
Low complex reads
Simple updates
Collection Service
Bursty high writes
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.10
DevOps
Cloud Manager
SECOND ITERATION
Cheap
Fast
Totally reasonable option
Atlas
THIRD (CURRENT) ITERATION
Cheaper than dedicated DevOps
Fast
Insights into indexes, long running
queries, performance glitches, and
more
Push button upgrades and scaling
Provision by hand
FIRST ITERATION
Cheap
Laborious
Not viable long term
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.11
• Start from zero
• Best guess on what works
• Iterate
• Kill the unused
Indexing and Optimizations
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.12
Problem 1: ...
Solution
HOW DID WE SOLVE
THINGS
Connection pools
Lesson
WHAT DID WE LEARN
Failover is expensive
Detection
Board metrics
indicated high
response time
Further digging
indicated >30K DB
connections
HOW DID WE FIND OUT
Manifestation
Database kept
failing over
Not responsive for
long periods of time
WHAT HAPPENED
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.13
Problem 2: ...
Solution
HOW DID WE SOLVE
THINGS
Discrepancy due to
Lambdas’
connections to
MongoDB
Switched from
Lambdas to
Dockerized services
Lesson
WHAT DID WE LEARN
Don’t use Lambdas
for constant
workload
Detection
Board metrics
indicated DB
queries taking 5
seconds
Atlas was indicating
queries taking <
100ms
HOW DID WE FIND OUT
Manifestation
Display response
time > 6 seconds
WHAT HAPPENED
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.14
Problem 3: ...
Solution
HOW DID WE SOLVE THINGS
Rules perform actions on
matching content,
unmatched content still
scanned in subsequent
executions
Exclude scanning
previously unmatched
content
Lesson
WHAT DID WE LEARN
Don’t rescan if
you don’t have to
Don’t let your DB
do all of your
work for you
Detection
Board metrics
indicated poor rule
execution time
HOW DID WE FIND OUT
Manifestation
Rules taking 30 min
to execute despite
multiple indexes
DB ops taking
minutes to
complete
WHAT HAPPENED
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.15
Problem 4: ...
Lesson
WHAT DID WE LEARN
Keep audits
Have a solid
recovery plan
Detection
Client complaints hit
us like a wet mop
HOW DID WE FIND OUT
Manifestation
Bad code caused
data corruption
WHAT HAPPENED
Solution
HOW DID WE SOLVE THINGS
Atlas point in time recovery
Cherry pick client
enrichment actions since
recovery (~12 hours)
Aggregations proved helpful
to cross-reference what was
changed when
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.16
• Scale and size/cost
• How we’ll address
• Cleanup unused content
• Partial indexes
Anticipated Future Issues
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.17
• Ability to give read only view to our services team
• An accidental test case for the rest of the company
• Many teams are using MongoDB they provision and manage themselves
• No maintenance
Nice Side Effects
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.18
• The text index is not for everyone
• Hint is good
• Even when you think MongoDB will pick the right index to use, it sometimes doesn’t
• Doesn’t work with updates :(
Mentions that don’t need a separate slide
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.19
• Bottlenecks happen, services break, requirements change, products evolve
• What makes a good datastore is not infallibility, but the tools and ability to
• Detect issues fast
• Diagnose
• Develop fast and recover
• Agility! Iteration!
Final thoughts
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.20
• Sebastian Wong, Kenney Wong, Frank Licea, Paul Durivage
• Praveen Kalamegham
Thanks
Confidential and Proprietary. © 2018 Bazaarvoice, Inc.21
Q & A

More Related Content

What's hot

What's hot (20)

General 05 integration design vs migration design
General 05   integration design vs migration designGeneral 05   integration design vs migration design
General 05 integration design vs migration design
 
How Houwzer Speeds Growth and Innovation by Gaining Insights Into API Use and...
How Houwzer Speeds Growth and Innovation by Gaining Insights Into API Use and...How Houwzer Speeds Growth and Innovation by Gaining Insights Into API Use and...
How Houwzer Speeds Growth and Innovation by Gaining Insights Into API Use and...
 
DECA Financial Services Reduces DR Costs by More Than 50%
DECA Financial Services Reduces DR Costs by More Than 50%DECA Financial Services Reduces DR Costs by More Than 50%
DECA Financial Services Reduces DR Costs by More Than 50%
 
Directions EMEA 09 Presentation
Directions EMEA 09 PresentationDirections EMEA 09 Presentation
Directions EMEA 09 Presentation
 
Monitoring on premise biz talk applications using cloud based power bi saas
Monitoring on premise biz talk applications using cloud based power bi saasMonitoring on premise biz talk applications using cloud based power bi saas
Monitoring on premise biz talk applications using cloud based power bi saas
 
Geniushive- Ruby on Rails
Geniushive- Ruby on RailsGeniushive- Ruby on Rails
Geniushive- Ruby on Rails
 
Content Management Systems: Making the Right Choice
Content Management Systems: Making the Right ChoiceContent Management Systems: Making the Right Choice
Content Management Systems: Making the Right Choice
 
Getting Real-Time Middle-Mile Visibility in Your CDN Behavior with DataStream
Getting Real-Time Middle-Mile Visibility in Your CDN Behavior with DataStreamGetting Real-Time Middle-Mile Visibility in Your CDN Behavior with DataStream
Getting Real-Time Middle-Mile Visibility in Your CDN Behavior with DataStream
 
Matthias einig transforming share point farm solutions to the app model
Matthias einig   transforming share point farm solutions to the app modelMatthias einig   transforming share point farm solutions to the app model
Matthias einig transforming share point farm solutions to the app model
 
SharePoint Saturday Brussels 2018 - Modern Collaboration in Teams & Projects ...
SharePoint Saturday Brussels 2018 - Modern Collaboration in Teams & Projects ...SharePoint Saturday Brussels 2018 - Modern Collaboration in Teams & Projects ...
SharePoint Saturday Brussels 2018 - Modern Collaboration in Teams & Projects ...
 
Importance of global certifications
Importance of global certificationsImportance of global certifications
Importance of global certifications
 
Continuing the journey with Talis Aspire Digitised Content (Open Day, 24th Oc...
Continuing the journey with Talis Aspire Digitised Content (Open Day, 24th Oc...Continuing the journey with Talis Aspire Digitised Content (Open Day, 24th Oc...
Continuing the journey with Talis Aspire Digitised Content (Open Day, 24th Oc...
 
What can asset managers learn from Netflix?
What can asset managers learn from Netflix?What can asset managers learn from Netflix?
What can asset managers learn from Netflix?
 
Aligner Deck
Aligner DeckAligner Deck
Aligner Deck
 
Leveraging AWS Partner Network (APN) Resources
Leveraging AWS Partner Network (APN) ResourcesLeveraging AWS Partner Network (APN) Resources
Leveraging AWS Partner Network (APN) Resources
 
Measurement Roadmap
Measurement RoadmapMeasurement Roadmap
Measurement Roadmap
 
AWS Partner Summit Sydney Keynote
AWS Partner Summit Sydney KeynoteAWS Partner Summit Sydney Keynote
AWS Partner Summit Sydney Keynote
 
Resello @WorldHostingDays 2014: The future of cloud business automation
Resello @WorldHostingDays 2014: The future of cloud business automation Resello @WorldHostingDays 2014: The future of cloud business automation
Resello @WorldHostingDays 2014: The future of cloud business automation
 
Apex Connector for Lightning Connect - Make Anything a Salesforce object
Apex Connector for Lightning Connect - Make Anything a Salesforce objectApex Connector for Lightning Connect - Make Anything a Salesforce object
Apex Connector for Lightning Connect - Make Anything a Salesforce object
 
IPv17 sync17
IPv17 sync17IPv17 sync17
IPv17 sync17
 

Similar to MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance

Cbt storage at scale use case deck ppt pdf
Cbt storage at scale use case deck ppt pdfCbt storage at scale use case deck ppt pdf
Cbt storage at scale use case deck ppt pdf
jaswantinxero
 
Cbt storage@scale use case deck (cl) (6.8.18)
Cbt storage@scale use case deck (cl) (6.8.18)Cbt storage@scale use case deck (cl) (6.8.18)
Cbt storage@scale use case deck (cl) (6.8.18)
Anand Raj
 
Cbt storage at scale use case deck ppt
Cbt storage at scale use case deck pptCbt storage at scale use case deck ppt
Cbt storage at scale use case deck ppt
jaswantinxero
 

Similar to MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance (20)

Transforming Product Development in the Cloud (ENT306) - AWS re:Invent 2018
Transforming Product Development in the Cloud (ENT306) - AWS re:Invent 2018Transforming Product Development in the Cloud (ENT306) - AWS re:Invent 2018
Transforming Product Development in the Cloud (ENT306) - AWS re:Invent 2018
 
Product Development in the Cloud - ENT206 - Chicago AWS Summit
Product Development in the Cloud - ENT206 - Chicago AWS SummitProduct Development in the Cloud - ENT206 - Chicago AWS Summit
Product Development in the Cloud - ENT206 - Chicago AWS Summit
 
ENT206 Product Development in the Cloud
ENT206 Product Development in the CloudENT206 Product Development in the Cloud
ENT206 Product Development in the Cloud
 
Product Development in the Cloud
Product Development in the Cloud Product Development in the Cloud
Product Development in the Cloud
 
Cbt storage at scale use case deck ppt pdf
Cbt storage at scale use case deck ppt pdfCbt storage at scale use case deck ppt pdf
Cbt storage at scale use case deck ppt pdf
 
Cbt storage@scale use case deck (cl) (6.8.18)
Cbt storage@scale use case deck (cl) (6.8.18)Cbt storage@scale use case deck (cl) (6.8.18)
Cbt storage@scale use case deck (cl) (6.8.18)
 
Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...
Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...
Enterprise DevOps: Begin with Production-Ready Migration (ENT217-R1) - AWS re...
 
An Agile Approach to Cloud Adoption
An Agile Approach to Cloud AdoptionAn Agile Approach to Cloud Adoption
An Agile Approach to Cloud Adoption
 
Webinar: How Partners Can Benefit from our New Program (EMEA)
Webinar: How Partners Can Benefit from our New Program (EMEA)Webinar: How Partners Can Benefit from our New Program (EMEA)
Webinar: How Partners Can Benefit from our New Program (EMEA)
 
Mastering the Secret Sauce to SaaS - Adrian De Luca - AWS TechShift ANZ 2018
Mastering the Secret Sauce to SaaS - Adrian De Luca - AWS TechShift ANZ 2018Mastering the Secret Sauce to SaaS - Adrian De Luca - AWS TechShift ANZ 2018
Mastering the Secret Sauce to SaaS - Adrian De Luca - AWS TechShift ANZ 2018
 
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
 
Cbt storage at scale use case deck ppt
Cbt storage at scale use case deck pptCbt storage at scale use case deck ppt
Cbt storage at scale use case deck ppt
 
Leveraging the AWS Cloud Adoption Framework to Build Your Cloud Action Plan (...
Leveraging the AWS Cloud Adoption Framework to Build Your Cloud Action Plan (...Leveraging the AWS Cloud Adoption Framework to Build Your Cloud Action Plan (...
Leveraging the AWS Cloud Adoption Framework to Build Your Cloud Action Plan (...
 
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
 
Innovation and Startups Today
Innovation and Startups TodayInnovation and Startups Today
Innovation and Startups Today
 
Praxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloudPraxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloud
 
Choosing the Right Open Source Database
Choosing the Right Open Source DatabaseChoosing the Right Open Source Database
Choosing the Right Open Source Database
 
Microservices & Data Design: Database Week SF
Microservices & Data Design: Database Week SFMicroservices & Data Design: Database Week SF
Microservices & Data Design: Database Week SF
 
Microservices and Data Design
Microservices and Data DesignMicroservices and Data Design
Microservices and Data Design
 
Microservices & Data Design: Database Week San Francisco
Microservices & Data Design: Database Week San FranciscoMicroservices & Data Design: Database Week San Francisco
Microservices & Data Design: Database Week San Francisco
 

More from MongoDB

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Recently uploaded

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance

  • 1. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.1 Replatforming: Switching to MongoDB For Flexibility, Scalability, Performance, and Simplicity September 26, 2018 Ani Hammond Sr Staff Software Engineer, Bazaarvoice
  • 2. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.2 • Senior Staff Software Engineer and Tech Lead at Bazaarvoice • Currently excited about serverless applications and distributed services • Always excited about simple, intuitive products with a clear mission Github: aniham Email: ani.popova@gmail.com whoami
  • 3. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.3 • Based in Austin, TX; 700 employees worldwide; Recently taken private What is Bazaarvoice? 530M BLACK FRIDAY 470M CYBER MONDAY 6000 PAGEVIEWS / SEC Q A 4.5
  • 4. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.4 What is Curations? • Social collection • Content enrichment • Social outreach • Targeted display
  • 5. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.5 Legacy Platform $60,000/mo Each client adds a few hundred/month Monolithic stack Python/Django MySQL Database Single-tenant Cluster per client ~400 clusters Multi-tenant services Social outreach Display
  • 6. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.6 Legacy Platform: Issues • Maintainability • Debugging • Patching • Releasing • Managing data • Cost • Single-tenant clusters (RDS, EC2) • Elasticsearch cluster • ETL and eventual consistency • Elasticsearch usability • MySQL usability
  • 7. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.7 • Support different access patterns Picking a new DB: Considerations • Able to scale as the client base and content volume grows • Be our own database administrator Service Read Volume Write Volume Query Complexity Fault Tolerance Collect Enrich Display
  • 8. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.8 • Prototype advantages • Easy to use • Flexible schema • Easy to export and share Picking a new DB: Early dev and experiments • Some early numbers • A note about indexes • No indexes to start • Added as needed
  • 9. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.9 New Platform $6,500/mo All services multi-tenant Display Service Constant high reads Enrichment Service Constant complex reads Simple updates Management Service Low complex reads Simple updates Collection Service Bursty high writes
  • 10. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.10 DevOps Cloud Manager SECOND ITERATION Cheap Fast Totally reasonable option Atlas THIRD (CURRENT) ITERATION Cheaper than dedicated DevOps Fast Insights into indexes, long running queries, performance glitches, and more Push button upgrades and scaling Provision by hand FIRST ITERATION Cheap Laborious Not viable long term
  • 11. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.11 • Start from zero • Best guess on what works • Iterate • Kill the unused Indexing and Optimizations
  • 12. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.12 Problem 1: ... Solution HOW DID WE SOLVE THINGS Connection pools Lesson WHAT DID WE LEARN Failover is expensive Detection Board metrics indicated high response time Further digging indicated >30K DB connections HOW DID WE FIND OUT Manifestation Database kept failing over Not responsive for long periods of time WHAT HAPPENED
  • 13. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.13 Problem 2: ... Solution HOW DID WE SOLVE THINGS Discrepancy due to Lambdas’ connections to MongoDB Switched from Lambdas to Dockerized services Lesson WHAT DID WE LEARN Don’t use Lambdas for constant workload Detection Board metrics indicated DB queries taking 5 seconds Atlas was indicating queries taking < 100ms HOW DID WE FIND OUT Manifestation Display response time > 6 seconds WHAT HAPPENED
  • 14. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.14 Problem 3: ... Solution HOW DID WE SOLVE THINGS Rules perform actions on matching content, unmatched content still scanned in subsequent executions Exclude scanning previously unmatched content Lesson WHAT DID WE LEARN Don’t rescan if you don’t have to Don’t let your DB do all of your work for you Detection Board metrics indicated poor rule execution time HOW DID WE FIND OUT Manifestation Rules taking 30 min to execute despite multiple indexes DB ops taking minutes to complete WHAT HAPPENED
  • 15. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.15 Problem 4: ... Lesson WHAT DID WE LEARN Keep audits Have a solid recovery plan Detection Client complaints hit us like a wet mop HOW DID WE FIND OUT Manifestation Bad code caused data corruption WHAT HAPPENED Solution HOW DID WE SOLVE THINGS Atlas point in time recovery Cherry pick client enrichment actions since recovery (~12 hours) Aggregations proved helpful to cross-reference what was changed when
  • 16. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.16 • Scale and size/cost • How we’ll address • Cleanup unused content • Partial indexes Anticipated Future Issues
  • 17. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.17 • Ability to give read only view to our services team • An accidental test case for the rest of the company • Many teams are using MongoDB they provision and manage themselves • No maintenance Nice Side Effects
  • 18. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.18 • The text index is not for everyone • Hint is good • Even when you think MongoDB will pick the right index to use, it sometimes doesn’t • Doesn’t work with updates :( Mentions that don’t need a separate slide
  • 19. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.19 • Bottlenecks happen, services break, requirements change, products evolve • What makes a good datastore is not infallibility, but the tools and ability to • Detect issues fast • Diagnose • Develop fast and recover • Agility! Iteration! Final thoughts
  • 20. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.20 • Sebastian Wong, Kenney Wong, Frank Licea, Paul Durivage • Praveen Kalamegham Thanks
  • 21. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.21 Q & A

Editor's Notes

  1. Hello everyone, I’m so glad to be here My name is Ani Hammond Today I’m going to talk to you about my team's journey replatforming And the important role that MongoDB played in it I’ll show you guys what our old stack looked like, what our new (and much better) stack looks like now, obviously how mongo fits in it And then I’ll go over some interesting issues we encountered with our new platform and the solutions we came up with
  2. Who am I? I’m a Software Engineer and Tech Lead at Bazaarvoice Spoiled Westerner, I like it when things are easy for me and I only get to do things I like to do, so my passions change over time and I get excited about different things. But currently I’m interested in serverless applications and distributed services and I’m always excited about simple intuitive products with a clear mission. I know it sounds like a stretch to call a database technology a simple product, But I think MongoDB fits my description perfectly because of how easy it is to develop with. And I’m going to make that more clear later on in the presentation Software engineer at Bazaarvoice… [next slide]
  3. What is Bazaarvoice? Our mission at Bazaarvoice is to connect brands and retailers to consumers. What that means in non-marketing speak is that most of the user-generated content on brand and retailer sites is flowing through our network. By user-generated content, I mean ratings & reviews or Q&A or social content. Here’s a random collection of logos that Marketing said I could show, but to give you a better idea our prevalence, if you’re shopping online anywhere other than Amazon and reading a review, it’s probably powered by us. To give you an idea of the scale we deal with, here are some stats from last year's Black Friday and Cyber Monday On Black Friday we had 530 million total page views on our network which is over 6000/second On Cyber Monday we had 470 million total page views which is just under 5500/second For a total of a billion page views from just those 2 days. What does a pageview imply? Each one implies multiple API calls fanning out to dozens of services. My team, Curations, built and supports some of those services.
  4. What is Curations? In short, the Curations platform allows a brand or retailer to display relevant social content in the path of purchase on their e-commerce site. Let me walk you through the flow [CLICK] Someone posts a cute picture of their child wearing Gymboree rain boots on their Instagram Using Curations, Gymboree is watching for content that mentions certain hashtags about their brand The Curations Social collection service picks up that post [CLICK] And shows it to Gymboree in the Curations application. Once in Curations, the post can be enriched in various automatic and manual ways. Enrichment means things like moderation approval and product identification. It can be done manually by the client, or by a set of automatic rules that define their needs. An example of an automatic rule would be to reject all content that includes profanity. [CLICK] Once the content is moderation approved, the Curations platform reaches out on behalf of Gymboree to request permission from the author of the post to use their cute picture on Gymboree’s ecommerce site. You probably can’t make out, but here you’d see a comment from Gymboree followed by approval from the user. [CLICK] Finally, now that we have the author’s permission, the post is shown in a Curations powered display on Gymboree’s site. Are there any questions? Ok good. So basically we collect the data, we enrich it, and we display it.
  5. How does this work? All of our infrastructure is in AWS. [CLICK] In our legacy platform, every client (in previous example, Gymboree), had their own cluster which consisted of a MySQL RDS instance, one or more EC2 instances running a Python/Django stack, and a load balancer. So for roughly 400 Curations clients, we needed 400 clusters [CLICK]. [CLICK] Outside of these clusters, we had a couple multi-tenant services. A social outreach service responsible for requesting author permissions. And a display service responsible for returning enriched content to all of our client’s sites. To meet display level scales, which we talked about before, we would ETL our enriched content to a Bazaarvoice-wide Cassandra ring before indexing it into an Elasticsearch cluster for efficient querying. Clearly this is a challenging stack to manage. Just look at how many different types of datastores we’re dealing with, we have MySQL, we have Cassandra, we have ElasticSearch. And it’s not cheap either. [CLICK] This came out to $60k/mo. And each additional client would add a few hundred dollars per month. And this doesn’t even include the cost of all the beer we had to drink to be able put up with this nightmare.
  6. So! Why a nightmare? [CLICK] Well, for one, there’s not much satisfaction in maintaining a platform that’s so obviously ineffective in terms of cost. When your every solution to scale is “throw more hardware slash money at it” it’s hard to feel innovative; especially when you know better solutions exist. I already mentioned the cost of adding a single client EC2/RDS cluster and that only becomes more expensive as this data gets ETL’d and re-indexed in elasticsearch and so on. [CLICK] Then there’s the issue of maintainability. Imagine a scenario where a team member gets paged in the middle of the night that some client’s RDS volume is running out of space. Now, for some of my teammates that meant waking up in the middle of the night and handling it. For me, as a lazier and less conscientious person, it meant turning off my phone, sleeping through the night, and handling it after a leisurely breakfast. Regardless though, it had to be handled and it involved someone logging into AWS and manually resizing a single database instance. Not to bring up this point again too much, but any resize also meant more money spent on a particular database. Debugging was hard, patching and releasing anything to 400 systems was just a nightmare, managing data (GDPR!) was a huge pain. A lot of effort spent on maintenance when none of us really wanted to do that. You guys saw are product, I think it’s so cool and it’s great at what it does, we wanted to work on making it better. But instead we were all dealing with devops AND we had a designated devops engineer that would babysit clusters and run “ansible scripts” (whatever the hell those are). [CLICK] Any system that relies on an ETL is also bound to have lag, so yet another can of worms [CLICK] And then usability. I know a lot of people love elasticsearch and it’s really awesome at what it does. But, personally, I find the query language super verbose and non-intuitive. Plenty of this could be lack of experience and expertise in elasticsearch; but I knew ten times as much in half the time when I started using Mongo, so I think that speaks volumes of its ease of use. [CLICK] And I’m not going to go into SQL, there are probably half a dozen talks about it going on right now (but mine is better!) Knowing what we knew about SQL and elasticsearch, as we started talking about replatforming, we also started considering different options for our next database.
  7. So what considerations did we have when picking the new database? [CLICK] If you remember from my earlier slide, the curations platform does three main things - collect, enrich and display They each have different access patterns COLLECT is high volume writes, but it’s more fault tolerant. If you don’t collect for a few minutes or even a couple hours, it’s not the end of the world and usually nobody’s the wiser ENRICH is complex querying and moderate volume read/write. And the queries can be as complex as the user chooses to make them. We often see things like “get me all Twitter content from this geolocation mentioning #babyclothes and send it for human moderation” the third component, DISPLAY is high volume reads (about 300 requests per second) with no tolerance for latency or outage. Content is displayed on retailer ecommerce sites in the path of purchase. If it doesn’t show up, it can’t influence and less stuff gets sold :) So we needed to be able to support all those different access patterns [CLICK] Next, we need to be able to support a growing number of clients and volume of content. Not only do each of our clients see organic growth of 10-20% but since this is a newer product the number of clients is also growing every quarter. [CLICK] Finally, our team must be able to self-manage (i.e., our own DBA). But honestly, we didn’t want to have to think of this at all (or we wanted to think of it as little as possible) We had already set on Node JS as our language and we knew we’ll be using AWS lambda, elastic beanstalk, and a few other AWS services that the team had previously had positive experience with We had several options when deciding on a database Mongo wasn’t very widely used within the company which, like our previous stack, favored a combination of SQL and cassandra indexed by elasticsearch There was also some pushback from the designated DevOps team at the time indicating they’d have a hard time supporting mongo. The question of scale also often came up usually backed by anecdotal evidence. But, the dedicated DevOps team essentially said “you’re on your own” if you choose mongo
  8. at the time our development started, we were still not decided on a database there were strong pushes for both cassandra and mongo. In retrospect i see that as a positive as it allowed us to design a fully database agnostic platform [CLICK] however prototyping with mongo is just very very easy - it’s easy to boot up a mongo instance locally, connect to it through a simple Node JS driver and do anything you need to do for your testing without fully having our schema worked out, mongo’s schema flexibility made it very easy to change things quickly as needed it’s also easy for someone working on the collection piece to run a mongo export and airdrop a bunch of data for someone else who is working on the enrichment piece to test with So even in our proof of concept phases we very naturally gravitated toward using mongo [CLICK] Some numbers we tested with initially Our collection services ran every 15 minutes and would write about 80,000 documents as fast as possible. It usually took a few seconds and the time was limited more by the social APIs than anything else. In production now we write close to a thousand documents every time we collect Enrichment services or rule execution. We tested with about 4,000 rules over 7 million documents. Execution took a few minutes with no indexes. In production now we have about 4,000 rules over 20 million documents Aaaand we did no display testing until later [CLICK] a quick side note - I’m going to speak more about indexes later on, but I just want to touch on it for a second here - we consciously used no indexes up front added them as needed. We did this in part because we didn’t know beforehand what indexes will be helpful and in part because we wanted to prove to ourselves that everything will scale
  9. What did we end up with? [CLICK] Here we have the collection service which is a bunch of lambda functions triggered off of a kinesis stream (kinesis being the AWS real time streaming platform). They hit up the social channel APIs every 10-15 minutes. That’s our bursty high write traffic. This service and all the other ones you’re about to see are written in Node JS. [CLICK] Here is our enrichment service which is a part of an autoscaling group. These are our constant complex reads and simple updates. [CLICK] Same access pattern as our enrichment service, our management service allows users to directly log in and approve content or identify products. [CLICK] And last but not least, our display autoscaling group which is obviously constant high volume simple reads. What datastore is in the middle of all these pieces? Well, you guessed it, it’s a convoluted combination of SQL, Cassandra, and Elasticsearch. No, I’m just kidding, all those other conferences turned me down, so we decided to use Mongo instead! And thank god. So here is our new platform and it now comes with a price tag of [CLICK] $6,500/month. If you’ll remember from the earlier slide, our original cost was $60,000/month so this new platform is running at 10% of our earlier cost. Massive cost savings, huge performance gains, transactional consistency instead for waiting for stuff to propagate to display, handful of services instead of hundreds of clusters to maintain. Great stuff; not without its challenges - but we’ll get into those next It’s worth pointing out that this architecture is completely serverless or containerized. We have lambdas and a few dockerized autoscaling services. I’ll speak more about Atlas in the next few slides, but in terms of its place here, it fits great into this architecture where we just want to code and not worry about infrastructure.
  10. Let’s talk about our DevOps decisions. [CLICK] In its first iteration our cluster was some EC2 instances we provisioned by hand. We put together a few memory optimized instances as a replica set, figured out what ports to have access to what, installed Mongo. It’s cheap, but not super easy to set up, and it wasn’t going to be a viable long-term solution. We actually had an old cluster that was set up by hand and running a side job in production. We never saw any issues with it, but we weren’t going to risk it this time. [CLICK] We did much better on the second iteration. We decided to use cloud manager which is still an option I would recommend to people on a budget It was again cheap, the installation was easy and fast, it allowed us to upgrade mongo versions quickly and scale with the push of a button A few kinks that we saw (and those are somewhat unique to our setup) had to do with dealing with our own VPCs within amazon (cloud manager didn’t have a seamless integration at the time) However, it allowed us to code fast and forget about our database for the most part. Again, a totally reasonable option Now, our third iteration was Atlas [CLICK]. What was great about it? Well it was much cheaper than having a dedicated DevOps engineer It is super fast to set up, can be set up to scale automatically It gave us insights into our indexes, long running queries, performance glitches, and more And updates took no time and no stress on our part whatsoever We recognize that some of the cooler things about Atlas like performance analytics and such can be done by hand. But it’s just tedious, less graphable, and of course, for things like showing database load and so on, Atlas just kills it. So, like I said earlier, between serverless and Atlas, our infrastructure basically manages itself and leaves our hands free to make great products which is what most of us are passionate about
  11. How did we decide on our indexes? I already mentioned that we started from zero. We really just wanted to see what works and what doesn’t. In our experience, if we could get an index to narrow down the scan size to thousands of documents, then it struck the balance between index size and performance gain. I can obviously create an index that gets us down to a single document, but the cost of doing that is not worth it Once we started creating indexes we kind of went with our best guesses on what works. For example, for display we knew tags, client name, and timestamp were going to be in every query - easy! For the more complex enrichment rules, we really just ballparked our guesses. Sometimes we were right, sometimes we weren’t A big part of our philosophy is to do what makes sense at the time and build stuff that we can easily iterate on. That applied to our indexes as well Once we started using Atlas, we realized we weren’t using some of our indexes nearly as often as we thought we would be We were able to make smart decisions on which ones to kill An index killed is as valuable as an index added. Why? We want all our indexes to fit in memory. Unused indexes obviously work against that goal
  12. Shifting gears a little bit, I wanted to talk about some of the problems we’ve encountered in our new platform over the last year, and how we tackled them. We all know, nothing in life is easy, we have this shiny new product we built, we got this amazing tool (our database), so we can just cruise from here on out, right? Uhhh actually yes, pretty much, but not quite. The first problem had to do with [CLICK] a random day when our database started just failing over again and again. During failover it would be unresponsive for minutes at a time and the pattern would repeat every hour or so [CLICK] How did we detect it? Our board metrics indicated high response times. Further digging indicated that we had over 30,000 open database connections at the time of failover (for those of you taking notes, how much does it take to bring down the Primary node? About 30,000 open connections) [CLICK] Tools we used to root cause. Datadog and the Mongo console. [CLICK] And the solution? Once we realized each request to our database was opening a new connection and those connections weren’t being closed fast enough, we switched to using connection pools. The lesson? Failover is not seamless and it’s not cheap. It’s great that it’s there, but it’s better when it doesn’t happen.
  13. Another problem happened in the very early days of our new platform launch. [CLICK] Shortly after onboarding our first live client, we realized that the displays on their site were taking around 6 seconds to load [CLICK] How did we detect it? Well for this one the datadog board was obvious. What’s interesting is that it also indicated our database queries were taking more than 5 seconds; at the same time Atlas was telling us that database queries (for the same request) were taking less than 100 milliseconds. So what gives? [CLICK][CLICK] As it turns out, the discrepancy in the request times had to do with our Lambdas connecting to mongo. On cold start, a lambda would take about 5 seconds to connect to mongo, then the mongo query would take 100 milliseconds, and all would get recorded in Datadog as a single transaction. Our solution was a quick switch from running display off of lambda (which is what we were doing at the time) to the dockerized autoscaling service you guys saw in the earlier diagram
  14. This next problem has to do with the execution of our complex rules. If you’ll remember from earlier, rules are a set of filters coupled with a set of actions. So for example, your filter is “everything that says rain boots and is moderation approved” and the action is “ask author for permission” [CLICK] As our rules started growing in complexity, we noticed that for all of them to execute it was sometimes taking 30 minutes or more. Individual database operations were taking minutes to complete despite multiple complex indexes. [CLICK] How did we detect it? Our Atlas board metrics indicated poor rule execution time and [CLICK] obviously Atlas was our tool to root cause the issue [CLICK] And how did we solve it? Well, we realized that our rules were performing actions on matching content, but unmatched content was still being scanned in subsequent executions. Our solution was to exclude scanning of previously unmatched content. And to do that we included a timestamp in our queries that only scanned content updated since the last time a rule ran. The lesson I took from this is don’t rescan content you don’t have to. This is a great example of an issue where someone might say, our database isn’t scaling and it’s not able to perform complex queries in reasonable time. Well guess what. Fix your code. Hardware is great, tools are great, but they can only carry you so far. I think we sometimes tend to be sloppier than we should be because hardware is so cheap and easy, but we have to write code responsibly too. Example if needed: Say we have 10000 documents in a collection, each has a color I run a query every 15 min to find all the red ones and take some action First time I run, I find 1000 and take some action. We tag these so they aren’t scanned next time. But the next time I run, the other 9000 docs that aren’t red still needed to be scanned.
  15. And speaking of bad code, the last issue I’ll talk about today is when [CLICK] some bad code caused major data corruption in our database across all clients and most of our content (It wasn’t me!! Actually it was :() How did we detect it? Well we didn’t need the boards this time because client complaints started pouring in fast [CLICK] [CLICK] Our solution? An atlas point in time recovery. Because we depend on social data, we can actually tolerate data loss pretty reasonably We rolled back our database to a backup less than 12 hours ago, and cherry-picked client enrichment actions since recovery Aggregations proved very helpful to cross-reference what was changed when This was a very bad day. It was on a Friday of course, because those things always happen on a Friday. Yet, somehow, Atlas made recovery super easy and as pain-free as we could have hoped for, considering. The lesson - keep an audit and have a solid backup path to recovery. Before this happened, we kept talking about how we need to do a dry run on recovery and we kept saying we’ll do it, but we didn’t until we had to I’m sure many people in the audience are thinking the same thing now. I really encourage you to do it. You don’t want it to be the first time when it’s a production escalation
  16. Some issues we anticipate we’ll encounter in the future. Scale, size, and cost, obviously. How do we plan to address these? One is clean up unused content. I really feel like most people use a lot less data than they think they use. This is a good opportunity to evaluate and clean up As our dataset grows, I see us utilizing more sparse indexes. For us, recent content is valuable, older content, not so much. For better or worse, no one cares what someone posted on Instagram two years ago. If you can reduce the size of your indexes by making them sparser, by all means do it
  17. Some surprising side effect arose from being our own devops engineers. Our services team can log in and get read-only view to all kinds of data that’s not available in standard analytics screens Unlike a relational database like SQL, there is no need for a deep understanding of a complex schema It just allows for very intuitive querying that doesn’t take very deep domain knowledge to get your work done According to our product manager, there’s been an 80% reduction in tickets since the switch, definitely in part due to people being able to get the information they need without developers being involved Another positive (this one specifically has to do with Atlas) is that other teams are now considering going the hosted route. It’s easier for others to walk a beaten path, there’s less uncertainty, and more successful examples to speak of next time someone mentions “scale” And I can’t stress this enough. We expected to have to do a little bit of maintenance; we haven’t had to do any. It’s harder to put a number on things like this. You can say our hosting costs went from 60,000 to 6,500, but it’s harder to gauge how much money we’ve saved by not having to worry about our database. Old platform: 60,000, new platform: 6,500, getting to focus all my time on just development: priceless.
  18. A couple things that I wanted to mention that didn’t really fit in their own slide. Of course it depends on your use case, but we haven’t been impressed by the text index. It’s huge, it can’t be compounded, and the search doesn’t always behave predictably. I would recommend narrowing your queries down and doing a regex search, if you can Hint is great. Even when you think Mongo will pick up the right index to use, it sometimes doesn’t. So, if you can add hint in your code, do it. Unfortunately it doesn’t work with updates, but since I’m speaking here, Mongo, this is an official request, please fix.
  19. Some final thoughts. Bottlenecks happen, services break, requirements change, products evolve. What makes a good datastore is not infallibility, but the tools and ability to detect issues fast, diagnose, develop fast, and recover. I think that the value of a great datastore or any good tool really is that it allows you to be agile and iterate. And really to do what you’re passionate about, which in our case is code.
  20. Why Cassandra? It's a bit of a bazaarvoice domain requirement as that is how our single source of truth datastore works at scale.  They picked cassandra years ago to handle the globally-fault tolerant high write volume access pattern that we see for ratings and reviews across all our clients