MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance

1. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.1 Replatforming: Switching to MongoDB For Flexibility, Scalability, Performance, and Simplicity September 26, 2018 Ani Hammond Sr Staff Software Engineer, Bazaarvoice

2. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.2 • Senior Staff Software Engineer and Tech Lead at Bazaarvoice • Currently excited about serverless applications and distributed services • Always excited about simple, intuitive products with a clear mission Github: aniham Email: ani.popova@gmail.com whoami

3. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.3 • Based in Austin, TX; 700 employees worldwide; Recently taken private What is Bazaarvoice? 530M BLACK FRIDAY 470M CYBER MONDAY 6000 PAGEVIEWS / SEC Q A 4.5

5. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.5 Legacy Platform $60,000/mo Each client adds a few hundred/month Monolithic stack Python/Django MySQL Database Single-tenant Cluster per client ~400 clusters Multi-tenant services Social outreach Display

6. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.6 Legacy Platform: Issues • Maintainability • Debugging • Patching • Releasing • Managing data • Cost • Single-tenant clusters (RDS, EC2) • Elasticsearch cluster • ETL and eventual consistency • Elasticsearch usability • MySQL usability

7. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.7 • Support different access patterns Picking a new DB: Considerations • Able to scale as the client base and content volume grows • Be our own database administrator Service Read Volume Write Volume Query Complexity Fault Tolerance Collect Enrich Display

8. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.8 • Prototype advantages • Easy to use • Flexible schema • Easy to export and share Picking a new DB: Early dev and experiments • Some early numbers • A note about indexes • No indexes to start • Added as needed

9. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.9 New Platform $6,500/mo All services multi-tenant Display Service Constant high reads Enrichment Service Constant complex reads Simple updates Management Service Low complex reads Simple updates Collection Service Bursty high writes

10. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.10 DevOps Cloud Manager SECOND ITERATION Cheap Fast Totally reasonable option Atlas THIRD (CURRENT) ITERATION Cheaper than dedicated DevOps Fast Insights into indexes, long running queries, performance glitches, and more Push button upgrades and scaling Provision by hand FIRST ITERATION Cheap Laborious Not viable long term

12. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.12 Problem 1: ... Solution HOW DID WE SOLVE THINGS Connection pools Lesson WHAT DID WE LEARN Failover is expensive Detection Board metrics indicated high response time Further digging indicated >30K DB connections HOW DID WE FIND OUT Manifestation Database kept failing over Not responsive for long periods of time WHAT HAPPENED

13. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.13 Problem 2: ... Solution HOW DID WE SOLVE THINGS Discrepancy due to Lambdas’ connections to MongoDB Switched from Lambdas to Dockerized services Lesson WHAT DID WE LEARN Don’t use Lambdas for constant workload Detection Board metrics indicated DB queries taking 5 seconds Atlas was indicating queries taking < 100ms HOW DID WE FIND OUT Manifestation Display response time > 6 seconds WHAT HAPPENED

14. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.14 Problem 3: ... Solution HOW DID WE SOLVE THINGS Rules perform actions on matching content, unmatched content still scanned in subsequent executions Exclude scanning previously unmatched content Lesson WHAT DID WE LEARN Don’t rescan if you don’t have to Don’t let your DB do all of your work for you Detection Board metrics indicated poor rule execution time HOW DID WE FIND OUT Manifestation Rules taking 30 min to execute despite multiple indexes DB ops taking minutes to complete WHAT HAPPENED

15. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.15 Problem 4: ... Lesson WHAT DID WE LEARN Keep audits Have a solid recovery plan Detection Client complaints hit us like a wet mop HOW DID WE FIND OUT Manifestation Bad code caused data corruption WHAT HAPPENED Solution HOW DID WE SOLVE THINGS Atlas point in time recovery Cherry pick client enrichment actions since recovery (~12 hours) Aggregations proved helpful to cross-reference what was changed when

17. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.17 • Ability to give read only view to our services team • An accidental test case for the rest of the company • Many teams are using MongoDB they provision and manage themselves • No maintenance Nice Side Effects

18. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.18 • The text index is not for everyone • Hint is good • Even when you think MongoDB will pick the right index to use, it sometimes doesn’t • Doesn’t work with updates :( Mentions that don’t need a separate slide

19. Confidential and Proprietary. © 2018 Bazaarvoice, Inc.19 • Bottlenecks happen, services break, requirements change, products evolve • What makes a good datastore is not infallibility, but the tools and ability to • Detect issues fast • Diagnose • Develop fast and recover • Agility! Iteration! Final thoughts

Editor's Notes

Hello everyone, I’m so glad to be here My name is Ani Hammond Today I’m going to talk to you about my team's journey replatforming And the important role that MongoDB played in it I’ll show you guys what our old stack looked like, what our new (and much better) stack looks like now, obviously how mongo fits in it And then I’ll go over some interesting issues we encountered with our new platform and the solutions we came up with
Who am I? I’m a Software Engineer and Tech Lead at Bazaarvoice Spoiled Westerner, I like it when things are easy for me and I only get to do things I like to do, so my passions change over time and I get excited about different things. But currently I’m interested in serverless applications and distributed services and I’m always excited about simple intuitive products with a clear mission. I know it sounds like a stretch to call a database technology a simple product, But I think MongoDB fits my description perfectly because of how easy it is to develop with. And I’m going to make that more clear later on in the presentation Software engineer at Bazaarvoice… [next slide]
What is Bazaarvoice? Our mission at Bazaarvoice is to connect brands and retailers to consumers. What that means in non-marketing speak is that most of the user-generated content on brand and retailer sites is flowing through our network. By user-generated content, I mean ratings & reviews or Q&A or social content. Here’s a random collection of logos that Marketing said I could show, but to give you a better idea our prevalence, if you’re shopping online anywhere other than Amazon and reading a review, it’s probably powered by us. To give you an idea of the scale we deal with, here are some stats from last year's Black Friday and Cyber Monday On Black Friday we had 530 million total page views on our network which is over 6000/second On Cyber Monday we had 470 million total page views which is just under 5500/second For a total of a billion page views from just those 2 days. What does a pageview imply? Each one implies multiple API calls fanning out to dozens of services. My team, Curations, built and supports some of those services.
What is Curations? In short, the Curations platform allows a brand or retailer to display relevant social content in the path of purchase on their e-commerce site. Let me walk you through the flow [CLICK] Someone posts a cute picture of their child wearing Gymboree rain boots on their Instagram Using Curations, Gymboree is watching for content that mentions certain hashtags about their brand The Curations Social collection service picks up that post [CLICK] And shows it to Gymboree in the Curations application. Once in Curations, the post can be enriched in various automatic and manual ways. Enrichment means things like moderation approval and product identification. It can be done manually by the client, or by a set of automatic rules that define their needs. An example of an automatic rule would be to reject all content that includes profanity. [CLICK] Once the content is moderation approved, the Curations platform reaches out on behalf of Gymboree to request permission from the author of the post to use their cute picture on Gymboree’s ecommerce site. You probably can’t make out, but here you’d see a comment from Gymboree followed by approval from the user. [CLICK] Finally, now that we have the author’s permission, the post is shown in a Curations powered display on Gymboree’s site. Are there any questions? Ok good. So basically we collect the data, we enrich it, and we display it.
How does this work? All of our infrastructure is in AWS. [CLICK] In our legacy platform, every client (in previous example, Gymboree), had their own cluster which consisted of a MySQL RDS instance, one or more EC2 instances running a Python/Django stack, and a load balancer. So for roughly 400 Curations clients, we needed 400 clusters [CLICK]. [CLICK] Outside of these clusters, we had a couple multi-tenant services. A social outreach service responsible for requesting author permissions. And a display service responsible for returning enriched content to all of our client’s sites. To meet display level scales, which we talked about before, we would ETL our enriched content to a Bazaarvoice-wide Cassandra ring before indexing it into an Elasticsearch cluster for efficient querying. Clearly this is a challenging stack to manage. Just look at how many different types of datastores we’re dealing with, we have MySQL, we have Cassandra, we have ElasticSearch. And it’s not cheap either. [CLICK] This came out to $60k/mo. And each additional client would add a few hundred dollars per month. And this doesn’t even include the cost of all the beer we had to drink to be able put up with this nightmare.
So! Why a nightmare? [CLICK] Well, for one, there’s not much satisfaction in maintaining a platform that’s so obviously ineffective in terms of cost. When your every solution to scale is “throw more hardware slash money at it” it’s hard to feel innovative; especially when you know better solutions exist. I already mentioned the cost of adding a single client EC2/RDS cluster and that only becomes more expensive as this data gets ETL’d and re-indexed in elasticsearch and so on. [CLICK] Then there’s the issue of maintainability. Imagine a scenario where a team member gets paged in the middle of the night that some client’s RDS volume is running out of space. Now, for some of my teammates that meant waking up in the middle of the night and handling it. For me, as a lazier and less conscientious person, it meant turning off my phone, sleeping through the night, and handling it after a leisurely breakfast. Regardless though, it had to be handled and it involved someone logging into AWS and manually resizing a single database instance. Not to bring up this point again too much, but any resize also meant more money spent on a particular database. Debugging was hard, patching and releasing anything to 400 systems was just a nightmare, managing data (GDPR!) was a huge pain. A lot of effort spent on maintenance when none of us really wanted to do that. You guys saw are product, I think it’s so cool and it’s great at what it does, we wanted to work on making it better. But instead we were all dealing with devops AND we had a designated devops engineer that would babysit clusters and run “ansible scripts” (whatever the hell those are). [CLICK] Any system that relies on an ETL is also bound to have lag, so yet another can of worms [CLICK] And then usability. I know a lot of people love elasticsearch and it’s really awesome at what it does. But, personally, I find the query language super verbose and non-intuitive. Plenty of this could be lack of experience and expertise in elasticsearch; but I knew ten times as much in half the time when I started using Mongo, so I think that speaks volumes of its ease of use. [CLICK] And I’m not going to go into SQL, there are probably half a dozen talks about it going on right now (but mine is better!) Knowing what we knew about SQL and elasticsearch, as we started talking about replatforming, we also started considering different options for our next database.
So what considerations did we have when picking the new database? [CLICK] If you remember from my earlier slide, the curations platform does three main things - collect, enrich and display They each have different access patterns COLLECT is high volume writes, but it’s more fault tolerant. If you don’t collect for a few minutes or even a couple hours, it’s not the end of the world and usually nobody’s the wiser ENRICH is complex querying and moderate volume read/write. And the queries can be as complex as the user chooses to make them. We often see things like “get me all Twitter content from this geolocation mentioning #babyclothes and send it for human moderation” the third component, DISPLAY is high volume reads (about 300 requests per second) with no tolerance for latency or outage. Content is displayed on retailer ecommerce sites in the path of purchase. If it doesn’t show up, it can’t influence and less stuff gets sold :) So we needed to be able to support all those different access patterns [CLICK] Next, we need to be able to support a growing number of clients and volume of content. Not only do each of our clients see organic growth of 10-20% but since this is a newer product the number of clients is also growing every quarter. [CLICK] Finally, our team must be able to self-manage (i.e., our own DBA). But honestly, we didn’t want to have to think of this at all (or we wanted to think of it as little as possible) We had already set on Node JS as our language and we knew we’ll be using AWS lambda, elastic beanstalk, and a few other AWS services that the team had previously had positive experience with We had several options when deciding on a database Mongo wasn’t very widely used within the company which, like our previous stack, favored a combination of SQL and cassandra indexed by elasticsearch There was also some pushback from the designated DevOps team at the time indicating they’d have a hard time supporting mongo. The question of scale also often came up usually backed by anecdotal evidence. But, the dedicated DevOps team essentially said “you’re on your own” if you choose mongo
at the time our development started, we were still not decided on a database there were strong pushes for both cassandra and mongo. In retrospect i see that as a positive as it allowed us to design a fully database agnostic platform [CLICK] however prototyping with mongo is just very very easy - it’s easy to boot up a mongo instance locally, connect to it through a simple Node JS driver and do anything you need to do for your testing without fully having our schema worked out, mongo’s schema flexibility made it very easy to change things quickly as needed it’s also easy for someone working on the collection piece to run a mongo export and airdrop a bunch of data for someone else who is working on the enrichment piece to test with So even in our proof of concept phases we very naturally gravitated toward using mongo [CLICK] Some numbers we tested with initially Our collection services ran every 15 minutes and would write about 80,000 documents as fast as possible. It usually took a few seconds and the time was limited more by the social APIs than anything else. In production now we write close to a thousand documents every time we collect Enrichment services or rule execution. We tested with about 4,000 rules over 7 million documents. Execution took a few minutes with no indexes. In production now we have about 4,000 rules over 20 million documents Aaaand we did no display testing until later [CLICK] a quick side note - I’m going to speak more about indexes later on, but I just want to touch on it for a second here - we consciously used no indexes up front added them as needed. We did this in part because we didn’t know beforehand what indexes will be helpful and in part because we wanted to prove to ourselves that everything will scale
What did we end up with? [CLICK] Here we have the collection service which is a bunch of lambda functions triggered off of a kinesis stream (kinesis being the AWS real time streaming platform). They hit up the social channel APIs every 10-15 minutes. That’s our bursty high write traffic. This service and all the other ones you’re about to see are written in Node JS. [CLICK] Here is our enrichment service which is a part of an autoscaling group. These are our constant complex reads and simple updates. [CLICK] Same access pattern as our enrichment service, our management service allows users to directly log in and approve content or identify products. [CLICK] And last but not least, our display autoscaling group which is obviously constant high volume simple reads. What datastore is in the middle of all these pieces? Well, you guessed it, it’s a convoluted combination of SQL, Cassandra, and Elasticsearch. No, I’m just kidding, all those other conferences turned me down, so we decided to use Mongo instead! And thank god. So here is our new platform and it now comes with a price tag of [CLICK] $6,500/month. If you’ll remember from the earlier slide, our original cost was $60,000/month so this new platform is running at 10% of our earlier cost. Massive cost savings, huge performance gains, transactional consistency instead for waiting for stuff to propagate to display, handful of services instead of hundreds of clusters to maintain. Great stuff; not without its challenges - but we’ll get into those next It’s worth pointing out that this architecture is completely serverless or containerized. We have lambdas and a few dockerized autoscaling services. I’ll speak more about Atlas in the next few slides, but in terms of its place here, it fits great into this architecture where we just want to code and not worry about infrastructure.
Let’s talk about our DevOps decisions. [CLICK] In its first iteration our cluster was some EC2 instances we provisioned by hand. We put together a few memory optimized instances as a replica set, figured out what ports to have access to what, installed Mongo. It’s cheap, but not super easy to set up, and it wasn’t going to be a viable long-term solution. We actually had an old cluster that was set up by hand and running a side job in production. We never saw any issues with it, but we weren’t going to risk it this time. [CLICK] We did much better on the second iteration. We decided to use cloud manager which is still an option I would recommend to people on a budget It was again cheap, the installation was easy and fast, it allowed us to upgrade mongo versions quickly and scale with the push of a button A few kinks that we saw (and those are somewhat unique to our setup) had to do with dealing with our own VPCs within amazon (cloud manager didn’t have a seamless integration at the time) However, it allowed us to code fast and forget about our database for the most part. Again, a totally reasonable option Now, our third iteration was Atlas [CLICK]. What was great about it? Well it was much cheaper than having a dedicated DevOps engineer It is super fast to set up, can be set up to scale automatically It gave us insights into our indexes, long running queries, performance glitches, and more And updates took no time and no stress on our part whatsoever We recognize that some of the cooler things about Atlas like performance analytics and such can be done by hand. But it’s just tedious, less graphable, and of course, for things like showing database load and so on, Atlas just kills it. So, like I said earlier, between serverless and Atlas, our infrastructure basically manages itself and leaves our hands free to make great products which is what most of us are passionate about
How did we decide on our indexes? I already mentioned that we started from zero. We really just wanted to see what works and what doesn’t. In our experience, if we could get an index to narrow down the scan size to thousands of documents, then it struck the balance between index size and performance gain. I can obviously create an index that gets us down to a single document, but the cost of doing that is not worth it Once we started creating indexes we kind of went with our best guesses on what works. For example, for display we knew tags, client name, and timestamp were going to be in every query - easy! For the more complex enrichment rules, we really just ballparked our guesses. Sometimes we were right, sometimes we weren’t A big part of our philosophy is to do what makes sense at the time and build stuff that we can easily iterate on. That applied to our indexes as well Once we started using Atlas, we realized we weren’t using some of our indexes nearly as often as we thought we would be We were able to make smart decisions on which ones to kill An index killed is as valuable as an index added. Why? We want all our indexes to fit in memory. Unused indexes obviously work against that goal
Shifting gears a little bit, I wanted to talk about some of the problems we’ve encountered in our new platform over the last year, and how we tackled them. We all know, nothing in life is easy, we have this shiny new product we built, we got this amazing tool (our database), so we can just cruise from here on out, right? Uhhh actually yes, pretty much, but not quite. The first problem had to do with [CLICK] a random day when our database started just failing over again and again. During failover it would be unresponsive for minutes at a time and the pattern would repeat every hour or so [CLICK] How did we detect it? Our board metrics indicated high response times. Further digging indicated that we had over 30,000 open database connections at the time of failover (for those of you taking notes, how much does it take to bring down the Primary node? About 30,000 open connections) [CLICK] Tools we used to root cause. Datadog and the Mongo console. [CLICK] And the solution? Once we realized each request to our database was opening a new connection and those connections weren’t being closed fast enough, we switched to using connection pools. The lesson? Failover is not seamless and it’s not cheap. It’s great that it’s there, but it’s better when it doesn’t happen.
Another problem happened in the very early days of our new platform launch. [CLICK] Shortly after onboarding our first live client, we realized that the displays on their site were taking around 6 seconds to load [CLICK] How did we detect it? Well for this one the datadog board was obvious. What’s interesting is that it also indicated our database queries were taking more than 5 seconds; at the same time Atlas was telling us that database queries (for the same request) were taking less than 100 milliseconds. So what gives? [CLICK][CLICK] As it turns out, the discrepancy in the request times had to do with our Lambdas connecting to mongo. On cold start, a lambda would take about 5 seconds to connect to mongo, then the mongo query would take 100 milliseconds, and all would get recorded in Datadog as a single transaction. Our solution was a quick switch from running display off of lambda (which is what we were doing at the time) to the dockerized autoscaling service you guys saw in the earlier diagram
This next problem has to do with the execution of our complex rules. If you’ll remember from earlier, rules are a set of filters coupled with a set of actions. So for example, your filter is “everything that says rain boots and is moderation approved” and the action is “ask author for permission” [CLICK] As our rules started growing in complexity, we noticed that for all of them to execute it was sometimes taking 30 minutes or more. Individual database operations were taking minutes to complete despite multiple complex indexes. [CLICK] How did we detect it? Our Atlas board metrics indicated poor rule execution time and [CLICK] obviously Atlas was our tool to root cause the issue [CLICK] And how did we solve it? Well, we realized that our rules were performing actions on matching content, but unmatched content was still being scanned in subsequent executions. Our solution was to exclude scanning of previously unmatched content. And to do that we included a timestamp in our queries that only scanned content updated since the last time a rule ran. The lesson I took from this is don’t rescan content you don’t have to. This is a great example of an issue where someone might say, our database isn’t scaling and it’s not able to perform complex queries in reasonable time. Well guess what. Fix your code. Hardware is great, tools are great, but they can only carry you so far. I think we sometimes tend to be sloppier than we should be because hardware is so cheap and easy, but we have to write code responsibly too. Example if needed: Say we have 10000 documents in a collection, each has a color I run a query every 15 min to find all the red ones and take some action First time I run, I find 1000 and take some action. We tag these so they aren’t scanned next time. But the next time I run, the other 9000 docs that aren’t red still needed to be scanned.
And speaking of bad code, the last issue I’ll talk about today is when [CLICK] some bad code caused major data corruption in our database across all clients and most of our content (It wasn’t me!! Actually it was :() How did we detect it? Well we didn’t need the boards this time because client complaints started pouring in fast [CLICK] [CLICK] Our solution? An atlas point in time recovery. Because we depend on social data, we can actually tolerate data loss pretty reasonably We rolled back our database to a backup less than 12 hours ago, and cherry-picked client enrichment actions since recovery Aggregations proved very helpful to cross-reference what was changed when This was a very bad day. It was on a Friday of course, because those things always happen on a Friday. Yet, somehow, Atlas made recovery super easy and as pain-free as we could have hoped for, considering. The lesson - keep an audit and have a solid backup path to recovery. Before this happened, we kept talking about how we need to do a dry run on recovery and we kept saying we’ll do it, but we didn’t until we had to I’m sure many people in the audience are thinking the same thing now. I really encourage you to do it. You don’t want it to be the first time when it’s a production escalation
Some issues we anticipate we’ll encounter in the future. Scale, size, and cost, obviously. How do we plan to address these? One is clean up unused content. I really feel like most people use a lot less data than they think they use. This is a good opportunity to evaluate and clean up As our dataset grows, I see us utilizing more sparse indexes. For us, recent content is valuable, older content, not so much. For better or worse, no one cares what someone posted on Instagram two years ago. If you can reduce the size of your indexes by making them sparser, by all means do it
Some surprising side effect arose from being our own devops engineers. Our services team can log in and get read-only view to all kinds of data that’s not available in standard analytics screens Unlike a relational database like SQL, there is no need for a deep understanding of a complex schema It just allows for very intuitive querying that doesn’t take very deep domain knowledge to get your work done According to our product manager, there’s been an 80% reduction in tickets since the switch, definitely in part due to people being able to get the information they need without developers being involved Another positive (this one specifically has to do with Atlas) is that other teams are now considering going the hosted route. It’s easier for others to walk a beaten path, there’s less uncertainty, and more successful examples to speak of next time someone mentions “scale” And I can’t stress this enough. We expected to have to do a little bit of maintenance; we haven’t had to do any. It’s harder to put a number on things like this. You can say our hosting costs went from 60,000 to 6,500, but it’s harder to gauge how much money we’ve saved by not having to worry about our database. Old platform: 60,000, new platform: 6,500, getting to focus all my time on just development: priceless.
A couple things that I wanted to mention that didn’t really fit in their own slide. Of course it depends on your use case, but we haven’t been impressed by the text index. It’s huge, it can’t be compounded, and the search doesn’t always behave predictably. I would recommend narrowing your queries down and doing a regex search, if you can Hint is great. Even when you think Mongo will pick up the right index to use, it sometimes doesn’t. So, if you can add hint in your code, do it. Unfortunately it doesn’t work with updates, but since I’m speaking here, Mongo, this is an official request, please fix.
Some final thoughts. Bottlenecks happen, services break, requirements change, products evolve. What makes a good datastore is not infallibility, but the tools and ability to detect issues fast, diagnose, develop fast, and recover. I think that the value of a great datastore or any good tool really is that it allows you to be agile and iterate. And really to do what you’re passionate about, which in our case is code.
Why Cassandra? It's a bit of a bazaarvoice domain requirement as that is how our single source of truth datastore works at scale. They picked cassandra years ago to handle the globally-fault tolerant high write volume access pattern that we see for ratings and reviews across all our clients

MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance

Similar to MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB.local Austin 2018: Replatforming: Switching to MongoDB for Flexibility, Scalability & Performance

Editor's Notes