Hello everyone, I’m so glad to be here
My name is Ani Hammond
Today I’m going to talk to you about my team's journey replatforming
And the important role that MongoDB played in it
I’ll show you guys what our old stack looked like, what our new (and much better) stack looks like now, obviously how mongo fits in it
And then I’ll go over some interesting issues we encountered with our new platform and the solutions we came up with
Who am I?
I’m a Software Engineer and Tech Lead at Bazaarvoice
Spoiled Westerner, I like it when things are easy for me and I only get to do things I like to do, so my passions change over time and I get excited about different things.
But currently I’m interested in serverless applications and distributed services and I’m always excited about simple intuitive products with a clear mission.
I know it sounds like a stretch to call a database technology a simple product,
But I think MongoDB fits my description perfectly because of how easy it is to develop with. And I’m going to make that more clear later on in the presentation
Software engineer at Bazaarvoice… [next slide]
What is Bazaarvoice? Our mission at Bazaarvoice is to connect brands and retailers to consumers. What that means in non-marketing speak is that most of the user-generated content on brand and retailer sites is flowing through our network. By user-generated content, I mean ratings & reviews or Q&A or social content.
Here’s a random collection of logos that Marketing said I could show, but to give you a better idea our prevalence, if you’re shopping online anywhere other than Amazon and reading a review, it’s probably powered by us.
To give you an idea of the scale we deal with, here are some stats from last year's Black Friday and Cyber Monday
On Black Friday we had 530 million total page views on our network which is over 6000/second
On Cyber Monday we had 470 million total page views which is just under 5500/second
For a total of a billion page views from just those 2 days.
What does a pageview imply? Each one implies multiple API calls fanning out to dozens of services.
My team, Curations, built and supports some of those services.
What is Curations? In short, the Curations platform allows a brand or retailer to display relevant social content in the path of purchase on their e-commerce site.
Let me walk you through the flow
[CLICK] Someone posts a cute picture of their child wearing Gymboree rain boots on their Instagram
Using Curations, Gymboree is watching for content that mentions certain hashtags about their brand
The Curations Social collection service picks up that post
[CLICK] And shows it to Gymboree in the Curations application.
Once in Curations, the post can be enriched in various automatic and manual ways. Enrichment means things like moderation approval and product identification. It can be done manually by the client, or by a set of automatic rules that define their needs. An example of an automatic rule would be to reject all content that includes profanity.
[CLICK] Once the content is moderation approved, the Curations platform reaches out on behalf of Gymboree to request permission from the author of the post to use their cute picture on Gymboree’s ecommerce site.
You probably can’t make out, but here you’d see a comment from Gymboree followed by approval from the user.
[CLICK] Finally, now that we have the author’s permission, the post is shown in a Curations powered display on Gymboree’s site.
Are there any questions? Ok good. So basically we collect the data, we enrich it, and we display it.
How does this work?
All of our infrastructure is in AWS.
[CLICK] In our legacy platform, every client (in previous example, Gymboree), had their own cluster which consisted of a MySQL RDS instance, one or more EC2 instances running a Python/Django stack, and a load balancer.
So for roughly 400 Curations clients, we needed 400 clusters [CLICK].
[CLICK] Outside of these clusters, we had a couple multi-tenant services. A social outreach service responsible for requesting author permissions. And a display service responsible for returning enriched content to all of our client’s sites. To meet display level scales, which we talked about before, we would ETL our enriched content to a Bazaarvoice-wide Cassandra ring before indexing it into an Elasticsearch cluster for efficient querying.
Clearly this is a challenging stack to manage. Just look at how many different types of datastores we’re dealing with, we have MySQL, we have Cassandra, we have ElasticSearch. And it’s not cheap either. [CLICK] This came out to $60k/mo. And each additional client would add a few hundred dollars per month. And this doesn’t even include the cost of all the beer we had to drink to be able put up with this nightmare.
So! Why a nightmare?
[CLICK] Well, for one, there’s not much satisfaction in maintaining a platform that’s so obviously ineffective in terms of cost.
When your every solution to scale is “throw more hardware slash money at it” it’s hard to feel innovative; especially when you know better solutions exist.
I already mentioned the cost of adding a single client EC2/RDS cluster and that only becomes more expensive as this data gets ETL’d and re-indexed in elasticsearch and so on.
[CLICK] Then there’s the issue of maintainability.
Imagine a scenario where a team member gets paged in the middle of the night that some client’s RDS volume is running out of space. Now, for some of my teammates that meant waking up in the middle of the night and handling it.
For me, as a lazier and less conscientious person, it meant turning off my phone, sleeping through the night, and handling it after a leisurely breakfast. Regardless though, it had to be handled and it involved someone logging into AWS and manually resizing a single database instance.
Not to bring up this point again too much, but any resize also meant more money spent on a particular database.
Debugging was hard, patching and releasing anything to 400 systems was just a nightmare, managing data (GDPR!) was a huge pain. A lot of effort spent on maintenance when none of us really wanted to do that. You guys saw are product, I think it’s so cool and it’s great at what it does, we wanted to work on making it better. But instead we were all dealing with devops AND we had a designated devops engineer that would babysit clusters and run “ansible scripts” (whatever the hell those are).
[CLICK] Any system that relies on an ETL is also bound to have lag, so yet another can of worms
[CLICK] And then usability.
I know a lot of people love elasticsearch and it’s really awesome at what it does.
But, personally, I find the query language super verbose and non-intuitive.
Plenty of this could be lack of experience and expertise in elasticsearch;
but I knew ten times as much in half the time when I started using Mongo, so I think that speaks volumes of its ease of use.
[CLICK] And I’m not going to go into SQL, there are probably half a dozen talks about it going on right now (but mine is better!)
Knowing what we knew about SQL and elasticsearch, as we started talking about replatforming, we also started considering different options for our next database.
So what considerations did we have when picking the new database?
[CLICK] If you remember from my earlier slide, the curations platform does three main things - collect, enrich and display
They each have different access patterns
COLLECT is high volume writes, but it’s more fault tolerant. If you don’t collect for a few minutes or even a couple hours, it’s not the end of the world and usually nobody’s the wiser
ENRICH is complex querying and moderate volume read/write. And the queries can be as complex as the user chooses to make them. We often see things like “get me all Twitter content from this geolocation mentioning #babyclothes and send it for human moderation”
the third component, DISPLAY is high volume reads (about 300 requests per second) with no tolerance for latency or outage. Content is displayed on retailer ecommerce sites in the path of purchase. If it doesn’t show up, it can’t influence and less stuff gets sold :)
So we needed to be able to support all those different access patterns
[CLICK] Next, we need to be able to support a growing number of clients and volume of content. Not only do each of our clients see organic growth of 10-20% but since this is a newer product the number of clients is also growing every quarter.
[CLICK] Finally, our team must be able to self-manage (i.e., our own DBA). But honestly, we didn’t want to have to think of this at all (or we wanted to think of it as little as possible)
We had already set on Node JS as our language and we knew we’ll be using AWS lambda, elastic beanstalk, and a few other AWS services that the team had previously had positive experience with
We had several options when deciding on a database
Mongo wasn’t very widely used within the company which, like our previous stack, favored a combination of SQL and cassandra indexed by elasticsearch
There was also some pushback from the designated DevOps team at the time indicating they’d have a hard time supporting mongo. The question of scale also often came up usually backed by anecdotal evidence. But, the dedicated DevOps team essentially said “you’re on your own” if you choose mongo
at the time our development started, we were still not decided on a database
there were strong pushes for both cassandra and mongo. In retrospect i see that as a positive as it allowed us to design a fully database agnostic platform
[CLICK] however prototyping with mongo is just very very easy - it’s easy to boot up a mongo instance locally, connect to it through a simple Node JS driver and do anything you need to do for your testing
without fully having our schema worked out, mongo’s schema flexibility made it very easy to change things quickly as needed
it’s also easy for someone working on the collection piece to run a mongo export and airdrop a bunch of data for someone else who is working on the enrichment piece to test with
So even in our proof of concept phases we very naturally gravitated toward using mongo
[CLICK] Some numbers we tested with initially
Our collection services ran every 15 minutes and would write about 80,000 documents as fast as possible. It usually took a few seconds and the time was limited more by the social APIs than anything else. In production now we write close to a thousand documents every time we collect
Enrichment services or rule execution. We tested with about 4,000 rules over 7 million documents. Execution took a few minutes with no indexes. In production now we have about 4,000 rules over 20 million documents
Aaaand we did no display testing until later
[CLICK] a quick side note - I’m going to speak more about indexes later on, but I just want to touch on it for a second here - we consciously used no indexes up front added them as needed. We did this in part because we didn’t know beforehand what indexes will be helpful and in part because we wanted to prove to ourselves that everything will scale
What did we end up with?
[CLICK] Here we have the collection service which is a bunch of lambda functions triggered off of a kinesis stream (kinesis being the AWS real time streaming platform). They hit up the social channel APIs every 10-15 minutes. That’s our bursty high write traffic. This service and all the other ones you’re about to see are written in Node JS.
[CLICK] Here is our enrichment service which is a part of an autoscaling group. These are our constant complex reads and simple updates.
[CLICK] Same access pattern as our enrichment service, our management service allows users to directly log in and approve content or identify products.
[CLICK] And last but not least, our display autoscaling group which is obviously constant high volume simple reads.
What datastore is in the middle of all these pieces? Well, you guessed it, it’s a convoluted combination of SQL, Cassandra, and Elasticsearch. No, I’m just kidding, all those other conferences turned me down, so we decided to use Mongo instead! And thank god. So here is our new platform and it now comes with a price tag of [CLICK] $6,500/month. If you’ll remember from the earlier slide, our original cost was $60,000/month so this new platform is running at 10% of our earlier cost.
Massive cost savings, huge performance gains, transactional consistency instead for waiting for stuff to propagate to display, handful of services instead of hundreds of clusters to maintain. Great stuff; not without its challenges - but we’ll get into those next
It’s worth pointing out that this architecture is completely serverless or containerized. We have lambdas and a few dockerized autoscaling services. I’ll speak more about Atlas in the next few slides, but in terms of its place here, it fits great into this architecture where we just want to code and not worry about infrastructure.
Let’s talk about our DevOps decisions.
[CLICK] In its first iteration our cluster was some EC2 instances we provisioned by hand. We put together a few memory optimized instances as a replica set, figured out what ports to have access to what, installed Mongo.
It’s cheap, but not super easy to set up, and it wasn’t going to be a viable long-term solution.
We actually had an old cluster that was set up by hand and running a side job in production. We never saw any issues with it, but we weren’t going to risk it this time.
[CLICK] We did much better on the second iteration. We decided to use cloud manager which is still an option I would recommend to people on a budget
It was again cheap, the installation was easy and fast, it allowed us to upgrade mongo versions quickly and scale with the push of a button
A few kinks that we saw (and those are somewhat unique to our setup) had to do with dealing with our own VPCs within amazon (cloud manager didn’t have a seamless integration at the time)
However, it allowed us to code fast and forget about our database for the most part. Again, a totally reasonable option
Now, our third iteration was Atlas [CLICK]. What was great about it?
Well it was much cheaper than having a dedicated DevOps engineer
It is super fast to set up, can be set up to scale automatically
It gave us insights into our indexes, long running queries, performance glitches, and more
And updates took no time and no stress on our part whatsoever
We recognize that some of the cooler things about Atlas like performance analytics and such can be done by hand. But it’s just tedious, less graphable, and of course, for things like showing database load and so on, Atlas just kills it.
So, like I said earlier, between serverless and Atlas, our infrastructure basically manages itself and leaves our hands free to make great products which is what most of us are passionate about
How did we decide on our indexes?
I already mentioned that we started from zero. We really just wanted to see what works and what doesn’t.
In our experience, if we could get an index to narrow down the scan size to thousands of documents, then it struck the balance between index size and performance gain. I can obviously create an index that gets us down to a single document, but the cost of doing that is not worth it
Once we started creating indexes we kind of went with our best guesses on what works.
For example, for display we knew tags, client name, and timestamp were going to be in every query - easy!
For the more complex enrichment rules, we really just ballparked our guesses. Sometimes we were right, sometimes we weren’t
A big part of our philosophy is to do what makes sense at the time and build stuff that we can easily iterate on. That applied to our indexes as well
Once we started using Atlas, we realized we weren’t using some of our indexes nearly as often as we thought we would be
We were able to make smart decisions on which ones to kill
An index killed is as valuable as an index added. Why? We want all our indexes to fit in memory. Unused indexes obviously work against that goal
Shifting gears a little bit, I wanted to talk about some of the problems we’ve encountered in our new platform over the last year, and how we tackled them. We all know, nothing in life is easy, we have this shiny new product we built, we got this amazing tool (our database), so we can just cruise from here on out, right? Uhhh actually yes, pretty much, but not quite.
The first problem had to do with [CLICK] a random day when our database started just failing over again and again. During failover it would be unresponsive for minutes at a time and the pattern would repeat every hour or so
[CLICK] How did we detect it? Our board metrics indicated high response times. Further digging indicated that we had over 30,000 open database connections at the time of failover (for those of you taking notes, how much does it take to bring down the Primary node? About 30,000 open connections)
[CLICK] Tools we used to root cause. Datadog and the Mongo console.
[CLICK] And the solution? Once we realized each request to our database was opening a new connection and those connections weren’t being closed fast enough, we switched to using connection pools.
The lesson? Failover is not seamless and it’s not cheap. It’s great that it’s there, but it’s better when it doesn’t happen.
Another problem happened in the very early days of our new platform launch. [CLICK] Shortly after onboarding our first live client, we realized that the displays on their site were taking around 6 seconds to load
[CLICK] How did we detect it? Well for this one the datadog board was obvious. What’s interesting is that it also indicated our database queries were taking more than 5 seconds; at the same time Atlas was telling us that database queries (for the same request) were taking less than 100 milliseconds. So what gives?
[CLICK][CLICK] As it turns out, the discrepancy in the request times had to do with our Lambdas connecting to mongo. On cold start, a lambda would take about 5 seconds to connect to mongo, then the mongo query would take 100 milliseconds, and all would get recorded in Datadog as a single transaction.
Our solution was a quick switch from running display off of lambda (which is what we were doing at the time) to the dockerized autoscaling service you guys saw in the earlier diagram
This next problem has to do with the execution of our complex rules. If you’ll remember from earlier, rules are a set of filters coupled with a set of actions. So for example, your filter is “everything that says rain boots and is moderation approved” and the action is “ask author for permission”
[CLICK] As our rules started growing in complexity, we noticed that for all of them to execute it was sometimes taking 30 minutes or more. Individual database operations were taking minutes to complete despite multiple complex indexes.
[CLICK] How did we detect it? Our Atlas board metrics indicated poor rule execution time and [CLICK] obviously Atlas was our tool to root cause the issue
[CLICK] And how did we solve it? Well, we realized that our rules were performing actions on matching content, but unmatched content was still being scanned in subsequent executions. Our solution was to exclude scanning of previously unmatched content. And to do that we included a timestamp in our queries that only scanned content updated since the last time a rule ran.
The lesson I took from this is don’t rescan content you don’t have to. This is a great example of an issue where someone might say, our database isn’t scaling and it’s not able to perform complex queries in reasonable time. Well guess what. Fix your code. Hardware is great, tools are great, but they can only carry you so far. I think we sometimes tend to be sloppier than we should be because hardware is so cheap and easy, but we have to write code responsibly too.
Example if needed:
Say we have 10000 documents in a collection, each has a color
I run a query every 15 min to find all the red ones and take some action
First time I run, I find 1000 and take some action. We tag these so they aren’t scanned next time.
But the next time I run, the other 9000 docs that aren’t red still needed to be scanned.
And speaking of bad code, the last issue I’ll talk about today is when [CLICK] some bad code caused major data corruption in our database across all clients and most of our content (It wasn’t me!! Actually it was :()
How did we detect it? Well we didn’t need the boards this time because client complaints started pouring in fast [CLICK] [CLICK]
Our solution? An atlas point in time recovery.
Because we depend on social data, we can actually tolerate data loss pretty reasonably
We rolled back our database to a backup less than 12 hours ago, and cherry-picked client enrichment actions since recovery
Aggregations proved very helpful to cross-reference what was changed when
This was a very bad day. It was on a Friday of course, because those things always happen on a Friday. Yet, somehow, Atlas made recovery super easy and as pain-free as we could have hoped for, considering. The lesson - keep an audit and have a solid backup path to recovery.
Before this happened, we kept talking about how we need to do a dry run on recovery and we kept saying we’ll do it, but we didn’t until we had to
I’m sure many people in the audience are thinking the same thing now. I really encourage you to do it. You don’t want it to be the first time when it’s a production escalation
Some issues we anticipate we’ll encounter in the future.
Scale, size, and cost, obviously.
How do we plan to address these?
One is clean up unused content. I really feel like most people use a lot less data than they think they use. This is a good opportunity to evaluate and clean up
As our dataset grows, I see us utilizing more sparse indexes. For us, recent content is valuable, older content, not so much. For better or worse, no one cares what someone posted on Instagram two years ago. If you can reduce the size of your indexes by making them sparser, by all means do it
Some surprising side effect arose from being our own devops engineers.
Our services team can log in and get read-only view to all kinds of data that’s not available in standard analytics screens
Unlike a relational database like SQL, there is no need for a deep understanding of a complex schema
It just allows for very intuitive querying that doesn’t take very deep domain knowledge to get your work done
According to our product manager, there’s been an 80% reduction in tickets since the switch, definitely in part due to people being able to get the information they need without developers being involved
Another positive (this one specifically has to do with Atlas) is that other teams are now considering going the hosted route. It’s easier for others to walk a beaten path, there’s less uncertainty, and more successful examples to speak of next time someone mentions “scale”
And I can’t stress this enough. We expected to have to do a little bit of maintenance; we haven’t had to do any. It’s harder to put a number on things like this. You can say our hosting costs went from 60,000 to 6,500, but it’s harder to gauge how much money we’ve saved by not having to worry about our database. Old platform: 60,000, new platform: 6,500, getting to focus all my time on just development: priceless.
A couple things that I wanted to mention that didn’t really fit in their own slide.
Of course it depends on your use case, but we haven’t been impressed by the text index. It’s huge, it can’t be compounded, and the search doesn’t always behave predictably. I would recommend narrowing your queries down and doing a regex search, if you can
Hint is great. Even when you think Mongo will pick up the right index to use, it sometimes doesn’t. So, if you can add hint in your code, do it.
Unfortunately it doesn’t work with updates, but since I’m speaking here, Mongo, this is an official request, please fix.
Some final thoughts. Bottlenecks happen, services break, requirements change, products evolve. What makes a good datastore is not infallibility, but the tools and ability to detect issues fast, diagnose, develop fast, and recover. I think that the value of a great datastore or any good tool really is that it allows you to be agile and iterate. And really to do what you’re passionate about, which in our case is code.
Why Cassandra?
It's a bit of a bazaarvoice domain requirement as that is how our single source of truth datastore works at scale. They picked cassandra years ago to handle the globally-fault tolerant high write volume access pattern that we see for ratings and reviews across all our clients