26. “Our approach to managing AWS costs is REACTIVE
and prioritises taking action against the highest
contributors to our costs as observed in production”
27. “The complexity of implementing multi-cloud makes this
a decision we don’t even want to contemplate.”
Here’s what I’m going to cover in my talk.
What are principles and why they’re so valuable to an organisation.
I’ll go through Intercom’s engineering principles
And finally the principles in practice with respect to our use of AWS. This is the technical part of the talk, but the first two parts are really important!
I’m talking here about principles used by your company, team or organisation. There are good generalised software engineering principles around testability, writing simple code, not repeating yourself etc. They’re good but don’t particularly help your teams build the right thing.
Organisation principles are a crucial tool that helps your organisation to learn, grow and establish working patterns for building stuff. Without a common set of principles, how do you know you are building the right thing in the right way?
You can make an educated guess. A lot of the time you might be right.
You can ask whoever is in charge. That can work fine.
You can copy what you’ve done elsewhere, which can also work well in a bunch of cases.
You can do something you saw at a cool tech conference where every presentation is full of unicorns, rainbows and every project is wildly successful. Sometimes this can work.
When you get something right on your own organisation, how do you share this information? Principles are a way of encoding successes, helping to repeat the behaviors that led to positive outcomes and avoid the previous behaviors that led to mistakes.
This means that everybody in the organisation, even a new joiner, knows how to read the minds of the most experienced or senior engineers in an engineering organisation.
That’s pretty powerful. It also enables them to build stuff in the knowledge that within reason as long as it falls roughly within the accepted guidance they’re going about it in the right way.
There can be exceptions of course, but with a set of written down principles at least you have something to work with and not just making guesses.
Let’s consider a bunch of bad principles and practices. What might a bad principal look like?
Is it a big deal if a bad one exists?
Lol, do you even computer? This is a bad principle because it’s unattainable and not realistic.
It is ambitious, sure. It sets a tone of a release being a death march of punishment. No company really wants to do the opposite either.
A huge engineering effort might be made to contain bugs and unexpected behaviours, but zero tolerance on bugs sounds like a poor choice. Nobody wants or likes bugs, but they’re unavoidable in the real world.
Ok, so what do good principles look like?
This is simple to understand, opinionated and has opposite cases which can be more than appropriate.
Building on AWS may not be appropriate if your business requires sub-millisecond latency to users in one building in Cairo, or you may have invested significantly in local datacentres, or you may require cloud diversity due to business requirements.
A principle like this speeds work up because people know what to do without guessing, asking around or doing the opposite thing and finding out later.
Ok, I think I’ve demonstrated that principles are useful.
Let’s move on to a few of Intercom’s principles that we use to build product.
First, I’m going to explain a little about who Intercom are and what we do.
I work at Intercom as an engineer working with the teams that make up the foundational parts of Intercom such as our use of cloud services, deployments, data storage security and IT.
Intercom have a suite of messaging-first products for businesses to accelerate growth across the customer lifecycle.
Business use Intercom to do sales, marketing, support, product tours, all through a beautiful and industry leading messenger.
Intercom’s R&D HQ is in Dublin. We’re kind of split-HQed between Dublin and San Francisco.
Here’s our messenger. There’s gifs, emojis, applications embedded in them such as Stripe and Shopify. There’s a pretty big set of backend application supporting all this functionality, and we’re all in on AWS, which is why I’m here I guess.
As with everything in the real world, our principles are a work in progress. We revisit them every year or two to make sure they still make sense.
But we have a good set today that are battle-hardened, tested in the real world and we’re happy to share to the outside world.
So we have a set of principles across our R&D organisation, which is comprised of engineering, product and design groups.
I’m only going to look at at relevant ones. There are ones which more clearly apply to the process in building product.
“Ship to learn” is a universal principle. The sooner we ship, the quicker we learn how our product is used. It also means that shipping a feature is just the start of the process. We believe that great products are built by shipping a feature, understanding how it’s really used, and then iterating. We’d rather get something out quicker that’s functional but incomplete.
This works well for us. I would not apply this if I were building, I dunno, software for a Boeing 737 Max or something.
“Build in small steps” is a direct instruction to our engineers. Make small changes frequently. Break work down into safer, smaller steps.
This doesn’t just refer to changes done via code deploys and pull requests, but allo usual modern “testing in production” techniques such As feature flags. In addition to being iterative and assistive for an agile development process, there are secondary benefits such as helping our availability, quality and again lets us understand what actually happens when we ship what we’re building.
Our environment allows us to frequently and easily ship to production.
We encourage our engineers to be technically conservative.
Maybe we could solve the problem using a graph database deployed using Kubernetes backed up onto the blockchain.
But first we ask ourselves: “can we solve this problem with tools and techniques that we already know well?”
One of our early principles was to “run less software”, meaning we preferred to use a smaller set of software and services, and preferably ones that we didn’t have to operate ourselves. This evolved over time to generally being technical conservative. This doesn’t mean that we don’t build beautiful, functional product, but that our implementation choices are conservative.
We prefer to keep things simple. We will trade off performance, financial cost and perfect abstractions to keep it this way.
For example, at our stage of growth, financial optimisation is not *the* goal of our engineering organisation. We’re a successful startup, but not yet an established company. Saving a few thousand dollars won’t make meaningful difference to whether Intercom is a long term success, but building and iterating on new features could.
So this is definitely one that can change over time, and again may not be appropriate for many other companies.
This one isn’t really directly related to the rest of my talk, but I really love it.
We’re deliberately positive, optimistic, eager to teach and learn, and welcoming to everybody. This is part of why I love working at Intercom.
That was Intercom’s set. Now we’re going to see what they look like in practice, specifically around our use of AWS.
How do our principles influence the management of AWS Costs?
As I’m sure many of you are aware of, understanding your AWS Bill can be pretty complex.
There’s a whole cottage industry of companies and consultants who are more than happy to take money off you in return for the promise of making your bill smaller and life easier.
Knowing what is worthwhile to reduce costs can be difficult and takes time and effort to do well.
So, I am the “costs person” at Intercom. A lot of the time when people have questions about costs, they ask me. I don’t work alone on this but I have been around for a while and working in the area.
In order to scale my function in costs and AWS architecture in general, and ensure I don’t lose my mind, I ended up writing down a load of things that were in my head into a document.
Like our principles, they are guidance that reflect actual usage of AWS. The document is in no way complete as it doesn’t explain AWS from scratch, just mostly how we use it.
I recommend writing something like this as a way of saving loads of time, but also as a way of testing your mental model about how things are actually used in the real world.
Shout out to the open guide to AWS! It’s a community written guide to using AWS in the real world.
AWS’s own documentation is reasonably good.
This guide is by and for engineers who use AWS. It aims to be a useful, living reference that consolidates links, tips, gotchas, and best practices. It arose from discussion and editing over beers by several engineers who have used AWS extensively. It’s concise and readable. There’s also a really good Slack community!
I point folks to this in my doc. I encourage you to read, share and contribute to it! It’s got a load of great cost related info in it too.
Back to my doc, here’s an important quote.
I get asked questions from engineers like “should I worry about costs” or “how much will this feature cost” is “ship it and we’ll find out when it’s used”. Don’t worry about it, just build.
There are some exceptions to this of course, when it’s clearly obvious that there will be a massive infrastructural impact (such as doubling all our Elasticsearch fleets).
But the guidance I give to our product engineers is to build and deploy, and learn by running their feature in production rather than do a load of upfront estimation work or optimisation work. The benefit of optimisation work is a lot easier to see after something has been optimised in production, otherwise you’re relying on judgement to be certain.
This works with “building in small steps” and “ship to learn”. This works for Intercom. It would not work for an environment under strict financial control or with very tight margins.
We don’t want engineers to worry about the existence of other clouds. Just pretend they don’t exist!
The real power of cloud services is tying them al together. They come with a load of overhead such as relatively complex permission models and network designs etc.
Smaller SaaS providers such as Honeycomb or Gremlin, Datadog or New Relic don’t require complex configuration to get started and don’t have this barrier, but cloud services do.
It’s less confusing and less work to use one cloud provider well and a small selection of single purpose world class SaaS providers to run our business. We are all in on AWS. These decisions tie back to our engineering principals “be technically conservative” and “keep it simple”.
There are very few user facing products whose customers benefit from being hosted on multiple clouds. We don’t want our engineers worrying about this.
So what’s driving our bill?
We run large numbers of instances to serve our application, and we also self-host Elasticsearch for many large datasets, so EC2 dominates.
Then there’s a load of RDS Aurora MySQL and Elasticache. To control these it comes down to
managing reservations, instance choices and usage of spot
Rightsizing and optimising
Manage this centrally rather than getting every team to figure this stuff out themselves.
Here’s our EC2 ratios in terms of dollar spend. Not bad overall.
On-demand has grown a little over the last two months, I really need to look at that!!!
Managing our costs focuses on the
The main tools and features we use to manage and control our costs are:
Tag infrastructure by product feature and team where possible.
Cost Explorer to visualise trends
This is consistent with our principle of keeping things simple. No complicated software involved here.
Understanding the inputs and changes. We’re still small enough so that we can avoid the overhead and abstractions of budgets.
Makes reservations a lot easier. Also keeps things simpler. We don’t want perfectly tuned and optimised fleets, we want to spend as little time as possible managing fleets of servers.
Move various workloads to spot.
Keep things simple and reactive to real world usage after we have shipped to learn. See how spot performs in the real world. Things have generally stabilised a lot over the last 3 years. In line with our principles. Never want to do complex stuff to participate in bidding wars, AWS now just take care of all of that.
Now I’m going to talk about our Monolith architecture and why it works for us.
We run a ruby on rails monolith. This means that we have a single large application with a lot of code that does many, many things. UIs, APIs, helpers, workers, large amounts of datastores… it does everything.
When said out loud at a conference, the word monolith is usually followed by phrases like…
This sounds like a fun talk.
This sounds like a fun talk.
All these imaginary talks are well worth going to and no doubt describe real life situations.
But it’s not what we’ve found at Intercom.
Our Ruby on Rails monolith keeps us fast. Yes, we have to invest in it to keep it this way.
Upgrading Ruby on Rails on a monolith is tough if you haven’t done it in a long time.
Refactoring core modules to ensure usable, safe boundaries
Deploying a monolith so that it doesn’t break all the time is hard if you do it infrequently.
So we garden our majestic monolith - continually upgrade the version of Rails, such that we’re now just about to have a permanent test branch running against the development version.
We use code owners and well defined boundaries in the code base to stop people overlapping.
Give a great experience for out of the box patterns and tooling for logs, metrics, scaling, etc.
Our majestic monolith with the vast majority of our business logic running on EC2 instances, a lot behind ELBs or running async jobs.
243 Auto Scaling Groups, all running different functions or logical separation of different APIs.
We have written lots of different services in the past, kind of because we thought it would make us faster.
But when it comes to day to day operations, maintenance, having great observability, upgrades, updates etc. teams see that they get more done in the monolith.
What we have observed is teams replacing their micro services and folding the function back into the monolith, where it’s typically easier, faster and cheaper.
Keep it simple, ship to learn. We revisit our assumptions.
At the same time as reducing our microservices, We have seen increasing use of AWS Lambda functions, but not generally to replace parts, but to glue different AWS services together and run some simple processing on the data.
Daniel Vassallo, another ex-AWS engineer who is good on twitter, did a good job of describing this pattern recently.
“stored procedures for the cloud”. This fits well with a monolith. There’s a danger of important stuff moving out of the monolith and surprising developers, but for simple functions that don’t need deep observability, they’re fast to work with and work well. We haven’t A/B tested this hypothesis, but we’re willing to bet that it will work well over time. We need to invest in the deployment and observability story for running Lambdas alongside our monolith, and make as good an experience as working with the monolith.
So I think our setup here is consistent with “be technically conservative”. We prefer to reuse than reinvent.
This was a pretty large project, as our biggest dataset, our users’ users was stored in MongoDB.
MongoDB is not a bad database, in fact it’s excellent and has been doing extremely impressive stuff over the last few years.
However we were using it badly, with large numbers of individual replica sets, complex indexes and and unclean ways of interacting with our code base (i.e. direct ORM)
So we needed to evolve, and so we decided to evolve towards replacing DynamoDB, as we only wanted to have to run a distributed database ourselves if we really, really had to. We thought it would be a similar amount of work to evolve our use of MongoDB vs. Replacing with DynamoDB, and we would gain a lot of “run less software” and general simplicity.
We needed to replace a lot of different functionality, including streaming changes to our Elasticsearch clusters, rate limiting and keeping history of user changes.
We ended up using DynamoDB streams, Lambda functions and other services to get data to the right place for our Rails monolith to do stuff like rate limit. (The hilarious thing is that it needs to rate limit mostly itself)
One of the big differences between mongoldb and dynamoDB was that we could send diffs to mongoDB, so if one attribute in an update changed we needed to only send that along, and it didn’t seem to be that expensive for mongoldb to handle that. However for DynamoDB, we had to update the entire doc, and some users were huge. To fix this we ended up breaking down the user documents into multiple related documents, allowing us to have very small documents that are updated frequently and larger ones that aren’t. This helped smooth out rate limits, hot-spots and improve costs significantly.
Overall we wanted to get to a more simple setup and we got there in the end, and we removed all use of MongoDB from our production environment last week. In a major change like this, we applied “ship to learn” and “built in small steps” ruthlessly, which meant moving slowly with the project, dual-writing for long durations and not being surprised by the differences between DynamoDB and MongoDB. We were also helped significantly by working with our Technical Account Management team and support team in AWS.
Thank you for listening, more than happy to chat to folks during the event, and there’s my twitter handle if you want to continue the discussion online!!