Sensu v1 is a popular open source monitoring framework with a wealth of community contributed and maintained plugins and features. As Sensu does not offer SaaS, the features and direction of our software have been guided by our customers and the experiences of those in our community. As its features and capabilities have matured, we've discovered some operational challenges at scale. In this talk, I'll discuss our v2 rewrite, the challenges of building scalable, feature complete tools for operations folks without running our own infrastructure, and how we included our community in this process. I'll also dive into some of the tools and methods we use for load testing and rapidly releasing new features.
2. ∎ Spent ~6 years in on-call rotations
∎ I now build software that wakes people up in
the night
∎ In my spare time I herd chickens #FarmOps
∎ @benzobot on the twitters
4. What is Open
Source?
Code that is publicly
available, with
permission to use,
modify, and share.
∎ Has a license
∎ Contributors
document
∎ In source control
5. Motivations
∎ Ensuring the systems we build
are performing as we expect, and
that we find out when they aren’t
∎ Build/maintain tools that make
our own work easier
∎ Build/maintain services to make
others’ work easier
Why do we work on open source software?
6. What is Sensu?
∎ ~6 year old open source
monitoring framework
∎ Service checks, metrics,
filtering, auto remediation, etc
∎ V1 - Ruby, Redis, and
RabbitMQ
∎ V2 - Go and etcd v3
9. V1 - Operational Challenges
∎ Clustering needed for large
installations
∎ Dependent on external processes
∎ Configuration management driven
10. ∎ Reduce operational complexity,
increase performance
∎ Backwards compatible with
current plugins
∎ API driven
∎ Written in Go on top of etcd v3
V2 Rewrite - Goals
Alternate titles include: when to rewrite, how not to leave your community behind, etc
Built data infrastructure and tooling for systems monitoring and performance analysis
I’ve spent a some of time on call, so I know what it’s like to scramble when things go bump in the night.
As someone who has been there, I am conscientious about building quality software that people delight in using (even if its job is to wake them up).
Be forewarned if you follow me on twitter, you will see pictures of chickens!
Today I’m going to cover:
The motivations behind building open source software
Different types of testing, and what we do to ensure Sensu is feature complete and performant when we can’t drink our own champagne, since we don’t have any infrastructure!
I’ll dive into some of the weird and wonderful bugs we found and fixed!
And finally, I’ll chat about the importance of community and what we’re doing to include and learn from it.
But first, a couple definitions.
I find it useful to put some qualifiers on what OSS really is.
Anyone can put code out in public, but to really be open source, you have to tell people:
How they can use it
How they can contribute
And where to find it.
Much of our industry relies on open source software in their daily work. Even tools that we pay for can be derived from open source.
I want to build and maintain tools that make people’s jobs easier, and I want to ensure those tools are performant.
Fame/glory: there is something cool about seeing people use code you wrote.
Ok so now let’s get into what Sensu is, and I’ll start talking about our journey to an exciting rewrite.
As a monitoring framework, we’re positioned to be the hub in an end-to-end monitoring solution.
Service and system checks are at the heart of Sensu, but you can also use it for metrics collection, sophisticated alerting, auto remediation, etc. There is a lot you can do, and I think this is pretty cool!
Our community is really important - as they’ve used Sensu in their own infrastructure and written plugins and checks, they share that work with others, making it easy for folks new to this solution to get started.
V1 is written in Ruby, using RabbitMQ as a messaging transport and Redis for coordination.
Our learnings from V1 have lead us to rewrite Sensu in Go, embedding etcd, which I’ll talk about in a bit.
Community metrics:
Plugins! We have a couple frameworks for writing your own plugins, and over 200 community built, open source plugins.
RabbitMQ serves as our transport layer; redis holds state.
Clients subscribe to the checks they need to run.
Actions work off of a publish subscribe messaging model.
These technologies have been extraordinarily useful and helpful.
However, the infrastructure landscape has changed over time, and it’s time to reevaluate what got us here.
What are the pain points?
Deployment (especially high availability) can be complex.
Our engineering team got a chance to sit in on a training by our Success team on how to install HA Sensu v1 without config management, and it was a time consuming process.
The state of v1 is not easy to containerize or manage without configuration management.
I could probably do a talk on each of these topics, but suffice to say, we wanted to make it easier to install and get up and running quickly.
The path to doing this was to spend almost a year in a closed source development cycle, rewriting the entire system in Go on top of etcd v3.
Go allows us to ship a couple of binaries that are easily installed via traditional package management, docker, or within kubernetes.
Now, I can set up a backend, agent, and have a check running in a handful of minutes.
We still use a publish subscribe model, but now our state and configuration are stored in etcd.
Etcd is embedded in the binary so you don’t have to run a separate instance!
We’ll eventually have clustering aided by etcd as well.
We released it to the wild as an Alpha in February this year.
I’m skipping ahead a bit here, but it was a pretty uninteresting process that was really just writing a bunch of code and closing issues.
Our path was clear as we had a successful project to model, and about 6 years of experience from working with community and business that used it.
We knew the alpha wasn’t perfect or polished, and we still had a long feature list to get through, but it was important to get it out there early to get user feedback and bug reports.
So we started asking people to use it!
“Hey friends, can you run this binary in your infrastructure and tell us when falls over?”
And they responded:
“This is cool, but where are the dashboards?!”
“Hey did you know your nightly build is a month old?”
“When will you support clustering/HA?”
It’s sometimes painful getting reports that something isn’t going right for our users, but totally necessary.
Because we are not the main consumers of our software, we rely on user feedback to know what’s important.
How do we know our software works when we are not the primary consumers? And actually - we don’t run any infrastructure at all?
While we did testing during development, it was limited to unit, integration, and end to end testing.
These tests are really useful for determining if your code is working during development, but not as useful for determining feature need, usability, or even system behavior.
In order to get some benchmarks about how our system was performing (or not) and the usability of said system, we had to do some different types of testing.
Once we had an Alpha, and some information from users, and space to breathe from feature development, we started planning out what that testing looked like.
Usability testing is actually kind of tough when you aren’t the core user of your product, and especially when it’s such a malleable framework (everyone has a different use case!)
And for that matter, Integration and end to end testing is really hard too.
We either have a pair and one additional reviewer, or one author and two reviewers. Note that bugs can still slip!
Every PR runs our full test suite with linter, and we schedule a nightly build.
We started using Velocity a couple months ago with sensu-go to track PR size, time to merge, and complexity risk.
So far, it’s been mostly a confirmation tool for us that we’re keeping our PR size manageable and getting reviews completed in a timely fashion.
Our PR goal is 200 lines of code, and we do that at least 50% of the time.
This is the other interesting metric worth paying attention to: our build success ratio.
Lately we’ve been having a _tough time_ as we’re starting to run into some issues with our testing strategy!
Our end to end testing strategy is proving to be brittle, and we’re having to rework some of our build tests to be more reliable.
You can’t possibly cover everything in software testing. It’s helpful to have actual usage with the software to uncover bugs.
So we got everyone together on a zoom call and came up with a strategy.
We’d each pick a component, write down how that component should behave and some acceptance criteria for testing.
Everyone would then pick one or two features to test that they had not worked on.
This was actually pretty fun to do! We found a few bugs, but the ones we found were minor, and we gained confidence in the performance and usability of our software.
Not having infrastructure and doing a process of local development and continuous integration and package building/deployment meant that we had some guesses as to the performance of our software.
To be confident in recommending usage and developing new features, we needed some data points!
So we came up with a plan and a goal for load testing. We wanted to see if we could connect 10K sensu-agents to a single sensu-backend, running keepalives and processing checks.
Initially, we planned on doing this by setting up a single VM running sensu-backend in Google Cloud Platform, and then spinning up kubernetes cluster with 5 agents per pod.
We expected that we could push a button and have GKE autoscale until we had the number of agents we needed (10K).
In actuality, we ended up effectively load testing GKE - we were never able to scale up to 10K agents! What happened was api throttling on the part of GKE, so we needed to rethink our load testing strategy.
So we did something much simpler - we wrote a script to spin up 10K agents, and connected that to our single backend.
During those tests, we definitely found some performance issues.
So here are some details on the more fun/enraging bugs we uncovered during our testing exploits!
We discovered this about a minute before we released our alpha to the wild.
Etcd is a key-value store written in Go, and we use it to store configuration and data in Sensu-go.
One of our engineers was doing some manual testing in a local vm, and he left it running overnight.
The next day, it was unusable. What happened?
We saw all these errors saying “mvcc: database space exceeded” in the logs. Ruh ruh.
You know when you get “database space exceeded” errors it’s bad.
So we dug in and pored over etcd.
Etcd keeps a record of all its keys whenever they are created or altered, including any internal keys. By default, etcd has a 2GB size limit.
We were doing a *lot* of writes, and since we were updating a key anytime a new event came in, we had a huge number of historical keys!
Enter autocompaction - you need to periodically prune the keyspace in order to keep it from maxing out your db.
There are two ways to setup autocompaction: first by time (say, run compaction every hour) and then by revision (only keep n revisions).
We wanted to keep 1 revision around, since there wasn’t any reason for us to go back in the keyspace history.
But for some reason, we couldn’t get it to work! We set autocompaction to revision with an int of 1, and nothing happened. As it turns out, we uncovered a bug in etcd! It was calculating revisions based on time and not by value, so we had to set it to 1 nano second for it to work. We reported the bug, they fixed it quickly, we upgraded, and now our db is happily autocompacting away.
We have an end to end test suite that spins up a sensu backend, agent, and command line interface and runs some basic tests against our features.
One of these features is Time Windows, which are used to exclude notifications outside of a particular day and time.
Our e2e test suite ran fine during the daytime, but sometimes tests would intermittently fail around the end of the day. It *seemed* intermittent and random since we didn’t always push code or open pull requests at the end of our day (around 5pm PST).
We had gremlins in our tests!
But hey wait - 5pm PST is when UTC rolls over to midnight.
As it turns out, we had a time calculation that was calculating the y/m/d from the current time, and wasn’t localized to UTC.
This was fixed with a 6 character change! The commit message was longer than the fix.
Check scheduling is kind of the bread and butter of what Sensu does.
One of the tests we ran was to see how many events we could process per second before the system started to fall over.
We added a check and scheduled it to be run on a 1-second interval, and then attempted to verify the number of requests + the number of events in the system. To our surprise, they were different! It looked like not all of our check requests were being executed on the agents.
We started testing further by doing what you do when you don’t quite know where the failure is in a system: throwing logging everywhere, creating a check to write a unix timestamp to a file, and counting the results.
We were quickly able to narrow it down to somewhere in the transport between the backend and the agent - the agent was executing all the requests it received, but not getting all the requests that were sent from the backend!
Figuring out where in the transport the request was failing was harder. It took 3 engineers multiple many hour pairing/debugging sessions to narrow down the bug.
Sensu uses a message topic on top of a Go channel for message routing. It turns out we were checking for a topic’s existence, but not checking if that topic was usable (go channel open).
This was a sneaky bug. It was tough to replicate, and our tests didn’t surface the issue. It was only when we went to test functionality that we uncovered it, and it still took lots of walking through code logic to find out why the bug was sometimes failing.
And yes, we have a test for this now :)
So now we have some confidence and recommendations for operating Sensu. It’s time to tell people how to use it!
I can’t tell you how many times I’ve pored over codebases looking through functions, comments, and system architecture to figure out how something works because it wasn’t well documented.
We have pretty complete documentation for 1.x, and in many cases, the mechanics/how to guides would apply to 2.0.
However, there were several differences in functionality or implementation that we felt necessary to explain in order for new users to get started with Sensu 2.0.
Writing documentation on how to use something is also a great way to ensure that you’ve built a feature correctly - ie, does this do what we said it was supposed to do.
We didn’t have much guidance when we started out writing documentation for our Alpha release. We had a deadline and needed to get something workable.
So we wrote a bunch of markdown docs and chucked them in a github repo.
Repos are pretty great for collaboration and iterative development, but not necessarily for navigating information, or formatting docs in a way that was easily accessible.
Our docs weren’t bad, but they weren’t user focused - we had way too many examples that didn’t fit well together, and we spent too much time going into detail about how our APIs worked and not enough time on how to use the product.
They also centered around our alpha program, and not how to use the software.
So we decided to polish them for our Beta release. Make it work and then make it pretty, right?
While we were hard at work on V2, a revamped docs site was also underway.
The main goal for the new docs site was introducing the ability to search, better information organization, and ease of new doc contributions (static site generator using markdown).
My colleague and I tag-teamed for 3 weeks on the Beta V2 docs after discussing the pain points we saw in using them.
We started by writing a doc and a template for how to write docs!
Our template consisted of a basic guide for how to write how-to guides, and a template for reference (api) documentation to fill in the blanks of where guides leave off.
Guides introduce a feature by explaining what it is, a use case, and then some simple and clear instructions for how to implement that use case.
And this is what our docs site looks like now.
A guide isn’t meant to be complete - it is intended to show how to get up and running quickly with a feature.
For more in-depth explanations of our features within Sensu, we have an API reference; it lists how a feature works, and what its default attributes are.
We’ve released everything to the wild! Hooray! But we’re not done yet.
Since we can’t drink our own champagne, we rely on our community’s experience to drive the work that we’re doing.
This takes the form of:
Accelerated feedback program - work with product and engineering
Test days - this is a new program we’re trying out to introduce features and get community experiences and feedback
Bug Reports (please tell us if something is broken!)
Experience Reports (we often hear if something went poorly, but we want to hear what worked, too!)
Community slack chat (we try to have engineering discussion and decisions out in the open via slack and design proposals in a public github repo)
Feature requests - what should Sensu do that it’s not currently doing? Is this useful to other folks?
Want to learn more? Check out the project, our list of issues, and our community slack!