Building open source monitoring tools

•Download as PPTX, PDF•

0 likes•263 views

Sensu v1 is a popular open source monitoring framework with a wealth of community contributed and maintained plugins and features. As Sensu does not offer SaaS, the features and direction of our software have been guided by our customers and the experiences of those in our community. As its features and capabilities have matured, we've discovered some operational challenges at scale. In this talk, I'll discuss our v2 rewrite, the challenges of building scalable, feature complete tools for operations folks without running our own infrastructure, and how we included our community in this process. I'll also dive into some of the tools and methods we use for load testing and rapidly releasing new features.

Software

∎ Motivations
∎ Testing
∎ Bugs
∎ Community
Overview

What is Open
Source?
Code that is publicly
available, with
permission to use,
modify, and share.
∎ Has a license
∎ Contributors
document
∎ In source control

Motivations
∎ Ensuring the systems we build
are performing as we expect, and
that we find out when they aren’t
∎ Build/maintain tools that make
our own work easier
∎ Build/maintain services to make
others’ work easier
Why do we work on open source software?

What is Sensu?
∎ ~6 year old open source
monitoring framework
∎ Service checks, metrics,
filtering, auto remediation, etc
∎ V1 - Ruby, Redis, and
RabbitMQ
∎ V2 - Go and etcd v3

V1 - Operational Challenges
∎ Clustering needed for large
installations
∎ Dependent on external processes
∎ Configuration management driven

∎ Reduce operational complexity,
increase performance
∎ Backwards compatible with
current plugins
∎ API driven
∎ Written in Go on top of etcd v3
V2 Rewrite - Goals

Testing
∎ Does the software behave as we
expect?
∎ Does it solve a problem or need?
∎ Can we find the bugs before
users do?

etcd Autocompaction
bug
https://github.com/sensu/sensu-
go/pull/1046

Don’t feed E2E Tests after Midnight
https://github.com/sensu/sensu-
go/pull/1019

Check Scheduling Failure
https://github.com/sensu/sensu-
go/pull/1424

Write the Docs
“If a new user has a bad time, it’s a
bug.”
- @jordansissel

Alpha Documentation
Handcrafted,
bespoke,
artisanal
developer
documentation,
hosted in a
github repo

github.com/sensu/sensu-alpha-documentation

Beta Documentation
Simple how-to
guides and API
reference on our
official docs site! *
*docs.sensu.io

Community
∎ Moar Community Engagement
□ Accelerated Feedback Program
□ http://bit.ly/sensu-afp
∎ Test Days
□ github.com/sensu/sensu-test-day

Resources
∎ github.com/sensu/sensu-
go/blob/master/CONTRIBUTING.md
∎ github.com/sensu/sensu-go/issues
∎ docs.sensu.io
∎ slack.sensu.io/

Similar to Building open source monitoring tools

Listen to Your Machines: DevOps Analytics for Better Feedback LoopsSplunk

Innovate Better Through Machine data AnalyticsHal Rottenberg

Top 10 DevOps tools for software development Mobiloitte

SAPUI5/OpenUI5 - Continuous IntegrationPeter Muessig

Best practices for using open source software in the enterpriseMarcel de Vries

Enterprise CI as-a-Service using JenkinsCollabNet

Top 10 dev ops tools (1)yalini97

The adoption of FOSS workfows in commercial software development: the case of...dmgerman

Open Auditncspa

Programming languages and techniques for today’s embedded andIoT worldRogue Wave Software

Resume_Ranjanaranjana mishra

DevopsFurkan Özbay

Delivering Better Software Faster (Without Breaking Everything)XebiaLabs

SplunkLive! London 2015 - DevOps BreakoutSplunk

How Azure DevOps can boost your organization's productivityIvan Porta

Tracing the evolution - Open source & Embedded systemsEmertxe Information Technologies Pvt Ltd

Rapid software testing and conformance with static code analysisRogue Wave Software

Open Source Governance at HPBruno Cornec

Case studykaran saini

SplunkLive! London 2016 Splunk for DevopsSplunk

Similar to Building open source monitoring tools (20)

Listen to Your Machines: DevOps Analytics for Better Feedback Loops

Innovate Better Through Machine data Analytics

Top 10 DevOps tools for software development

SAPUI5/OpenUI5 - Continuous Integration

Best practices for using open source software in the enterprise

Enterprise CI as-a-Service using Jenkins

Top 10 dev ops tools (1)

The adoption of FOSS workfows in commercial software development: the case of...

Open Audit

Programming languages and techniques for today’s embedded andIoT world

Resume_Ranjana

Devops

Delivering Better Software Faster (Without Breaking Everything)

SplunkLive! London 2015 - DevOps Breakout

How Azure DevOps can boost your organization's productivity

Tracing the evolution - Open source & Embedded systems

Rapid software testing and conformance with static code analysis

Open Source Governance at HP

Case study

SplunkLive! London 2016 Splunk for Devops

Recently uploaded

Software Quality Assurance Interview QuestionsArshad QA

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Professional Resume Template for Software DevelopersVinodh Ram

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

TECUNIQUE: Success Stories: IT Service providermohitmore19

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Recently uploaded (20)

Software Quality Assurance Interview Questions

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Professional Resume Template for Software Developers

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Exploring iOS App Development: Simplifying the Process

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

TECUNIQUE: Success Stories: IT Service provider

How To Use Server-Side Rendering with Nuxt.js

Microsoft AI Transformation Partner Playbook.pdf

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Optimizing AI for immediate response in Smart CCTV

Diamond Application Development Crafting Solutions with Precision

A Secure and Reliable Document Management System is Essential.docx

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Hand gesture recognition PROJECT PPT.pptx

Active Directory Penetration Testing, cionsystems.com.pdf

5 Signs You Need a Fashion PLM Software.pdf

Building open source monitoring tools

1. Building Open Source Monitoring Tools Mercedes Coyle, Software Engineer @benzobot

2. ∎ Spent ~6 years in on-call rotations ∎ I now build software that wakes people up in the night ∎ In my spare time I herd chickens #FarmOps ∎ @benzobot on the twitters

3. ∎ Motivations ∎ Testing ∎ Bugs ∎ Community Overview

4. What is Open Source? Code that is publicly available, with permission to use, modify, and share. ∎ Has a license ∎ Contributors document ∎ In source control

5. Motivations ∎ Ensuring the systems we build are performing as we expect, and that we find out when they aren’t ∎ Build/maintain tools that make our own work easier ∎ Build/maintain services to make others’ work easier Why do we work on open source software?

6. What is Sensu? ∎ ~6 year old open source monitoring framework ∎ Service checks, metrics, filtering, auto remediation, etc ∎ V1 - Ruby, Redis, and RabbitMQ ∎ V2 - Go and etcd v3

7. Community Driven

8. Sensu V1 Architecture

9. V1 - Operational Challenges ∎ Clustering needed for large installations ∎ Dependent on external processes ∎ Configuration management driven

10. ∎ Reduce operational complexity, increase performance ∎ Backwards compatible with current plugins ∎ API driven ∎ Written in Go on top of etcd v3 V2 Rewrite - Goals

11. Sensu V2 Architecture

12. Open Sourcing

13. Sensu Alpha

14. Where are your dashboards?

15. Testing ∎ Does the software behave as we expect? ∎ Does it solve a problem or need? ∎ Can we find the bugs before users do?

16. Code Quality

17. Code Analysis

18. Mob QA

19. Load Testing

20. Fun bugs!

21. etcd Autocompaction bug https://github.com/sensu/sensu- go/pull/1046

22. etcd Autocompaction

23. Don’t feed E2E Tests after Midnight https://github.com/sensu/sensu- go/pull/1019

24. UTC 4 lyfe

25. Check Scheduling Failure https://github.com/sensu/sensu- go/pull/1424

26. !ok

27. Write the Docs “If a new user has a bad time, it’s a bug.” - @jordansissel

28. Alpha Documentation Handcrafted, bespoke, artisanal developer documentation, hosted in a github repo

29. github.com/sensu/sensu-alpha-documentation

30. Beta Documentation Simple how-to guides and API reference on our official docs site! * *docs.sensu.io

31. Doc for how to write docs!

32.

33. Next Steps

34. Community ∎ Moar Community Engagement □ Accelerated Feedback Program □ http://bit.ly/sensu-afp ∎ Test Days □ github.com/sensu/sensu-test-day

35. Resources ∎ github.com/sensu/sensu- go/blob/master/CONTRIBUTING.md ∎ github.com/sensu/sensu-go/issues ∎ docs.sensu.io ∎ slack.sensu.io/

36. Thanks! Mercedes Coyle @benzobot

Editor's Notes

Alternate titles include: when to rewrite, how not to leave your community behind, etc
Built data infrastructure and tooling for systems monitoring and performance analysis I’ve spent a some of time on call, so I know what it’s like to scramble when things go bump in the night. As someone who has been there, I am conscientious about building quality software that people delight in using (even if its job is to wake them up). Be forewarned if you follow me on twitter, you will see pictures of chickens!
Today I’m going to cover: The motivations behind building open source software Different types of testing, and what we do to ensure Sensu is feature complete and performant when we can’t drink our own champagne, since we don’t have any infrastructure! I’ll dive into some of the weird and wonderful bugs we found and fixed! And finally, I’ll chat about the importance of community and what we’re doing to include and learn from it.
But first, a couple definitions. I find it useful to put some qualifiers on what OSS really is. Anyone can put code out in public, but to really be open source, you have to tell people: How they can use it How they can contribute And where to find it.
Much of our industry relies on open source software in their daily work. Even tools that we pay for can be derived from open source. I want to build and maintain tools that make people’s jobs easier, and I want to ensure those tools are performant. Fame/glory: there is something cool about seeing people use code you wrote.
Ok so now let’s get into what Sensu is, and I’ll start talking about our journey to an exciting rewrite. As a monitoring framework, we’re positioned to be the hub in an end-to-end monitoring solution. Service and system checks are at the heart of Sensu, but you can also use it for metrics collection, sophisticated alerting, auto remediation, etc. There is a lot you can do, and I think this is pretty cool! Our community is really important - as they’ve used Sensu in their own infrastructure and written plugins and checks, they share that work with others, making it easy for folks new to this solution to get started. V1 is written in Ruby, using RabbitMQ as a messaging transport and Redis for coordination. Our learnings from V1 have lead us to rewrite Sensu in Go, embedding etcd, which I’ll talk about in a bit.
Community metrics: Plugins! We have a couple frameworks for writing your own plugins, and over 200 community built, open source plugins.
RabbitMQ serves as our transport layer; redis holds state. Clients subscribe to the checks they need to run. Actions work off of a publish subscribe messaging model.
These technologies have been extraordinarily useful and helpful. However, the infrastructure landscape has changed over time, and it’s time to reevaluate what got us here. What are the pain points? Deployment (especially high availability) can be complex. Our engineering team got a chance to sit in on a training by our Success team on how to install HA Sensu v1 without config management, and it was a time consuming process. The state of v1 is not easy to containerize or manage without configuration management.
I could probably do a talk on each of these topics, but suffice to say, we wanted to make it easier to install and get up and running quickly. The path to doing this was to spend almost a year in a closed source development cycle, rewriting the entire system in Go on top of etcd v3. Go allows us to ship a couple of binaries that are easily installed via traditional package management, docker, or within kubernetes. Now, I can set up a backend, agent, and have a check running in a handful of minutes.
We still use a publish subscribe model, but now our state and configuration are stored in etcd. Etcd is embedded in the binary so you don’t have to run a separate instance! We’ll eventually have clustering aided by etcd as well.
We released it to the wild as an Alpha in February this year. I’m skipping ahead a bit here, but it was a pretty uninteresting process that was really just writing a bunch of code and closing issues. Our path was clear as we had a successful project to model, and about 6 years of experience from working with community and business that used it.
We knew the alpha wasn’t perfect or polished, and we still had a long feature list to get through, but it was important to get it out there early to get user feedback and bug reports. So we started asking people to use it! “Hey friends, can you run this binary in your infrastructure and tell us when falls over?”
And they responded: “This is cool, but where are the dashboards?!” “Hey did you know your nightly build is a month old?” “When will you support clustering/HA?” It’s sometimes painful getting reports that something isn’t going right for our users, but totally necessary. Because we are not the main consumers of our software, we rely on user feedback to know what’s important.
How do we know our software works when we are not the primary consumers? And actually - we don’t run any infrastructure at all? While we did testing during development, it was limited to unit, integration, and end to end testing. These tests are really useful for determining if your code is working during development, but not as useful for determining feature need, usability, or even system behavior. In order to get some benchmarks about how our system was performing (or not) and the usability of said system, we had to do some different types of testing. Once we had an Alpha, and some information from users, and space to breathe from feature development, we started planning out what that testing looked like. Usability testing is actually kind of tough when you aren’t the core user of your product, and especially when it’s such a malleable framework (everyone has a different use case!) And for that matter, Integration and end to end testing is really hard too.
We either have a pair and one additional reviewer, or one author and two reviewers. Note that bugs can still slip! Every PR runs our full test suite with linter, and we schedule a nightly build. We started using Velocity a couple months ago with sensu-go to track PR size, time to merge, and complexity risk. So far, it’s been mostly a confirmation tool for us that we’re keeping our PR size manageable and getting reviews completed in a timely fashion. Our PR goal is 200 lines of code, and we do that at least 50% of the time.
This is the other interesting metric worth paying attention to: our build success ratio. Lately we’ve been having a _tough time_ as we’re starting to run into some issues with our testing strategy! Our end to end testing strategy is proving to be brittle, and we’re having to rework some of our build tests to be more reliable.
You can’t possibly cover everything in software testing. It’s helpful to have actual usage with the software to uncover bugs. So we got everyone together on a zoom call and came up with a strategy. We’d each pick a component, write down how that component should behave and some acceptance criteria for testing. Everyone would then pick one or two features to test that they had not worked on. This was actually pretty fun to do! We found a few bugs, but the ones we found were minor, and we gained confidence in the performance and usability of our software.
Not having infrastructure and doing a process of local development and continuous integration and package building/deployment meant that we had some guesses as to the performance of our software. To be confident in recommending usage and developing new features, we needed some data points! So we came up with a plan and a goal for load testing. We wanted to see if we could connect 10K sensu-agents to a single sensu-backend, running keepalives and processing checks. Initially, we planned on doing this by setting up a single VM running sensu-backend in Google Cloud Platform, and then spinning up kubernetes cluster with 5 agents per pod. We expected that we could push a button and have GKE autoscale until we had the number of agents we needed (10K). In actuality, we ended up effectively load testing GKE - we were never able to scale up to 10K agents! What happened was api throttling on the part of GKE, so we needed to rethink our load testing strategy. So we did something much simpler - we wrote a script to spin up 10K agents, and connected that to our single backend. During those tests, we definitely found some performance issues.
So here are some details on the more fun/enraging bugs we uncovered during our testing exploits!
We discovered this about a minute before we released our alpha to the wild. Etcd is a key-value store written in Go, and we use it to store configuration and data in Sensu-go. One of our engineers was doing some manual testing in a local vm, and he left it running overnight. The next day, it was unusable. What happened? We saw all these errors saying “mvcc: database space exceeded” in the logs. Ruh ruh. You know when you get “database space exceeded” errors it’s bad. So we dug in and pored over etcd.
Etcd keeps a record of all its keys whenever they are created or altered, including any internal keys. By default, etcd has a 2GB size limit. We were doing a *lot* of writes, and since we were updating a key anytime a new event came in, we had a huge number of historical keys! Enter autocompaction - you need to periodically prune the keyspace in order to keep it from maxing out your db. There are two ways to setup autocompaction: first by time (say, run compaction every hour) and then by revision (only keep n revisions). We wanted to keep 1 revision around, since there wasn’t any reason for us to go back in the keyspace history. But for some reason, we couldn’t get it to work! We set autocompaction to revision with an int of 1, and nothing happened. As it turns out, we uncovered a bug in etcd! It was calculating revisions based on time and not by value, so we had to set it to 1 nano second for it to work. We reported the bug, they fixed it quickly, we upgraded, and now our db is happily autocompacting away.
We have an end to end test suite that spins up a sensu backend, agent, and command line interface and runs some basic tests against our features. One of these features is Time Windows, which are used to exclude notifications outside of a particular day and time. Our e2e test suite ran fine during the daytime, but sometimes tests would intermittently fail around the end of the day. It *seemed* intermittent and random since we didn’t always push code or open pull requests at the end of our day (around 5pm PST). We had gremlins in our tests!
But hey wait - 5pm PST is when UTC rolls over to midnight. As it turns out, we had a time calculation that was calculating the y/m/d from the current time, and wasn’t localized to UTC. This was fixed with a 6 character change! The commit message was longer than the fix.
Check scheduling is kind of the bread and butter of what Sensu does. One of the tests we ran was to see how many events we could process per second before the system started to fall over. We added a check and scheduled it to be run on a 1-second interval, and then attempted to verify the number of requests + the number of events in the system. To our surprise, they were different! It looked like not all of our check requests were being executed on the agents. We started testing further by doing what you do when you don’t quite know where the failure is in a system: throwing logging everywhere, creating a check to write a unix timestamp to a file, and counting the results. We were quickly able to narrow it down to somewhere in the transport between the backend and the agent - the agent was executing all the requests it received, but not getting all the requests that were sent from the backend!
Figuring out where in the transport the request was failing was harder. It took 3 engineers multiple many hour pairing/debugging sessions to narrow down the bug. Sensu uses a message topic on top of a Go channel for message routing. It turns out we were checking for a topic’s existence, but not checking if that topic was usable (go channel open). This was a sneaky bug. It was tough to replicate, and our tests didn’t surface the issue. It was only when we went to test functionality that we uncovered it, and it still took lots of walking through code logic to find out why the bug was sometimes failing. And yes, we have a test for this now :)
So now we have some confidence and recommendations for operating Sensu. It’s time to tell people how to use it! I can’t tell you how many times I’ve pored over codebases looking through functions, comments, and system architecture to figure out how something works because it wasn’t well documented. We have pretty complete documentation for 1.x, and in many cases, the mechanics/how to guides would apply to 2.0. However, there were several differences in functionality or implementation that we felt necessary to explain in order for new users to get started with Sensu 2.0. Writing documentation on how to use something is also a great way to ensure that you’ve built a feature correctly - ie, does this do what we said it was supposed to do.
We didn’t have much guidance when we started out writing documentation for our Alpha release. We had a deadline and needed to get something workable. So we wrote a bunch of markdown docs and chucked them in a github repo.
Repos are pretty great for collaboration and iterative development, but not necessarily for navigating information, or formatting docs in a way that was easily accessible. Our docs weren’t bad, but they weren’t user focused - we had way too many examples that didn’t fit well together, and we spent too much time going into detail about how our APIs worked and not enough time on how to use the product. They also centered around our alpha program, and not how to use the software.
So we decided to polish them for our Beta release. Make it work and then make it pretty, right? While we were hard at work on V2, a revamped docs site was also underway. The main goal for the new docs site was introducing the ability to search, better information organization, and ease of new doc contributions (static site generator using markdown). My colleague and I tag-teamed for 3 weeks on the Beta V2 docs after discussing the pain points we saw in using them.
We started by writing a doc and a template for how to write docs! Our template consisted of a basic guide for how to write how-to guides, and a template for reference (api) documentation to fill in the blanks of where guides leave off. Guides introduce a feature by explaining what it is, a use case, and then some simple and clear instructions for how to implement that use case.
And this is what our docs site looks like now. A guide isn’t meant to be complete - it is intended to show how to get up and running quickly with a feature. For more in-depth explanations of our features within Sensu, we have an API reference; it lists how a feature works, and what its default attributes are.
We’ve released everything to the wild! Hooray! But we’re not done yet.
Since we can’t drink our own champagne, we rely on our community’s experience to drive the work that we’re doing. This takes the form of: Accelerated feedback program - work with product and engineering Test days - this is a new program we’re trying out to introduce features and get community experiences and feedback Bug Reports (please tell us if something is broken!) Experience Reports (we often hear if something went poorly, but we want to hear what worked, too!) Community slack chat (we try to have engineering discussion and decisions out in the open via slack and design proposals in a public github repo) Feature requests - what should Sensu do that it’s not currently doing? Is this useful to other folks?
Want to learn more? Check out the project, our list of issues, and our community slack!

Building open source monitoring tools

Recommended

Recommended

More Related Content

Similar to Building open source monitoring tools

Similar to Building open source monitoring tools (20)

Recently uploaded

Recently uploaded (20)

Building open source monitoring tools

Editor's Notes