SlideShare a Scribd company logo
1 of 36
Building Open Source
Monitoring Tools
Mercedes Coyle, Software Engineer
@benzobot
∎ Spent ~6 years in on-call rotations
∎ I now build software that wakes people up in
the night
∎ In my spare time I herd chickens #FarmOps
∎ @benzobot on the twitters
∎ Motivations
∎ Testing
∎ Bugs
∎ Community
Overview
What is Open
Source?
Code that is publicly
available, with
permission to use,
modify, and share.
∎ Has a license
∎ Contributors
document
∎ In source control
Motivations
∎ Ensuring the systems we build
are performing as we expect, and
that we find out when they aren’t
∎ Build/maintain tools that make
our own work easier
∎ Build/maintain services to make
others’ work easier
Why do we work on open source software?
What is Sensu?
∎ ~6 year old open source
monitoring framework
∎ Service checks, metrics,
filtering, auto remediation, etc
∎ V1 - Ruby, Redis, and
RabbitMQ
∎ V2 - Go and etcd v3
Community Driven
Sensu V1 Architecture
V1 - Operational Challenges
∎ Clustering needed for large
installations
∎ Dependent on external processes
∎ Configuration management driven
∎ Reduce operational complexity,
increase performance
∎ Backwards compatible with
current plugins
∎ API driven
∎ Written in Go on top of etcd v3
V2 Rewrite - Goals
Sensu V2 Architecture
Open Sourcing
Sensu Alpha
Where are your dashboards?
Testing
∎ Does the software behave as we
expect?
∎ Does it solve a problem or need?
∎ Can we find the bugs before
users do?
Code Quality
Code Analysis
Mob QA
Load Testing
Fun bugs!
etcd Autocompaction
bug
https://github.com/sensu/sensu-
go/pull/1046
etcd Autocompaction
Don’t feed E2E Tests after Midnight
https://github.com/sensu/sensu-
go/pull/1019
UTC 4 lyfe
Check Scheduling Failure
https://github.com/sensu/sensu-
go/pull/1424
!ok
Write the Docs
“If a new user has a bad time, it’s a
bug.”
- @jordansissel
Alpha Documentation
Handcrafted,
bespoke,
artisanal
developer
documentation,
hosted in a
github repo
github.com/sensu/sensu-alpha-documentation
Beta Documentation
Simple how-to
guides and API
reference on our
official docs site! *
*docs.sensu.io
Doc for how to write docs!
Next Steps
Community
∎ Moar Community Engagement
□ Accelerated Feedback Program
□ http://bit.ly/sensu-afp
∎ Test Days
□ github.com/sensu/sensu-test-day
Resources
∎ github.com/sensu/sensu-
go/blob/master/CONTRIBUTING.md
∎ github.com/sensu/sensu-go/issues
∎ docs.sensu.io
∎ slack.sensu.io/
Thanks!
Mercedes Coyle
@benzobot

More Related Content

Similar to Building open source monitoring tools

Listen to Your Machines: DevOps Analytics for Better Feedback Loops
Listen to Your Machines: DevOps Analytics for Better Feedback LoopsListen to Your Machines: DevOps Analytics for Better Feedback Loops
Listen to Your Machines: DevOps Analytics for Better Feedback LoopsSplunk
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsHal Rottenberg
 
Top 10 DevOps tools for software development
 Top 10 DevOps tools for software development  Top 10 DevOps tools for software development
Top 10 DevOps tools for software development Mobiloitte
 
SAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous IntegrationSAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous IntegrationPeter Muessig
 
Best practices for using open source software in the enterprise
Best practices for using open source software in the enterpriseBest practices for using open source software in the enterprise
Best practices for using open source software in the enterpriseMarcel de Vries
 
Enterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using JenkinsEnterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using JenkinsCollabNet
 
Top 10 dev ops tools (1)
Top 10 dev ops tools (1)Top 10 dev ops tools (1)
Top 10 dev ops tools (1)yalini97
 
The adoption of FOSS workfows in commercial software development: the case of...
The adoption of FOSS workfows in commercial software development: the case of...The adoption of FOSS workfows in commercial software development: the case of...
The adoption of FOSS workfows in commercial software development: the case of...dmgerman
 
Open Audit
Open AuditOpen Audit
Open Auditncspa
 
Programming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT worldProgramming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT worldRogue Wave Software
 
Delivering Better Software Faster (Without Breaking Everything)
Delivering Better Software Faster (Without Breaking Everything)Delivering Better Software Faster (Without Breaking Everything)
Delivering Better Software Faster (Without Breaking Everything)XebiaLabs
 
SplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps BreakoutSplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps BreakoutSplunk
 
How Azure DevOps can boost your organization's productivity
How Azure DevOps can boost your organization's productivityHow Azure DevOps can boost your organization's productivity
How Azure DevOps can boost your organization's productivityIvan Porta
 
Rapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysisRapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysisRogue Wave Software
 
Open Source Governance at HP
Open Source Governance at HPOpen Source Governance at HP
Open Source Governance at HPBruno Cornec
 
SplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for DevopsSplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for DevopsSplunk
 

Similar to Building open source monitoring tools (20)

Listen to Your Machines: DevOps Analytics for Better Feedback Loops
Listen to Your Machines: DevOps Analytics for Better Feedback LoopsListen to Your Machines: DevOps Analytics for Better Feedback Loops
Listen to Your Machines: DevOps Analytics for Better Feedback Loops
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data Analytics
 
Top 10 DevOps tools for software development
 Top 10 DevOps tools for software development  Top 10 DevOps tools for software development
Top 10 DevOps tools for software development
 
SAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous IntegrationSAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous Integration
 
Best practices for using open source software in the enterprise
Best practices for using open source software in the enterpriseBest practices for using open source software in the enterprise
Best practices for using open source software in the enterprise
 
Enterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using JenkinsEnterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using Jenkins
 
Top 10 dev ops tools (1)
Top 10 dev ops tools (1)Top 10 dev ops tools (1)
Top 10 dev ops tools (1)
 
The adoption of FOSS workfows in commercial software development: the case of...
The adoption of FOSS workfows in commercial software development: the case of...The adoption of FOSS workfows in commercial software development: the case of...
The adoption of FOSS workfows in commercial software development: the case of...
 
Open Audit
Open AuditOpen Audit
Open Audit
 
Programming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT worldProgramming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT world
 
Resume_Ranjana
Resume_RanjanaResume_Ranjana
Resume_Ranjana
 
Devops
DevopsDevops
Devops
 
Delivering Better Software Faster (Without Breaking Everything)
Delivering Better Software Faster (Without Breaking Everything)Delivering Better Software Faster (Without Breaking Everything)
Delivering Better Software Faster (Without Breaking Everything)
 
SplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps BreakoutSplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps Breakout
 
How Azure DevOps can boost your organization's productivity
How Azure DevOps can boost your organization's productivityHow Azure DevOps can boost your organization's productivity
How Azure DevOps can boost your organization's productivity
 
Tracing the evolution - Open source & Embedded systems
Tracing the evolution - Open source & Embedded systemsTracing the evolution - Open source & Embedded systems
Tracing the evolution - Open source & Embedded systems
 
Rapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysisRapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysis
 
Open Source Governance at HP
Open Source Governance at HPOpen Source Governance at HP
Open Source Governance at HP
 
Case study
Case studyCase study
Case study
 
SplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for DevopsSplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for Devops
 

Recently uploaded

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Recently uploaded (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Building open source monitoring tools

Editor's Notes

  1. Alternate titles include: when to rewrite, how not to leave your community behind, etc
  2. Built data infrastructure and tooling for systems monitoring and performance analysis I’ve spent a some of time on call, so I know what it’s like to scramble when things go bump in the night. As someone who has been there, I am conscientious about building quality software that people delight in using (even if its job is to wake them up). Be forewarned if you follow me on twitter, you will see pictures of chickens!
  3. Today I’m going to cover: The motivations behind building open source software Different types of testing, and what we do to ensure Sensu is feature complete and performant when we can’t drink our own champagne, since we don’t have any infrastructure! I’ll dive into some of the weird and wonderful bugs we found and fixed! And finally, I’ll chat about the importance of community and what we’re doing to include and learn from it.
  4. But first, a couple definitions. I find it useful to put some qualifiers on what OSS really is. Anyone can put code out in public, but to really be open source, you have to tell people: How they can use it How they can contribute And where to find it.
  5. Much of our industry relies on open source software in their daily work. Even tools that we pay for can be derived from open source. I want to build and maintain tools that make people’s jobs easier, and I want to ensure those tools are performant. Fame/glory: there is something cool about seeing people use code you wrote.
  6. Ok so now let’s get into what Sensu is, and I’ll start talking about our journey to an exciting rewrite. As a monitoring framework, we’re positioned to be the hub in an end-to-end monitoring solution. Service and system checks are at the heart of Sensu, but you can also use it for metrics collection, sophisticated alerting, auto remediation, etc. There is a lot you can do, and I think this is pretty cool! Our community is really important - as they’ve used Sensu in their own infrastructure and written plugins and checks, they share that work with others, making it easy for folks new to this solution to get started. V1 is written in Ruby, using RabbitMQ as a messaging transport and Redis for coordination. Our learnings from V1 have lead us to rewrite Sensu in Go, embedding etcd, which I’ll talk about in a bit.
  7. Community metrics: Plugins! We have a couple frameworks for writing your own plugins, and over 200 community built, open source plugins.
  8. RabbitMQ serves as our transport layer; redis holds state. Clients subscribe to the checks they need to run. Actions work off of a publish subscribe messaging model.
  9. These technologies have been extraordinarily useful and helpful. However, the infrastructure landscape has changed over time, and it’s time to reevaluate what got us here. What are the pain points? Deployment (especially high availability) can be complex. Our engineering team got a chance to sit in on a training by our Success team on how to install HA Sensu v1 without config management, and it was a time consuming process. The state of v1 is not easy to containerize or manage without configuration management.
  10. I could probably do a talk on each of these topics, but suffice to say, we wanted to make it easier to install and get up and running quickly. The path to doing this was to spend almost a year in a closed source development cycle, rewriting the entire system in Go on top of etcd v3. Go allows us to ship a couple of binaries that are easily installed via traditional package management, docker, or within kubernetes. Now, I can set up a backend, agent, and have a check running in a handful of minutes.
  11. We still use a publish subscribe model, but now our state and configuration are stored in etcd. Etcd is embedded in the binary so you don’t have to run a separate instance! We’ll eventually have clustering aided by etcd as well.
  12. We released it to the wild as an Alpha in February this year. I’m skipping ahead a bit here, but it was a pretty uninteresting process that was really just writing a bunch of code and closing issues. Our path was clear as we had a successful project to model, and about 6 years of experience from working with community and business that used it.
  13. We knew the alpha wasn’t perfect or polished, and we still had a long feature list to get through, but it was important to get it out there early to get user feedback and bug reports. So we started asking people to use it! “Hey friends, can you run this binary in your infrastructure and tell us when falls over?”
  14. And they responded: “This is cool, but where are the dashboards?!” “Hey did you know your nightly build is a month old?” “When will you support clustering/HA?” It’s sometimes painful getting reports that something isn’t going right for our users, but totally necessary. Because we are not the main consumers of our software, we rely on user feedback to know what’s important.
  15. How do we know our software works when we are not the primary consumers? And actually - we don’t run any infrastructure at all? While we did testing during development, it was limited to unit, integration, and end to end testing. These tests are really useful for determining if your code is working during development, but not as useful for determining feature need, usability, or even system behavior. In order to get some benchmarks about how our system was performing (or not) and the usability of said system, we had to do some different types of testing. Once we had an Alpha, and some information from users, and space to breathe from feature development, we started planning out what that testing looked like. Usability testing is actually kind of tough when you aren’t the core user of your product, and especially when it’s such a malleable framework (everyone has a different use case!) And for that matter, Integration and end to end testing is really hard too.
  16. We either have a pair and one additional reviewer, or one author and two reviewers. Note that bugs can still slip! Every PR runs our full test suite with linter, and we schedule a nightly build. We started using Velocity a couple months ago with sensu-go to track PR size, time to merge, and complexity risk. So far, it’s been mostly a confirmation tool for us that we’re keeping our PR size manageable and getting reviews completed in a timely fashion. Our PR goal is 200 lines of code, and we do that at least 50% of the time.
  17. This is the other interesting metric worth paying attention to: our build success ratio. Lately we’ve been having a _tough time_ as we’re starting to run into some issues with our testing strategy! Our end to end testing strategy is proving to be brittle, and we’re having to rework some of our build tests to be more reliable.
  18. You can’t possibly cover everything in software testing. It’s helpful to have actual usage with the software to uncover bugs. So we got everyone together on a zoom call and came up with a strategy. We’d each pick a component, write down how that component should behave and some acceptance criteria for testing. Everyone would then pick one or two features to test that they had not worked on. This was actually pretty fun to do! We found a few bugs, but the ones we found were minor, and we gained confidence in the performance and usability of our software.
  19. Not having infrastructure and doing a process of local development and continuous integration and package building/deployment meant that we had some guesses as to the performance of our software. To be confident in recommending usage and developing new features, we needed some data points! So we came up with a plan and a goal for load testing. We wanted to see if we could connect 10K sensu-agents to a single sensu-backend, running keepalives and processing checks. Initially, we planned on doing this by setting up a single VM running sensu-backend in Google Cloud Platform, and then spinning up kubernetes cluster with 5 agents per pod. We expected that we could push a button and have GKE autoscale until we had the number of agents we needed (10K). In actuality, we ended up effectively load testing GKE - we were never able to scale up to 10K agents! What happened was api throttling on the part of GKE, so we needed to rethink our load testing strategy. So we did something much simpler - we wrote a script to spin up 10K agents, and connected that to our single backend. During those tests, we definitely found some performance issues.
  20. So here are some details on the more fun/enraging bugs we uncovered during our testing exploits!
  21. We discovered this about a minute before we released our alpha to the wild. Etcd is a key-value store written in Go, and we use it to store configuration and data in Sensu-go. One of our engineers was doing some manual testing in a local vm, and he left it running overnight. The next day, it was unusable. What happened? We saw all these errors saying “mvcc: database space exceeded” in the logs. Ruh ruh. You know when you get “database space exceeded” errors it’s bad. So we dug in and pored over etcd.
  22. Etcd keeps a record of all its keys whenever they are created or altered, including any internal keys. By default, etcd has a 2GB size limit. We were doing a *lot* of writes, and since we were updating a key anytime a new event came in, we had a huge number of historical keys! Enter autocompaction - you need to periodically prune the keyspace in order to keep it from maxing out your db. There are two ways to setup autocompaction: first by time (say, run compaction every hour) and then by revision (only keep n revisions). We wanted to keep 1 revision around, since there wasn’t any reason for us to go back in the keyspace history. But for some reason, we couldn’t get it to work! We set autocompaction to revision with an int of 1, and nothing happened. As it turns out, we uncovered a bug in etcd! It was calculating revisions based on time and not by value, so we had to set it to 1 nano second for it to work. We reported the bug, they fixed it quickly, we upgraded, and now our db is happily autocompacting away.
  23. We have an end to end test suite that spins up a sensu backend, agent, and command line interface and runs some basic tests against our features. One of these features is Time Windows, which are used to exclude notifications outside of a particular day and time. Our e2e test suite ran fine during the daytime, but sometimes tests would intermittently fail around the end of the day. It *seemed* intermittent and random since we didn’t always push code or open pull requests at the end of our day (around 5pm PST). We had gremlins in our tests!
  24. But hey wait - 5pm PST is when UTC rolls over to midnight. As it turns out, we had a time calculation that was calculating the y/m/d from the current time, and wasn’t localized to UTC. This was fixed with a 6 character change! The commit message was longer than the fix.
  25. Check scheduling is kind of the bread and butter of what Sensu does. One of the tests we ran was to see how many events we could process per second before the system started to fall over. We added a check and scheduled it to be run on a 1-second interval, and then attempted to verify the number of requests + the number of events in the system. To our surprise, they were different! It looked like not all of our check requests were being executed on the agents. We started testing further by doing what you do when you don’t quite know where the failure is in a system: throwing logging everywhere, creating a check to write a unix timestamp to a file, and counting the results. We were quickly able to narrow it down to somewhere in the transport between the backend and the agent - the agent was executing all the requests it received, but not getting all the requests that were sent from the backend!
  26. Figuring out where in the transport the request was failing was harder. It took 3 engineers multiple many hour pairing/debugging sessions to narrow down the bug. Sensu uses a message topic on top of a Go channel for message routing. It turns out we were checking for a topic’s existence, but not checking if that topic was usable (go channel open). This was a sneaky bug. It was tough to replicate, and our tests didn’t surface the issue. It was only when we went to test functionality that we uncovered it, and it still took lots of walking through code logic to find out why the bug was sometimes failing. And yes, we have a test for this now :)
  27. So now we have some confidence and recommendations for operating Sensu. It’s time to tell people how to use it! I can’t tell you how many times I’ve pored over codebases looking through functions, comments, and system architecture to figure out how something works because it wasn’t well documented. We have pretty complete documentation for 1.x, and in many cases, the mechanics/how to guides would apply to 2.0. However, there were several differences in functionality or implementation that we felt necessary to explain in order for new users to get started with Sensu 2.0. Writing documentation on how to use something is also a great way to ensure that you’ve built a feature correctly - ie, does this do what we said it was supposed to do.
  28. We didn’t have much guidance when we started out writing documentation for our Alpha release. We had a deadline and needed to get something workable. So we wrote a bunch of markdown docs and chucked them in a github repo.
  29. Repos are pretty great for collaboration and iterative development, but not necessarily for navigating information, or formatting docs in a way that was easily accessible. Our docs weren’t bad, but they weren’t user focused - we had way too many examples that didn’t fit well together, and we spent too much time going into detail about how our APIs worked and not enough time on how to use the product. They also centered around our alpha program, and not how to use the software.
  30. So we decided to polish them for our Beta release. Make it work and then make it pretty, right? While we were hard at work on V2, a revamped docs site was also underway. The main goal for the new docs site was introducing the ability to search, better information organization, and ease of new doc contributions (static site generator using markdown). My colleague and I tag-teamed for 3 weeks on the Beta V2 docs after discussing the pain points we saw in using them.
  31. We started by writing a doc and a template for how to write docs! Our template consisted of a basic guide for how to write how-to guides, and a template for reference (api) documentation to fill in the blanks of where guides leave off. Guides introduce a feature by explaining what it is, a use case, and then some simple and clear instructions for how to implement that use case.
  32. And this is what our docs site looks like now. A guide isn’t meant to be complete - it is intended to show how to get up and running quickly with a feature. For more in-depth explanations of our features within Sensu, we have an API reference; it lists how a feature works, and what its default attributes are.
  33. We’ve released everything to the wild! Hooray! But we’re not done yet.
  34. Since we can’t drink our own champagne, we rely on our community’s experience to drive the work that we’re doing. This takes the form of: Accelerated feedback program - work with product and engineering Test days - this is a new program we’re trying out to introduce features and get community experiences and feedback Bug Reports (please tell us if something is broken!) Experience Reports (we often hear if something went poorly, but we want to hear what worked, too!) Community slack chat (we try to have engineering discussion and decisions out in the open via slack and design proposals in a public github repo) Feature requests - what should Sensu do that it’s not currently doing? Is this useful to other folks?
  35. Want to learn more? Check out the project, our list of issues, and our community slack!