The document summarizes Netflix's approach to automating operational tasks to reduce the burden on on-call engineers. It describes how Netflix built the Winston platform using tools like Stackstorm to create event-driven automation of common issues like replacing unhealthy Cassandra instances. This automation handles alerts, applies diagnostic rules and runbooks, and can auto-remediate issues. The benefits are reduced mean time to recovery, increased safety by reducing human errors, capturing operational knowledge, increased productivity for engineers, and improved morale by reducing pager fatigue.
2012: Putting your robots to work: security automation at TwitterNeil Matatall
How the Twitter product security team does automation and where we're going. All tools in the presentation were built on open source technology and will be open sourced over time.
We Are Developers - Modern React (Suspense, Context, Hooks) - Roy DerksRoy Derks
Since the introduction of React a few years ago, a lot has been changed. Were React used to be a library with just a limited amount of features, it now can handle complex use-cases like state management or lazy-loading with just its APIs. This talk will show how these new features can help you develop in React with fewer packages and above all, less code.
Cloud Native PWAs (progressive web apps with Spring Boot and Angular) - DevNe...Matt Raible
In this session, we show how to build microservices with Spring, deploy them to the cloud and expose their functionality with an progressive web application that can run offline. You’ll learn how to “build to fail” and create a quality, resilient application. Live coding will show how to use: Spring Boot, Spring Cloud, Spring Security, Cloud Foundry, IntelliJ IDEA, Angular 2, JWT, Stormpath, and Progressive Web Apps.
Demo code: https://github.com/mraible/cloud-native-pwas
Automated Testing is a vital part of any healthy software development process. It ensures high code quality, architectural flexibility which in turn helps to keep a fast pace in developing new features. It even gets more important for open source projects. You don’t want to blindly depend on untested projects which you include as libraries in your next production release.
In this talk we’re not going to talk about TDD or unit testing. Rather, we’re diving into how automated, end-2-end testing is done in 2018, with Cypress.
2012: Putting your robots to work: security automation at TwitterNeil Matatall
How the Twitter product security team does automation and where we're going. All tools in the presentation were built on open source technology and will be open sourced over time.
We Are Developers - Modern React (Suspense, Context, Hooks) - Roy DerksRoy Derks
Since the introduction of React a few years ago, a lot has been changed. Were React used to be a library with just a limited amount of features, it now can handle complex use-cases like state management or lazy-loading with just its APIs. This talk will show how these new features can help you develop in React with fewer packages and above all, less code.
Cloud Native PWAs (progressive web apps with Spring Boot and Angular) - DevNe...Matt Raible
In this session, we show how to build microservices with Spring, deploy them to the cloud and expose their functionality with an progressive web application that can run offline. You’ll learn how to “build to fail” and create a quality, resilient application. Live coding will show how to use: Spring Boot, Spring Cloud, Spring Security, Cloud Foundry, IntelliJ IDEA, Angular 2, JWT, Stormpath, and Progressive Web Apps.
Demo code: https://github.com/mraible/cloud-native-pwas
Automated Testing is a vital part of any healthy software development process. It ensures high code quality, architectural flexibility which in turn helps to keep a fast pace in developing new features. It even gets more important for open source projects. You don’t want to blindly depend on untested projects which you include as libraries in your next production release.
In this talk we’re not going to talk about TDD or unit testing. Rather, we’re diving into how automated, end-2-end testing is done in 2018, with Cypress.
Bit-encoded parser in communication system is not easy to program, and usually written in C. Sometimes for lack of toolchain or reduction of memory footprint, programmers might still need to write it in assembly, and it's definitely not a happy task. The author tries to share his happy experience on rapid prototyping this kind of parser in Python, and later coded them in assembly running in an embedded processor.
This was presented in PyCon APAC 2015.
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Rosenfeld Media
Bill Scott: "Lean Engineering: Engineering for Learning & Experimentation in the Enterprise"
Enterprise UX 2015 • May 13, 2015 • San Antonio, TX, USA
http://enterpriseux.net
Unikernels and another way of secure cloud computingMotiejus Jakštys
This talk was presented in Build Stuff LT 2015.
Over one million AWS customers are happy with the benefits they get from Cloud Computing. One of the reasons for this is the vast array of choice they have in how they run their applications in the cloud. One choice customers have to run their applications, that is not so well known, is to use Unikernels. At the end of the talk, you will understand how Unikernels can make your applications efficient, scalable and secure.
The talk will be followed by a demonstration on how we all can take advantage of unikernels right now. An existing Linux-runnable web service will be converted to a unikernel and executed both the local desktop and on Amazon EC2.
The digital transformation evolution will rely on the benefits of computer vision and natural language processing to enable new business models to align the physical and digital world for citizens. In this session, learn how to deploy machine learning models to leverage the power of smart devices using Alexa, as well as process data quickly at the edge – even without connectivity – using smart cameras and autonomous agents.
This presentation, along with a live coding demo is intended to be a gentle and fun introduction to being able to quickly spin up a full stack application leveraging nuxt for a server side rendered Vue layer.
The accompanying repository with code is made available here: https://github.com/hyperwidget/llnlreal
Rise of the Machines - Automate your DevelopmentSven Peters
When we talk about automation in software development, we immediately think of automated builds and deployments. We may also be using scripts to help make our daily work easier. But this is really just the beginning of the rise of the machines.
I show you how leading developers in our industry are using open source and commercial tools for automating much more. They've got "robots" for monitoring production servers, updating issues, supporting customers, reviewing code, setting up laptops, doing development reporting, conducting customer feedback -- even automating daily standups. In what instances is it useful to automate? In what cases does it not make sense? Automation prevents us from having to do the same thing twice, helps us to work better together, reduces workflow errors and frees up time to write production code. Plus, as it turns out, spending time on automation is fun! Don't be afraid of robots in software development, embrace them! Even if I save you just half an hour a week, this talk will be a beneficial investment of your time.
Metasepi team meeting #16: Safety on ATS language + MCUKiwamu Okabe
* [1] What is Metasepi?
* [2] How to create Metasepi?
* [3] Demo using ATS language
* [4] What is ATS language?
* [5] Why ATS language is safe?
* [6] ATS programming on MCU
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
More Related Content
Similar to Netflix Winston meetup presentation 2015-11-18
Bit-encoded parser in communication system is not easy to program, and usually written in C. Sometimes for lack of toolchain or reduction of memory footprint, programmers might still need to write it in assembly, and it's definitely not a happy task. The author tries to share his happy experience on rapid prototyping this kind of parser in Python, and later coded them in assembly running in an embedded processor.
This was presented in PyCon APAC 2015.
Lean Engineering: Engineering for Learning & Experimentation in the Enterpris...Rosenfeld Media
Bill Scott: "Lean Engineering: Engineering for Learning & Experimentation in the Enterprise"
Enterprise UX 2015 • May 13, 2015 • San Antonio, TX, USA
http://enterpriseux.net
Unikernels and another way of secure cloud computingMotiejus Jakštys
This talk was presented in Build Stuff LT 2015.
Over one million AWS customers are happy with the benefits they get from Cloud Computing. One of the reasons for this is the vast array of choice they have in how they run their applications in the cloud. One choice customers have to run their applications, that is not so well known, is to use Unikernels. At the end of the talk, you will understand how Unikernels can make your applications efficient, scalable and secure.
The talk will be followed by a demonstration on how we all can take advantage of unikernels right now. An existing Linux-runnable web service will be converted to a unikernel and executed both the local desktop and on Amazon EC2.
The digital transformation evolution will rely on the benefits of computer vision and natural language processing to enable new business models to align the physical and digital world for citizens. In this session, learn how to deploy machine learning models to leverage the power of smart devices using Alexa, as well as process data quickly at the edge – even without connectivity – using smart cameras and autonomous agents.
This presentation, along with a live coding demo is intended to be a gentle and fun introduction to being able to quickly spin up a full stack application leveraging nuxt for a server side rendered Vue layer.
The accompanying repository with code is made available here: https://github.com/hyperwidget/llnlreal
Rise of the Machines - Automate your DevelopmentSven Peters
When we talk about automation in software development, we immediately think of automated builds and deployments. We may also be using scripts to help make our daily work easier. But this is really just the beginning of the rise of the machines.
I show you how leading developers in our industry are using open source and commercial tools for automating much more. They've got "robots" for monitoring production servers, updating issues, supporting customers, reviewing code, setting up laptops, doing development reporting, conducting customer feedback -- even automating daily standups. In what instances is it useful to automate? In what cases does it not make sense? Automation prevents us from having to do the same thing twice, helps us to work better together, reduces workflow errors and frees up time to write production code. Plus, as it turns out, spending time on automation is fun! Don't be afraid of robots in software development, embrace them! Even if I save you just half an hour a week, this talk will be a beneficial investment of your time.
Metasepi team meeting #16: Safety on ATS language + MCUKiwamu Okabe
* [1] What is Metasepi?
* [2] How to create Metasepi?
* [3] Demo using ATS language
* [4] What is ATS language?
* [5] Why ATS language is safe?
* [6] ATS programming on MCU
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
9. Healthcheck Script
Every
30 min
Disappearing
instance?
Launch new
instance
Is the C* ring
healthy?
Are all instances
healthy?
Can we fix
automatically?
Replace bad
instance
First failure?
Sleep for X
minutes and
retry
First failure?
Is there an
offline
maintenance?
11. How Did The Healthcheck Script Handled It
Every
30 min
Disappearing
instance?
Launch new
instance
Is the C* ring
healthy?
Are all instances
healthy?
Can we fix
automatically?
Replace bad
instance
First failure?
Sleep for X
minutes and
retry
First failure?
Is there an
offline
maintenance?
14. Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
17. Failure / Alert Automation
Automation using Building Blocks
Integrations with Netflix Ecosystem
Platform as a Service
Event-driven Automation Platform
23. Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
29. ● Product
○ Reduced MTTR (Mean Time To Recover)
○ Safety - Reduce risk of human errors
○ Capture operational knowledge as code
● People
○ Reduced pager fatigue for developers
○ Increase in productivity
○ Morale
Impact
30. Stackstorm Docs - http://docs.stackstorm.com/
Stackstorm Slack Channel - https://stackstorm-community.slack.com/
Netflix OpenSource: https://netflix.github.io/
Check out our https://jobs.netflix.com page for current openings
We focus on providing a common automation platform for Netflix Teams.
Who runs a service on AWS?
Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones. Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones.
Why Re:Boot? Xen security issue. Reboot a lot of instances in all the Availability Zones.
Why is it a big deal? For stateless services, it’s not. But for Stateful services it is. C* for example.
Missing the 50M Party in L.A.
Denial: That can’t be true …
Anger: Yep, it’s confirmed
Bargaining: Tried convincing AWS to delay
Depression: They said no. Risk is too high.
Acceptance: What now …
Actually it’s easy to accept, because of the Simian Army.
-
Anyone heard about the simian army?
The Simian Army is a suite of tools for keeping your cloud operating in top form.
Janitor Monkey, Security Monkey, Coffee Monkey
Chaos Monkey, the first member, is a resiliency tool that helps ensure that your applications can tolerate random instance failures
Netflix EMBRACE chaos. We love it so much that we generate it. In PROD.
We run it on most of Netflix services, and even on C*
CDE has Chaos Monkey enabled on our C* clusters
Maximum 1 node per day, during business hours
Cassandra Team Health Check system detects the missing instance and replaces it
Going back to our stages of grief, this made acceptance easier.
We test for this
Our automation can take it.
What our stack looked like at the time?
Bunch of Python/Shell scripts
Jenkins as job scheduler (HC, node-replacements, repairs, upgrades and etc)
On C* nodes: C* + Priam
tAtlas is already a very powerful metrics and alerting tool, and our metric systems add non-C* related metrics (App metrics for example) that help in correlation.
- Atlas is a very powerful time series metrics and alerting tool
- Atlas is Open Source
Simplified view of Healthcheck flow
Assisted Diagnostics
Auto-Remediation
Auto-Remediation supported:
Disappearing instances
Replace instance with bad I/O
How did the healthcheck behave during Re:boot
2 behaviour:
Instance rebooting:
False positive (transient issue)
Instance rebooting, but failing AWS healthcheck and being terminated:
Auto-remediation
218 C* nodes rebooted
22 nodes didn’t start and were automatically terminated by AWS internal healthcheck
Our heathcheck identified the missing nodes and automatically remediated the issue
0 downtime
L.A. Party was awesome
Take the learnings from CDE, abstract it and see how to apply to other teams
Increase in scope - How can we maximize impact?
So our main focus was to apply the learnings:
False Positive, Assisted Diagnostics, Auto-Remediation
Help on-call engineers sleep at night (improve on-call automation)
Why is it such a big deal?
First, you need to understand the DevOps Model at Netflix
Everyone is on-call at Netflix, every team manage their own service
This means a lot of on-call engineer doing on-call operations.
So what does it look like to be On-Call when there is no or limited automation?
On-call before winston. Long MTTR (Mean Time to Recover).
Operational knowledge in document - hard to maintain
Risk of human errors
Pager fatigue - Morale
High MTTR (Mean Time To Recover)
Impact on productivity
JS --
Hand it over to Sayli who will cover how the new system help alleviate these pain points …
Sayli --
To reduce the pain points that we were facing, we started thinking of new approach.
Quality of a good engineer - learn not only from failures but from success as well like our AWS reboot success story.
We survived during reboot using a system which automatically diagnosed and fixed a known failure scenario.
The problem was that it wasn’t designed to extend for other failure scenarios or could be used by other teams.
Proactive automation.
Idea of reactive automatic troubleshooting and remediation will be highly useful especially in operations.
With this expanded charter for our team, we focused on what will be the key features for a system that will solve these problems for us and the answer was event-driven automation.
2. Instead of autonomous systems, ability to share building blocks within a single service or even to multiple services
3. eg. sophisticates telemetry system Atlas and CI platform (spinnaker), jenkins
4. Last but not least, Service owners can focus on the automation and not platform - Make it self-serve
Problem space not unique to Netflix
We started working on initial design of our own in-house , internal POC
Looked at Facebook (FBAR) / LinkedIn (Nurse) / DropBox (Naoru). This helped us see how they approached the problem
Also came across this meetup group .. 400 auto remediators
Now that we knew WHAT are requirements, we worked on figuring out HOW.
Evaluated building platform from scratch, adopting an existing solution or mix and match - using some existing components and building some.
After POC, stackstorm.
Stackstorm platform for integration and automation across services and tools
The usecases that stackstorm was targeting -- facilitated troubleshooting and auto remediation fitted right into what we are looking for.open source. Quality of the code. Great to collaborate and code with.
Great discussions with respect to their usecases, approach and adoption challenges. Helped us validate benefits.
Do our own or adopt existing solution?
We started with our own POC, then we decided to go with Stackstorm- event-driven automation platform
Facilitated Troubleshooting/Event handling
Automated remediation
Stackstorm gave us and event driven automation platform and building blocks ..what about integration with netflix ecosystem?
Pulp fiction fans in the audience?
On-call before winston. Long MTTR (Mean Time to Recover).
And now with Winston.
Winston gets the Alert. Using its rule engine decide what the right action is. Action then analyse the issue and if it’s identified as a False Positive, no need to Page the on-call.
Another use case is that Winston will identify that it can fix the issue. When it does, again, no need to Page the on-call.
Last use case, the one we want you to focus on is Assisted Diagnostics. While the on-call is being Paged, Winston runs a series of pre-defined diagnostics and prepare a report for the On-call so that when he logs in the system, he has comprehensive information like the Discovery status, list of recent exceptions or error, or any other relevant context to help him make a decision faster.
Let’s look at some of the real-life scenarios
Anybody who doesn’t know what a runbook is?
a 'runbook' is a routine compilation of procedures and steps that a sys-admin or a person on-call goes through to diagnose and remediate a failure. Generally runbooks have 3 broadly classified steps --
Real-life scenarios
Remove False Positive - expected scenario, can safely be ignored
Diagnostics - collect troubleshooting information
Remediation - fix the problem
Now let’s see some examples of how winston can assist in these steps ...
First example is for False Positive: Data Pipeline Team, Broker Offline. But instance was terminated by AWS, so it’s expected that the broker is offline. Issue resolved. No need to Page on-call for that.
Another Assisted Diagnostics example for Cassandra: Disk Space issue
Gives context around the size of the actual C* data
Checks if there is any Repair or Compaction running which temporarily increases disk usage
Try some auto-remediation: Clean-up old snapshots
Still above disk usage threshold, Paging On-call
In this case, on-call doesn’t have to try to cleanup snapshot since it was already done by Winston, and can now focus on other unknown root causes. Faster TTR.
Last example: Auto-Remediation. For Data Pipeline team: Broker is offline again, like in the first example, but this time, the EC2 instance is still running. So it’s not a false positive.
Check if there is any disk failure
If not, tries to restart the Kafka broker.
Succeeded. Broker is back online. Resolved. Not paging on-call.
Add resources for stackstorm, slack channel and happy hour
Stackstorm guys are here