Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Winston: Helping Netflix Engineers Sleep at Night
Our journey… assisting engineers reduce operational load and MTTR
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every...
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practi...
On-Call !
Sayli Karmarkar
Senior Software Engineer
Diagnostics and Remediation Engineering (DaRE)
skarmarkar@netflix.com
@HikerTechy...
Engineer
Wakes up
PagerDuty
Alert
2:02 AM
Logs in
and ACK
2:07 AM
Checks
runbook
2:15 AM
Studies
the alert
2:10 AM
Fixes/M...
Works, but does it scale?!
Scale and Growth Availability
MTTR
Productivity
Traditional On-Call Pain Points
Solution?
Automate
● Removing False Positives
● Collecting Diagnostic Information
● Mitigating the problem to reduce impac...
Unique Problem? Not really ..
Define
● Business Goals
● Use-cases
● Customers
● Constraints
● Interactions with other services
Winston’s Goals
● Assist engineers in reducing MTTR and pager fatigue by
providing a platform to automate their runbooks
●...
What is Winston?
Winston is an event driven runbook automation
platform. It is designed to host and execute runbooks in
re...
Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02...
False
Positive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AM
Assisted
Diagnostics
Mitigates the
problem
On-Call With Winston
Evaluation - Build / Reuse / Buy
Stackstorm
+
● A generic pluggable Event-Driven Automation Platform
● Designed with availability and reliability in mind
●...
+
Inbound Integrations
(through SQS)
+
Outbound Integrations
Bolt ...
As a Service (High Availability and Reliability )
Go...
A closer look at a Winston Instance
V1.0 Winston HA Deployment
Challenges
● Added cognitive load resulting in less adoption
● How to help engineers choose operational
efficiency over ne...
● One stop portal for all things Winston
● Supports Create, Read, Update, Delete, Execute and Diagnose
functionality
● Imp...
Winston Studio
Runbook
View
Executions
Execution
Details
Current Winston Deployment
Use-case Patterns
Sample Use-cases
False Positives
● Broker reporting offline when AWS maintenance takes
down an instance
● Cassandra ring h...
The Road Ahead
● Adoption / Usability
○ Find common operational use-cases and allow them to be re-used
○ Improve discovera...
Key Takeaways
● Don’t re-invent the wheel
● Start simple and iterate. Have some room for experimentation.
● Start with use...
Resources
● Winston Tech Blog link
http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html
● Stackstorm ...
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-winston
Winston: Helping Netflix Engineers Sleep at Night
Winston: Helping Netflix Engineers Sleep at Night
Winston: Helping Netflix Engineers Sleep at Night
Winston: Helping Netflix Engineers Sleep at Night
Upcoming SlideShare
Loading in …5
×

Winston: Helping Netflix Engineers Sleep at Night

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lLZ7M8.

Sayli Karmarkar discusses Winston, a monitoring and remediation platform built for Netflix engineers. Winston lets engineers go beyond the current DevOps model of monitoring the system and getting alerted when a failure condition occurs. He talks about the types of use cases that this type of platform targets, adoption challenges and how the architecture of the platform evolved over time. Filmed at qconsf.com.

Sayli Karmarkar is a software engineer with 8+ years of experience in developing platforms and services to configure and manage thousands of systems in an enterprise. At Netflix, she has been focusing on building platforms to help developers in diagnosing and remediating their operational failures automatically in response to events.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Winston: Helping Netflix Engineers Sleep at Night

  1. 1. Winston: Helping Netflix Engineers Sleep at Night Our journey… assisting engineers reduce operational load and MTTR
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-winston
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. On-Call !
  5. 5. Sayli Karmarkar Senior Software Engineer Diagnostics and Remediation Engineering (DaRE) skarmarkar@netflix.com @HikerTechy https://www.linkedin.com/in/saylikarmarkar DaRE Team’s Focus Build platforms, tools and libraries to help teams reduce MTTR for operational issues.
  6. 6. Engineer Wakes up PagerDuty Alert 2:02 AM Logs in and ACK 2:07 AM Checks runbook 2:15 AM Studies the alert 2:10 AM Fixes/Mitigates the problem 2:30 AM Runs diagnostics 2:20 AM2:00 AM Traditional On-Call Timeline
  7. 7. Works, but does it scale?! Scale and Growth Availability
  8. 8. MTTR Productivity Traditional On-Call Pain Points
  9. 9. Solution? Automate ● Removing False Positives ● Collecting Diagnostic Information ● Mitigating the problem to reduce impact on the customers Hands-free Feed the runbooks to an event-driven automation platform and have them executed in response to operational events
  10. 10. Unique Problem? Not really ..
  11. 11. Define ● Business Goals ● Use-cases ● Customers ● Constraints ● Interactions with other services
  12. 12. Winston’s Goals ● Assist engineers in reducing MTTR and pager fatigue by providing a platform to automate their runbooks ● Provide an easy way to connect automated runbooks to an event ● Let engineers focus on the business logic of runbooks rather than infrastructure aka PaaS. ● Provide appropriate wrappers and libraries to interact with other services ● Ensure best practices for automations and runbook lifecycle management
  13. 13. What is Winston? Winston is an event driven runbook automation platform. It is designed to host and execute runbooks in response to operational events.
  14. 14. Engineer Wakes up Logs in and ACK Checks runbook Studies the alert Fixes the problem Runs diagnostics PagerDuty Alert 2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM Traditional On-Call Timeline
  15. 15. False Positive Winston 2:00 AM 2:05 AM 2:05 AM 2:15 AM Assisted Diagnostics Mitigates the problem On-Call With Winston
  16. 16. Evaluation - Build / Reuse / Buy
  17. 17. Stackstorm + ● A generic pluggable Event-Driven Automation Platform ● Designed with availability and reliability in mind ● Open source + Code following good design practices ● Good alignment with respect to goals and future direction - ● High availability and reliability not exercised a lot ● Dependency on MongoDB and RabbitMQ ● No easy way of adding and updating automation
  18. 18. + Inbound Integrations (through SQS) + Outbound Integrations Bolt ... As a Service (High Availability and Reliability ) Good Starting Point .. Iterate and Evaluate Regularly
  19. 19. A closer look at a Winston Instance
  20. 20. V1.0 Winston HA Deployment
  21. 21. Challenges ● Added cognitive load resulting in less adoption ● How to help engineers choose operational efficiency over new features? ● Recommended and safe automation and lifecycle management practices are often not followed ● Simple use-cases are not trivial to on-board
  22. 22. ● One stop portal for all things Winston ● Supports Create, Read, Update, Delete, Execute and Diagnose functionality ● Implements best practises ○ Compliance/Auditing ○ Persistence ○ Security (Authentication/Authorization) ● Self serve & scalable Winston Studio
  23. 23. Winston Studio
  24. 24. Runbook View
  25. 25. Executions
  26. 26. Execution Details
  27. 27. Current Winston Deployment
  28. 28. Use-case Patterns
  29. 29. Sample Use-cases False Positives ● Broker reporting offline when AWS maintenance takes down an instance ● Cassandra ring health Diagnostics - Correlation could point towards the root cause ● Checking current maintenance jobs running on a cluster when an issue occurs ● Querying dependencies upstream and downstream for anomalous behavior ● Capture current system state and logs to analyze failures and reach the root cause quicker Mitigation ● Restart kafka process ● Clean up disk space
  30. 30. The Road Ahead ● Adoption / Usability ○ Find common operational use-cases and allow them to be re-used ○ Improve discoverability of Winston by integrating into existing alerting systems ○ Polyglot support (Groovy based runbooks) ● Safety ○ Resource isolation using containers ○ Rate limiting capability ● Stronger analytics ○ Provide aggregate visualization of runbook executions
  31. 31. Key Takeaways ● Don’t re-invent the wheel ● Start simple and iterate. Have some room for experimentation. ● Start with use-cases where there is more pain and less control over the source of the problem ● Pay special attention to usability of your product ● Push for changing the culture -- Usage will follow ● Find sponsors for features that are much more involved ● Implement best practices through carefully designed user interface instead of documentation. ● Discourage anti-patterns that focus on long term mitigation rather than fixing the root-cause ● Talk to us and others to share insights and learnings
  32. 32. Resources ● Winston Tech Blog link http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html ● Stackstorm documentation https://docs.stackstorm.com/ ● Reach out skarmarkar@netflix.com @HikerTechy We are hiring Senior Software Engineer - https://jobs.netflix.com/jobs/860752
  33. 33. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-winston

×