SlideShare a Scribd company logo
Winston: Helping Netflix Engineers Sleep at Night
Our journey… assisting engineers reduce operational load and MTTR
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-winston
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
On-Call !
Sayli Karmarkar
Senior Software Engineer
Diagnostics and Remediation Engineering (DaRE)
skarmarkar@netflix.com
@HikerTechy
https://www.linkedin.com/in/saylikarmarkar
DaRE Team’s Focus
Build platforms, tools and libraries
to help teams reduce MTTR for operational issues.
Engineer
Wakes up
PagerDuty
Alert
2:02 AM
Logs in
and ACK
2:07 AM
Checks
runbook
2:15 AM
Studies
the alert
2:10 AM
Fixes/Mitigates
the
problem
2:30 AM
Runs
diagnostics
2:20 AM2:00 AM
Traditional On-Call Timeline
Works, but does it scale?!
Scale and Growth Availability
MTTR
Productivity
Traditional On-Call Pain Points
Solution?
Automate
● Removing False Positives
● Collecting Diagnostic Information
● Mitigating the problem to reduce impact on
the customers
Hands-free
Feed the runbooks to an event-driven
automation platform and have them executed in
response to operational events
Unique Problem? Not really ..
Define
● Business Goals
● Use-cases
● Customers
● Constraints
● Interactions with other services
Winston’s Goals
● Assist engineers in reducing MTTR and pager fatigue by
providing a platform to automate their runbooks
● Provide an easy way to connect automated runbooks to an
event
● Let engineers focus on the business logic of runbooks rather
than infrastructure aka PaaS.
● Provide appropriate wrappers and libraries to interact with other
services
● Ensure best practices for automations and runbook lifecycle
management
What is Winston?
Winston is an event driven runbook automation
platform. It is designed to host and execute runbooks in
response to operational events.
Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
Traditional On-Call Timeline
False
Positive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AM
Assisted
Diagnostics
Mitigates the
problem
On-Call With Winston
Evaluation - Build / Reuse / Buy
Stackstorm
+
● A generic pluggable Event-Driven Automation Platform
● Designed with availability and reliability in mind
● Open source + Code following good design practices
● Good alignment with respect to goals and future direction
-
● High availability and reliability not exercised a lot
● Dependency on MongoDB and RabbitMQ
● No easy way of adding and updating automation
+
Inbound Integrations
(through SQS)
+
Outbound Integrations
Bolt ...
As a Service (High Availability and Reliability )
Good Starting Point ..
Iterate and Evaluate Regularly
A closer look at a Winston Instance
V1.0 Winston HA Deployment
Challenges
● Added cognitive load resulting in less adoption
● How to help engineers choose operational
efficiency over new features?
● Recommended and safe automation and lifecycle
management practices are often not followed
● Simple use-cases are not trivial to on-board
● One stop portal for all things Winston
● Supports Create, Read, Update, Delete, Execute and Diagnose
functionality
● Implements best practises
○ Compliance/Auditing
○ Persistence
○ Security (Authentication/Authorization)
● Self serve & scalable
Winston Studio
Winston Studio
Runbook
View
Executions
Execution
Details
Current Winston Deployment
Use-case Patterns
Sample Use-cases
False Positives
● Broker reporting offline when AWS maintenance takes
down an instance
● Cassandra ring health
Diagnostics - Correlation could point towards the root cause
● Checking current maintenance jobs running on a cluster
when an issue occurs
● Querying dependencies upstream and downstream for
anomalous behavior
● Capture current system state and logs to analyze
failures and reach the root cause quicker
Mitigation
● Restart kafka process
● Clean up disk space
The Road Ahead
● Adoption / Usability
○ Find common operational use-cases and allow them to be re-used
○ Improve discoverability of Winston by integrating into existing alerting
systems
○ Polyglot support (Groovy based runbooks)
● Safety
○ Resource isolation using containers
○ Rate limiting capability
● Stronger analytics
○ Provide aggregate visualization of runbook executions
Key Takeaways
● Don’t re-invent the wheel
● Start simple and iterate. Have some room for experimentation.
● Start with use-cases where there is more pain and less control
over the source of the problem
● Pay special attention to usability of your product
● Push for changing the culture -- Usage will follow
● Find sponsors for features that are much more involved
● Implement best practices through carefully designed user interface
instead of documentation.
● Discourage anti-patterns that focus on long term mitigation rather
than fixing the root-cause
● Talk to us and others to share insights and learnings
Resources
● Winston Tech Blog link
http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html
● Stackstorm documentation
https://docs.stackstorm.com/
● Reach out
skarmarkar@netflix.com
@HikerTechy
We are hiring
Senior Software Engineer - https://jobs.netflix.com/jobs/860752
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-winston

More Related Content

More from C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
C4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
C4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
C4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
C4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
C4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
C4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
C4Media
 

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 

Recently uploaded (20)

GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 

Winston: Helping Netflix Engineers Sleep at Night

  • 1. Winston: Helping Netflix Engineers Sleep at Night Our journey… assisting engineers reduce operational load and MTTR
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-winston
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 5. Sayli Karmarkar Senior Software Engineer Diagnostics and Remediation Engineering (DaRE) skarmarkar@netflix.com @HikerTechy https://www.linkedin.com/in/saylikarmarkar DaRE Team’s Focus Build platforms, tools and libraries to help teams reduce MTTR for operational issues.
  • 6. Engineer Wakes up PagerDuty Alert 2:02 AM Logs in and ACK 2:07 AM Checks runbook 2:15 AM Studies the alert 2:10 AM Fixes/Mitigates the problem 2:30 AM Runs diagnostics 2:20 AM2:00 AM Traditional On-Call Timeline
  • 7. Works, but does it scale?! Scale and Growth Availability
  • 8.
  • 10. Solution? Automate ● Removing False Positives ● Collecting Diagnostic Information ● Mitigating the problem to reduce impact on the customers Hands-free Feed the runbooks to an event-driven automation platform and have them executed in response to operational events
  • 11. Unique Problem? Not really ..
  • 12. Define ● Business Goals ● Use-cases ● Customers ● Constraints ● Interactions with other services
  • 13. Winston’s Goals ● Assist engineers in reducing MTTR and pager fatigue by providing a platform to automate their runbooks ● Provide an easy way to connect automated runbooks to an event ● Let engineers focus on the business logic of runbooks rather than infrastructure aka PaaS. ● Provide appropriate wrappers and libraries to interact with other services ● Ensure best practices for automations and runbook lifecycle management
  • 14. What is Winston? Winston is an event driven runbook automation platform. It is designed to host and execute runbooks in response to operational events.
  • 15. Engineer Wakes up Logs in and ACK Checks runbook Studies the alert Fixes the problem Runs diagnostics PagerDuty Alert 2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM Traditional On-Call Timeline
  • 16. False Positive Winston 2:00 AM 2:05 AM 2:05 AM 2:15 AM Assisted Diagnostics Mitigates the problem On-Call With Winston
  • 17. Evaluation - Build / Reuse / Buy
  • 18. Stackstorm + ● A generic pluggable Event-Driven Automation Platform ● Designed with availability and reliability in mind ● Open source + Code following good design practices ● Good alignment with respect to goals and future direction - ● High availability and reliability not exercised a lot ● Dependency on MongoDB and RabbitMQ ● No easy way of adding and updating automation
  • 19. + Inbound Integrations (through SQS) + Outbound Integrations Bolt ... As a Service (High Availability and Reliability ) Good Starting Point .. Iterate and Evaluate Regularly
  • 20. A closer look at a Winston Instance
  • 21. V1.0 Winston HA Deployment
  • 22. Challenges ● Added cognitive load resulting in less adoption ● How to help engineers choose operational efficiency over new features? ● Recommended and safe automation and lifecycle management practices are often not followed ● Simple use-cases are not trivial to on-board
  • 23. ● One stop portal for all things Winston ● Supports Create, Read, Update, Delete, Execute and Diagnose functionality ● Implements best practises ○ Compliance/Auditing ○ Persistence ○ Security (Authentication/Authorization) ● Self serve & scalable Winston Studio
  • 30. Sample Use-cases False Positives ● Broker reporting offline when AWS maintenance takes down an instance ● Cassandra ring health Diagnostics - Correlation could point towards the root cause ● Checking current maintenance jobs running on a cluster when an issue occurs ● Querying dependencies upstream and downstream for anomalous behavior ● Capture current system state and logs to analyze failures and reach the root cause quicker Mitigation ● Restart kafka process ● Clean up disk space
  • 31.
  • 32.
  • 33.
  • 34. The Road Ahead ● Adoption / Usability ○ Find common operational use-cases and allow them to be re-used ○ Improve discoverability of Winston by integrating into existing alerting systems ○ Polyglot support (Groovy based runbooks) ● Safety ○ Resource isolation using containers ○ Rate limiting capability ● Stronger analytics ○ Provide aggregate visualization of runbook executions
  • 35. Key Takeaways ● Don’t re-invent the wheel ● Start simple and iterate. Have some room for experimentation. ● Start with use-cases where there is more pain and less control over the source of the problem ● Pay special attention to usability of your product ● Push for changing the culture -- Usage will follow ● Find sponsors for features that are much more involved ● Implement best practices through carefully designed user interface instead of documentation. ● Discourage anti-patterns that focus on long term mitigation rather than fixing the root-cause ● Talk to us and others to share insights and learnings
  • 36. Resources ● Winston Tech Blog link http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html ● Stackstorm documentation https://docs.stackstorm.com/ ● Reach out skarmarkar@netflix.com @HikerTechy We are hiring Senior Software Engineer - https://jobs.netflix.com/jobs/860752
  • 37. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-winston