SlideShare a Scribd company logo
1 of 17
Kevin Li, SVP of Engineering
2018.11.07
Solving Reliability Fears
with SRE at 17 Media
About 17 Media
#1 in Live Streaming
10 million downloads
across Asia - Japan,
Hong Kong, and Taiwan
Why SRE is important at 17 Media
As a fast-growing app startup, SRE practices
are required to enable our product to be
reliable and robust and enable our team to be
agile
SRE at 17 Media
Guarantees the service availability
Minimizes the risk to release and
enable our team to iterate fast
Consists of 5 people, while we have
around 70 engineers.
Our Practices
Define expectations (Error budget)
Make sure reliability and
innovation are balanced
The hardest thing is to get
organizational buy-in
Our expectations currently
99.95% around 1 hour per quarter
What's unacceptable:
● Data corruption
What’s acceptable:
● A little more downtime during non-peak hours
Monitors, alerts and on-call policy
Latency,
Traffic,
Errors and Exceptions,
Saturation
For every component,
you’ll need to
monitor…
And setup the alerts
Monitors, alerts and on-call policy
For every important alert,
System will first call the on-call engineer,
If no reply, call next one…
If no reply, call everyone
Maximize dev speed while minimize risk
Implement continuous integration
Automate the release process
Release frequently (everyday) and at morning
Canary release
Shorten the time to deploy
Rollback effortlessly
Learn from failures
Blameless Postmortem
Post-mortems should be blameless
Gather all facts
Create the timeline
Focus on fixing the process
Actions items to improve the process
Postmortem Review Example
Root Cause Analysis
Timeline (Facts)
Action Items
Track, classify, and review outages
Classified as P0 for causing more than $X dollars
lost of revenue
Classified as P1 for causing some revenue loss
Classified as P2 for not causing any revenue loss
Aggregate and classify outages
Conduct monthly or quarterly analysis
Regularly share the lessons learned
Recap
Define error budget
Monitors, alerts and on-call policy
Maximize dev speed and minimize risk
Learn from failures
Conclusion
Adopting SRE practices not only ensures the
availability of our service, but also enables our
tech team to develop and release faster.
Questions?

More Related Content

What's hot

Grab a coffee and take 5 mins out
Grab a coffee and take 5 mins outGrab a coffee and take 5 mins out
Grab a coffee and take 5 mins outDruantia
 
FAQs in training automation: Your questions answered
FAQs in training automation: Your questions answeredFAQs in training automation: Your questions answered
FAQs in training automation: Your questions answeredaccessplanit
 
Xpm usa event 2012
Xpm usa event 2012Xpm usa event 2012
Xpm usa event 2012GeneXus
 
Xpm usa event 2012
Xpm usa event 2012Xpm usa event 2012
Xpm usa event 2012GeneXus
 
Finance Presentation
Finance PresentationFinance Presentation
Finance PresentationTim Demorais
 
Grab a coffee and take 5 mins out
Grab a coffee and take 5 mins outGrab a coffee and take 5 mins out
Grab a coffee and take 5 mins outDruantia
 
Most Common Mistakes in a Containment Plan
Most Common Mistakes in a Containment PlanMost Common Mistakes in a Containment Plan
Most Common Mistakes in a Containment Planptiqcs
 
Managed service
Managed serviceManaged service
Managed serviceSara Gepp
 
Iq Generic Show Ppt
Iq Generic Show PptIq Generic Show Ppt
Iq Generic Show Pptsbruce
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call Raygun
 
Poka Yoke with Matt Hansen at StatStuff
Poka Yoke with Matt Hansen at StatStuffPoka Yoke with Matt Hansen at StatStuff
Poka Yoke with Matt Hansen at StatStuffMatt Hansen
 
How Drones are Driving 1-Hour Claim Decisions
How Drones are Driving 1-Hour Claim DecisionsHow Drones are Driving 1-Hour Claim Decisions
How Drones are Driving 1-Hour Claim DecisionsKespry, Inc.
 
Perceptual Video Metrics, a New Vocabulary for QoE
Perceptual Video Metrics, a New Vocabulary for QoEPerceptual Video Metrics, a New Vocabulary for QoE
Perceptual Video Metrics, a New Vocabulary for QoECheetah Technologies
 
Software Success Ladder
Software Success LadderSoftware Success Ladder
Software Success LadderSeerene
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHDevOpsDays Tel Aviv
 
NormShield Cyber Risk Rating October 18
NormShield Cyber Risk Rating October 18NormShield Cyber Risk Rating October 18
NormShield Cyber Risk Rating October 18NormShield
 
Overview Of Job Hazard Analysis for South Carolina Hospitality Industry
Overview Of Job Hazard Analysis for South Carolina Hospitality IndustryOverview Of Job Hazard Analysis for South Carolina Hospitality Industry
Overview Of Job Hazard Analysis for South Carolina Hospitality IndustryStephen Deas
 
Customer Case Study – Worcestershire Acute NHS
Customer Case Study – Worcestershire Acute NHS Customer Case Study – Worcestershire Acute NHS
Customer Case Study – Worcestershire Acute NHS Aden Maine
 

What's hot (19)

Grab a coffee and take 5 mins out
Grab a coffee and take 5 mins outGrab a coffee and take 5 mins out
Grab a coffee and take 5 mins out
 
FAQs in training automation: Your questions answered
FAQs in training automation: Your questions answeredFAQs in training automation: Your questions answered
FAQs in training automation: Your questions answered
 
Xpm usa event 2012
Xpm usa event 2012Xpm usa event 2012
Xpm usa event 2012
 
Xpm usa event 2012
Xpm usa event 2012Xpm usa event 2012
Xpm usa event 2012
 
Finance Presentation
Finance PresentationFinance Presentation
Finance Presentation
 
Grab a coffee and take 5 mins out
Grab a coffee and take 5 mins outGrab a coffee and take 5 mins out
Grab a coffee and take 5 mins out
 
Most Common Mistakes in a Containment Plan
Most Common Mistakes in a Containment PlanMost Common Mistakes in a Containment Plan
Most Common Mistakes in a Containment Plan
 
Managed service
Managed serviceManaged service
Managed service
 
Iq Generic Show Ppt
Iq Generic Show PptIq Generic Show Ppt
Iq Generic Show Ppt
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call
 
Poka Yoke with Matt Hansen at StatStuff
Poka Yoke with Matt Hansen at StatStuffPoka Yoke with Matt Hansen at StatStuff
Poka Yoke with Matt Hansen at StatStuff
 
How Drones are Driving 1-Hour Claim Decisions
How Drones are Driving 1-Hour Claim DecisionsHow Drones are Driving 1-Hour Claim Decisions
How Drones are Driving 1-Hour Claim Decisions
 
KPI SMRP Presentation
KPI SMRP PresentationKPI SMRP Presentation
KPI SMRP Presentation
 
Perceptual Video Metrics, a New Vocabulary for QoE
Perceptual Video Metrics, a New Vocabulary for QoEPerceptual Video Metrics, a New Vocabulary for QoE
Perceptual Video Metrics, a New Vocabulary for QoE
 
Software Success Ladder
Software Success LadderSoftware Success Ladder
Software Success Ladder
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
 
NormShield Cyber Risk Rating October 18
NormShield Cyber Risk Rating October 18NormShield Cyber Risk Rating October 18
NormShield Cyber Risk Rating October 18
 
Overview Of Job Hazard Analysis for South Carolina Hospitality Industry
Overview Of Job Hazard Analysis for South Carolina Hospitality IndustryOverview Of Job Hazard Analysis for South Carolina Hospitality Industry
Overview Of Job Hazard Analysis for South Carolina Hospitality Industry
 
Customer Case Study – Worcestershire Acute NHS
Customer Case Study – Worcestershire Acute NHS Customer Case Study – Worcestershire Acute NHS
Customer Case Study – Worcestershire Acute NHS
 

Similar to SRE at 17 Media Solves Reliability Fears with Agile Practices

2011 09 18 United "Platitudes, reality and promise"
2011 09 18 United "Platitudes, reality and promise"2011 09 18 United "Platitudes, reality and promise"
2011 09 18 United "Platitudes, reality and promise"Gene Kim
 
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...Gene Kim
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsRicardo Amaro
 
March APLN: Agile development- Measure & Analyze by Garry Rowland
March APLN: Agile development- Measure & Analyze by Garry RowlandMarch APLN: Agile development- Measure & Analyze by Garry Rowland
March APLN: Agile development- Measure & Analyze by Garry RowlandConscires Agile Practices
 
Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...
Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...
Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...Agile Montréal
 
Increase your Service Advantage: Innovations in Agent Hiring
Increase your Service Advantage: Innovations in Agent HiringIncrease your Service Advantage: Innovations in Agent Hiring
Increase your Service Advantage: Innovations in Agent HiringHireIQ Solutions, Inc.
 
Taking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout SessionTaking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout SessionSplunk
 
Taking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout SessionTaking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout SessionSplunk
 
The 7 Principles of DevOps and Cloud Applications
The 7 Principles of DevOps and Cloud ApplicationsThe 7 Principles of DevOps and Cloud Applications
The 7 Principles of DevOps and Cloud ApplicationsSolarWinds
 
2011 09 19 LSPE Dev Ops Cookbook 1a
2011 09 19 LSPE Dev Ops Cookbook 1a2011 09 19 LSPE Dev Ops Cookbook 1a
2011 09 19 LSPE Dev Ops Cookbook 1aGene Kim
 
Compliance Automation: detect & correct
Compliance Automation: detect & correctCompliance Automation: detect & correct
Compliance Automation: detect & correctKangaroot
 
The journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef AutomateThe journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef AutomateKangaroot
 
Advanced Analytics for Asset Management with IBM
Advanced Analytics for Asset Management with IBMAdvanced Analytics for Asset Management with IBM
Advanced Analytics for Asset Management with IBMPerficient, Inc.
 
A G S005 Perdew 091807
A G S005  Perdew 091807A G S005  Perdew 091807
A G S005 Perdew 091807Dreamforce07
 
2011 06 15 velocity conf from visible ops to dev ops final
2011 06 15 velocity conf   from visible ops to dev ops final2011 06 15 velocity conf   from visible ops to dev ops final
2011 06 15 velocity conf from visible ops to dev ops finalGene Kim
 
How to Apply a Product Mindset to Your Platform Team Tomorrow
How to Apply a Product Mindset to Your Platform Team TomorrowHow to Apply a Product Mindset to Your Platform Team Tomorrow
How to Apply a Product Mindset to Your Platform Team TomorrowJelmer Borst
 
Taking Splunk to the Next Level - Management
Taking Splunk to the Next Level - ManagementTaking Splunk to the Next Level - Management
Taking Splunk to the Next Level - ManagementSplunk
 
Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...
Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...
Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...Net at Work
 

Similar to SRE at 17 Media Solves Reliability Fears with Agile Practices (20)

2011 09 18 United "Platitudes, reality and promise"
2011 09 18 United "Platitudes, reality and promise"2011 09 18 United "Platitudes, reality and promise"
2011 09 18 United "Platitudes, reality and promise"
 
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
 
Agile dashboard
Agile dashboardAgile dashboard
Agile dashboard
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
 
March APLN: Agile development- Measure & Analyze by Garry Rowland
March APLN: Agile development- Measure & Analyze by Garry RowlandMarch APLN: Agile development- Measure & Analyze by Garry Rowland
March APLN: Agile development- Measure & Analyze by Garry Rowland
 
Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...
Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...
Agile Project Management: From Agile Teams to Agile Organizations - Steve Mer...
 
Increase your Service Advantage: Innovations in Agent Hiring
Increase your Service Advantage: Innovations in Agent HiringIncrease your Service Advantage: Innovations in Agent Hiring
Increase your Service Advantage: Innovations in Agent Hiring
 
Taking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout SessionTaking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout Session
 
Taking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout SessionTaking Splunk to the Next Level - Management Breakout Session
Taking Splunk to the Next Level - Management Breakout Session
 
The 7 Principles of DevOps and Cloud Applications
The 7 Principles of DevOps and Cloud ApplicationsThe 7 Principles of DevOps and Cloud Applications
The 7 Principles of DevOps and Cloud Applications
 
2011 09 19 LSPE Dev Ops Cookbook 1a
2011 09 19 LSPE Dev Ops Cookbook 1a2011 09 19 LSPE Dev Ops Cookbook 1a
2011 09 19 LSPE Dev Ops Cookbook 1a
 
Compliance Automation: detect & correct
Compliance Automation: detect & correctCompliance Automation: detect & correct
Compliance Automation: detect & correct
 
The journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef AutomateThe journey to Continuous Automation - Chef Automate
The journey to Continuous Automation - Chef Automate
 
Advanced Analytics for Asset Management with IBM
Advanced Analytics for Asset Management with IBMAdvanced Analytics for Asset Management with IBM
Advanced Analytics for Asset Management with IBM
 
A G S005 Perdew 091807
A G S005  Perdew 091807A G S005  Perdew 091807
A G S005 Perdew 091807
 
2011 06 15 velocity conf from visible ops to dev ops final
2011 06 15 velocity conf   from visible ops to dev ops final2011 06 15 velocity conf   from visible ops to dev ops final
2011 06 15 velocity conf from visible ops to dev ops final
 
How to Apply a Product Mindset to Your Platform Team Tomorrow
How to Apply a Product Mindset to Your Platform Team TomorrowHow to Apply a Product Mindset to Your Platform Team Tomorrow
How to Apply a Product Mindset to Your Platform Team Tomorrow
 
Taking Splunk to the Next Level - Management
Taking Splunk to the Next Level - ManagementTaking Splunk to the Next Level - Management
Taking Splunk to the Next Level - Management
 
Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...
Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...
Information Security Risks - What You Can Do To Help Your Clients Avoid Costl...
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 

SRE at 17 Media Solves Reliability Fears with Agile Practices

  • 1. Kevin Li, SVP of Engineering 2018.11.07 Solving Reliability Fears with SRE at 17 Media
  • 2. About 17 Media #1 in Live Streaming 10 million downloads across Asia - Japan, Hong Kong, and Taiwan
  • 3. Why SRE is important at 17 Media As a fast-growing app startup, SRE practices are required to enable our product to be reliable and robust and enable our team to be agile
  • 4. SRE at 17 Media Guarantees the service availability Minimizes the risk to release and enable our team to iterate fast Consists of 5 people, while we have around 70 engineers.
  • 6. Define expectations (Error budget) Make sure reliability and innovation are balanced The hardest thing is to get organizational buy-in
  • 7. Our expectations currently 99.95% around 1 hour per quarter What's unacceptable: ● Data corruption What’s acceptable: ● A little more downtime during non-peak hours
  • 8. Monitors, alerts and on-call policy Latency, Traffic, Errors and Exceptions, Saturation For every component, you’ll need to monitor… And setup the alerts
  • 9. Monitors, alerts and on-call policy For every important alert, System will first call the on-call engineer, If no reply, call next one… If no reply, call everyone
  • 10. Maximize dev speed while minimize risk Implement continuous integration Automate the release process Release frequently (everyday) and at morning Canary release Shorten the time to deploy Rollback effortlessly
  • 12. Blameless Postmortem Post-mortems should be blameless Gather all facts Create the timeline Focus on fixing the process Actions items to improve the process
  • 13. Postmortem Review Example Root Cause Analysis Timeline (Facts) Action Items
  • 14. Track, classify, and review outages Classified as P0 for causing more than $X dollars lost of revenue Classified as P1 for causing some revenue loss Classified as P2 for not causing any revenue loss Aggregate and classify outages Conduct monthly or quarterly analysis Regularly share the lessons learned
  • 15. Recap Define error budget Monitors, alerts and on-call policy Maximize dev speed and minimize risk Learn from failures
  • 16. Conclusion Adopting SRE practices not only ensures the availability of our service, but also enables our tech team to develop and release faster.