SlideShare a Scribd company logo
Site Reliability Engineering
Presenter Name: Keet Malin Sugathadasa
Designation: Associate Technical Lead
Presented By
Keet Malin Sugathadasa
Associate Tech Lead at Cognite
More than 3 years of experience in
various roles related to Software
Engineering
Contributor to NPM and
Stackoverflow
Research Interests –Cyber
Security, Cloud Computing,
Distributed Computing.
AGENDA
• What is Site Reliability Engineering (SRE)
• The 5 Pillars of SRE
• SLOs, SLIs, SLAs
• Error Budgets
• Toil
• Ensuring Successful operations of a
production system
What is DevOps
Like Agile came in to remove the gap between BA &
Dev, DevOps made the gap between Dev & Ops go
away
What is SRE?
• DevOps has been a community built set of practices, a culture;
• while SRE was groomed inside Google as a secret sauce.
Reduce Organizational Silos
• SRE teams share ownership of production with
developers
• SRE teams get involved in development at very early
stages
• But products may not start with SRE support at first.
When onboarding, following items get checked
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
Reduce Silos
Accept Failure as Normal
Blameless Postmortems
• When things have actually gone bazooka,
who’s fault is it?
• Answer: Nobody’s. It's the system’s fault.
It allowed people to act that way!
• Ask WHY not WHO!
If nobody is blamed, people open up, and
then the root cause cascade opens up.
Agility[Devs] vs Stability[Ops]
• What is availability?
• Clear definitions
• How available you want to be?
• Clear numerical indicators
• What to do when availability is
not met?
SLI - SLO - SLA : Service Level what?
Service Level Indicator: A metric aggregated over time, ( 90th percentile, median )
• Batch throughput
• Failures per request
• Is the ratios of errors to total number of requests received in last 5 minutes < 1%?
• Request latency
• Is the average latency of requests in last 5 minutes < 300ms?
• Is the 90th percentile of the latency of requests in last 5 minutes < 300ms?
Service Level Objectives: Number which SLI needs to be
• Is above indicator is YES 99.9% of the time?
• Monitor the SLIs over a long time and decide this
Service Level Agreement: A legal agreement
• The the level of reliability I promise & what will I do if I do not
• Usually based on SLOs but a business agreement
Risk and availability
• 100% availability is impossible.
• Each 9 you add to the SLO,
increases your cost
• Each 9 you add, you lose your
comfort
Error Budgets
• Once you decide the SLO, you get X number of minutes to go unavailable.
• X is your Error Budget
• If you reach that budget, you cannot release new features anymore
• Under AND over spending is bad.
Implement Gradual Change
Gradual change
• Updates should be pushed as canaries, not as bulk version changes
• Less code change means lesser mean time to recover on failure
• Rate of change would depend on selection of SLO
Tooling & Automation
Toil
Toil is the manual repetitive work tied to running in PROD ( which can be
automated )
Toil & Toil budget
SREs actively measure Toil. Toil budget should be
around 30% to 50%
If toil is not kept at its margins, it fills up to 100%
easily
But a little amount of toil is not harmful.
• Automation might be harder than the manual
work
• Helps newcomers to orient themselves
Measuring
Service reliability needs to be measured
• Uptime
• Mean time to failure
• Mean time to recover
Whatsapp (Example Use case)
• Message Delivery Time
• Message Throughput
• Image Resolution (Compression Algorithm)
• Video Compression Quality
• Etc etc
Hope is not a
Strategy!
Thank you

More Related Content

What's hot

Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
Jason Loeffler
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
Franklin Angulo
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
Squadcast Inc
 
SRE 101
SRE 101SRE 101
SRE 101
Diego Pacheco
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
jeetendra mandal
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
Setyo Legowo
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
New Relic
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
Levon Avakyan
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
Ladislav Prskavec
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
Ernest Mueller
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
Bob Wise
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
Ashutosh Agarwal
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE Paris
Clément Michaud
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 

What's hot (20)

Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
SRE 101
SRE 101SRE 101
SRE 101
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE Paris
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 

Similar to Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'
DevOpsDays DFW
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryXebiaLabs
 
Adapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored ProcessesAdapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored Processes
Prabhat Sinha
 
DevOps By The Numbers
DevOps By The NumbersDevOps By The Numbers
DevOps By The Numbers
XebiaLabs
 
Kanban testing
Kanban testingKanban testing
Kanban testing
Cprime
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
Liran Levy
 
Top 10 Agile Metrics
Top 10 Agile MetricsTop 10 Agile Metrics
Top 10 Agile Metrics
XBOSoft
 
Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...
QASymphony
 
How to test a Mainframe Application
How to test a Mainframe ApplicationHow to test a Mainframe Application
How to test a Mainframe Application
Michael Erichsen
 
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
Analyst Keynote: Continuous Delivery: Making DevOps AwesomeAnalyst Keynote: Continuous Delivery: Making DevOps Awesome
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
CA Technologies
 
Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015
Sumo Logic
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
Shannon Lietz
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Datavail
 
Measuring DevOps Performance
Measuring DevOps PerformanceMeasuring DevOps Performance
Measuring DevOps Performance
Ben Kohl
 
Scaling unstable systems velocity 2015
Scaling unstable systems   velocity 2015Scaling unstable systems   velocity 2015
Scaling unstable systems velocity 2015
Siddharth Ram
 
Approaching Quality in Digital Era
Approaching Quality in Digital EraApproaching Quality in Digital Era
Approaching Quality in Digital Era
SoftServe
 
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Serverless Days Helsinki 2019 Rolf Koski - Business Driven AvailabilityServerless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Rolf Koski
 
What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my Business
Qualitest
 
Delivering A Great End User Experience
Delivering A Great End User ExperienceDelivering A Great End User Experience
Delivering A Great End User Experience
Trevor Warren
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is Dead
Kevin Crawley
 

Similar to Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa (20)

Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
 
Adapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored ProcessesAdapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored Processes
 
DevOps By The Numbers
DevOps By The NumbersDevOps By The Numbers
DevOps By The Numbers
 
Kanban testing
Kanban testingKanban testing
Kanban testing
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
 
Top 10 Agile Metrics
Top 10 Agile MetricsTop 10 Agile Metrics
Top 10 Agile Metrics
 
Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...
 
How to test a Mainframe Application
How to test a Mainframe ApplicationHow to test a Mainframe Application
How to test a Mainframe Application
 
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
Analyst Keynote: Continuous Delivery: Making DevOps AwesomeAnalyst Keynote: Continuous Delivery: Making DevOps Awesome
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
 
Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
 
Measuring DevOps Performance
Measuring DevOps PerformanceMeasuring DevOps Performance
Measuring DevOps Performance
 
Scaling unstable systems velocity 2015
Scaling unstable systems   velocity 2015Scaling unstable systems   velocity 2015
Scaling unstable systems velocity 2015
 
Approaching Quality in Digital Era
Approaching Quality in Digital EraApproaching Quality in Digital Era
Approaching Quality in Digital Era
 
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Serverless Days Helsinki 2019 Rolf Koski - Business Driven AvailabilityServerless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
 
What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my Business
 
Delivering A Great End User Experience
Delivering A Great End User ExperienceDelivering A Great End User Experience
Delivering A Great End User Experience
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is Dead
 

More from Keet Sugathadasa

Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 
Human Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook MessengerHuman Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook Messenger
Keet Sugathadasa
 
Cyber Security and Cloud Computing
Cyber Security and Cloud ComputingCyber Security and Cloud Computing
Cyber Security and Cloud Computing
Keet Sugathadasa
 
How to compete in hackathons
How to compete in hackathonsHow to compete in hackathons
How to compete in hackathons
Keet Sugathadasa
 
Quality Engineering - When to Stop Testing
Quality Engineering - When to Stop TestingQuality Engineering - When to Stop Testing
Quality Engineering - When to Stop Testing
Keet Sugathadasa
 
Training Report WSO2 internship
Training Report  WSO2 internshipTraining Report  WSO2 internship
Training Report WSO2 internship
Keet Sugathadasa
 
Object oriented programming interview questions
Object oriented programming interview questionsObject oriented programming interview questions
Object oriented programming interview questions
Keet Sugathadasa
 
Interview Facing Workshop
Interview Facing WorkshopInterview Facing Workshop
Interview Facing Workshop
Keet Sugathadasa
 
Revolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connectRevolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connect
Keet Sugathadasa
 

More from Keet Sugathadasa (9)

Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Human Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook MessengerHuman Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook Messenger
 
Cyber Security and Cloud Computing
Cyber Security and Cloud ComputingCyber Security and Cloud Computing
Cyber Security and Cloud Computing
 
How to compete in hackathons
How to compete in hackathonsHow to compete in hackathons
How to compete in hackathons
 
Quality Engineering - When to Stop Testing
Quality Engineering - When to Stop TestingQuality Engineering - When to Stop Testing
Quality Engineering - When to Stop Testing
 
Training Report WSO2 internship
Training Report  WSO2 internshipTraining Report  WSO2 internship
Training Report WSO2 internship
 
Object oriented programming interview questions
Object oriented programming interview questionsObject oriented programming interview questions
Object oriented programming interview questions
 
Interview Facing Workshop
Interview Facing WorkshopInterview Facing Workshop
Interview Facing Workshop
 
Revolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connectRevolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connect
 

Recently uploaded

Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 

Recently uploaded (20)

Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

  • 1. Site Reliability Engineering Presenter Name: Keet Malin Sugathadasa Designation: Associate Technical Lead
  • 2. Presented By Keet Malin Sugathadasa Associate Tech Lead at Cognite More than 3 years of experience in various roles related to Software Engineering Contributor to NPM and Stackoverflow Research Interests –Cyber Security, Cloud Computing, Distributed Computing.
  • 3. AGENDA • What is Site Reliability Engineering (SRE) • The 5 Pillars of SRE • SLOs, SLIs, SLAs • Error Budgets • Toil • Ensuring Successful operations of a production system
  • 4. What is DevOps Like Agile came in to remove the gap between BA & Dev, DevOps made the gap between Dev & Ops go away
  • 5. What is SRE? • DevOps has been a community built set of practices, a culture; • while SRE was groomed inside Google as a secret sauce.
  • 6.
  • 7.
  • 9. • SRE teams share ownership of production with developers • SRE teams get involved in development at very early stages • But products may not start with SRE support at first. When onboarding, following items get checked • System architecture and interservice dependencies • Instrumentation, metrics, and monitoring • Emergency response • Capacity planning • Change management • Performance: availability, latency, and efficiency Reduce Silos
  • 11. Blameless Postmortems • When things have actually gone bazooka, who’s fault is it? • Answer: Nobody’s. It's the system’s fault. It allowed people to act that way! • Ask WHY not WHO! If nobody is blamed, people open up, and then the root cause cascade opens up.
  • 12. Agility[Devs] vs Stability[Ops] • What is availability? • Clear definitions • How available you want to be? • Clear numerical indicators • What to do when availability is not met?
  • 13. SLI - SLO - SLA : Service Level what? Service Level Indicator: A metric aggregated over time, ( 90th percentile, median ) • Batch throughput • Failures per request • Is the ratios of errors to total number of requests received in last 5 minutes < 1%? • Request latency • Is the average latency of requests in last 5 minutes < 300ms? • Is the 90th percentile of the latency of requests in last 5 minutes < 300ms? Service Level Objectives: Number which SLI needs to be • Is above indicator is YES 99.9% of the time? • Monitor the SLIs over a long time and decide this Service Level Agreement: A legal agreement • The the level of reliability I promise & what will I do if I do not • Usually based on SLOs but a business agreement
  • 14.
  • 15. Risk and availability • 100% availability is impossible. • Each 9 you add to the SLO, increases your cost • Each 9 you add, you lose your comfort
  • 16. Error Budgets • Once you decide the SLO, you get X number of minutes to go unavailable. • X is your Error Budget • If you reach that budget, you cannot release new features anymore • Under AND over spending is bad.
  • 17.
  • 19. Gradual change • Updates should be pushed as canaries, not as bulk version changes • Less code change means lesser mean time to recover on failure • Rate of change would depend on selection of SLO
  • 21. Toil Toil is the manual repetitive work tied to running in PROD ( which can be automated )
  • 22. Toil & Toil budget SREs actively measure Toil. Toil budget should be around 30% to 50% If toil is not kept at its margins, it fills up to 100% easily But a little amount of toil is not harmful. • Automation might be harder than the manual work • Helps newcomers to orient themselves
  • 23. Measuring Service reliability needs to be measured • Uptime • Mean time to failure • Mean time to recover
  • 24. Whatsapp (Example Use case) • Message Delivery Time • Message Throughput • Image Resolution (Compression Algorithm) • Video Compression Quality • Etc etc
  • 25. Hope is not a Strategy!