SlideShare a Scribd company logo
A quick summary of
SRE – Site Reliability Engineering
Yogesh shah
Agenda
• What is SRE & its background
• Before going to SRE
• SRE and DevOps
• Components of SRE
• Reliability
• SLA
• SLO
• SLI
• Error budget
• Toil
• Things we did not cover
• References
What is SRE,
History &
Background
SRE = Site Reliability Engineering
Term SRE originated in google more than decade ago and it
has been backbone of Google’s highly reliable & valuable
suite of products & service
Google didn’t make details of SRE public as it thought that
it is the secrete sauce of their success
When DevOps movement stated, google could see that
there is lot of interest in implementing DevOps but there is
no clear path and people are struggling to implement
DevOps
Scrum, SAFe, Lean,
DevOps …………..
now SRE… 
• Framework Direction: Dev  Ops
• Flexibility: Rigid  open for interpretation
• Ease of implementation Easy  very hard
• Fit for market demand Less  high
Software delivery
mechanism
What is at the center Type Advantages Difficulties
Waterfall/ Project
management
Centers around: Plan
Outcome: Fixed target
Process • Easy to implement
• Scope, time, cost fixed
• Changing requirement
• Too heavy, complex & costly
ITIL
Centers around: SLA
Outcome: Predefined service quality
Framework • Easy to implement
• Clear accountability
• Predictable service quality
• Meet SLA != customer
satisfaction
• Too heavy & complex
Scrum/ SAFe
Centers around: Timebox, Focus
Outcome: delivery of Changing
requirement
Framework • Simple to understand • Difficult to implement
• Works best in pockets but
consistency is hard to achieve
Lean
Centers around: Flow of work
Outcome: Removal of waste
Methodology • Easy to implement
• Clear accountability
• Predictable service quality
• Meet SLA != customer
satisfaction
• Too heavy & complex
DevOps
Centers around: Unify Dev & Ops
Outcome: End to end accountability for
Dev & Ops
Philosophy • Great vision • Open to interpretation
What is SRE in comparison of others
• Centers around: Reliability
• Outcome: Customer satisfaction with control over balance of
Enhancement & Reliability
• Type: Implementation pattern
• Advantage: Implements DevOps,
• Disadvantage: None 
• Addresses so far neglected question “is system ready to handle change
without impacting customer experience?”
• SRE happens when a software engineer is tasked with what used to be
called operations.
SRE and DevOps But what is DevOps?
DevOps is about combined team (Dev & Ops)
using common set of tools & processes to deliver
any software change
SRE is an implementation of DevOps.
DevOps
Reduce organization silos
Accept failure as normal
Implement gradual change
Leverage tooling & automation
Measure everything
SRE
Share ownership with developers by using the same tools and techniques across the stack
Have a formula for balancing accidents and failures against new releases
Encourage moving quickly by reducing costs of failure
Encourages "automating this year's job away" and minimizing manual systems work to
focus on efforts that bring long-term value to the system
Believes that operations is a software problem, and defines prescriptive ways for
measuring availability, uptime, outages, toil, etc.
Components
of SRE
Reliability
SLA
SLO
SLI
Error Budget
Toil
Defining Reliability
•Clunky system with great features doesn’t work
•100% reliability is most often wrong target as it slows down velocity
•Reliability beyond a certain point has diminishing returns
•Each 9 after decimal point makes system 10 time more reliable but it costs 10 time more
Most important feature of any system
is its Reliability
•User, not monitoring metrics decide reliability hence in order to say system is reliable one
needs to measure user experienceUser Experience decides Reliability
•To achieve highly reliable (99.999…) systems well trained incident response team
(proactive & reactive) is required. Only talented developers & well engineered system is
not enough
Only engineering & talented
developer are not enough for highly
reliable systems. Well trained
incident response team is must
Reliability
• SRE helps defining reliability in clear way using concept of an error
budget
• Due to error budget understanding of reliability is understood
consistently across organization
• 100% reliability is wrong target as it slows down velocity
• User happiness and reliability is directly proportional till a point
beyond that user doesn’t care
SLA
• These are your agreements that you make with your customers about
the reliability of your service. An SLA has to have consequences if it's
violated
• Violating SLAs is costly affair in many aspects & hence getting a
informative warning with enough time to react is must to prevent
violation of SLA
SLO – Service Level Objectives
• Reliability is a feature hence it is prioritized against other functional features. However
prioritizing Reliability is challenging and SLOs are key to help in prioritizing Reliability
along with other features
• Target for specified reliability is SLO. In other words SLO is used to measure reliability
• SLO should always be stronger than your SLAs because customers are usually impacted
before the SLA is actually breached.
• SLO is effectively an internal promise to meet customer expectations. Violation of SLO
becomes really important issue as you are no longer have more outages so that you'll
want to take steps to remove risks from your service by devoting engineering
and automation efforts to reducing and eliminating areas of risks, etc.
• A good rule of thumb to set SLO targets is “happiness test” A threshold beyond which
user tends to become grumpy due to degraded service performance
• So Setting identifying and selecting SLO target is important but tough task and SRE has
clear guidelines to identify SLOs, set targets and revise SLO, Targets or both
SLI – Service Level Indicators
What is SLI
• Now we understand what is Reliability but how do we measure it?
• Reliability of service should be quantitative measure of customer experience. SRE helps you to
find suitable metric based on characteristics of your service
• The chosen metrics to measure level service provided to user is called SLI. In simple words It is a
quantitative measure of user experience
• Implementation to measure SLI metric changes based on implementation and environment
where service is operating
Relationship between SLI & SLO
• SLI is how is the service performing against that target at the given point in time
• SLO is the target we chose and measure SLI for period of time (e.g. 99% of requests are served within 2 seconds in last 4 weeks)
• SLI will tell us if certain time is good or bad based on measure of SLI against SLO target
• SLOs can be different for different times, different customer types, frequency of SLO misses etc. however concept of error budget
helps you manage this
How SRE helps
• SRE provide SLI menu for typical
user journey (system
characteristics)
• SRE provides simple formula to
measure SLIs. It is always ratio
(good events/ valid events)
• Provides blueprints to
implement SLI capture
mechanism along with tradeoffs
Error Budget
• Identifying, documenting and agreeing SLOs and SLIs can be great progress but how can
we make all this work?
• Error budget is useful
• actively balance Reliability of system against progress of other features in coherent manner
• To inform all how much head room is available before impacting customer experience
• It quantitatively informs how much failure or unreliability is allowed
• E.g.
• If intended reliability is 99.9% that means error budget is 0.1%
• 0.1% error budget = 40.32 mins of downtime over 28 days
• These 40.32 mins is SLO which we agree with all stakeholder. That means we have 40.32 mins for
recovering from any failure. Failure can be because of any reason hdd failure, bad code,
maintenance error, etc.
• It prompts lot of useful thinking.
• Assume that Reliability for your platform is 95% in 28 days. That means you are allowed to have
1.4 days of down time. Now do you really need CI-CD, Blue green deployment, test automation
etc.?
Toil
• Toil is work related to running production system/ service
• Toil satisfies following conditions
• manual
• Repetitive
• Automatable
• tactical
• devoid of long-term value
• Overhead (attending meeting, responding to email, etc.) is not a Toil
Not covered
• Detail steps and workshops for developing SLOs and SLIs
• Setting achievable SLO targets
• Define SLIs
• Manage growth of SLI parameter
• SLI menu, implementation patterns, tradeoffs and cost analysis
• Define and analyze error budget
• Error budget policy, thresholds and scenarios
• Identify and address SLO risks
• Consequences of missing SLO
• There is much more
References
• SRE Introduction – Set of videos about SRE introduction
• SRE – How google runs production systems
• SRE Workbook – Practical ways to implement SRE
Thank you

More Related Content

What's hot

Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
DevClub_lv
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
Michae Blakeney
 
SRE 101
SRE 101SRE 101
SRE 101
Diego Pacheco
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
Franklin Angulo
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
Kaushik Bhattacharya
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
Ladislav Prskavec
 
Setting SLOs and SLIs in the Real World
Setting SLOs and SLIs in the Real WorldSetting SLOs and SLIs in the Real World
Setting SLOs and SLIs in the Real World
New Relic
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps.com
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
Setyo Legowo
 
How to implement DevOps in your Organization
How to implement DevOps in your OrganizationHow to implement DevOps in your Organization
How to implement DevOps in your Organization
Dalibor Blazevic
 
About DevOps in simple steps
About DevOps in simple stepsAbout DevOps in simple steps
About DevOps in simple steps
Ihor Odynets
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
jeetendra mandal
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
From Measurement to Insight: Putting DevOps Metrics To Work
From Measurement to Insight: Putting DevOps Metrics To WorkFrom Measurement to Insight: Putting DevOps Metrics To Work
From Measurement to Insight: Putting DevOps Metrics To Work
DevOps.com
 

What's hot (20)

Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
SRE 101
SRE 101SRE 101
SRE 101
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Setting SLOs and SLIs in the Real World
Setting SLOs and SLIs in the Real WorldSetting SLOs and SLIs in the Real World
Setting SLOs and SLIs in the Real World
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
How to implement DevOps in your Organization
How to implement DevOps in your OrganizationHow to implement DevOps in your Organization
How to implement DevOps in your Organization
 
About DevOps in simple steps
About DevOps in simple stepsAbout DevOps in simple steps
About DevOps in simple steps
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
From Measurement to Insight: Putting DevOps Metrics To Work
From Measurement to Insight: Putting DevOps Metrics To WorkFrom Measurement to Insight: Putting DevOps Metrics To Work
From Measurement to Insight: Putting DevOps Metrics To Work
 

Similar to Sre summary

SLA and DevsecOps.presentation topic etc
SLA and DevsecOps.presentation topic etcSLA and DevsecOps.presentation topic etc
SLA and DevsecOps.presentation topic etc
shrutipanda43
 
What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my Business
Qualitest
 
Scaled Agile Framework® Overview
Scaled Agile Framework® OverviewScaled Agile Framework® Overview
Scaled Agile Framework® Overview
Cprime
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
Business of Software Conference
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
NUS-ISS
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
Ricardo Amaro
 
Dev ops
Dev opsDev ops
Dev ops
PHAGUNJAIN1
 
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgyStc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Archana Krushnan
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
Ernest Mueller
 
TDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul HolwayTDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul Holway
TDWI St. Louis
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Datavail
 
Dev ops training in chennai
Dev ops training in chennaiDev ops training in chennai
Dev ops training in chennai
raj esaki
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
Puppet
 
Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015
Bob Sokol
 
Deliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and AtlassianDeliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and Atlassian
Xpand IT
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce
Salesforce Engineering
 
Applying both of waterfall and iterative development
Applying both of waterfall and iterative developmentApplying both of waterfall and iterative development
Applying both of waterfall and iterative development
Deny Prasetia
 
Erp implementation guide
Erp implementation guideErp implementation guide
Erp implementation guide
PPTS India Pvt Ltd
 
Sdec10 lean package implementation
Sdec10 lean package implementationSdec10 lean package implementation
Sdec10 lean package implementation
Terry Bunio
 
Agile Course Presentation
Agile Course PresentationAgile Course Presentation
Agile Course Presentation
Soumya De
 

Similar to Sre summary (20)

SLA and DevsecOps.presentation topic etc
SLA and DevsecOps.presentation topic etcSLA and DevsecOps.presentation topic etc
SLA and DevsecOps.presentation topic etc
 
What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my Business
 
Scaled Agile Framework® Overview
Scaled Agile Framework® OverviewScaled Agile Framework® Overview
Scaled Agile Framework® Overview
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
 
Dev ops
Dev opsDev ops
Dev ops
 
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgyStc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
 
TDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul HolwayTDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul Holway
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
 
Dev ops training in chennai
Dev ops training in chennaiDev ops training in chennai
Dev ops training in chennai
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
 
Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015
 
Deliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and AtlassianDeliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and Atlassian
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce
 
Applying both of waterfall and iterative development
Applying both of waterfall and iterative developmentApplying both of waterfall and iterative development
Applying both of waterfall and iterative development
 
Erp implementation guide
Erp implementation guideErp implementation guide
Erp implementation guide
 
Sdec10 lean package implementation
Sdec10 lean package implementationSdec10 lean package implementation
Sdec10 lean package implementation
 
Agile Course Presentation
Agile Course PresentationAgile Course Presentation
Agile Course Presentation
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Sre summary

  • 1. A quick summary of SRE – Site Reliability Engineering Yogesh shah
  • 2. Agenda • What is SRE & its background • Before going to SRE • SRE and DevOps • Components of SRE • Reliability • SLA • SLO • SLI • Error budget • Toil • Things we did not cover • References
  • 3. What is SRE, History & Background SRE = Site Reliability Engineering Term SRE originated in google more than decade ago and it has been backbone of Google’s highly reliable & valuable suite of products & service Google didn’t make details of SRE public as it thought that it is the secrete sauce of their success When DevOps movement stated, google could see that there is lot of interest in implementing DevOps but there is no clear path and people are struggling to implement DevOps
  • 4. Scrum, SAFe, Lean, DevOps ………….. now SRE…  • Framework Direction: Dev  Ops • Flexibility: Rigid  open for interpretation • Ease of implementation Easy  very hard • Fit for market demand Less  high Software delivery mechanism What is at the center Type Advantages Difficulties Waterfall/ Project management Centers around: Plan Outcome: Fixed target Process • Easy to implement • Scope, time, cost fixed • Changing requirement • Too heavy, complex & costly ITIL Centers around: SLA Outcome: Predefined service quality Framework • Easy to implement • Clear accountability • Predictable service quality • Meet SLA != customer satisfaction • Too heavy & complex Scrum/ SAFe Centers around: Timebox, Focus Outcome: delivery of Changing requirement Framework • Simple to understand • Difficult to implement • Works best in pockets but consistency is hard to achieve Lean Centers around: Flow of work Outcome: Removal of waste Methodology • Easy to implement • Clear accountability • Predictable service quality • Meet SLA != customer satisfaction • Too heavy & complex DevOps Centers around: Unify Dev & Ops Outcome: End to end accountability for Dev & Ops Philosophy • Great vision • Open to interpretation
  • 5. What is SRE in comparison of others • Centers around: Reliability • Outcome: Customer satisfaction with control over balance of Enhancement & Reliability • Type: Implementation pattern • Advantage: Implements DevOps, • Disadvantage: None  • Addresses so far neglected question “is system ready to handle change without impacting customer experience?” • SRE happens when a software engineer is tasked with what used to be called operations.
  • 6. SRE and DevOps But what is DevOps? DevOps is about combined team (Dev & Ops) using common set of tools & processes to deliver any software change SRE is an implementation of DevOps. DevOps Reduce organization silos Accept failure as normal Implement gradual change Leverage tooling & automation Measure everything SRE Share ownership with developers by using the same tools and techniques across the stack Have a formula for balancing accidents and failures against new releases Encourage moving quickly by reducing costs of failure Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.
  • 8. Defining Reliability •Clunky system with great features doesn’t work •100% reliability is most often wrong target as it slows down velocity •Reliability beyond a certain point has diminishing returns •Each 9 after decimal point makes system 10 time more reliable but it costs 10 time more Most important feature of any system is its Reliability •User, not monitoring metrics decide reliability hence in order to say system is reliable one needs to measure user experienceUser Experience decides Reliability •To achieve highly reliable (99.999…) systems well trained incident response team (proactive & reactive) is required. Only talented developers & well engineered system is not enough Only engineering & talented developer are not enough for highly reliable systems. Well trained incident response team is must
  • 9. Reliability • SRE helps defining reliability in clear way using concept of an error budget • Due to error budget understanding of reliability is understood consistently across organization • 100% reliability is wrong target as it slows down velocity • User happiness and reliability is directly proportional till a point beyond that user doesn’t care
  • 10. SLA • These are your agreements that you make with your customers about the reliability of your service. An SLA has to have consequences if it's violated • Violating SLAs is costly affair in many aspects & hence getting a informative warning with enough time to react is must to prevent violation of SLA
  • 11. SLO – Service Level Objectives • Reliability is a feature hence it is prioritized against other functional features. However prioritizing Reliability is challenging and SLOs are key to help in prioritizing Reliability along with other features • Target for specified reliability is SLO. In other words SLO is used to measure reliability • SLO should always be stronger than your SLAs because customers are usually impacted before the SLA is actually breached. • SLO is effectively an internal promise to meet customer expectations. Violation of SLO becomes really important issue as you are no longer have more outages so that you'll want to take steps to remove risks from your service by devoting engineering and automation efforts to reducing and eliminating areas of risks, etc. • A good rule of thumb to set SLO targets is “happiness test” A threshold beyond which user tends to become grumpy due to degraded service performance • So Setting identifying and selecting SLO target is important but tough task and SRE has clear guidelines to identify SLOs, set targets and revise SLO, Targets or both
  • 12. SLI – Service Level Indicators What is SLI • Now we understand what is Reliability but how do we measure it? • Reliability of service should be quantitative measure of customer experience. SRE helps you to find suitable metric based on characteristics of your service • The chosen metrics to measure level service provided to user is called SLI. In simple words It is a quantitative measure of user experience • Implementation to measure SLI metric changes based on implementation and environment where service is operating Relationship between SLI & SLO • SLI is how is the service performing against that target at the given point in time • SLO is the target we chose and measure SLI for period of time (e.g. 99% of requests are served within 2 seconds in last 4 weeks) • SLI will tell us if certain time is good or bad based on measure of SLI against SLO target • SLOs can be different for different times, different customer types, frequency of SLO misses etc. however concept of error budget helps you manage this How SRE helps • SRE provide SLI menu for typical user journey (system characteristics) • SRE provides simple formula to measure SLIs. It is always ratio (good events/ valid events) • Provides blueprints to implement SLI capture mechanism along with tradeoffs
  • 13. Error Budget • Identifying, documenting and agreeing SLOs and SLIs can be great progress but how can we make all this work? • Error budget is useful • actively balance Reliability of system against progress of other features in coherent manner • To inform all how much head room is available before impacting customer experience • It quantitatively informs how much failure or unreliability is allowed • E.g. • If intended reliability is 99.9% that means error budget is 0.1% • 0.1% error budget = 40.32 mins of downtime over 28 days • These 40.32 mins is SLO which we agree with all stakeholder. That means we have 40.32 mins for recovering from any failure. Failure can be because of any reason hdd failure, bad code, maintenance error, etc. • It prompts lot of useful thinking. • Assume that Reliability for your platform is 95% in 28 days. That means you are allowed to have 1.4 days of down time. Now do you really need CI-CD, Blue green deployment, test automation etc.?
  • 14. Toil • Toil is work related to running production system/ service • Toil satisfies following conditions • manual • Repetitive • Automatable • tactical • devoid of long-term value • Overhead (attending meeting, responding to email, etc.) is not a Toil
  • 15. Not covered • Detail steps and workshops for developing SLOs and SLIs • Setting achievable SLO targets • Define SLIs • Manage growth of SLI parameter • SLI menu, implementation patterns, tradeoffs and cost analysis • Define and analyze error budget • Error budget policy, thresholds and scenarios • Identify and address SLO risks • Consequences of missing SLO • There is much more
  • 16. References • SRE Introduction – Set of videos about SRE introduction • SRE – How google runs production systems • SRE Workbook – Practical ways to implement SRE