Site-Reliability-Engineering-v2[6241].pdf

Site Reliability
Engineering
(SRE)

2
Introduction
• Name
• Total Experience
• Background – Development /
Infrastructure /
Management
• Experience on DevOps Tools, Cloud
• Your expectations from this training
2

3
Few Pointers
3
• SRE is more of a concept to implement (vs a tool). So theoretical aspects will be more.
• Many pointers will be familiar. It’s just the right use of same.
• Few best practices we already know but couldn’t implement. Now it’s the right time to
implement same. It’s a do or die situation now.
• Try to co-relate the topics with your domain and prepare notes with pointers.
• Implementation discussion is welcome for your existing teams and domains.
• Learn with - “What behavior will I change?” Learning isn’t collecting information. Learning
is changing behavior.

4
Module 1:
Principles & Practices

6
Digital Transformation
A way Forward
Security, Legal, Compliance must
be in center in all designs
Security
Integration of digital technology into all areas of a business, fundamentally
changing how you operate and deliver value to customers.
Concept
Use technologies which
suits you the best
Tools
Cultural Change
Process
The most important part
and the barrier too
People
SECURITY
PEOPLE
PROCESS
TOOLS
Successful
Digital
Transformation

7
Emerging Technologies
Which are helping industries in Digital Transformation
2
3
4
5
1
6
DATA
ML/AI
Cloud
Security
Automation
Blockchain
IOT
Security toolsets
Remain secure to avoid Financial,
legal and compliance issues
Automation
Automation with DevOps Toolsets
– CICD, SRE, CM, Containers etc.
Cloud
Keeping infra off-ground to third
party and move all Opex
ML/AI
Better decision Making (well
predicted, informed) and autonomy
IOT
Connect and Integrate whatever
possible to automate
Blockchain
For secure transactions over
distributed public networks

9
9
• A systems development life cycle is composed of several clearly defined and distinct work phases which
are used by systems engineers and systems developers to plan for, design, build, test, and deliver
information systems
Require-
ment
Analysis
Design
Implementa
tion
Testing
Evaluation
SDLC – Life Cycle
SDLC Model

10
10
- long release cycle
- A lot of WIP
- Functional silos
- Incredibly rigid for developing
1. Determine the Requirements
2. Complete the design
3. Do the coding and testing (unit
tests)
4. Perform other tests (functional
tests, non-functional tests,
Performance testing, bug fixes
etc.)
5. At last deploy and maintain
Waterfall Model

11
11
- Shorter release cycle
- Small batch sizes (MVP)
- Cross-functional teams
- Incredibly agile
Agile

12
12
- Suddenly ops was the bottleneck (more release less people), again WIP is more!
Lean Development

13
Software Development
Infrastructure, Operations and
Support
Build & Release, Testing Teams
DevOps
- Break the Silos
- Communication (not only with
emails)
- Collaboration
- Trust
- Involvement in the early
development stages
- Automation is the key
- Continuous Integration
- Continuous Deployments in
the lower environments
- Fail fast and fail often
DevOps

14
Development QA Testing Implementation & Release InfraManagement
- name:
Playbook
for
webserver
setup
hosts:
all
tasks:
- name:
package
installatio
n
yum:
name=yum
state=prese
nt
....
....
....
....
- name:
Playbook
for
webserver
setup
t
....
....
....
....
APP APP APP
Waiting Waiting Waiting
Waiting
APP
DevOps

15
15
DevOps
• DevOps is a loose set of practices, guidelines, and culture designed to break down silos in IT development,
operations, networking, and security.
• In a DevOps approach, you improve something (often by automating it), measure the results, and share
those results with colleagues so the whole organization can improve.
• DevOps, Agile, and a variety of other business and software reengineering techniques are all examples of a
general worldview on how best to do business in the modern world. None of the elements in the DevOps
philosophy are easily separable from each other, and this is essentially by design.
• DevOps is a broad set of principles about whole-lifecycle collaboration between operations and product
development.

16
Extreme siloization of knowledge, incentives for
purely local optimization, and lack of collaboration
have in many cases been actively bad for business
16
DevOps Principles
No More Silos
Accidents Are Normal
Change Should Be Gradual
Tooling and Culture Are Interrelated
Measurement Is Crucial
Accidents are not just a result of the isolated
actions of an individual, but rather result from
missing safeguards for when things inevitably go
wrong. I.e. Misconfigured System, broken
monitoring, under pressure wrong actions etc.
Rooting out the Mistake makers and punishing
them creates mess, like incentives to confuse
issues, hide the truth, and blame others, all of which
are ultimately unprofitable distractions.
Change is best when it is small and frequent.
Change is risky, true, but the correct response is to
split up your changes into smaller subcomponents
where possible. Then you build a steady CICD
pipeline of low-risk change out of regular output
from product, design, and infrastructure changes
with Automated testing and improvements.
A good culture can work around broken tooling, but
the opposite rarely holds true. Promoters of
DevOps strongly emphasize organizational
culture—rather than tooling—as the key to success
in adopting a new way of working.
Measure your outcomes time to time. Its can be in
the form of Number of incidents, faster time to
market, MTTR, SLA etc.

17
17
Site Reliability Engineering

18
Site Reliability Engineering (SRE) is a term (and associated job role) coined by Ben Treynor Sloss, a VP of engineering at
Google.
SRE is a job role, a set of practices which are known to work at ground, and some beliefs that animate those practices.
SRE is coined around Reliability of the system.
In general, an SRE has particular expertise around the availability, latency, performance, efficiency, change management,
monitoring, emergency response, and capacity planning of the service(s) they are looking after.
SRE implements interface DevOps.
SRE is hiring software engineers to run products and to create systems to accomplish the work that would otherwise be
performed, often manually, by sysadmins.
Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems.
Way to SRE

19
SRE is a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary
to write software to replace their previously manual work, even when the solution is complicated.
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software
expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design
and implement automation with software to replace human labor.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases
and teams will need more people just to keep pace with the workload
Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that
the SRE team has enough time in their schedule to make the service stable and operable via engineering tasks of
automation.
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management,
monitoring, emergency response, and capacity planning of their service(s).
Way to SRE

20
SRE should therefore use software engineering
approaches to solve that problem.
20
SRE Principles
Operations Is a Software Problem
Manage by Service Level Objectives
Work to Minimize Toil
Automate - What you can
Move Fast by Reducing the Cost of Failure
Share Ownership with Developers
Use the Same Tooling, Regardless of Function
Define SLOs and work around same
Toils should be reduced to minimum and
automation to be done to the extent possible.
Automation is the key. The real work in this area is
determining what to automate, under what
conditions, and how to automate it. As per Google
max 50% work can be toil and rest 50% time should
be given to SRE Engineering tasks or something
new.
Cost of failure is directly proportional to Mean
Time to Repair (MTTR) which effects product
developer velocity. This follows from the well-
known fact that the later in the product lifecycle a
problem is discovered, the more expensive it is to
fix.
Ideally, both product development and SRE teams
should have a holistic view of the stack—the
frontend, backend, libraries, storage, kernels, and
physical machine—and no team should jealously
own single components. It turns out that you can
get a lot more done if you “blur the lines” and have
SREs instrument JavaScript, or product developers
qualify kernels configurations.
Having similar qualified tools across organizations
will help the easy process understanding and
SRE/DevOps Culture adoption.
Way to SRE

21
21
DevOps vs SRE
• DevOps a loose generic set of principles (philosophy and culture) and SRE an advanced explicit implementation.
• Site Reliability Engineering, like DevOps, should not just be changing titles, but making definitive behavior changes,
focusing on outcomes and obviously reliability.
• Collaboration is front and center for DevOps work. An effective shared ownership model and partner team relationships
are necessary for SRE to function.
• Change management is best pursued as small, continual actions, the majority of which are ideally both automatically
tested and applied. The critical interaction between change and reliability makes this especially important for SRE.
• Measurement is absolutely key to how both DevOps and SRE work. For SRE, SLOs are dominant in determining the
actions taken to improve the service. For DevOps, the act of measurement is often used to understand what the outputs
of a process are, what the duration of feedback loops is, and so on.
• DevOps is relatively silent on how to run operations at a detailed level. While SRE talks about detailed steps of
implementations and deployments.
Way to SRE

22
22
DevOps vs SRE
• DevOps is more context-sensitive and works organization wide. SRE, on the other hand, has relatively narrowly defined
responsibilities and its remit is generally service-oriented (and end-user-oriented) rather than whole-business-oriented.
• Ultimately, implementing DevOps or SRE is a holistic act; both hope to make the whole of the team (or unit, or
organization) better, as a function of working together in a highly specific way. For both DevOps and SRE, better velocity
should be the outcome.
Way to SRE

23
SRE Context and Successful Adoption
Narrow, Rigid (launch-related or reliability-related) Incentives, Narrow Your Success.
A system with early SRE engagement (ideally, at design time) typically works better in production after deployment,
regardless of who is responsible for managing the service.
Don’t just allow, but actively encourage, engineers to change code and configuration when required for the product.
Support blameless postmortems. Doing so eliminates incentives to downplay or cover up a problem.
Allow support to move away from products that are irredeemably operationally difficult. The threat of support withdrawal
motivates product development to fix issues both in the run-up to support and once the product is itself supported, saving
everyone time.
Always remember – Good people will quit if they’re tasked with too much operational work and aren’t given the opportunity
to use their engineering skill set.
Consider Reliability Work as a Specialized Role.
Strive for Parity of Esteem: Career and Financial.
Way to SRE

25
Brain - Storming
DevOps vs SRE
An x company wants to reduce the time to market for its new software product releases and facing below issues:
• Hardware capacity planning is a challenge
• Infra is new, yet hardware failures are more
• Lots of bugs are being identified in the products
• Releases fails on production days
• Huge Incident tickets post new releases for next few days.
Where DevOps can help in this area?
Where SRE can help in this segment?
Understand how SRE heals DevOps Failures…

26
Case Study – French Telecom
DevOps vs SRE
Identifying the DevOps Work and building DevOps Team
Identifying Reliability needs and building SRE Engineering Team
Continuous Enhancement…

27
Video
DevOps vs SRE
DevOps vs SRE (Google)
https://www.youtube.com/watch?v=uTEL8Ff1Zvk

28
Exercise
DevOps vs SRE
What we do all day? Is there a way to automate?
Is there any way to make the systemmore reliable?
Factored ROI?

29
Q & A
DevOps vs SRE
Q & A
Questionnaire

30
30
Module 2:
Service Level Objectives (SLOs) &
Error Budgets

31
Important terms
Way to SRE
Availability=The ability of less downtime, or the fraction of the time that a service is usable. Although 100% availability is
impossible, near-100% availability is often readily achievable
Reliability=The ability to work properly (even if some parts/components failed).
Durability=The ability of not losing data. Or the likelihood that data will be retained over a long period of time—is equally
important (alike Availability) for data storage systems.
SLA (Promise) = Service-Level Agreement is a commitment between a service provider and a client, regarding particular
aspects of the service – quality, availability, responsibilities etc. It is an explicit or implicit contract with your users that
includes consequences of meeting (or missing) the SLOs they contain.
SLO (Goal) = Service Level Objective – The Objectives (within SLA, i.e. uptime, response time) which your team must hit to
meet the SLA.
SLI (How and What) = Service Level Indicators – the Real numbers to measure your compliance against SLO. In Specific – its a
carefully defined quantitative measure of some aspect of the level of service that is provided. i.e. request latency, error rate
etc.

33
Service Level Objectives
Way to SRE
It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that
service and how to measure and evaluate those behaviors.
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives
(SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we
want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate
metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is
healthy.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural
structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
Choosing an appropriate SLO is usually complex (i.e. QPS, Network Bandwidth etc.) , but sometime its straightforward too
(i.e. setting low-latency).
Choosing and publishing SLOs to users sets expectations about how a service will perform. Without an explicit SLO, users
often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people
designing and operating the service.

34
SLI – Indicators in Practice
Way to SRE
SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the
consequences of missed SLOs.
What Do You and Your Users Care About?
• User-facing serving systems, such as the Shakespeare search frontends, generally care about availability,
latency, and throughput. In other words: Could we respond to the request? How long did it take to respond?
How many requests could be handled?
• Storage systems often emphasize latency, throughputs, IOPS, availability, and durability.
• Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end
latency (How much data is being processed?, and how long it takes to process?).
• All systems should care about correctness: was the right answer returned, the right data retrieved, the right
analysis done?
Collecting Indicators
Aggregate the metric for better usage (Average out, Instantaneous usages)
Standardize Indicators (Over a period of time i.e. average packets per minute)
Collect Indicators at server as well as Client end
Objectives in Practice (SLO)
Define Objectives (99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms)
Choose realistic targets, which are simple & minimum-required and always keep a refine strategy/scope.

35
Control Measures
SLIs and SLOs are crucial elements in the control loops used to manage systems:
• Monitor and measure the system’s SLIs.
• Compare the SLIs to the SLOs, and decide whether or not action is needed.
• If action is needed, figure out what needs to happen in order to meet the target.
• Take that action.
Always remember:
• Publishing SLOs sets expectations for system behavior
• Keep margins - Using a tighter internal SLO than the SLO advertised to users gives you room to respond to
chronic problems before they become visible externally.
• If your service’s actual performance is much better than its stated SLO, users will come to rely on its current
performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s
Chubby service introduced planned outages in response to being overly available).
Understanding how well a system is meeting its expectations helps decide whether to invest in making the system faster,
more available, and more resilient. Alternatively, if the service is doing fine, perhaps staff time should be spent on other
priorities, such as paying off technical debt, adding new features, or introducing other products.
Way to SRE

36
Monitoring in place
Way to SRE

37
Error Budget and policies
The SLO is a target percentage, and the error budget is 100% minus the SLO. For example, if you have a 99.9% success
ratio SLO, then a service that receives 3 million requests over a four-week period had a budget of 3,000 (0.1%) errors over
that period. If a single outage is responsible for 1,500 errors, that error costs 50% of the error budget.
Once you have an SLO, you can use the SLO to derive an error budget. In order to use this error budget, you need a policy
outlining what to do when your service runs out of budget.
When we talk about enforcing an error budget policy, we mean that once you exhaust your error budget (or come close to
exhausting it), you should do something in order to restore stability to your system
Common owners and actions might include:
• The development team gives top priority to bugs relating to reliability issues over the past four weeks.
• The development team focuses exclusively on reliability issues until the system is within SLO. This responsibility comes
with high-level approval to push back on external feature requests and mandates.
• To reduce the risk of more outages, a production freeze halts certain changes to the system until there is sufficient error
budget to resume changes.
Way to SRE

39
Case Study – Genpact
SLO & Error Budget
Pain Areas…
Penalties due to SLA Miss…
Setting SLO for Application uptime and performance.
Tracking SLIs

40
Video
SLO & Error Budget
SLA, SLO and SLI (Google)
https://www.youtube.com/watch?v=tEylFyxbDLE
Risks and Error Budgets (Google)
https://www.youtube.com/watch?v=y2ILKr8kCJU

41
Exercise
SLO & Error Budget
Define 3 SLI/SLOfor your current Application contract.
Have We Defined Right SLOs and Monitoring right SLIs?
Do we Just work with Availability Monitoring or Performance Monitoring too?

42
Q & A
SLO & Error Budget
Q & A
Questionnaire

43
43
Module 3:
Reducing Toils

44
Toils
Way to SRE
“If a human operator needs to touch your system during normal operations, you have a bug.”
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a service grows.
Google’s SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time.
At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service
features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil
as a second-order effect.
It is equally important to calculate toils and time spent on same, over a given period and keep aligning the team towards
engineering tasks.
Engineering work is novel and essentially requires human judgment. It produces a permanent improvement in your service
and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem—the
more generalized, the better.

45
Toils
A Must know
Manual
This includes work such as manually running a script that automates some task. Running a script may be quicker than
manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is
still toil time.
Repetitive
If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and
over. If you’re solving a novel problem or inventing a new solution, this work is not toil.
Automatable
If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is
toil. If human judgment is essential for the task, there’s a good chance it’s not toil.
Tactical
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be
able to eliminate this type of work completely, but we have to continually work toward minimizing it.
No enduring value
If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a
permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into
legacy code and configurations and straightening them out—was involved.

46
SRE tasks
Way to SRE
Engineering work helps your team or the SRE organization handle a larger service, or more services, with the same level of
staffing.
Software engineering
Involves writing or modifying code, in addition to any associated design and documentation work. Examples include writing
automation scripts, creating tools or frameworks, adding service features for scalability and reliability, or modifying
infrastructure code to make it more robust.
Systems engineering
Involves configuring production systems, modifying configurations, or documenting systems in a way that produces lasting
improvements from a one-time effort. Examples include monitoring setup and updates, load balancing configuration, server
configuration, tuning of OS parameters, and load balancer setup. Systems engineering also includes consulting on
architecture, design, and productionization for developer teams.
Toil
Work directly tied to running a service that is repetitive, manual, etc.
Overhead
Administrative work not tied directly to running a service. Examples include hiring, HR paperwork, team/company meetings,
bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.

47
Why Toils are bad?
Way to SRE
Career stagnation
Your career progress will slow down or grind to a halt if you spend too little time on projects.
Low morale
People have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout,
boredom, and discontent.
Slows progress
Excessive toil makes a team less productive. A product’s feature velocity will slow if the SRE team is too busy with manual
work and firefighting to roll out new features promptly.
Sets precedent
If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil,
sometimes shifting operational tasks that should rightfully be performed by Devs to SRE.
Promotes attrition
Even if you’re not personally unhappy with toil, your current or future teammates might like it much less. If you build too much
toil into your team’s procedures, you motivate the team’s best engineers to start looking elsewhere for a more rewarding job.

48
How to Reduce Toils?
Identify Toils and try to reduce same to the level you can. Let's take an example:
Identify what your team members are involved into at 80% of the on-job time.
Check if same can be automated.
If yes, then automate same with some tools, else if not identify what else can be done to improve the process.
Keep improving the existing state and service.
Set goals for Engineering tasks too, for e.g. increasing Internal SLO from 99.9 to 99.95%. Identifying the ways and
implementing the procedure for same.
Way to SRE

50
Case Study – Reducing Toils
One of my client “X” work in Contact Center field, where they deploy the Contact Center services for end clients and manage
it for them.
Now for every new client deploying and building the infrastructure was a very hectic task (contains 15+ Servers with 10+
microservice, multiple LBs, Cache servers, DBs, Security Implementations). Similarly increasing the existing client
environment was very difficult and time-consuming tasks. Even the hardware capacity planning started becoming challenging.
80-90% of the teams (including Developers) were involved in deployment of new services or expansion of environment and
issues handling for existing clients was becoming difficult. Teams were in pain with repeated tasks and pressure they were
going through.
• Company took the hard decision and migrated to Cloud services to avoid hardware bottlenecks.
• Terraform (DevOps IaC) tool was used to automate the deployment. Now the same deployment of infra, which was taking
1 month to design and get ready, is getting up and running in less than an hour.
• Same team members are free from pressure and happy investing their time to enhance the features, reducing bugs and
automating the environment further to next levels.
• Even during pandemic, they thrive with 200% increase in customer on-boarding, without any hastle.
Reducing Toils

51
Video
Reducing Toils
Pragmatic Automation
https://www.youtube.com/watch?v=oDcjAcFTFC0

52
Exercise
Reducing Toils
Do you foresee any toils in your team?
If yes, benefits of Reducing Toil?
How same can be reduced?
Is daily mails, project Reports toil?
Considered ROI ???

53
Q & A
Reducing Toils
Q & A
Questionnaire

54
54
Module 4:
Monitoring & Service Level Indicators

55
Monitoring
Way to SRE
Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and
types, error counts and types, processing times, and server lifetimes.
White-box monitoring
Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine
Profiling Interface, or an HTTP handler that emits internal statistics.
Black-box monitoring
Testing externally visible behavior as a user would see it.
Alert
A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a
pager. Respectively, these alerts are classified as tickets, email alerts,22 and pages.
Root cause
A defect in a software or human system that, if repaired, instills confidence that this event won’t happen again in the same
way.

56
SLI’s -Service Level Indicators
Monitoring
SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the
consequences of missed SLOs. And to achieve same having proper monitoring with right SLIs is very important.
What Do You and Your Users Care About?
• User-facing serving systems, Could we respond to the request? How long did it take to respond? How many
requests could be handled?
• Storage systems often emphasize latency, availability, and durability.
• Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end
latency (How much data is being processed?, and how long it takes to process?).
• All systems should care about correctness: was the right answer returned, the right data retrieved, the right
analysis done?
Collecting Indicators
Aggregate the metric for better usage (Average out, Instantaneous usages)
Standardize Indicators (Over a period of time i.e. average packets per minute)
Collect Indicators at server as well as Client end
E.g. Application latency for 99.9% users in last 5 min should be less than 100ms.

57
Why Monitor?
Way to SRE
Analyzing long-term trends
Comparing over time
Alerting
Building dashboards
Conducting ad hoc retrospective analysis

58
Monitoring with proper SLI Metric
Way to SRE
Always make sure to select the right Service Level Indicator Metric to track and alert.
Set alerts (with respective criticality, pager, Email etc) for all your SLO targets and also creating simple and meaning full
dashboards for a higher visibility.
Four golden signals to track:
Latency
Traffic
Errors
Saturation (IO, Memory, CPU etc)

59
SLO Improvements
Way to SRE
VALET
Google summed up our new SLOs into a handy acronym: VALET.
Volume (traffic)
How much business volume can my service handle?
Availability
Is the service up when I need it?
Latency
Does the service respond fast when I use it?
Errors
Does the service throw an error when I use it?
Tickets
Does the service require manual intervention to complete my request?
Use Telemetry tools to collect SLI metrics from remote servers into a centralized Monitoring server to generate graphs.

60
Case Study - Genpact
Monitoring and SLI
At Client “X”, we configured autoscaling with value CPU percentage > 80% and made the system up and running in
production. Even the performance/Load Test was done and successful.
But after 6 month, during a heavy peak load, system didn’t autoscaling and got crashed. During RCA identification, we found
that it was a DISK IO full issue, for which we have not monitored the systems and no alert/autoscaling setup on same.
We modified the HDD to SSD on the server for fixing the issues and also enabled monitoring and autoscaling for the DISK
environment.

61
Video
Monitoring and SLI
SLI/SLO and reliability Deep Dive
https://www.youtube.com/watch?v=dplGoewF4DA

62
62
High Availability and Capacity Planning

63
High Availability
A Must know
Serving millions of request by a single server is not possible, even it is a supercomputer. Hence, we need Horizontal Scaling
(adding more servers to handle the requests).
Traffic load balancing is the solution to heavy traffic management, which is distributing traffic across multiple network links,
datacenters, and machines in an "optimal" fashion.
Multiple factors, which affects HA:
• The hierarchical level at which we evaluate the problem (global versus local)
• The technical level at which we evaluate the problem (hardware versus software)
• The nature of the traffic we’re dealing with
Nature of requests and handling techniques plays a useful role here.

64
High Availability techniques
A Must know
Techniques to handle the High availability/DR:
• Clustering
• Load Balancing with VIP
• Microservice architectures ( with Containerization Approach)
• Passive DR Sites
• Load Balancing Using DNS
• Content Delivery Networks (for low latency)

65
65
Thinknyx Technologies
High Availability & Burst handling
Application
Server
Cloud Premises
Application
Server
Autoscaling Group
Application
Server
Way to SRE

66
66
LB for High Availability
Way to SRE

67
67
HA in AWS
Way to SRE

68
68
Disaster Recovery
X – Cloud
DC 1 Operations On-Prem, Cloud as DR
Application is hosted in self-managedDatacenter and
Backup is hosted on Cloud
Operations in Cloud, Third-party DR
Application is hosted in Cloud Datacenter and Backup is
hosted on Third party cloud service provider or may be
on third party backup service provider.
Operations in Cloud, Cloud as DR
Application is hosted in Cloud Datacenter and Backup is
hosted in Second Region/datacenter in same cloud.
Y – Cloud
DC 1
APP
Backup
X – Cloud
DC 2
On-Prem
APP Backup
Backup
Way to SRE

69
69
Business Continuity
1
3
5
2
4
Consider all natural
disasters and their
range, before finalizing
a DR location.
Physical Location
Never put al your eggs
in single bucket.
Logical Location
Declarations of Role &
Responsibilities,
Emergency process,
Backup ready
Who’ll do what
Test BC/DR at least
annually.
DR Drill
Define application criticality and failover
priorities in advance. Health and Human safety
should be primary concern.
Priorities
Way to SRE

70
Exercise
Monitoring and SLI
What do you monitor now and what all reliability aspects you considered?
Performance Monitoring in place?
DR Monitoring /Activation in place?
What we can monitor, where and how?
Risk Factors

71
Q & A
Monitoring and SLI
Q & A
Did you consider Monitoring importance in HA/DR?

72
72
Module 5:
SRE Tools & Automation

73
Automation
Way to SRE
Automation is the key for any organization to thrill.
For SRE, automation is a force multiplier, not a panacea. Of course, just multiplying force does not naturally change the
accuracy of where that force is applied: doing automation thoughtlessly can create as many problems as it solves.
Consistency A Platform
Faster Repairs Faster Actions
Time Saving Ease and Effectiveness

74
Automation Focus
Way to SRE
SRE has a number of philosophies and products in the domain of automation, some of which look more like generic rollout
tools without particularly detailed modeling of higher-level entities, and some of which look more like languages for describing
service deployment (and so on) at a very abstract level.
Some use cases:
• User account creation
• Cluster turnup and turndown for services
• Software or hardware installation preparation and decommissioning
• Rollouts of new software versions
Count is endless, its just identifying the priority and keep automating tasks one by one.
End Goal is to create Autonomous system, which runs and manages on its own. For e.g. A system should not just trigger alerts
and try to make the services up on same system, it should do the failover on its own to another better available system, if
services are not coming up on same server.
Automationsystem must be secure and reliable. Automation works at scale, so destruction will also be at scale, if something
goes wrong.

75
Automation Hierarchy
Way to SRE
Now a days, tools are available in market to automate majorities of tasks and events what we want to manage; yet there can
be few things which needs customized automation. For same we can follow custom paths. For example a database failover
automation evaluation path for Autonomous environment:
1) No automation
Database master is failed over manually between locations.
2) Externally maintained system-specific automation
An SRE has a failover script in his or her home directory.
3) Externally maintained generic automation
The SRE adds database support to a "generic failover" script that everyone uses.
4) Internally maintained system-specific automation
The database ships with its own failover script.
5) Systems that don’t need any automation (autonomoussystem)
The database notices problems, and automatically fails over without human intervention.
SRE hates manual operations, so they obviously try to create systems that don’t require them. However, sometimes manual
operations are unavoidable (DR activation, Production push etc).

76
Secure Automation
Way to SRE
Unsecure Automation can be dangerous too. Learn from an example of “CODESPACES” and other clients where AWS AK/SK
was leaked, and disaster happened.
Think about keeping Username and password in your container Images? Who all in your organization do have access to these
image?
Keeping API keys in Code and code on Github/bitbucket?
Zero touch automation is the final goal for an SRE. We have to consider Security also into it at every layer, as multiple tools get
involved.
We have to replace our use of sshd with an authenticated, ACL-driven, RPC-based Local Admin Daemon, also known as
Admin Servers, which had permissions to perform those local changes. As a result, no one could install or modify a server
without an audit trail.
CIA terms are important to implement in automation too.

77
Automation Tools
Way to SRE
Though there are not defined set of tools for any SRE, but it is always better to have universal tools in the basket. Few
Categories and tools which are well known to the market for their well-known results in the area, are as below:
Version Control System: TFVC, Git (Gitlab, Github, Bitbucket, Azure DevOps)
Pipelining for CICD: Jenkins, Azure DevOps, TeamCity, bamboo
Automated Deployment: Octopus Deploy, UrbanDeploy
Configuration Management: Ansible, Chef, puppet, Saltstack
Container and Orchestration: Docker (Kubernetes, Docker Swarm, Openshift)
Automation oriented languages: Python, Java
DevOps Infrastructure as Code: Terraform, Cloud Formation, Azure Templates
Continuous Testing: Jmeter, Sonarqube, Selenium

78
Video
SRE tools and Automation
SREcon19 Asia/Pacific - Ironies of Automation (Microsoft)
https://www.youtube.com/watch?v=U3ubcoNzx9k

79
Case Study
Amazon has done the automation for Leasing servers on Rent and metering same for usage and gradually over a period of
time, it became Cloud with 100s of service provisioned/Released via self-service portal.
50 million changes into AWS Cloud happened in 2016 in 1 year, which is 1 change in production per second.
Distributed system with APIs and Queues are the best way to scale with automation.

80
Exercise
Automation “Greatest Hits” – Uber, Airbnb, Ola, Olx, AWS …
How much automation you have and what can be automated?
Its as simple as:
“Anything that you do more than twice has to be automated.”
-Adam Stone, CEO, D-Tools

81
Q & A
Q & A
Questionnaire

82
82
Module 6:
Anti-Fragility & Learning from Failure

83
Anti-fragility: Learning from Failure
Way to SRE
“The cost of failure is education.” Devin Carraway
Anti-fragility is all about understanding disorder and using it to your advantage. It is a property of systems in which they
increase in capability to thrive as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures.
Postmortems are an essential tool for SRE (to make the system resilient and reliable).
When an incident occurs, we fix the underlying issue, and services return to their normal operating conditions. Unless we
have some formalized process of learning from these incidents in place, they may reoccur.
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and
the follow-up actions to prevent the incident from recurring.
The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s)
are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact
of recurrence.

84
Anti-fragility: Shifting the organizational balance
Way to SRE
The antifragile loves randomness and uncertainty.
Anti-fragility is a concept which encompasses the idea that things need chaos and disorder in order to thrive and flourish.
Whatever doesn’t kill us makes us stronger, pushing the notion that we shouldn’t construct our lives or our plans against
randomness and misfortune, rather, we should adopt anti-fragility as a means of maneuvering through disorder.
Similarly, we should plan our system to withstand failures (planned/unplanned). We can even plan/plot the failures in our
system to understand the withstanding capability/Anti-fragility of our system.
We can have unplanned downtimes and activities, which can simulate failure to learn from it and make our system more
robust. For example – pulling a network cable of server or shutting down the UPS to understand the impact, etc… But we
must first understand the error budget and failure cost before we plan for such failure activities. Such activities definitely
add learning and robustness to our system, but we must keep a balance between error budget and enhancements.

85
Postmortem Culture : Learning from Failure
Way to SRE
The postmortem process does present an inherent cost in terms of time or effort, so we can be deliberate in choosing when
to write one. Teams have some internal flexibility, but common postmortem triggers include:
• User-visible downtime or degradation beyond a certain threshold
• Data loss of any kind
• On-call engineer intervention (release rollback, rerouting of traffic, etc.)
• A resolution time above some threshold
• A monitoring failure (which usually implies manual incident discovery)
• Stakeholder request a postmortem for an event

86
Blameless Postmortems
Way to SRE
“Blameless postmortems” is a Principle of SRE culture.
For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without
indicting any individual or team for bad or inappropriate behavior.
A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right
thing with the information they had.
If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring
issues to light for fear of punishment.
When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had
incomplete or incorrect information, effective prevention plans can be put in place.
You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when
designing and maintaining complex systems.

87
Best Practice
Way to SRE
Avoid Blame and Keep It Constructive.
Collaborate and Share Knowledge
No Postmortem Left Unreviewed
Introduce a Postmortem Culture
Visibly Reward People for Doing the Right Thing
Ask for Feedback on Postmortem Effectiveness
Continuous improvement
Postmortem should have clearly defined ownership, priority, preventive actions and Action taken

88
Case Study
Anti-Fragility
Netflix Simian Army
https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
https://github.com/netflix/chaosmonkey

89
Failures
Anti-Fragility
Do Failure is really bad – for organization and individuals ?
Consider Elon Musk and Colonel Sanders ☺

90
Exercise
Anti-Fragility
Share some example of problem ticket within your team where you were involved and had lot of incidents be
of a single root cause.
Share the Incidents, Business Impact, Root Cause, Course of Actions done to resolve same and whose mist
was this, if it was a configuration issue.

91
Q & A
Anti-Fragility
Q & A
Questionnaire

92
92
Module 7:
Organizational Impact of SRE

93
Organizations Embracing SRE
Way to SRE
Availability
Reliability
Capacity Planning
Happy Customers
Cost Effectiveness
due to less failure
Velocity
Continuous
Improvements

94
Typical ORG Chart
Way to SRE
Specialized
Reliability
Engineers
Specialized
Reliability
Engineers
Specialized
Reliability
Engineers
Site Reliability
Engineers
(TL)
Site Reliability
Engineers
(TL)
Manager
SRE Team
DB
Admins
DB
Manager
DB
Admins
Systems
Manager
OS
Admins
OS
Admins
Operations
Dev
Dev
Manager
Dev
Q&A
Manager
QA
Admin
QA
Admin
Dev/Prod Team

95
SRE Responsibilities
Way to SRE
Tasks SRE Team
Architecture Design Approvals and Consultaning RC
Instrumentation, Metrics, and Monitoring CI
Maintaining SLI CI
SLO /SLI track and management CR
Handling Incidents CI
Repeated Incidents and Problem Management RA
Capacity Planning CR
Change Management CI
Critical/Large Scale Changes CR
Performance: availability, latency, and efficiency R
Automation RA
Innovation RA
Release Management C
Release Repeated failures R
Test Management C
Test Repeated Failures R
Standardization of Tools/Softwares/Process/Technologies RA
Supporting Presales and Sales CR
Training other Team Members R

96
SRE Engagement and Adoption
Way to SRE
SRE seeks production responsibility for important services for which it can make concrete contributions to reliability. SRE is
concerned with several aspects of a service, which are collectively referred to as production. These aspects include the
following:
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
When SREs engage with a service, they aim to improve it along all of these axes, which makes managing production for the
service easier.

97
SRE & Scale
Way to SRE
The bigger the Operations/Systems, the more autonomous systems it should be.
If 10 engineers handles 100 hundred servers, we shouldn’t need 100 engineers to handle 1000 servers.
SRE is all about automation, improvements and reliability in the system.
Bigger environment means more effectiveness from SRE.
As the need for manual tasks reduces over time due to automation, yet enhancement in autonomous system and further
improvement in same is a continuous process.

98
Testing
Way to SRE
“If you haven't tried it, assume it's broken.”
One key responsibility of Site Reliability Engineers is to quantify confidence in the systems they maintain. SREs perform this
task by adapting classical software testing techniques to systems at scale.
Testing is the mechanism we use to demonstrate specific areas of equivalence when changes occur. Each test that passes
both before and after a change reduces the uncertainty for which the analysis needs to allow. Thorough testing helps us
predict the future reliability of a given site with enough detail to be practically useful.
Passing a test or a series of tests doesn’t necessarily prove reliability. However, tests that are failing generally prove the
absence of reliability.
How failures are being measured:
Mean Time to Repair (MTTR) measures how long it takes the operations team to fix the bug, either through a rollback or
another action.
Mean Time Between Failures (MTBF) measures time - for how long the service worked well post a failure condition.

99
99
SW Testing Classification
Manual
Automated
Testing Type
Static
Dynamic
Testing Methods
Unit Testing
Integration Testing
System Testing
Acceptance testing
Testing Levels
Black Box
White Box
Grey Box
Testing Approach

100
Managing Incidents
Way to SRE
Effective incident management is key to limiting the disruption caused by an incident and restoring normal business
operations as quickly as possible.
As SRE you are also supposed to be on-call (limited efforts again) and handle the incidents.
When on-call, an engineer is available to perform operations on production systems within minutes, according to the paging
response times agreed to by the team and the business system owners. Typical values are 5 minutes for user-facing or
otherwise highly time-critical services, and 30 minutes for less time-sensitive systems.
Google strongly believe that invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be
spent on-call, leaving up to another 25% on other types of operational, nonproject work.
The most important on-call resources are:
• Clear escalation paths
• Well-defined incident-management procedures
• A blameless postmortem culture

101
Emergency Response
Way to SRE
“Things break; that’s life.”
How employees responds to an emergency, show the process and long-term health of the organization. Organization long-
run depends on this one factor very well in IT industry.
What to Do When Systems Break
First of all, don’t panic!
If you feel overwhelmed, pull in more people.
Follow the Incident response process.
Take a deep breath and try to understand the situation, failure cause or relate sources in case of multiple failures.
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
Some important pointers:
• Keep a History of Outages
• Ask the Big, Even Improbable, Questions: What If…?
• Encourage Proactive Testing

102
Videos
Organizational Impact
A history of SRE at Uber:
https://www.youtube.com/watch?v=qJnS-EfIIIE

103
Case Study - OBS
Orange Business Services – The Flexible Engine

104
Exercise
Why do you want to adopt SRE? Who in your organization currently provides SRE?
Your organizational plan for SRE?

105
Q & A
Q & A
Questionnaire

106
106
Module 8:
SRE, Other Frameworks,
& The Future

107
Transforming Culture
Way to SRE
Site Reliability Engineering (SRE) proclaims many advantages for distributed systems. It improves infrastructure
automation, increases reliability, and transforms incident management.
Instead of taking individual at centre, we have a specialized team in centre which is a centre of collaboration and
communication in the organization.
Embracing Risk
Learning From Failure
Better collaboration and communication
Automation in centre – which benefits complete organization
Standardization of tools / technologies / process
Centralize documentation
Consultation and Trainings

108
SRE with Other frameworks
Way to SRE
SRE works well with all major existing process and culture concepts, like:
DevOps
Agile
Scrum
Lean
ITIL
PMP
Its while many of above are majorly conceptual stuff, SRE is having those concepts implemented on ground with practical
work.
It’s a path to create a stress-free autonoums reliable environment with tremendous velocity.

109
SRE Evolution
Way to SRE
Google coined the term “site reliability engineer” in 2003, but it certainly has existed for decades more in different forms —
disaster recovery and production testers.
Ways the SRE Approach is Evolving:
1. Increased Adoption
2. Larger, Diversified SRE Departments
3. New Testing Tactics emerge – e.g., Chaos Monkey
4. Businesses Rely on SREs to Mitigate Risk
Currently SRE approach is widely being adopted by organizations to achieve high uptime and stability for the application, as
even 1 minute of downtime costs millions of $ to many MNCs.

110
Videos
SRE and other frameworks
A Look at ITIL4 & SRE
https://www.youtube.com/watch?v=vFyPXIsUEhE

111
Case Study – VictorOps
Victor Ops

112
Exercise
Where do you see SRE future heading?
Sketch board your understanding of SRE and Requirements for the job role.

113
Q & A
Q & A
Questionnaire

Site-Reliability-Engineering-v2[6241].pdf

Recommended

Recommended

More Related Content

Similar to Site-Reliability-Engineering-v2[6241].pdf

Similar to Site-Reliability-Engineering-v2[6241].pdf (20)

Recently uploaded

Recently uploaded (20)

Site-Reliability-Engineering-v2[6241].pdf