SlideShare a Scribd company logo
1 of 115
Download to read offline
Site Reliability
Engineering
(SRE)
2
Introduction
• Name
• Total Experience
• Background – Development /
Infrastructure /
Management
• Experience on DevOps Tools, Cloud
• Your expectations from this training
2
3
Few Pointers
3
• SRE is more of a concept to implement (vs a tool). So theoretical aspects will be more.
• Many pointers will be familiar. It’s just the right use of same.
• Few best practices we already know but couldn’t implement. Now it’s the right time to
implement same. It’s a do or die situation now.
• Try to co-relate the topics with your domain and prepare notes with pointers.
• Implementation discussion is welcome for your existing teams and domains.
• Learn with - “What behavior will I change?” Learning isn’t collecting information. Learning
is changing behavior.
4
Module 1:
Principles & Practices
5
What is SRE?
6
Digital Transformation
A way Forward
Security, Legal, Compliance must
be in center in all designs
Security
Integration of digital technology into all areas of a business, fundamentally
changing how you operate and deliver value to customers.
Concept
Use technologies which
suits you the best
Tools
Cultural Change
Process
The most important part
and the barrier too
People
SECURITY
PEOPLE
PROCESS
TOOLS
Successful
Digital
Transformation
7
Emerging Technologies
Which are helping industries in Digital Transformation
2
3
4
5
1
6
DATA
ML/AI
Cloud
Security
Automation
Blockchain
IOT
Security toolsets
Remain secure to avoid Financial,
legal and compliance issues
Automation
Automation with DevOps Toolsets
– CICD, SRE, CM, Containers etc.
Cloud
Keeping infra off-ground to third
party and move all Opex
ML/AI
Better decision Making (well
predicted, informed) and autonomy
IOT
Connect and Integrate whatever
possible to automate
Blockchain
For secure transactions over
distributed public networks
8
8
DevOps
9
9
• A systems development life cycle is composed of several clearly defined and distinct work phases which
are used by systems engineers and systems developers to plan for, design, build, test, and deliver
information systems
Require-
ment
Analysis
Design
Implementa
tion
Testing
Evaluation
SDLC – Life Cycle
SDLC Model
10
10
- long release cycle
- A lot of WIP
- Functional silos
- Incredibly rigid for developing
1. Determine the Requirements
2. Complete the design
3. Do the coding and testing (unit
tests)
4. Perform other tests (functional
tests, non-functional tests,
Performance testing, bug fixes
etc.)
5. At last deploy and maintain
Waterfall Model
11
11
- Shorter release cycle
- Small batch sizes (MVP)
- Cross-functional teams
- Incredibly agile
Agile
12
12
- Suddenly ops was the bottleneck (more release less people), again WIP is more!
Lean Development
13
Software Development
Infrastructure, Operations and
Support
Build & Release, Testing Teams
DevOps
- Break the Silos
- Communication (not only with
emails)
- Collaboration
- Trust
- Involvement in the early
development stages
- Automation is the key
- Continuous Integration
- Continuous Deployments in
the lower environments
- Fail fast and fail often
DevOps
14
Development QA Testing Implementation & Release InfraManagement
- name:
Playbook
for
webserver
setup
hosts:
all
tasks:
- name:
package
installatio
n
yum:
name=yum
state=prese
nt
....
....
....
....
- name:
Playbook
for
webserver
setup
t
....
....
....
....
APP APP APP
Waiting Waiting Waiting
Waiting
APP
DevOps
15
15
DevOps
• DevOps is a loose set of practices, guidelines, and culture designed to break down silos in IT development,
operations, networking, and security.
• In a DevOps approach, you improve something (often by automating it), measure the results, and share
those results with colleagues so the whole organization can improve.
• DevOps, Agile, and a variety of other business and software reengineering techniques are all examples of a
general worldview on how best to do business in the modern world. None of the elements in the DevOps
philosophy are easily separable from each other, and this is essentially by design.
• DevOps is a broad set of principles about whole-lifecycle collaboration between operations and product
development.
16
Extreme siloization of knowledge, incentives for
purely local optimization, and lack of collaboration
have in many cases been actively bad for business
16
DevOps Principles
No More Silos
Accidents Are Normal
Change Should Be Gradual
Tooling and Culture Are Interrelated
Measurement Is Crucial
Accidents are not just a result of the isolated
actions of an individual, but rather result from
missing safeguards for when things inevitably go
wrong. I.e. Misconfigured System, broken
monitoring, under pressure wrong actions etc.
Rooting out the Mistake makers and punishing
them creates mess, like incentives to confuse
issues, hide the truth, and blame others, all of which
are ultimately unprofitable distractions.
Change is best when it is small and frequent.
Change is risky, true, but the correct response is to
split up your changes into smaller subcomponents
where possible. Then you build a steady CICD
pipeline of low-risk change out of regular output
from product, design, and infrastructure changes
with Automated testing and improvements.
A good culture can work around broken tooling, but
the opposite rarely holds true. Promoters of
DevOps strongly emphasize organizational
culture—rather than tooling—as the key to success
in adopting a new way of working.
Measure your outcomes time to time. Its can be in
the form of Number of incidents, faster time to
market, MTTR, SLA etc.
17
17
Site Reliability Engineering
18
Site Reliability Engineering
Site Reliability Engineering (SRE) is a term (and associated job role) coined by Ben Treynor Sloss, a VP of engineering at
Google.
SRE is a job role, a set of practices which are known to work at ground, and some beliefs that animate those practices.
SRE is coined around Reliability of the system.
In general, an SRE has particular expertise around the availability, latency, performance, efficiency, change management,
monitoring, emergency response, and capacity planning of the service(s) they are looking after.
SRE implements interface DevOps.
SRE is hiring software engineers to run products and to create systems to accomplish the work that would otherwise be
performed, often manually, by sysadmins.
Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems.
Way to SRE
19
Site Reliability Engineering
SRE is a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary
to write software to replace their previously manual work, even when the solution is complicated.
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software
expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design
and implement automation with software to replace human labor.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases
and teams will need more people just to keep pace with the workload
Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that
the SRE team has enough time in their schedule to make the service stable and operable via engineering tasks of
automation.
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management,
monitoring, emergency response, and capacity planning of their service(s).
Way to SRE
20
SRE should therefore use software engineering
approaches to solve that problem.
20
SRE Principles
Operations Is a Software Problem
Manage by Service Level Objectives
Work to Minimize Toil
Automate - What you can
Move Fast by Reducing the Cost of Failure
Share Ownership with Developers
Use the Same Tooling, Regardless of Function
Define SLOs and work around same
Toils should be reduced to minimum and
automation to be done to the extent possible.
Automation is the key. The real work in this area is
determining what to automate, under what
conditions, and how to automate it. As per Google
max 50% work can be toil and rest 50% time should
be given to SRE Engineering tasks or something
new.
Cost of failure is directly proportional to Mean
Time to Repair (MTTR) which effects product
developer velocity. This follows from the well-
known fact that the later in the product lifecycle a
problem is discovered, the more expensive it is to
fix.
Ideally, both product development and SRE teams
should have a holistic view of the stack—the
frontend, backend, libraries, storage, kernels, and
physical machine—and no team should jealously
own single components. It turns out that you can
get a lot more done if you “blur the lines” and have
SREs instrument JavaScript, or product developers
qualify kernels configurations.
Having similar qualified tools across organizations
will help the easy process understanding and
SRE/DevOps Culture adoption.
Way to SRE
21
21
DevOps vs SRE
• DevOps a loose generic set of principles (philosophy and culture) and SRE an advanced explicit implementation.
• Site Reliability Engineering, like DevOps, should not just be changing titles, but making definitive behavior changes,
focusing on outcomes and obviously reliability.
• Collaboration is front and center for DevOps work. An effective shared ownership model and partner team relationships
are necessary for SRE to function.
• Change management is best pursued as small, continual actions, the majority of which are ideally both automatically
tested and applied. The critical interaction between change and reliability makes this especially important for SRE.
• Measurement is absolutely key to how both DevOps and SRE work. For SRE, SLOs are dominant in determining the
actions taken to improve the service. For DevOps, the act of measurement is often used to understand what the outputs
of a process are, what the duration of feedback loops is, and so on.
• DevOps is relatively silent on how to run operations at a detailed level. While SRE talks about detailed steps of
implementations and deployments.
Way to SRE
22
22
DevOps vs SRE
• DevOps is more context-sensitive and works organization wide. SRE, on the other hand, has relatively narrowly defined
responsibilities and its remit is generally service-oriented (and end-user-oriented) rather than whole-business-oriented.
• Ultimately, implementing DevOps or SRE is a holistic act; both hope to make the whole of the team (or unit, or
organization) better, as a function of working together in a highly specific way. For both DevOps and SRE, better velocity
should be the outcome.
Way to SRE
23
SRE Context and Successful Adoption
Narrow, Rigid (launch-related or reliability-related) Incentives, Narrow Your Success.
A system with early SRE engagement (ideally, at design time) typically works better in production after deployment,
regardless of who is responsible for managing the service.
Don’t just allow, but actively encourage, engineers to change code and configuration when required for the product.
Support blameless postmortems. Doing so eliminates incentives to downplay or cover up a problem.
Allow support to move away from products that are irredeemably operationally difficult. The threat of support withdrawal
motivates product development to fix issues both in the run-up to support and once the product is itself supported, saving
everyone time.
Always remember – Good people will quit if they’re tasked with too much operational work and aren’t given the opportunity
to use their engineering skill set.
Consider Reliability Work as a Specialized Role.
Strive for Parity of Esteem: Career and Financial.
Way to SRE
24
25
Brain - Storming
DevOps vs SRE
An x company wants to reduce the time to market for its new software product releases and facing below issues:
• Hardware capacity planning is a challenge
• Infra is new, yet hardware failures are more
• Lots of bugs are being identified in the products
• Releases fails on production days
• Huge Incident tickets post new releases for next few days.
Where DevOps can help in this area?
Where SRE can help in this segment?
Understand how SRE heals DevOps Failures…
26
Case Study – French Telecom
DevOps vs SRE
Identifying the DevOps Work and building DevOps Team
Identifying Reliability needs and building SRE Engineering Team
Continuous Enhancement…
27
Video
DevOps vs SRE
DevOps vs SRE (Google)
https://www.youtube.com/watch?v=uTEL8Ff1Zvk
28
Exercise
DevOps vs SRE
What we do all day? Is there a way to automate?
Is there any way to make the systemmore reliable?
Factored ROI?
29
Q & A
DevOps vs SRE
Q & A
Questionnaire
30
30
Module 2:
Service Level Objectives (SLOs) &
Error Budgets
31
Important terms
Way to SRE
Availability=The ability of less downtime, or the fraction of the time that a service is usable. Although 100% availability is
impossible, near-100% availability is often readily achievable
Reliability=The ability to work properly (even if some parts/components failed).
Durability=The ability of not losing data. Or the likelihood that data will be retained over a long period of time—is equally
important (alike Availability) for data storage systems.
SLA (Promise) = Service-Level Agreement is a commitment between a service provider and a client, regarding particular
aspects of the service – quality, availability, responsibilities etc. It is an explicit or implicit contract with your users that
includes consequences of meeting (or missing) the SLOs they contain.
SLO (Goal) = Service Level Objective – The Objectives (within SLA, i.e. uptime, response time) which your team must hit to
meet the SLA.
SLI (How and What) = Service Level Indicators – the Real numbers to measure your compliance against SLO. In Specific – its a
carefully defined quantitative measure of some aspect of the level of service that is provided. i.e. request latency, error rate
etc.
32
Important terms
Way to SRE
33
Service Level Objectives
Way to SRE
It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that
service and how to measure and evaluate those behaviors.
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives
(SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we
want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate
metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is
healthy.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural
structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
Choosing an appropriate SLO is usually complex (i.e. QPS, Network Bandwidth etc.) , but sometime its straightforward too
(i.e. setting low-latency).
Choosing and publishing SLOs to users sets expectations about how a service will perform. Without an explicit SLO, users
often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people
designing and operating the service.
34
SLI – Indicators in Practice
Way to SRE
SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the
consequences of missed SLOs.
What Do You and Your Users Care About?
• User-facing serving systems, such as the Shakespeare search frontends, generally care about availability,
latency, and throughput. In other words: Could we respond to the request? How long did it take to respond?
How many requests could be handled?
• Storage systems often emphasize latency, throughputs, IOPS, availability, and durability.
• Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end
latency (How much data is being processed?, and how long it takes to process?).
• All systems should care about correctness: was the right answer returned, the right data retrieved, the right
analysis done?
Collecting Indicators
Aggregate the metric for better usage (Average out, Instantaneous usages)
Standardize Indicators (Over a period of time i.e. average packets per minute)
Collect Indicators at server as well as Client end
Objectives in Practice (SLO)
Define Objectives (99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms)
Choose realistic targets, which are simple & minimum-required and always keep a refine strategy/scope.
35
Control Measures
SLIs and SLOs are crucial elements in the control loops used to manage systems:
• Monitor and measure the system’s SLIs.
• Compare the SLIs to the SLOs, and decide whether or not action is needed.
• If action is needed, figure out what needs to happen in order to meet the target.
• Take that action.
Always remember:
• Publishing SLOs sets expectations for system behavior
• Keep margins - Using a tighter internal SLO than the SLO advertised to users gives you room to respond to
chronic problems before they become visible externally.
• If your service’s actual performance is much better than its stated SLO, users will come to rely on its current
performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s
Chubby service introduced planned outages in response to being overly available).
Understanding how well a system is meeting its expectations helps decide whether to invest in making the system faster,
more available, and more resilient. Alternatively, if the service is doing fine, perhaps staff time should be spent on other
priorities, such as paying off technical debt, adding new features, or introducing other products.
Way to SRE
36
Monitoring in place
Way to SRE
37
Error Budget and policies
The SLO is a target percentage, and the error budget is 100% minus the SLO. For example, if you have a 99.9% success
ratio SLO, then a service that receives 3 million requests over a four-week period had a budget of 3,000 (0.1%) errors over
that period. If a single outage is responsible for 1,500 errors, that error costs 50% of the error budget.
Once you have an SLO, you can use the SLO to derive an error budget. In order to use this error budget, you need a policy
outlining what to do when your service runs out of budget.
When we talk about enforcing an error budget policy, we mean that once you exhaust your error budget (or come close to
exhausting it), you should do something in order to restore stability to your system
Common owners and actions might include:
• The development team gives top priority to bugs relating to reliability issues over the past four weeks.
• The development team focuses exclusively on reliability issues until the system is within SLO. This responsibility comes
with high-level approval to push back on external feature requests and mandates.
• To reduce the risk of more outages, a production freeze halts certain changes to the system until there is sufficient error
budget to resume changes.
Way to SRE
38
39
Case Study – Genpact
SLO & Error Budget
Pain Areas…
Penalties due to SLA Miss…
Setting SLO for Application uptime and performance.
Tracking SLIs
40
Video
SLO & Error Budget
SLA, SLO and SLI (Google)
https://www.youtube.com/watch?v=tEylFyxbDLE
Risks and Error Budgets (Google)
https://www.youtube.com/watch?v=y2ILKr8kCJU
41
Exercise
SLO & Error Budget
Define 3 SLI/SLOfor your current Application contract.
Have We Defined Right SLOs and Monitoring right SLIs?
Do we Just work with Availability Monitoring or Performance Monitoring too?
42
Q & A
SLO & Error Budget
Q & A
Questionnaire
43
43
Module 3:
Reducing Toils
44
Toils
Way to SRE
“If a human operator needs to touch your system during normal operations, you have a bug.”
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a service grows.
Google’s SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time.
At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service
features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil
as a second-order effect.
It is equally important to calculate toils and time spent on same, over a given period and keep aligning the team towards
engineering tasks.
Engineering work is novel and essentially requires human judgment. It produces a permanent improvement in your service
and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem—the
more generalized, the better.
45
Toils
A Must know
Manual
This includes work such as manually running a script that automates some task. Running a script may be quicker than
manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is
still toil time.
Repetitive
If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and
over. If you’re solving a novel problem or inventing a new solution, this work is not toil.
Automatable
If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is
toil. If human judgment is essential for the task, there’s a good chance it’s not toil.
Tactical
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be
able to eliminate this type of work completely, but we have to continually work toward minimizing it.
No enduring value
If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a
permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into
legacy code and configurations and straightening them out—was involved.
46
SRE tasks
Way to SRE
Engineering work helps your team or the SRE organization handle a larger service, or more services, with the same level of
staffing.
Software engineering
Involves writing or modifying code, in addition to any associated design and documentation work. Examples include writing
automation scripts, creating tools or frameworks, adding service features for scalability and reliability, or modifying
infrastructure code to make it more robust.
Systems engineering
Involves configuring production systems, modifying configurations, or documenting systems in a way that produces lasting
improvements from a one-time effort. Examples include monitoring setup and updates, load balancing configuration, server
configuration, tuning of OS parameters, and load balancer setup. Systems engineering also includes consulting on
architecture, design, and productionization for developer teams.
Toil
Work directly tied to running a service that is repetitive, manual, etc.
Overhead
Administrative work not tied directly to running a service. Examples include hiring, HR paperwork, team/company meetings,
bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.
47
Why Toils are bad?
Way to SRE
Career stagnation
Your career progress will slow down or grind to a halt if you spend too little time on projects.
Low morale
People have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout,
boredom, and discontent.
Slows progress
Excessive toil makes a team less productive. A product’s feature velocity will slow if the SRE team is too busy with manual
work and firefighting to roll out new features promptly.
Sets precedent
If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil,
sometimes shifting operational tasks that should rightfully be performed by Devs to SRE.
Promotes attrition
Even if you’re not personally unhappy with toil, your current or future teammates might like it much less. If you build too much
toil into your team’s procedures, you motivate the team’s best engineers to start looking elsewhere for a more rewarding job.
48
How to Reduce Toils?
Identify Toils and try to reduce same to the level you can. Let's take an example:
Identify what your team members are involved into at 80% of the on-job time.
Check if same can be automated.
If yes, then automate same with some tools, else if not identify what else can be done to improve the process.
Keep improving the existing state and service.
Set goals for Engineering tasks too, for e.g. increasing Internal SLO from 99.9 to 99.95%. Identifying the ways and
implementing the procedure for same.
Way to SRE
49
Toils worth
to
automate?
50
Case Study – Reducing Toils
One of my client “X” work in Contact Center field, where they deploy the Contact Center services for end clients and manage
it for them.
Now for every new client deploying and building the infrastructure was a very hectic task (contains 15+ Servers with 10+
microservice, multiple LBs, Cache servers, DBs, Security Implementations). Similarly increasing the existing client
environment was very difficult and time-consuming tasks. Even the hardware capacity planning started becoming challenging.
80-90% of the teams (including Developers) were involved in deployment of new services or expansion of environment and
issues handling for existing clients was becoming difficult. Teams were in pain with repeated tasks and pressure they were
going through.
• Company took the hard decision and migrated to Cloud services to avoid hardware bottlenecks.
• Terraform (DevOps IaC) tool was used to automate the deployment. Now the same deployment of infra, which was taking
1 month to design and get ready, is getting up and running in less than an hour.
• Same team members are free from pressure and happy investing their time to enhance the features, reducing bugs and
automating the environment further to next levels.
• Even during pandemic, they thrive with 200% increase in customer on-boarding, without any hastle.
Reducing Toils
51
Video
Reducing Toils
Pragmatic Automation
https://www.youtube.com/watch?v=oDcjAcFTFC0
52
Exercise
Reducing Toils
Do you foresee any toils in your team?
If yes, benefits of Reducing Toil?
How same can be reduced?
Is daily mails, project Reports toil?
Considered ROI ???
53
Q & A
Reducing Toils
Q & A
Questionnaire
54
54
Module 4:
Monitoring & Service Level Indicators
55
Monitoring
Way to SRE
Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and
types, error counts and types, processing times, and server lifetimes.
White-box monitoring
Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine
Profiling Interface, or an HTTP handler that emits internal statistics.
Black-box monitoring
Testing externally visible behavior as a user would see it.
Alert
A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a
pager. Respectively, these alerts are classified as tickets, email alerts,22 and pages.
Root cause
A defect in a software or human system that, if repaired, instills confidence that this event won’t happen again in the same
way.
56
SLI’s -Service Level Indicators
Monitoring
SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the
consequences of missed SLOs. And to achieve same having proper monitoring with right SLIs is very important.
What Do You and Your Users Care About?
• User-facing serving systems, Could we respond to the request? How long did it take to respond? How many
requests could be handled?
• Storage systems often emphasize latency, availability, and durability.
• Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end
latency (How much data is being processed?, and how long it takes to process?).
• All systems should care about correctness: was the right answer returned, the right data retrieved, the right
analysis done?
Collecting Indicators
Aggregate the metric for better usage (Average out, Instantaneous usages)
Standardize Indicators (Over a period of time i.e. average packets per minute)
Collect Indicators at server as well as Client end
E.g. Application latency for 99.9% users in last 5 min should be less than 100ms.
57
Why Monitor?
Way to SRE
Analyzing long-term trends
Comparing over time
Alerting
Building dashboards
Conducting ad hoc retrospective analysis
58
Monitoring with proper SLI Metric
Way to SRE
Always make sure to select the right Service Level Indicator Metric to track and alert.
Set alerts (with respective criticality, pager, Email etc) for all your SLO targets and also creating simple and meaning full
dashboards for a higher visibility.
Four golden signals to track:
Latency
Traffic
Errors
Saturation (IO, Memory, CPU etc)
59
SLO Improvements
Way to SRE
VALET
Google summed up our new SLOs into a handy acronym: VALET.
Volume (traffic)
How much business volume can my service handle?
Availability
Is the service up when I need it?
Latency
Does the service respond fast when I use it?
Errors
Does the service throw an error when I use it?
Tickets
Does the service require manual intervention to complete my request?
Use Telemetry tools to collect SLI metrics from remote servers into a centralized Monitoring server to generate graphs.
60
Case Study - Genpact
Monitoring and SLI
At Client “X”, we configured autoscaling with value CPU percentage > 80% and made the system up and running in
production. Even the performance/Load Test was done and successful.
But after 6 month, during a heavy peak load, system didn’t autoscaling and got crashed. During RCA identification, we found
that it was a DISK IO full issue, for which we have not monitored the systems and no alert/autoscaling setup on same.
We modified the HDD to SSD on the server for fixing the issues and also enabled monitoring and autoscaling for the DISK
environment.
61
Video
Monitoring and SLI
SLI/SLO and reliability Deep Dive
https://www.youtube.com/watch?v=dplGoewF4DA
62
62
High Availability and Capacity Planning
63
High Availability
A Must know
Serving millions of request by a single server is not possible, even it is a supercomputer. Hence, we need Horizontal Scaling
(adding more servers to handle the requests).
Traffic load balancing is the solution to heavy traffic management, which is distributing traffic across multiple network links,
datacenters, and machines in an "optimal" fashion.
Multiple factors, which affects HA:
• The hierarchical level at which we evaluate the problem (global versus local)
• The technical level at which we evaluate the problem (hardware versus software)
• The nature of the traffic we’re dealing with
Nature of requests and handling techniques plays a useful role here.
64
High Availability techniques
A Must know
Techniques to handle the High availability/DR:
• Clustering
• Load Balancing with VIP
• Microservice architectures ( with Containerization Approach)
• Passive DR Sites
• Load Balancing Using DNS
• Content Delivery Networks (for low latency)
65
65
Thinknyx Technologies
High Availability & Burst handling
Application
Server
Cloud Premises
Application
Server
Autoscaling Group
Application
Server
Way to SRE
66
66
Thinknyx Technologies
LB for High Availability
Way to SRE
67
67
Thinknyx Technologies
HA in AWS
Way to SRE
68
68
Thinknyx Technologies
Disaster Recovery
X – Cloud
DC 1 Operations On-Prem, Cloud as DR
Application is hosted in self-managedDatacenter and
Backup is hosted on Cloud
Operations in Cloud, Third-party DR
Application is hosted in Cloud Datacenter and Backup is
hosted on Third party cloud service provider or may be
on third party backup service provider.
Operations in Cloud, Cloud as DR
Application is hosted in Cloud Datacenter and Backup is
hosted in Second Region/datacenter in same cloud.
Y – Cloud
DC 1
APP
Backup
X – Cloud
DC 2
On-Prem
APP Backup
Backup
Way to SRE
69
69
Thinknyx Technologies
Business Continuity
1
3
5
2
4
Consider all natural
disasters and their
range, before finalizing
a DR location.
Physical Location
Never put al your eggs
in single bucket.
Logical Location
Declarations of Role &
Responsibilities,
Emergency process,
Backup ready
Who’ll do what
Test BC/DR at least
annually.
DR Drill
Define application criticality and failover
priorities in advance. Health and Human safety
should be primary concern.
Priorities
Way to SRE
70
Exercise
Monitoring and SLI
What do you monitor now and what all reliability aspects you considered?
Performance Monitoring in place?
DR Monitoring /Activation in place?
What we can monitor, where and how?
Risk Factors
71
Q & A
Monitoring and SLI
Q & A
Did you consider Monitoring importance in HA/DR?
72
72
Module 5:
SRE Tools & Automation
73
Automation
Way to SRE
Automation is the key for any organization to thrill.
For SRE, automation is a force multiplier, not a panacea. Of course, just multiplying force does not naturally change the
accuracy of where that force is applied: doing automation thoughtlessly can create as many problems as it solves.
Consistency A Platform
Faster Repairs Faster Actions
Time Saving Ease and Effectiveness
74
Automation Focus
Way to SRE
SRE has a number of philosophies and products in the domain of automation, some of which look more like generic rollout
tools without particularly detailed modeling of higher-level entities, and some of which look more like languages for describing
service deployment (and so on) at a very abstract level.
Some use cases:
• User account creation
• Cluster turnup and turndown for services
• Software or hardware installation preparation and decommissioning
• Rollouts of new software versions
Count is endless, its just identifying the priority and keep automating tasks one by one.
End Goal is to create Autonomous system, which runs and manages on its own. For e.g. A system should not just trigger alerts
and try to make the services up on same system, it should do the failover on its own to another better available system, if
services are not coming up on same server.
Automationsystem must be secure and reliable. Automation works at scale, so destruction will also be at scale, if something
goes wrong.
75
Automation Hierarchy
Way to SRE
Now a days, tools are available in market to automate majorities of tasks and events what we want to manage; yet there can
be few things which needs customized automation. For same we can follow custom paths. For example a database failover
automation evaluation path for Autonomous environment:
1) No automation
Database master is failed over manually between locations.
2) Externally maintained system-specific automation
An SRE has a failover script in his or her home directory.
3) Externally maintained generic automation
The SRE adds database support to a "generic failover" script that everyone uses.
4) Internally maintained system-specific automation
The database ships with its own failover script.
5) Systems that don’t need any automation (autonomoussystem)
The database notices problems, and automatically fails over without human intervention.
SRE hates manual operations, so they obviously try to create systems that don’t require them. However, sometimes manual
operations are unavoidable (DR activation, Production push etc).
76
Secure Automation
Way to SRE
Unsecure Automation can be dangerous too. Learn from an example of “CODESPACES” and other clients where AWS AK/SK
was leaked, and disaster happened.
Think about keeping Username and password in your container Images? Who all in your organization do have access to these
image?
Keeping API keys in Code and code on Github/bitbucket?
Zero touch automation is the final goal for an SRE. We have to consider Security also into it at every layer, as multiple tools get
involved.
We have to replace our use of sshd with an authenticated, ACL-driven, RPC-based Local Admin Daemon, also known as
Admin Servers, which had permissions to perform those local changes. As a result, no one could install or modify a server
without an audit trail.
CIA terms are important to implement in automation too.
77
Automation Tools
Way to SRE
Though there are not defined set of tools for any SRE, but it is always better to have universal tools in the basket. Few
Categories and tools which are well known to the market for their well-known results in the area, are as below:
Version Control System: TFVC, Git (Gitlab, Github, Bitbucket, Azure DevOps)
Pipelining for CICD: Jenkins, Azure DevOps, TeamCity, bamboo
Automated Deployment: Octopus Deploy, UrbanDeploy
Configuration Management: Ansible, Chef, puppet, Saltstack
Container and Orchestration: Docker (Kubernetes, Docker Swarm, Openshift)
Automation oriented languages: Python, Java
DevOps Infrastructure as Code: Terraform, Cloud Formation, Azure Templates
Continuous Testing: Jmeter, Sonarqube, Selenium
78
Video
SRE tools and Automation
SREcon19 Asia/Pacific - Ironies of Automation (Microsoft)
https://www.youtube.com/watch?v=U3ubcoNzx9k
79
Case Study
SRE Tools & Automation
Amazon has done the automation for Leasing servers on Rent and metering same for usage and gradually over a period of
time, it became Cloud with 100s of service provisioned/Released via self-service portal.
50 million changes into AWS Cloud happened in 2016 in 1 year, which is 1 change in production per second.
Distributed system with APIs and Queues are the best way to scale with automation.
80
Exercise
SRE Tools & Automation
Automation “Greatest Hits” – Uber, Airbnb, Ola, Olx, AWS …
How much automation you have and what can be automated?
Its as simple as:
“Anything that you do more than twice has to be automated.”
-Adam Stone, CEO, D-Tools
81
Q & A
SRE Tools & Automation
Q & A
Questionnaire
82
82
Module 6:
Anti-Fragility & Learning from Failure
83
Anti-fragility: Learning from Failure
Way to SRE
“The cost of failure is education.” Devin Carraway
Anti-fragility is all about understanding disorder and using it to your advantage. It is a property of systems in which they
increase in capability to thrive as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures.
Postmortems are an essential tool for SRE (to make the system resilient and reliable).
When an incident occurs, we fix the underlying issue, and services return to their normal operating conditions. Unless we
have some formalized process of learning from these incidents in place, they may reoccur.
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and
the follow-up actions to prevent the incident from recurring.
The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s)
are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact
of recurrence.
84
Anti-fragility: Shifting the organizational balance
Way to SRE
The antifragile loves randomness and uncertainty.
Anti-fragility is a concept which encompasses the idea that things need chaos and disorder in order to thrive and flourish.
Whatever doesn’t kill us makes us stronger, pushing the notion that we shouldn’t construct our lives or our plans against
randomness and misfortune, rather, we should adopt anti-fragility as a means of maneuvering through disorder.
Similarly, we should plan our system to withstand failures (planned/unplanned). We can even plan/plot the failures in our
system to understand the withstanding capability/Anti-fragility of our system.
We can have unplanned downtimes and activities, which can simulate failure to learn from it and make our system more
robust. For example – pulling a network cable of server or shutting down the UPS to understand the impact, etc… But we
must first understand the error budget and failure cost before we plan for such failure activities. Such activities definitely
add learning and robustness to our system, but we must keep a balance between error budget and enhancements.
85
Postmortem Culture : Learning from Failure
Way to SRE
The postmortem process does present an inherent cost in terms of time or effort, so we can be deliberate in choosing when
to write one. Teams have some internal flexibility, but common postmortem triggers include:
• User-visible downtime or degradation beyond a certain threshold
• Data loss of any kind
• On-call engineer intervention (release rollback, rerouting of traffic, etc.)
• A resolution time above some threshold
• A monitoring failure (which usually implies manual incident discovery)
• Stakeholder request a postmortem for an event
86
Blameless Postmortems
Way to SRE
“Blameless postmortems” is a Principle of SRE culture.
For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without
indicting any individual or team for bad or inappropriate behavior.
A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right
thing with the information they had.
If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring
issues to light for fear of punishment.
When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had
incomplete or incorrect information, effective prevention plans can be put in place.
You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when
designing and maintaining complex systems.
87
Best Practice
Way to SRE
Avoid Blame and Keep It Constructive.
Collaborate and Share Knowledge
No Postmortem Left Unreviewed
Introduce a Postmortem Culture
Visibly Reward People for Doing the Right Thing
Ask for Feedback on Postmortem Effectiveness
Continuous improvement
Postmortem should have clearly defined ownership, priority, preventive actions and Action taken
88
Case Study
Anti-Fragility
Netflix Simian Army
https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
https://github.com/netflix/chaosmonkey
89
Failures
Anti-Fragility
Do Failure is really bad – for organization and individuals ?
Consider Elon Musk and Colonel Sanders ☺
90
Exercise
Anti-Fragility
Share some example of problem ticket within your team where you were involved and had lot of incidents be
of a single root cause.
Share the Incidents, Business Impact, Root Cause, Course of Actions done to resolve same and whose mist
was this, if it was a configuration issue.
91
Q & A
Anti-Fragility
Q & A
Questionnaire
92
92
Module 7:
Organizational Impact of SRE
93
Organizations Embracing SRE
Way to SRE
Availability
Reliability
Capacity Planning
Happy Customers
Cost Effectiveness
due to less failure
Velocity
Continuous
Improvements
94
Typical ORG Chart
Way to SRE
Specialized
Reliability
Engineers
Specialized
Reliability
Engineers
Specialized
Reliability
Engineers
Site Reliability
Engineers
(TL)
Site Reliability
Engineers
(TL)
Manager
SRE Team
DB
Admins
DB
Manager
DB
Admins
Systems
Manager
OS
Admins
OS
Admins
Operations
Dev
Dev
Manager
Dev
Q&A
Manager
QA
Admin
QA
Admin
Dev/Prod Team
95
SRE Responsibilities
Way to SRE
Tasks SRE Team
Architecture Design Approvals and Consultaning RC
Instrumentation, Metrics, and Monitoring CI
Maintaining SLI CI
SLO /SLI track and management CR
Handling Incidents CI
Repeated Incidents and Problem Management RA
Capacity Planning CR
Change Management CI
Critical/Large Scale Changes CR
Performance: availability, latency, and efficiency R
Automation RA
Innovation RA
Release Management C
Release Repeated failures R
Test Management C
Test Repeated Failures R
Standardization of Tools/Softwares/Process/Technologies RA
Supporting Presales and Sales CR
Training other Team Members R
96
SRE Engagement and Adoption
Way to SRE
SRE seeks production responsibility for important services for which it can make concrete contributions to reliability. SRE is
concerned with several aspects of a service, which are collectively referred to as production. These aspects include the
following:
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
When SREs engage with a service, they aim to improve it along all of these axes, which makes managing production for the
service easier.
97
SRE & Scale
Way to SRE
The bigger the Operations/Systems, the more autonomous systems it should be.
If 10 engineers handles 100 hundred servers, we shouldn’t need 100 engineers to handle 1000 servers.
SRE is all about automation, improvements and reliability in the system.
Bigger environment means more effectiveness from SRE.
As the need for manual tasks reduces over time due to automation, yet enhancement in autonomous system and further
improvement in same is a continuous process.
98
Testing
Way to SRE
“If you haven't tried it, assume it's broken.”
One key responsibility of Site Reliability Engineers is to quantify confidence in the systems they maintain. SREs perform this
task by adapting classical software testing techniques to systems at scale.
Testing is the mechanism we use to demonstrate specific areas of equivalence when changes occur. Each test that passes
both before and after a change reduces the uncertainty for which the analysis needs to allow. Thorough testing helps us
predict the future reliability of a given site with enough detail to be practically useful.
Passing a test or a series of tests doesn’t necessarily prove reliability. However, tests that are failing generally prove the
absence of reliability.
How failures are being measured:
Mean Time to Repair (MTTR) measures how long it takes the operations team to fix the bug, either through a rollback or
another action.
Mean Time Between Failures (MTBF) measures time - for how long the service worked well post a failure condition.
99
99
SW Testing Classification
Manual
Automated
Testing Type
Static
Dynamic
Testing Methods
Unit Testing
Integration Testing
System Testing
Acceptance testing
Testing Levels
Black Box
White Box
Grey Box
Testing Approach
100
Managing Incidents
Way to SRE
Effective incident management is key to limiting the disruption caused by an incident and restoring normal business
operations as quickly as possible.
As SRE you are also supposed to be on-call (limited efforts again) and handle the incidents.
When on-call, an engineer is available to perform operations on production systems within minutes, according to the paging
response times agreed to by the team and the business system owners. Typical values are 5 minutes for user-facing or
otherwise highly time-critical services, and 30 minutes for less time-sensitive systems.
Google strongly believe that invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be
spent on-call, leaving up to another 25% on other types of operational, nonproject work.
The most important on-call resources are:
• Clear escalation paths
• Well-defined incident-management procedures
• A blameless postmortem culture
101
Emergency Response
Way to SRE
“Things break; that’s life.”
How employees responds to an emergency, show the process and long-term health of the organization. Organization long-
run depends on this one factor very well in IT industry.
What to Do When Systems Break
First of all, don’t panic!
If you feel overwhelmed, pull in more people.
Follow the Incident response process.
Take a deep breath and try to understand the situation, failure cause or relate sources in case of multiple failures.
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
Some important pointers:
• Keep a History of Outages
• Ask the Big, Even Improbable, Questions: What If…?
• Encourage Proactive Testing
102
Videos
Organizational Impact
A history of SRE at Uber:
https://www.youtube.com/watch?v=qJnS-EfIIIE
103
Case Study - OBS
Organizational Impact
Orange Business Services – The Flexible Engine
104
Exercise
Organizational Impact
Why do you want to adopt SRE? Who in your organization currently provides SRE?
Your organizational plan for SRE?
105
Q & A
Organizational Impact
Q & A
Questionnaire
106
106
Module 8:
SRE, Other Frameworks,
& The Future
107
Transforming Culture
Way to SRE
Site Reliability Engineering (SRE) proclaims many advantages for distributed systems. It improves infrastructure
automation, increases reliability, and transforms incident management.
Instead of taking individual at centre, we have a specialized team in centre which is a centre of collaboration and
communication in the organization.
Embracing Risk
Learning From Failure
Better collaboration and communication
Automation in centre – which benefits complete organization
Standardization of tools / technologies / process
Centralize documentation
Consultation and Trainings
108
SRE with Other frameworks
Way to SRE
SRE works well with all major existing process and culture concepts, like:
DevOps
Agile
Scrum
Lean
ITIL
PMP
Its while many of above are majorly conceptual stuff, SRE is having those concepts implemented on ground with practical
work.
It’s a path to create a stress-free autonoums reliable environment with tremendous velocity.
109
SRE Evolution
Way to SRE
Google coined the term “site reliability engineer” in 2003, but it certainly has existed for decades more in different forms —
disaster recovery and production testers.
Ways the SRE Approach is Evolving:
1. Increased Adoption
2. Larger, Diversified SRE Departments
3. New Testing Tactics emerge – e.g., Chaos Monkey
4. Businesses Rely on SREs to Mitigate Risk
Currently SRE approach is widely being adopted by organizations to achieve high uptime and stability for the application, as
even 1 minute of downtime costs millions of $ to many MNCs.
110
Videos
SRE and other frameworks
A Look at ITIL4 & SRE
https://www.youtube.com/watch?v=vFyPXIsUEhE
111
Case Study – VictorOps
SRE and other frameworks
Victor Ops
112
Exercise
SRE and other frameworks
Where do you see SRE future heading?
Sketch board your understanding of SRE and Requirements for the job role.
113
Q & A
SRE and other frameworks
Q & A
Questionnaire
114
THANK YOU

More Related Content

Similar to Site-Reliability-Engineering-v2[6241].pdf

5 principles-securing-devops-veracode-whitepaper
5 principles-securing-devops-veracode-whitepaper5 principles-securing-devops-veracode-whitepaper
5 principles-securing-devops-veracode-whitepaperwardell henley
 
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechRosalie Lauren
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
 
Software Engineering in a Quick and Easy way - v1.pdf
Software Engineering in a Quick and Easy way - v1.pdfSoftware Engineering in a Quick and Easy way - v1.pdf
Software Engineering in a Quick and Easy way - v1.pdfKAJAL MANDAL
 
Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...
Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...
Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...AgileNetwork
 
DevOps Introduction
DevOps IntroductionDevOps Introduction
DevOps IntroductionRobert Sell
 
Scrum an extension pattern language for hyperproductive software development
Scrum an extension pattern language  for hyperproductive software developmentScrum an extension pattern language  for hyperproductive software development
Scrum an extension pattern language for hyperproductive software developmentShiraz316
 
DevOps Transformation - Another View
DevOps Transformation - Another ViewDevOps Transformation - Another View
DevOps Transformation - Another ViewAgron Fazliu
 
DevSecOps in the Cloud from the Lens of a Well-Architected Framework.pptx
DevSecOps in the Cloud from the Lens of a  Well-Architected Framework.pptxDevSecOps in the Cloud from the Lens of a  Well-Architected Framework.pptx
DevSecOps in the Cloud from the Lens of a Well-Architected Framework.pptxTurja Narayan Chaudhuri
 
Introduction to DevOps in Cloud Computing.pptx
Introduction to DevOps in Cloud Computing.pptxIntroduction to DevOps in Cloud Computing.pptx
Introduction to DevOps in Cloud Computing.pptxLAKSHMIS553566
 
Pm soln9416141129710
Pm soln9416141129710Pm soln9416141129710
Pm soln9416141129710Nikhil Todkar
 
Building Maintainable PHP Applications.pptx
Building Maintainable PHP Applications.pptxBuilding Maintainable PHP Applications.pptx
Building Maintainable PHP Applications.pptxdavorminchorov1
 
Release Engineering Downstream of an OpenStack Project
Release Engineering Downstream of an OpenStack ProjectRelease Engineering Downstream of an OpenStack Project
Release Engineering Downstream of an OpenStack ProjectRainya Mosher
 
A Pattern-Language-for-software-Development
A Pattern-Language-for-software-DevelopmentA Pattern-Language-for-software-Development
A Pattern-Language-for-software-DevelopmentShiraz316
 
Pete Marshall - casmadrid2015 - Continuous Delivery in Legacy Environments
Pete Marshall - casmadrid2015 - Continuous Delivery in Legacy EnvironmentsPete Marshall - casmadrid2015 - Continuous Delivery in Legacy Environments
Pete Marshall - casmadrid2015 - Continuous Delivery in Legacy EnvironmentsPeter Marshall
 
DevOps and Devsecops- Everything you need to know.
DevOps and Devsecops- Everything you need to know.DevOps and Devsecops- Everything you need to know.
DevOps and Devsecops- Everything you need to know.Techugo
 
8 Ways to Boost Your DevOps Efforts
8 Ways to Boost Your DevOps Efforts8 Ways to Boost Your DevOps Efforts
8 Ways to Boost Your DevOps EffortsLucy Zeniffer
 

Similar to Site-Reliability-Engineering-v2[6241].pdf (20)

Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
5 principles-securing-devops-veracode-whitepaper
5 principles-securing-devops-veracode-whitepaper5 principles-securing-devops-veracode-whitepaper
5 principles-securing-devops-veracode-whitepaper
 
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Software Engineering in a Quick and Easy way - v1.pdf
Software Engineering in a Quick and Easy way - v1.pdfSoftware Engineering in a Quick and Easy way - v1.pdf
Software Engineering in a Quick and Easy way - v1.pdf
 
Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...
Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...
Agile Gurugram Conference 2020 | Value of DevOps - Journey from Automation to...
 
DevOps Introduction
DevOps IntroductionDevOps Introduction
DevOps Introduction
 
Scrum an extension pattern language for hyperproductive software development
Scrum an extension pattern language  for hyperproductive software developmentScrum an extension pattern language  for hyperproductive software development
Scrum an extension pattern language for hyperproductive software development
 
DevOps Transformation - Another View
DevOps Transformation - Another ViewDevOps Transformation - Another View
DevOps Transformation - Another View
 
DevSecOps in the Cloud from the Lens of a Well-Architected Framework.pptx
DevSecOps in the Cloud from the Lens of a  Well-Architected Framework.pptxDevSecOps in the Cloud from the Lens of a  Well-Architected Framework.pptx
DevSecOps in the Cloud from the Lens of a Well-Architected Framework.pptx
 
Introduction to DevOps in Cloud Computing.pptx
Introduction to DevOps in Cloud Computing.pptxIntroduction to DevOps in Cloud Computing.pptx
Introduction to DevOps in Cloud Computing.pptx
 
Pm soln9416141129710
Pm soln9416141129710Pm soln9416141129710
Pm soln9416141129710
 
Lect7
Lect7Lect7
Lect7
 
Lect7
Lect7Lect7
Lect7
 
Building Maintainable PHP Applications.pptx
Building Maintainable PHP Applications.pptxBuilding Maintainable PHP Applications.pptx
Building Maintainable PHP Applications.pptx
 
Release Engineering Downstream of an OpenStack Project
Release Engineering Downstream of an OpenStack ProjectRelease Engineering Downstream of an OpenStack Project
Release Engineering Downstream of an OpenStack Project
 
A Pattern-Language-for-software-Development
A Pattern-Language-for-software-DevelopmentA Pattern-Language-for-software-Development
A Pattern-Language-for-software-Development
 
Pete Marshall - casmadrid2015 - Continuous Delivery in Legacy Environments
Pete Marshall - casmadrid2015 - Continuous Delivery in Legacy EnvironmentsPete Marshall - casmadrid2015 - Continuous Delivery in Legacy Environments
Pete Marshall - casmadrid2015 - Continuous Delivery in Legacy Environments
 
DevOps and Devsecops- Everything you need to know.
DevOps and Devsecops- Everything you need to know.DevOps and Devsecops- Everything you need to know.
DevOps and Devsecops- Everything you need to know.
 
8 Ways to Boost Your DevOps Efforts
8 Ways to Boost Your DevOps Efforts8 Ways to Boost Your DevOps Efforts
8 Ways to Boost Your DevOps Efforts
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Site-Reliability-Engineering-v2[6241].pdf

  • 2. 2 Introduction • Name • Total Experience • Background – Development / Infrastructure / Management • Experience on DevOps Tools, Cloud • Your expectations from this training 2
  • 3. 3 Few Pointers 3 • SRE is more of a concept to implement (vs a tool). So theoretical aspects will be more. • Many pointers will be familiar. It’s just the right use of same. • Few best practices we already know but couldn’t implement. Now it’s the right time to implement same. It’s a do or die situation now. • Try to co-relate the topics with your domain and prepare notes with pointers. • Implementation discussion is welcome for your existing teams and domains. • Learn with - “What behavior will I change?” Learning isn’t collecting information. Learning is changing behavior.
  • 6. 6 Digital Transformation A way Forward Security, Legal, Compliance must be in center in all designs Security Integration of digital technology into all areas of a business, fundamentally changing how you operate and deliver value to customers. Concept Use technologies which suits you the best Tools Cultural Change Process The most important part and the barrier too People SECURITY PEOPLE PROCESS TOOLS Successful Digital Transformation
  • 7. 7 Emerging Technologies Which are helping industries in Digital Transformation 2 3 4 5 1 6 DATA ML/AI Cloud Security Automation Blockchain IOT Security toolsets Remain secure to avoid Financial, legal and compliance issues Automation Automation with DevOps Toolsets – CICD, SRE, CM, Containers etc. Cloud Keeping infra off-ground to third party and move all Opex ML/AI Better decision Making (well predicted, informed) and autonomy IOT Connect and Integrate whatever possible to automate Blockchain For secure transactions over distributed public networks
  • 9. 9 9 • A systems development life cycle is composed of several clearly defined and distinct work phases which are used by systems engineers and systems developers to plan for, design, build, test, and deliver information systems Require- ment Analysis Design Implementa tion Testing Evaluation SDLC – Life Cycle SDLC Model
  • 10. 10 10 - long release cycle - A lot of WIP - Functional silos - Incredibly rigid for developing 1. Determine the Requirements 2. Complete the design 3. Do the coding and testing (unit tests) 4. Perform other tests (functional tests, non-functional tests, Performance testing, bug fixes etc.) 5. At last deploy and maintain Waterfall Model
  • 11. 11 11 - Shorter release cycle - Small batch sizes (MVP) - Cross-functional teams - Incredibly agile Agile
  • 12. 12 12 - Suddenly ops was the bottleneck (more release less people), again WIP is more! Lean Development
  • 13. 13 Software Development Infrastructure, Operations and Support Build & Release, Testing Teams DevOps - Break the Silos - Communication (not only with emails) - Collaboration - Trust - Involvement in the early development stages - Automation is the key - Continuous Integration - Continuous Deployments in the lower environments - Fail fast and fail often DevOps
  • 14. 14 Development QA Testing Implementation & Release InfraManagement - name: Playbook for webserver setup hosts: all tasks: - name: package installatio n yum: name=yum state=prese nt .... .... .... .... - name: Playbook for webserver setup t .... .... .... .... APP APP APP Waiting Waiting Waiting Waiting APP DevOps
  • 15. 15 15 DevOps • DevOps is a loose set of practices, guidelines, and culture designed to break down silos in IT development, operations, networking, and security. • In a DevOps approach, you improve something (often by automating it), measure the results, and share those results with colleagues so the whole organization can improve. • DevOps, Agile, and a variety of other business and software reengineering techniques are all examples of a general worldview on how best to do business in the modern world. None of the elements in the DevOps philosophy are easily separable from each other, and this is essentially by design. • DevOps is a broad set of principles about whole-lifecycle collaboration between operations and product development.
  • 16. 16 Extreme siloization of knowledge, incentives for purely local optimization, and lack of collaboration have in many cases been actively bad for business 16 DevOps Principles No More Silos Accidents Are Normal Change Should Be Gradual Tooling and Culture Are Interrelated Measurement Is Crucial Accidents are not just a result of the isolated actions of an individual, but rather result from missing safeguards for when things inevitably go wrong. I.e. Misconfigured System, broken monitoring, under pressure wrong actions etc. Rooting out the Mistake makers and punishing them creates mess, like incentives to confuse issues, hide the truth, and blame others, all of which are ultimately unprofitable distractions. Change is best when it is small and frequent. Change is risky, true, but the correct response is to split up your changes into smaller subcomponents where possible. Then you build a steady CICD pipeline of low-risk change out of regular output from product, design, and infrastructure changes with Automated testing and improvements. A good culture can work around broken tooling, but the opposite rarely holds true. Promoters of DevOps strongly emphasize organizational culture—rather than tooling—as the key to success in adopting a new way of working. Measure your outcomes time to time. Its can be in the form of Number of incidents, faster time to market, MTTR, SLA etc.
  • 18. 18 Site Reliability Engineering Site Reliability Engineering (SRE) is a term (and associated job role) coined by Ben Treynor Sloss, a VP of engineering at Google. SRE is a job role, a set of practices which are known to work at ground, and some beliefs that animate those practices. SRE is coined around Reliability of the system. In general, an SRE has particular expertise around the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the service(s) they are looking after. SRE implements interface DevOps. SRE is hiring software engineers to run products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins. Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems. Way to SRE
  • 19. 19 Site Reliability Engineering SRE is a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated. SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor. By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable via engineering tasks of automation. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). Way to SRE
  • 20. 20 SRE should therefore use software engineering approaches to solve that problem. 20 SRE Principles Operations Is a Software Problem Manage by Service Level Objectives Work to Minimize Toil Automate - What you can Move Fast by Reducing the Cost of Failure Share Ownership with Developers Use the Same Tooling, Regardless of Function Define SLOs and work around same Toils should be reduced to minimum and automation to be done to the extent possible. Automation is the key. The real work in this area is determining what to automate, under what conditions, and how to automate it. As per Google max 50% work can be toil and rest 50% time should be given to SRE Engineering tasks or something new. Cost of failure is directly proportional to Mean Time to Repair (MTTR) which effects product developer velocity. This follows from the well- known fact that the later in the product lifecycle a problem is discovered, the more expensive it is to fix. Ideally, both product development and SRE teams should have a holistic view of the stack—the frontend, backend, libraries, storage, kernels, and physical machine—and no team should jealously own single components. It turns out that you can get a lot more done if you “blur the lines” and have SREs instrument JavaScript, or product developers qualify kernels configurations. Having similar qualified tools across organizations will help the easy process understanding and SRE/DevOps Culture adoption. Way to SRE
  • 21. 21 21 DevOps vs SRE • DevOps a loose generic set of principles (philosophy and culture) and SRE an advanced explicit implementation. • Site Reliability Engineering, like DevOps, should not just be changing titles, but making definitive behavior changes, focusing on outcomes and obviously reliability. • Collaboration is front and center for DevOps work. An effective shared ownership model and partner team relationships are necessary for SRE to function. • Change management is best pursued as small, continual actions, the majority of which are ideally both automatically tested and applied. The critical interaction between change and reliability makes this especially important for SRE. • Measurement is absolutely key to how both DevOps and SRE work. For SRE, SLOs are dominant in determining the actions taken to improve the service. For DevOps, the act of measurement is often used to understand what the outputs of a process are, what the duration of feedback loops is, and so on. • DevOps is relatively silent on how to run operations at a detailed level. While SRE talks about detailed steps of implementations and deployments. Way to SRE
  • 22. 22 22 DevOps vs SRE • DevOps is more context-sensitive and works organization wide. SRE, on the other hand, has relatively narrowly defined responsibilities and its remit is generally service-oriented (and end-user-oriented) rather than whole-business-oriented. • Ultimately, implementing DevOps or SRE is a holistic act; both hope to make the whole of the team (or unit, or organization) better, as a function of working together in a highly specific way. For both DevOps and SRE, better velocity should be the outcome. Way to SRE
  • 23. 23 SRE Context and Successful Adoption Narrow, Rigid (launch-related or reliability-related) Incentives, Narrow Your Success. A system with early SRE engagement (ideally, at design time) typically works better in production after deployment, regardless of who is responsible for managing the service. Don’t just allow, but actively encourage, engineers to change code and configuration when required for the product. Support blameless postmortems. Doing so eliminates incentives to downplay or cover up a problem. Allow support to move away from products that are irredeemably operationally difficult. The threat of support withdrawal motivates product development to fix issues both in the run-up to support and once the product is itself supported, saving everyone time. Always remember – Good people will quit if they’re tasked with too much operational work and aren’t given the opportunity to use their engineering skill set. Consider Reliability Work as a Specialized Role. Strive for Parity of Esteem: Career and Financial. Way to SRE
  • 24. 24
  • 25. 25 Brain - Storming DevOps vs SRE An x company wants to reduce the time to market for its new software product releases and facing below issues: • Hardware capacity planning is a challenge • Infra is new, yet hardware failures are more • Lots of bugs are being identified in the products • Releases fails on production days • Huge Incident tickets post new releases for next few days. Where DevOps can help in this area? Where SRE can help in this segment? Understand how SRE heals DevOps Failures…
  • 26. 26 Case Study – French Telecom DevOps vs SRE Identifying the DevOps Work and building DevOps Team Identifying Reliability needs and building SRE Engineering Team Continuous Enhancement…
  • 27. 27 Video DevOps vs SRE DevOps vs SRE (Google) https://www.youtube.com/watch?v=uTEL8Ff1Zvk
  • 28. 28 Exercise DevOps vs SRE What we do all day? Is there a way to automate? Is there any way to make the systemmore reliable? Factored ROI?
  • 29. 29 Q & A DevOps vs SRE Q & A Questionnaire
  • 30. 30 30 Module 2: Service Level Objectives (SLOs) & Error Budgets
  • 31. 31 Important terms Way to SRE Availability=The ability of less downtime, or the fraction of the time that a service is usable. Although 100% availability is impossible, near-100% availability is often readily achievable Reliability=The ability to work properly (even if some parts/components failed). Durability=The ability of not losing data. Or the likelihood that data will be retained over a long period of time—is equally important (alike Availability) for data storage systems. SLA (Promise) = Service-Level Agreement is a commitment between a service provider and a client, regarding particular aspects of the service – quality, availability, responsibilities etc. It is an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. SLO (Goal) = Service Level Objective – The Objectives (within SLA, i.e. uptime, response time) which your team must hit to meet the SLA. SLI (How and What) = Service Level Indicators – the Real numbers to measure your compliance against SLO. In Specific – its a carefully defined quantitative measure of some aspect of the level of service that is provided. i.e. request latency, error rate etc.
  • 33. 33 Service Level Objectives Way to SRE It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy. An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. Choosing an appropriate SLO is usually complex (i.e. QPS, Network Bandwidth etc.) , but sometime its straightforward too (i.e. setting low-latency). Choosing and publishing SLOs to users sets expectations about how a service will perform. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service.
  • 34. 34 SLI – Indicators in Practice Way to SRE SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the consequences of missed SLOs. What Do You and Your Users Care About? • User-facing serving systems, such as the Shakespeare search frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled? • Storage systems often emphasize latency, throughputs, IOPS, availability, and durability. • Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency (How much data is being processed?, and how long it takes to process?). • All systems should care about correctness: was the right answer returned, the right data retrieved, the right analysis done? Collecting Indicators Aggregate the metric for better usage (Average out, Instantaneous usages) Standardize Indicators (Over a period of time i.e. average packets per minute) Collect Indicators at server as well as Client end Objectives in Practice (SLO) Define Objectives (99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms) Choose realistic targets, which are simple & minimum-required and always keep a refine strategy/scope.
  • 35. 35 Control Measures SLIs and SLOs are crucial elements in the control loops used to manage systems: • Monitor and measure the system’s SLIs. • Compare the SLIs to the SLOs, and decide whether or not action is needed. • If action is needed, figure out what needs to happen in order to meet the target. • Take that action. Always remember: • Publishing SLOs sets expectations for system behavior • Keep margins - Using a tighter internal SLO than the SLO advertised to users gives you room to respond to chronic problems before they become visible externally. • If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available). Understanding how well a system is meeting its expectations helps decide whether to invest in making the system faster, more available, and more resilient. Alternatively, if the service is doing fine, perhaps staff time should be spent on other priorities, such as paying off technical debt, adding new features, or introducing other products. Way to SRE
  • 37. 37 Error Budget and policies The SLO is a target percentage, and the error budget is 100% minus the SLO. For example, if you have a 99.9% success ratio SLO, then a service that receives 3 million requests over a four-week period had a budget of 3,000 (0.1%) errors over that period. If a single outage is responsible for 1,500 errors, that error costs 50% of the error budget. Once you have an SLO, you can use the SLO to derive an error budget. In order to use this error budget, you need a policy outlining what to do when your service runs out of budget. When we talk about enforcing an error budget policy, we mean that once you exhaust your error budget (or come close to exhausting it), you should do something in order to restore stability to your system Common owners and actions might include: • The development team gives top priority to bugs relating to reliability issues over the past four weeks. • The development team focuses exclusively on reliability issues until the system is within SLO. This responsibility comes with high-level approval to push back on external feature requests and mandates. • To reduce the risk of more outages, a production freeze halts certain changes to the system until there is sufficient error budget to resume changes. Way to SRE
  • 38. 38
  • 39. 39 Case Study – Genpact SLO & Error Budget Pain Areas… Penalties due to SLA Miss… Setting SLO for Application uptime and performance. Tracking SLIs
  • 40. 40 Video SLO & Error Budget SLA, SLO and SLI (Google) https://www.youtube.com/watch?v=tEylFyxbDLE Risks and Error Budgets (Google) https://www.youtube.com/watch?v=y2ILKr8kCJU
  • 41. 41 Exercise SLO & Error Budget Define 3 SLI/SLOfor your current Application contract. Have We Defined Right SLOs and Monitoring right SLIs? Do we Just work with Availability Monitoring or Performance Monitoring too?
  • 42. 42 Q & A SLO & Error Budget Q & A Questionnaire
  • 44. 44 Toils Way to SRE “If a human operator needs to touch your system during normal operations, you have a bug.” Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Google’s SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect. It is equally important to calculate toils and time spent on same, over a given period and keep aligning the team towards engineering tasks. Engineering work is novel and essentially requires human judgment. It produces a permanent improvement in your service and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem—the more generalized, the better.
  • 45. 45 Toils A Must know Manual This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time. Repetitive If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil. Automatable If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil. Tactical Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it. No enduring value If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved.
  • 46. 46 SRE tasks Way to SRE Engineering work helps your team or the SRE organization handle a larger service, or more services, with the same level of staffing. Software engineering Involves writing or modifying code, in addition to any associated design and documentation work. Examples include writing automation scripts, creating tools or frameworks, adding service features for scalability and reliability, or modifying infrastructure code to make it more robust. Systems engineering Involves configuring production systems, modifying configurations, or documenting systems in a way that produces lasting improvements from a one-time effort. Examples include monitoring setup and updates, load balancing configuration, server configuration, tuning of OS parameters, and load balancer setup. Systems engineering also includes consulting on architecture, design, and productionization for developer teams. Toil Work directly tied to running a service that is repetitive, manual, etc. Overhead Administrative work not tied directly to running a service. Examples include hiring, HR paperwork, team/company meetings, bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.
  • 47. 47 Why Toils are bad? Way to SRE Career stagnation Your career progress will slow down or grind to a halt if you spend too little time on projects. Low morale People have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout, boredom, and discontent. Slows progress Excessive toil makes a team less productive. A product’s feature velocity will slow if the SRE team is too busy with manual work and firefighting to roll out new features promptly. Sets precedent If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by Devs to SRE. Promotes attrition Even if you’re not personally unhappy with toil, your current or future teammates might like it much less. If you build too much toil into your team’s procedures, you motivate the team’s best engineers to start looking elsewhere for a more rewarding job.
  • 48. 48 How to Reduce Toils? Identify Toils and try to reduce same to the level you can. Let's take an example: Identify what your team members are involved into at 80% of the on-job time. Check if same can be automated. If yes, then automate same with some tools, else if not identify what else can be done to improve the process. Keep improving the existing state and service. Set goals for Engineering tasks too, for e.g. increasing Internal SLO from 99.9 to 99.95%. Identifying the ways and implementing the procedure for same. Way to SRE
  • 50. 50 Case Study – Reducing Toils One of my client “X” work in Contact Center field, where they deploy the Contact Center services for end clients and manage it for them. Now for every new client deploying and building the infrastructure was a very hectic task (contains 15+ Servers with 10+ microservice, multiple LBs, Cache servers, DBs, Security Implementations). Similarly increasing the existing client environment was very difficult and time-consuming tasks. Even the hardware capacity planning started becoming challenging. 80-90% of the teams (including Developers) were involved in deployment of new services or expansion of environment and issues handling for existing clients was becoming difficult. Teams were in pain with repeated tasks and pressure they were going through. • Company took the hard decision and migrated to Cloud services to avoid hardware bottlenecks. • Terraform (DevOps IaC) tool was used to automate the deployment. Now the same deployment of infra, which was taking 1 month to design and get ready, is getting up and running in less than an hour. • Same team members are free from pressure and happy investing their time to enhance the features, reducing bugs and automating the environment further to next levels. • Even during pandemic, they thrive with 200% increase in customer on-boarding, without any hastle. Reducing Toils
  • 52. 52 Exercise Reducing Toils Do you foresee any toils in your team? If yes, benefits of Reducing Toil? How same can be reduced? Is daily mails, project Reports toil? Considered ROI ???
  • 53. 53 Q & A Reducing Toils Q & A Questionnaire
  • 54. 54 54 Module 4: Monitoring & Service Level Indicators
  • 55. 55 Monitoring Way to SRE Monitoring Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. White-box monitoring Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics. Black-box monitoring Testing externally visible behavior as a user would see it. Alert A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a pager. Respectively, these alerts are classified as tickets, email alerts,22 and pages. Root cause A defect in a software or human system that, if repaired, instills confidence that this event won’t happen again in the same way.
  • 56. 56 SLI’s -Service Level Indicators Monitoring SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the consequences of missed SLOs. And to achieve same having proper monitoring with right SLIs is very important. What Do You and Your Users Care About? • User-facing serving systems, Could we respond to the request? How long did it take to respond? How many requests could be handled? • Storage systems often emphasize latency, availability, and durability. • Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency (How much data is being processed?, and how long it takes to process?). • All systems should care about correctness: was the right answer returned, the right data retrieved, the right analysis done? Collecting Indicators Aggregate the metric for better usage (Average out, Instantaneous usages) Standardize Indicators (Over a period of time i.e. average packets per minute) Collect Indicators at server as well as Client end E.g. Application latency for 99.9% users in last 5 min should be less than 100ms.
  • 57. 57 Why Monitor? Way to SRE Analyzing long-term trends Comparing over time Alerting Building dashboards Conducting ad hoc retrospective analysis
  • 58. 58 Monitoring with proper SLI Metric Way to SRE Always make sure to select the right Service Level Indicator Metric to track and alert. Set alerts (with respective criticality, pager, Email etc) for all your SLO targets and also creating simple and meaning full dashboards for a higher visibility. Four golden signals to track: Latency Traffic Errors Saturation (IO, Memory, CPU etc)
  • 59. 59 SLO Improvements Way to SRE VALET Google summed up our new SLOs into a handy acronym: VALET. Volume (traffic) How much business volume can my service handle? Availability Is the service up when I need it? Latency Does the service respond fast when I use it? Errors Does the service throw an error when I use it? Tickets Does the service require manual intervention to complete my request? Use Telemetry tools to collect SLI metrics from remote servers into a centralized Monitoring server to generate graphs.
  • 60. 60 Case Study - Genpact Monitoring and SLI At Client “X”, we configured autoscaling with value CPU percentage > 80% and made the system up and running in production. Even the performance/Load Test was done and successful. But after 6 month, during a heavy peak load, system didn’t autoscaling and got crashed. During RCA identification, we found that it was a DISK IO full issue, for which we have not monitored the systems and no alert/autoscaling setup on same. We modified the HDD to SSD on the server for fixing the issues and also enabled monitoring and autoscaling for the DISK environment.
  • 61. 61 Video Monitoring and SLI SLI/SLO and reliability Deep Dive https://www.youtube.com/watch?v=dplGoewF4DA
  • 62. 62 62 High Availability and Capacity Planning
  • 63. 63 High Availability A Must know Serving millions of request by a single server is not possible, even it is a supercomputer. Hence, we need Horizontal Scaling (adding more servers to handle the requests). Traffic load balancing is the solution to heavy traffic management, which is distributing traffic across multiple network links, datacenters, and machines in an "optimal" fashion. Multiple factors, which affects HA: • The hierarchical level at which we evaluate the problem (global versus local) • The technical level at which we evaluate the problem (hardware versus software) • The nature of the traffic we’re dealing with Nature of requests and handling techniques plays a useful role here.
  • 64. 64 High Availability techniques A Must know Techniques to handle the High availability/DR: • Clustering • Load Balancing with VIP • Microservice architectures ( with Containerization Approach) • Passive DR Sites • Load Balancing Using DNS • Content Delivery Networks (for low latency)
  • 65. 65 65 Thinknyx Technologies High Availability & Burst handling Application Server Cloud Premises Application Server Autoscaling Group Application Server Way to SRE
  • 66. 66 66 Thinknyx Technologies LB for High Availability Way to SRE
  • 68. 68 68 Thinknyx Technologies Disaster Recovery X – Cloud DC 1 Operations On-Prem, Cloud as DR Application is hosted in self-managedDatacenter and Backup is hosted on Cloud Operations in Cloud, Third-party DR Application is hosted in Cloud Datacenter and Backup is hosted on Third party cloud service provider or may be on third party backup service provider. Operations in Cloud, Cloud as DR Application is hosted in Cloud Datacenter and Backup is hosted in Second Region/datacenter in same cloud. Y – Cloud DC 1 APP Backup X – Cloud DC 2 On-Prem APP Backup Backup Way to SRE
  • 69. 69 69 Thinknyx Technologies Business Continuity 1 3 5 2 4 Consider all natural disasters and their range, before finalizing a DR location. Physical Location Never put al your eggs in single bucket. Logical Location Declarations of Role & Responsibilities, Emergency process, Backup ready Who’ll do what Test BC/DR at least annually. DR Drill Define application criticality and failover priorities in advance. Health and Human safety should be primary concern. Priorities Way to SRE
  • 70. 70 Exercise Monitoring and SLI What do you monitor now and what all reliability aspects you considered? Performance Monitoring in place? DR Monitoring /Activation in place? What we can monitor, where and how? Risk Factors
  • 71. 71 Q & A Monitoring and SLI Q & A Did you consider Monitoring importance in HA/DR?
  • 73. 73 Automation Way to SRE Automation is the key for any organization to thrill. For SRE, automation is a force multiplier, not a panacea. Of course, just multiplying force does not naturally change the accuracy of where that force is applied: doing automation thoughtlessly can create as many problems as it solves. Consistency A Platform Faster Repairs Faster Actions Time Saving Ease and Effectiveness
  • 74. 74 Automation Focus Way to SRE SRE has a number of philosophies and products in the domain of automation, some of which look more like generic rollout tools without particularly detailed modeling of higher-level entities, and some of which look more like languages for describing service deployment (and so on) at a very abstract level. Some use cases: • User account creation • Cluster turnup and turndown for services • Software or hardware installation preparation and decommissioning • Rollouts of new software versions Count is endless, its just identifying the priority and keep automating tasks one by one. End Goal is to create Autonomous system, which runs and manages on its own. For e.g. A system should not just trigger alerts and try to make the services up on same system, it should do the failover on its own to another better available system, if services are not coming up on same server. Automationsystem must be secure and reliable. Automation works at scale, so destruction will also be at scale, if something goes wrong.
  • 75. 75 Automation Hierarchy Way to SRE Now a days, tools are available in market to automate majorities of tasks and events what we want to manage; yet there can be few things which needs customized automation. For same we can follow custom paths. For example a database failover automation evaluation path for Autonomous environment: 1) No automation Database master is failed over manually between locations. 2) Externally maintained system-specific automation An SRE has a failover script in his or her home directory. 3) Externally maintained generic automation The SRE adds database support to a "generic failover" script that everyone uses. 4) Internally maintained system-specific automation The database ships with its own failover script. 5) Systems that don’t need any automation (autonomoussystem) The database notices problems, and automatically fails over without human intervention. SRE hates manual operations, so they obviously try to create systems that don’t require them. However, sometimes manual operations are unavoidable (DR activation, Production push etc).
  • 76. 76 Secure Automation Way to SRE Unsecure Automation can be dangerous too. Learn from an example of “CODESPACES” and other clients where AWS AK/SK was leaked, and disaster happened. Think about keeping Username and password in your container Images? Who all in your organization do have access to these image? Keeping API keys in Code and code on Github/bitbucket? Zero touch automation is the final goal for an SRE. We have to consider Security also into it at every layer, as multiple tools get involved. We have to replace our use of sshd with an authenticated, ACL-driven, RPC-based Local Admin Daemon, also known as Admin Servers, which had permissions to perform those local changes. As a result, no one could install or modify a server without an audit trail. CIA terms are important to implement in automation too.
  • 77. 77 Automation Tools Way to SRE Though there are not defined set of tools for any SRE, but it is always better to have universal tools in the basket. Few Categories and tools which are well known to the market for their well-known results in the area, are as below: Version Control System: TFVC, Git (Gitlab, Github, Bitbucket, Azure DevOps) Pipelining for CICD: Jenkins, Azure DevOps, TeamCity, bamboo Automated Deployment: Octopus Deploy, UrbanDeploy Configuration Management: Ansible, Chef, puppet, Saltstack Container and Orchestration: Docker (Kubernetes, Docker Swarm, Openshift) Automation oriented languages: Python, Java DevOps Infrastructure as Code: Terraform, Cloud Formation, Azure Templates Continuous Testing: Jmeter, Sonarqube, Selenium
  • 78. 78 Video SRE tools and Automation SREcon19 Asia/Pacific - Ironies of Automation (Microsoft) https://www.youtube.com/watch?v=U3ubcoNzx9k
  • 79. 79 Case Study SRE Tools & Automation Amazon has done the automation for Leasing servers on Rent and metering same for usage and gradually over a period of time, it became Cloud with 100s of service provisioned/Released via self-service portal. 50 million changes into AWS Cloud happened in 2016 in 1 year, which is 1 change in production per second. Distributed system with APIs and Queues are the best way to scale with automation.
  • 80. 80 Exercise SRE Tools & Automation Automation “Greatest Hits” – Uber, Airbnb, Ola, Olx, AWS … How much automation you have and what can be automated? Its as simple as: “Anything that you do more than twice has to be automated.” -Adam Stone, CEO, D-Tools
  • 81. 81 Q & A SRE Tools & Automation Q & A Questionnaire
  • 82. 82 82 Module 6: Anti-Fragility & Learning from Failure
  • 83. 83 Anti-fragility: Learning from Failure Way to SRE “The cost of failure is education.” Devin Carraway Anti-fragility is all about understanding disorder and using it to your advantage. It is a property of systems in which they increase in capability to thrive as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures. Postmortems are an essential tool for SRE (to make the system resilient and reliable). When an incident occurs, we fix the underlying issue, and services return to their normal operating conditions. Unless we have some formalized process of learning from these incidents in place, they may reoccur. A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring. The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.
  • 84. 84 Anti-fragility: Shifting the organizational balance Way to SRE The antifragile loves randomness and uncertainty. Anti-fragility is a concept which encompasses the idea that things need chaos and disorder in order to thrive and flourish. Whatever doesn’t kill us makes us stronger, pushing the notion that we shouldn’t construct our lives or our plans against randomness and misfortune, rather, we should adopt anti-fragility as a means of maneuvering through disorder. Similarly, we should plan our system to withstand failures (planned/unplanned). We can even plan/plot the failures in our system to understand the withstanding capability/Anti-fragility of our system. We can have unplanned downtimes and activities, which can simulate failure to learn from it and make our system more robust. For example – pulling a network cable of server or shutting down the UPS to understand the impact, etc… But we must first understand the error budget and failure cost before we plan for such failure activities. Such activities definitely add learning and robustness to our system, but we must keep a balance between error budget and enhancements.
  • 85. 85 Postmortem Culture : Learning from Failure Way to SRE The postmortem process does present an inherent cost in terms of time or effort, so we can be deliberate in choosing when to write one. Teams have some internal flexibility, but common postmortem triggers include: • User-visible downtime or degradation beyond a certain threshold • Data loss of any kind • On-call engineer intervention (release rollback, rerouting of traffic, etc.) • A resolution time above some threshold • A monitoring failure (which usually implies manual incident discovery) • Stakeholder request a postmortem for an event
  • 86. 86 Blameless Postmortems Way to SRE “Blameless postmortems” is a Principle of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
  • 87. 87 Best Practice Way to SRE Avoid Blame and Keep It Constructive. Collaborate and Share Knowledge No Postmortem Left Unreviewed Introduce a Postmortem Culture Visibly Reward People for Doing the Right Thing Ask for Feedback on Postmortem Effectiveness Continuous improvement Postmortem should have clearly defined ownership, priority, preventive actions and Action taken
  • 88. 88 Case Study Anti-Fragility Netflix Simian Army https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116 https://github.com/netflix/chaosmonkey
  • 89. 89 Failures Anti-Fragility Do Failure is really bad – for organization and individuals ? Consider Elon Musk and Colonel Sanders ☺
  • 90. 90 Exercise Anti-Fragility Share some example of problem ticket within your team where you were involved and had lot of incidents be of a single root cause. Share the Incidents, Business Impact, Root Cause, Course of Actions done to resolve same and whose mist was this, if it was a configuration issue.
  • 91. 91 Q & A Anti-Fragility Q & A Questionnaire
  • 93. 93 Organizations Embracing SRE Way to SRE Availability Reliability Capacity Planning Happy Customers Cost Effectiveness due to less failure Velocity Continuous Improvements
  • 94. 94 Typical ORG Chart Way to SRE Specialized Reliability Engineers Specialized Reliability Engineers Specialized Reliability Engineers Site Reliability Engineers (TL) Site Reliability Engineers (TL) Manager SRE Team DB Admins DB Manager DB Admins Systems Manager OS Admins OS Admins Operations Dev Dev Manager Dev Q&A Manager QA Admin QA Admin Dev/Prod Team
  • 95. 95 SRE Responsibilities Way to SRE Tasks SRE Team Architecture Design Approvals and Consultaning RC Instrumentation, Metrics, and Monitoring CI Maintaining SLI CI SLO /SLI track and management CR Handling Incidents CI Repeated Incidents and Problem Management RA Capacity Planning CR Change Management CI Critical/Large Scale Changes CR Performance: availability, latency, and efficiency R Automation RA Innovation RA Release Management C Release Repeated failures R Test Management C Test Repeated Failures R Standardization of Tools/Softwares/Process/Technologies RA Supporting Presales and Sales CR Training other Team Members R
  • 96. 96 SRE Engagement and Adoption Way to SRE SRE seeks production responsibility for important services for which it can make concrete contributions to reliability. SRE is concerned with several aspects of a service, which are collectively referred to as production. These aspects include the following: • System architecture and interservice dependencies • Instrumentation, metrics, and monitoring • Emergency response • Capacity planning • Change management • Performance: availability, latency, and efficiency When SREs engage with a service, they aim to improve it along all of these axes, which makes managing production for the service easier.
  • 97. 97 SRE & Scale Way to SRE The bigger the Operations/Systems, the more autonomous systems it should be. If 10 engineers handles 100 hundred servers, we shouldn’t need 100 engineers to handle 1000 servers. SRE is all about automation, improvements and reliability in the system. Bigger environment means more effectiveness from SRE. As the need for manual tasks reduces over time due to automation, yet enhancement in autonomous system and further improvement in same is a continuous process.
  • 98. 98 Testing Way to SRE “If you haven't tried it, assume it's broken.” One key responsibility of Site Reliability Engineers is to quantify confidence in the systems they maintain. SREs perform this task by adapting classical software testing techniques to systems at scale. Testing is the mechanism we use to demonstrate specific areas of equivalence when changes occur. Each test that passes both before and after a change reduces the uncertainty for which the analysis needs to allow. Thorough testing helps us predict the future reliability of a given site with enough detail to be practically useful. Passing a test or a series of tests doesn’t necessarily prove reliability. However, tests that are failing generally prove the absence of reliability. How failures are being measured: Mean Time to Repair (MTTR) measures how long it takes the operations team to fix the bug, either through a rollback or another action. Mean Time Between Failures (MTBF) measures time - for how long the service worked well post a failure condition.
  • 99. 99 99 SW Testing Classification Manual Automated Testing Type Static Dynamic Testing Methods Unit Testing Integration Testing System Testing Acceptance testing Testing Levels Black Box White Box Grey Box Testing Approach
  • 100. 100 Managing Incidents Way to SRE Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. As SRE you are also supposed to be on-call (limited efforts again) and handle the incidents. When on-call, an engineer is available to perform operations on production systems within minutes, according to the paging response times agreed to by the team and the business system owners. Typical values are 5 minutes for user-facing or otherwise highly time-critical services, and 30 minutes for less time-sensitive systems. Google strongly believe that invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work. The most important on-call resources are: • Clear escalation paths • Well-defined incident-management procedures • A blameless postmortem culture
  • 101. 101 Emergency Response Way to SRE “Things break; that’s life.” How employees responds to an emergency, show the process and long-term health of the organization. Organization long- run depends on this one factor very well in IT industry. What to Do When Systems Break First of all, don’t panic! If you feel overwhelmed, pull in more people. Follow the Incident response process. Take a deep breath and try to understand the situation, failure cause or relate sources in case of multiple failures. Test-Induced Emergency Change-Induced Emergency Process-Induced Emergency Some important pointers: • Keep a History of Outages • Ask the Big, Even Improbable, Questions: What If…? • Encourage Proactive Testing
  • 102. 102 Videos Organizational Impact A history of SRE at Uber: https://www.youtube.com/watch?v=qJnS-EfIIIE
  • 103. 103 Case Study - OBS Organizational Impact Orange Business Services – The Flexible Engine
  • 104. 104 Exercise Organizational Impact Why do you want to adopt SRE? Who in your organization currently provides SRE? Your organizational plan for SRE?
  • 105. 105 Q & A Organizational Impact Q & A Questionnaire
  • 106. 106 106 Module 8: SRE, Other Frameworks, & The Future
  • 107. 107 Transforming Culture Way to SRE Site Reliability Engineering (SRE) proclaims many advantages for distributed systems. It improves infrastructure automation, increases reliability, and transforms incident management. Instead of taking individual at centre, we have a specialized team in centre which is a centre of collaboration and communication in the organization. Embracing Risk Learning From Failure Better collaboration and communication Automation in centre – which benefits complete organization Standardization of tools / technologies / process Centralize documentation Consultation and Trainings
  • 108. 108 SRE with Other frameworks Way to SRE SRE works well with all major existing process and culture concepts, like: DevOps Agile Scrum Lean ITIL PMP Its while many of above are majorly conceptual stuff, SRE is having those concepts implemented on ground with practical work. It’s a path to create a stress-free autonoums reliable environment with tremendous velocity.
  • 109. 109 SRE Evolution Way to SRE Google coined the term “site reliability engineer” in 2003, but it certainly has existed for decades more in different forms — disaster recovery and production testers. Ways the SRE Approach is Evolving: 1. Increased Adoption 2. Larger, Diversified SRE Departments 3. New Testing Tactics emerge – e.g., Chaos Monkey 4. Businesses Rely on SREs to Mitigate Risk Currently SRE approach is widely being adopted by organizations to achieve high uptime and stability for the application, as even 1 minute of downtime costs millions of $ to many MNCs.
  • 110. 110 Videos SRE and other frameworks A Look at ITIL4 & SRE https://www.youtube.com/watch?v=vFyPXIsUEhE
  • 111. 111 Case Study – VictorOps SRE and other frameworks Victor Ops
  • 112. 112 Exercise SRE and other frameworks Where do you see SRE future heading? Sketch board your understanding of SRE and Requirements for the job role.
  • 113. 113 Q & A SRE and other frameworks Q & A Questionnaire
  • 114. 114