Independently from the DevOps movement but starting from the same problems, Google developed its own strategy defining a new specific role called SRE (Site Reliability Engineer). This introduction tries to explain the history and the concept of this methodology and to compare it with the DevOps manifesto to understand what does it mean to adopt DevOps and what does it mean to be an SRE and what the two things are sharing and where they diverge.
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
When it comes to Site Reliability Engineering, short for SRE, the resources available online are only limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is, and how to understand the concepts given in it, but they do not clearly explain how to build your own SRE team for your organization. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this presentation I would like to give a brief introduction to SRE and why it is important to any Software Engineering organization. This is based on my experiences and learnings from leading a Site Reliability Engineering team for leading organizations in the US and Norway.
This presentation was conducted by me as a Tech Talk as an Associate Technical Lead at Creative Software Sri Lanka.
An overview of Google's Site Reliability Engineering with a view toward possible incorporation in the IEEE P2675 DevOps security standard. (Creative Commons with credit.)
This document discusses site reliability engineering (SRE) for growing organizations. SRE focuses on production automation, resiliency and scalability, similar to devops but with more emphasis on keeping systems running. As companies grow, expectations often outpace capacity and complexity increases, requiring more automation rather than personnel to maintain high uptime levels. A dedicated SRE team can improve reaction times, learn from incidents, raise awareness of system behaviors, and focus on forward-looking improvements rather than just keeping existing systems running. Key SRE practices include automated monitoring, log indexing, health checks, establishing service level objectives and agreements, and implementing self-healing systems and runbooks.
SRE (Site Reliability Engineering) is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of services. An SRE team uses an "error budget" approach where new features can be launched if the service is within its agreed SLA, but launches are frozen if the SLA is not being met until enough of the error budget is earned back. SRE teams hire only coders who can speak the same language as developers and rotate developers into operations work. The goal of SRE is to minimize impact and prevent recurrence of outages through practices like post-mortem analysis and constant improvement of processes.
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
The document discusses the principles, habits, and practices of site reliability engineering (SRE) at New Relic. It describes New Relic's transition from a monolithic architecture with siloed teams to a microservices architecture with 200+ services and embedded SREs on engineering teams. The goals of SREs at New Relic are to continuously improve the reliability of their platform through two main roles: "pure" SREs who build core platforms and embedded SREs who partner with engineering teams. SREs focus on three spheres: stability, reliability, and engineering.
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
Presenter: Perry Statham
SRE Squad Leader with IBM Cloud DevOps Services
In this presentation, the IBM DevOps Services SRE team will give a brief introduction to Site Reliability Engineering, then show how they adopted its principals in their existing enterprise organization.
How Small Team Get Ready for SRE (public version)Setyo Legowo
This document discusses how small teams can get ready for Site Reliability Engineering (SRE). It describes the challenges faced by a small engineering team at a company with around 100 employees and 10 engineers. To address issues with productivity, reliability, and deployment speed, the team implemented several initiatives including adopting SCRUM, adding automated testing, simplifying deployments, and creating easy-to-use development environments. While these changes helped, the team knows there is still work needed in areas like data center operations and establishing formal SLAs and incident management processes as the company and services grow. The presentation concludes by discussing why SRE is preferable to just DevOps and provides resources for further learning.
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
When it comes to Site Reliability Engineering, short for SRE, the resources available online are only limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is, and how to understand the concepts given in it, but they do not clearly explain how to build your own SRE team for your organization. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this presentation I would like to give a brief introduction to SRE and why it is important to any Software Engineering organization. This is based on my experiences and learnings from leading a Site Reliability Engineering team for leading organizations in the US and Norway.
This presentation was conducted by me as a Tech Talk as an Associate Technical Lead at Creative Software Sri Lanka.
An overview of Google's Site Reliability Engineering with a view toward possible incorporation in the IEEE P2675 DevOps security standard. (Creative Commons with credit.)
This document discusses site reliability engineering (SRE) for growing organizations. SRE focuses on production automation, resiliency and scalability, similar to devops but with more emphasis on keeping systems running. As companies grow, expectations often outpace capacity and complexity increases, requiring more automation rather than personnel to maintain high uptime levels. A dedicated SRE team can improve reaction times, learn from incidents, raise awareness of system behaviors, and focus on forward-looking improvements rather than just keeping existing systems running. Key SRE practices include automated monitoring, log indexing, health checks, establishing service level objectives and agreements, and implementing self-healing systems and runbooks.
SRE (Site Reliability Engineering) is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of services. An SRE team uses an "error budget" approach where new features can be launched if the service is within its agreed SLA, but launches are frozen if the SLA is not being met until enough of the error budget is earned back. SRE teams hire only coders who can speak the same language as developers and rotate developers into operations work. The goal of SRE is to minimize impact and prevent recurrence of outages through practices like post-mortem analysis and constant improvement of processes.
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
The document discusses the principles, habits, and practices of site reliability engineering (SRE) at New Relic. It describes New Relic's transition from a monolithic architecture with siloed teams to a microservices architecture with 200+ services and embedded SREs on engineering teams. The goals of SREs at New Relic are to continuously improve the reliability of their platform through two main roles: "pure" SREs who build core platforms and embedded SREs who partner with engineering teams. SREs focus on three spheres: stability, reliability, and engineering.
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
Presenter: Perry Statham
SRE Squad Leader with IBM Cloud DevOps Services
In this presentation, the IBM DevOps Services SRE team will give a brief introduction to Site Reliability Engineering, then show how they adopted its principals in their existing enterprise organization.
How Small Team Get Ready for SRE (public version)Setyo Legowo
This document discusses how small teams can get ready for Site Reliability Engineering (SRE). It describes the challenges faced by a small engineering team at a company with around 100 employees and 10 engineers. To address issues with productivity, reliability, and deployment speed, the team implemented several initiatives including adopting SCRUM, adding automated testing, simplifying deployments, and creating easy-to-use development environments. While these changes helped, the team knows there is still work needed in areas like data center operations and establishing formal SLAs and incident management processes as the company and services grow. The presentation concludes by discussing why SRE is preferable to just DevOps and provides resources for further learning.
This document provides an introduction to Site Reliability Engineering (SRE). It lists the credentials and background of Diego Pacheco, including his roles as a cat's father, principal software architect, agile coach, and expert in SOA/microservices, DevOps, and observability. The document then defines SRE as "what happens when you ask a software engineer to design an operations function" and outlines some key aspects of SRE culture, including MTTD, MTTR, error budgets, jitter retries, exponential back-off, the "You build it you run it" mindset, and production readiness.
This document provides an introduction to Site Reliability Engineering (SRE). It discusses DevOps principles and how SRE relates to and implements DevOps. Key aspects of SRE covered include guiding principles like eliminating toil, embracing risk, and measuring services through SLIs, SLOs, and error budgets. Specific SRE practices mentioned are removing toil, defining system criticalities, designing for availability, observability, chaos engineering, restricting production access, and focusing on metrics like MTTR and MTBF.
The document discusses how to implement site reliability engineering (SRE) practices without having an SRE on the team. It recommends starting with setting service level indicators (SLIs) and service level objectives (SLOs), establishing rules for how the system should function and an incident response process, and always performing postmortems after incidents to document failures and fixes. The document also provides some best practices for code deployment, managing on-call responsibilities, and finding automation opportunities. Useful resources for learning more about SRE are also listed.
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv
This document discusses best practices for site reliability engineering (SRE). It recommends hiring only coders, establishing service level agreements (SLAs) and measuring performance against them. It also suggests using error budgets, maintaining a common staffing pool for SRE and development teams, ensuring on-call teams have at least 8 people, and conducting post-mortems after every incident. Key reliability metrics like availability, latency, throughput and quality are identified. Objectives, service level objectives (SLOs) and responses if the error budget is exceeded or exhausted are outlined.
<p>From <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" target="_blank">Wikipedia</a>: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.<p>
<p>Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.</p>
<p>This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:</p>
<ul>
<li>SRE's basic concepts and history from Google</li>
<li>The management support you will need to get started</li>
<li>Introducing the idea of service level objectives and error budgets</li>
<li>Operational Responsibility Assessments as a tool to measure risk</li>
<li>Creating a Launch Readiness Checklist to standardize and improve product launches</li>
<li>Finding ideal candidates for your SRE team</li></ul>
<p>The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.</p>
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
In any software organization, stability & innovation are always at loggerheads - the faster you move, the more things will break. This talk defines what SRE org looks like at high-tech organizations (Google, Uber).
The document discusses the growth of Site Reliability Engineering (SRE) at Squarespace from a team of 2 people in New York to a global organization with teams in New York, Portland, and Dublin. It describes how the initial SRE team focused on three pillars: monitoring and alerting, configuration management, and builds and deploys. It then explains how the SRE organization expanded to include additional teams focused on areas like provisioning, release engineering, developer productivity, and observability while also embedding SREs within product teams.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain the term SRE (Site Reliability Engineering) and introduce key metrics for an SRE team SLI, SLO, and SLA.
Youtube Channel here: https://www.youtube.com/playlist?list=PLm_COkBtXzFq5uxmamT0tqXo-aKftLC1U
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
SRE stands for Site Reliability Engineering. It originated at Google over a decade ago as a way to ensure their products and services were highly reliable. SRE implements DevOps principles through components like reliability, service level agreements (SLAs), service level objectives (SLOs), service level indicators (SLIs), and error budgets. Reliability is measured through SLOs and SLIs to quantify user experience. Error budgets allow teams to balance new features against reliability by quantifying how much downtime is acceptable. SRE aims to reduce "toil", or unnecessary repetitive manual work, through automation.
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
1. SRE is the discipline of applying software engineering practices to solve operations problems to build reliable systems.
2. Service level terminology includes Service Level Indicators (SLIs) which are quantitative measures of service aspects like latency or error rates, Service Level Objectives (SLOs) which are goals for specific metrics, and Service Level Agreements (SLAs) which are agreements within an SLA.
3. Choosing the right SLIs, crafting meaningful SLOs, collecting indicator data, and meeting customer expectations through SLAs are important for building reliable services.
In this presentation I will speak how are the SRE and DevOps, what is a reliability. Also about the reliability approach in Competitive Gaming in Wargaming and show a few cases.
The document discusses Site Reliability Engineering (SRE) practices at New Relic. It summarizes that New Relic has transitioned from a monolithic architecture run by siloed teams to over 200 microservices run by many engineering teams with embedded SREs. SREs aim to continuously improve reliability by reducing toil, encouraging best practices, automating operations, and supporting engineering teams. SREs focus on stability, reliability engineering, and reducing operations toil. The document provides a template for other companies to establish SRE roles, focus areas, and details in the SRE book.
Getting started with Site Reliability Engineering (SRE)Abeer R
"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"
This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
There are many phases in the software development cycle, from requirements to development and testing, but at the tail of the process, is an often overlooked aspect: deployment and delivery. With the paradigm shift of delivering on-site software to offering software-as-a-service, Site Reliability Engineering is beginning to take a greater role in product delivery.
This session aims to give a glimpse of the work that goes into site reliability engineering (SRE) and effort that goes into keeping a service going 24/7.
Site reliability engineering (SRE) is a set of principles that applies software engineering practices to infrastructure and operations. SRE teams use automation and software development skills to manage systems and solve problems in order to create highly reliable and scalable software systems. SRE teams are responsible for availability, performance, monitoring, change management, emergency response, and capacity planning within an engineering organization. SRE focuses on automation, system design, and improvements to system resilience.
This document provides a summary of chapters 1 and 2 from the SRE book. Chapter 1 discusses the sysadmin approach versus Google's SRE approach. The key aspects of SRE include focusing on software engineering to automate tasks, maintaining a 50% cap on operational work, and using an error budget to balance change velocity and reliability. Chapter 2 describes Google's production environment, including the use of Borg for resource management, Colossus for storage, Chubby for locking services, and gRPC for RPC communication. It also discusses development practices like code reviews and shared code repositories.
This talk explains a proven approach to assessment SRE practices for an organization. The approach uses a 9 pillar model and 7 step transformation blueprint to determine current state of SRE practices and to set a roadmap to improve SRE practices towards industry best practices.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain what is and isn't toil, how to identify, measure and eliminate them.
Youtube channel here: https://youtu.be/EgpCw15fIK8
The document discusses implementing a DevOps culture at an organization. It covers defining standard tools and processes, educating employees, and establishing continuous integration and delivery (CI/CD) pipelines. The key steps are to start with test-driven development, implement version control and code reviews, define roles and responsibilities, and set up build, deployment, and automated testing processes for development, QA, and production environments. Infrastructure should also be managed as code. Implementing these changes will help transition the organization to more agile, collaborative ways of working.
Site Reliability Engineering (SRE) is a concept where software engineers are hired to manage the reliability of products and services. SRE implements DevOps principles by automating manual system administration tasks and focusing on availability, performance, change management and monitoring through software. The key aspects of SRE include treating operations as a software problem to be solved through automation; managing services based on service level objectives; minimizing manual toil through automation; and sharing ownership of systems between SRE and product development teams.
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechRosalie Lauren
DevOps Vs SRE what option should you choose to manage your IT infrastructure? Having a mobile app has become a crucial business need in the age of digitalization. Also, two key methodologies that help you improve the product lifecycle and accelerate app development are DevOps and Site Reliability Engineers (SREs).
This document provides an introduction to Site Reliability Engineering (SRE). It lists the credentials and background of Diego Pacheco, including his roles as a cat's father, principal software architect, agile coach, and expert in SOA/microservices, DevOps, and observability. The document then defines SRE as "what happens when you ask a software engineer to design an operations function" and outlines some key aspects of SRE culture, including MTTD, MTTR, error budgets, jitter retries, exponential back-off, the "You build it you run it" mindset, and production readiness.
This document provides an introduction to Site Reliability Engineering (SRE). It discusses DevOps principles and how SRE relates to and implements DevOps. Key aspects of SRE covered include guiding principles like eliminating toil, embracing risk, and measuring services through SLIs, SLOs, and error budgets. Specific SRE practices mentioned are removing toil, defining system criticalities, designing for availability, observability, chaos engineering, restricting production access, and focusing on metrics like MTTR and MTBF.
The document discusses how to implement site reliability engineering (SRE) practices without having an SRE on the team. It recommends starting with setting service level indicators (SLIs) and service level objectives (SLOs), establishing rules for how the system should function and an incident response process, and always performing postmortems after incidents to document failures and fixes. The document also provides some best practices for code deployment, managing on-call responsibilities, and finding automation opportunities. Useful resources for learning more about SRE are also listed.
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv
This document discusses best practices for site reliability engineering (SRE). It recommends hiring only coders, establishing service level agreements (SLAs) and measuring performance against them. It also suggests using error budgets, maintaining a common staffing pool for SRE and development teams, ensuring on-call teams have at least 8 people, and conducting post-mortems after every incident. Key reliability metrics like availability, latency, throughput and quality are identified. Objectives, service level objectives (SLOs) and responses if the error budget is exceeded or exhausted are outlined.
<p>From <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" target="_blank">Wikipedia</a>: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.<p>
<p>Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.</p>
<p>This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:</p>
<ul>
<li>SRE's basic concepts and history from Google</li>
<li>The management support you will need to get started</li>
<li>Introducing the idea of service level objectives and error budgets</li>
<li>Operational Responsibility Assessments as a tool to measure risk</li>
<li>Creating a Launch Readiness Checklist to standardize and improve product launches</li>
<li>Finding ideal candidates for your SRE team</li></ul>
<p>The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.</p>
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
In any software organization, stability & innovation are always at loggerheads - the faster you move, the more things will break. This talk defines what SRE org looks like at high-tech organizations (Google, Uber).
The document discusses the growth of Site Reliability Engineering (SRE) at Squarespace from a team of 2 people in New York to a global organization with teams in New York, Portland, and Dublin. It describes how the initial SRE team focused on three pillars: monitoring and alerting, configuration management, and builds and deploys. It then explains how the SRE organization expanded to include additional teams focused on areas like provisioning, release engineering, developer productivity, and observability while also embedding SREs within product teams.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain the term SRE (Site Reliability Engineering) and introduce key metrics for an SRE team SLI, SLO, and SLA.
Youtube Channel here: https://www.youtube.com/playlist?list=PLm_COkBtXzFq5uxmamT0tqXo-aKftLC1U
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
SRE stands for Site Reliability Engineering. It originated at Google over a decade ago as a way to ensure their products and services were highly reliable. SRE implements DevOps principles through components like reliability, service level agreements (SLAs), service level objectives (SLOs), service level indicators (SLIs), and error budgets. Reliability is measured through SLOs and SLIs to quantify user experience. Error budgets allow teams to balance new features against reliability by quantifying how much downtime is acceptable. SRE aims to reduce "toil", or unnecessary repetitive manual work, through automation.
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
1. SRE is the discipline of applying software engineering practices to solve operations problems to build reliable systems.
2. Service level terminology includes Service Level Indicators (SLIs) which are quantitative measures of service aspects like latency or error rates, Service Level Objectives (SLOs) which are goals for specific metrics, and Service Level Agreements (SLAs) which are agreements within an SLA.
3. Choosing the right SLIs, crafting meaningful SLOs, collecting indicator data, and meeting customer expectations through SLAs are important for building reliable services.
In this presentation I will speak how are the SRE and DevOps, what is a reliability. Also about the reliability approach in Competitive Gaming in Wargaming and show a few cases.
The document discusses Site Reliability Engineering (SRE) practices at New Relic. It summarizes that New Relic has transitioned from a monolithic architecture run by siloed teams to over 200 microservices run by many engineering teams with embedded SREs. SREs aim to continuously improve reliability by reducing toil, encouraging best practices, automating operations, and supporting engineering teams. SREs focus on stability, reliability engineering, and reducing operations toil. The document provides a template for other companies to establish SRE roles, focus areas, and details in the SRE book.
Getting started with Site Reliability Engineering (SRE)Abeer R
"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"
This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
There are many phases in the software development cycle, from requirements to development and testing, but at the tail of the process, is an often overlooked aspect: deployment and delivery. With the paradigm shift of delivering on-site software to offering software-as-a-service, Site Reliability Engineering is beginning to take a greater role in product delivery.
This session aims to give a glimpse of the work that goes into site reliability engineering (SRE) and effort that goes into keeping a service going 24/7.
Site reliability engineering (SRE) is a set of principles that applies software engineering practices to infrastructure and operations. SRE teams use automation and software development skills to manage systems and solve problems in order to create highly reliable and scalable software systems. SRE teams are responsible for availability, performance, monitoring, change management, emergency response, and capacity planning within an engineering organization. SRE focuses on automation, system design, and improvements to system resilience.
This document provides a summary of chapters 1 and 2 from the SRE book. Chapter 1 discusses the sysadmin approach versus Google's SRE approach. The key aspects of SRE include focusing on software engineering to automate tasks, maintaining a 50% cap on operational work, and using an error budget to balance change velocity and reliability. Chapter 2 describes Google's production environment, including the use of Borg for resource management, Colossus for storage, Chubby for locking services, and gRPC for RPC communication. It also discusses development practices like code reviews and shared code repositories.
This talk explains a proven approach to assessment SRE practices for an organization. The approach uses a 9 pillar model and 7 step transformation blueprint to determine current state of SRE practices and to set a roadmap to improve SRE practices towards industry best practices.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain what is and isn't toil, how to identify, measure and eliminate them.
Youtube channel here: https://youtu.be/EgpCw15fIK8
The document discusses implementing a DevOps culture at an organization. It covers defining standard tools and processes, educating employees, and establishing continuous integration and delivery (CI/CD) pipelines. The key steps are to start with test-driven development, implement version control and code reviews, define roles and responsibilities, and set up build, deployment, and automated testing processes for development, QA, and production environments. Infrastructure should also be managed as code. Implementing these changes will help transition the organization to more agile, collaborative ways of working.
Site Reliability Engineering (SRE) is a concept where software engineers are hired to manage the reliability of products and services. SRE implements DevOps principles by automating manual system administration tasks and focusing on availability, performance, change management and monitoring through software. The key aspects of SRE include treating operations as a software problem to be solved through automation; managing services based on service level objectives; minimizing manual toil through automation; and sharing ownership of systems between SRE and product development teams.
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechRosalie Lauren
DevOps Vs SRE what option should you choose to manage your IT infrastructure? Having a mobile app has become a crucial business need in the age of digitalization. Also, two key methodologies that help you improve the product lifecycle and accelerate app development are DevOps and Site Reliability Engineers (SREs).
DevOps, the fusing of software development (Dev) with IT operations (Ops) is growing in popularity. A maturing of the agile software development methodology, DevOps unites developers and IT operations to release high quality code into solidly performing environments more rapidly than is possible with traditional developer-to-ops handoffs. It solves a basic problem that arises with agile methodology, namely that quickly producing new code is of little use if it cannot be deployed on reliable infrastructure.
We nvestigate the ways that DevOps can generate a return on investment (ROI) for an organization that makes DevOps part of its IT strategy. DevOps certainly has great potential for business impact, with beneficial effects reaching far beyond the IT department. The ability to release high quality code efficiently confers benefits on both the income and expense sides of a business, measurable in hard dollars as well as intangible advantages such as increased brand equity.
Getting DevOps to pay off is far from a push-button process, however. CloudMunch offers a number of suggested practices based on its experience in DevOps with large enterprises. Business success with DevOps involves choreographing between people, organizational culture and the DevOps platform and tools. The paper explores practices related to setting up DevOps so that everyone on both Dev and Ops teams can get early, instant feedback on project work. In addition, it looks at practices to ensure that DevOps tools and processes can access the entire application lifecycle, which is critical to DevOps work.
So now you know what DevOps is and why your business should invest in DevOps Consulting Services for better business growth.
You might be wondering how to implement a DevOps program in your organization. Here are some tips that can help you do just that:
DevOps is a combination of cultural philosophies, practices, and tools that increases an organization's ability to deliver applications and services at high velocity. The DevOps lifecycle includes seven phases: continuous development, continuous integration, continuous testing, continuous delivery, continuous deployment, continuous monitoring, and continuous feedback. Continuous integration involves committing code changes frequently and building and testing the code continuously to identify problems early.
A Pattern-Language-for-software-DevelopmentShiraz316
The document discusses the Scrum framework for agile software development. It notes that traditional defined process approaches make incorrect assumptions that requirements, solutions, developers, and environments can be fully defined and repeated. Scrum addresses this by dividing projects into short "Sprints" of fixed time periods, usually 1 month or less. Each Sprint pulls tasks from a prioritized backlog and aims to deliver working software. Daily Scrum meetings help teams self-organize and resolve issues. At the end of each Sprint, teams demonstrate progress to customers and prioritize new tasks for the next Sprint. By continually adapting requirements and quickly delivering working software, Scrum allows for the uncertainties of software development.
Scrum an extension pattern language for hyperproductive software developmentShiraz316
Scrum is an agile software development framework that utilizes daily stand-up meetings called Scrum Meetings to manage unpredictable processes. During short, 15-minute Scrum Meetings, team members report on tasks completed since the previous meeting, any issues encountered, and their plan for the next 24 hours. This allows for continuous monitoring and adjustment of small, flexible assignments. Scrum Meetings foster transparency, knowledge sharing, and a collaborative culture within self-organizing teams. By frequently inspecting and adapting their process, teams can respond effectively to unpredictability and complexity inherent in software development.
Rapid customer response and efficient development are critical to creating a positive customer experience and remaining relevant in today’s marketplace.
The business puts significant pressure on development and operations teams to achieve these business objectives. DevOps is a popular approach to alleviate this pressure.
Critical Insight
Development and operations are traditionally siloed departments. Communication is limited to hand-offs and optimization efforts are focused locally. Cross-department collaboration is often discouraged.
Operational concerns are not accommodated early in the development lifecycle, leading to costly post-release fixes and poorly performing systems.
Focus only on your most impactful development and operations pain points. Trust your empowered teams to leverage the right DevOps practices to achieve quicker deployment throughput and better customer service.
Collaborate on the development and implementation of DevOps practices. Merging development and operations perspectives can help alleviate the tension between these two groups and initiate regular communication.
Impact and Result
There are many factors that may derail your DevOps initiatives, such as lack of management buy-in. Understand the current state of your development and operations teams to determine the focus of your efforts:
Conduct retrospectives: discuss experiences with previous projects amongst development and operations managers to identify issues that need to be addressed.
Complete a root cause analysis: review your development and operations metrics and processes to pinpoint the cause of your throughput inefficiencies, execution inconsistencies, and other pain points.
Develop customized solutions: there is no silver bullet. Each team and project may encounter different problems during their iterations. Tailor your solutions to the context of each team and regularly review their effectiveness.
Greens Technology provides DevOps training and certification in Chennai to professionals and corporates on Deployment and automation using devops tools - Chef, Docker, Puppet, Ansible, Nagios, Git, TestNG, SonarQube, Jenkins, and Project Object Model (POM) in Maven.
Le cloudvupardesexperts 9pov-curationparloicsimon-clubclouddespartenairesClub Alliances
9 points de vue d'experts sur le cloud. Sélection d'articles issus de la veille et de la curation faire par Loic Simon pour le Compte des membres du Club Cloud des Partenaires
A high level introduction to DevOps. Explains what it is, how popular DevOps has become, why DevOps is popular, how DevOps differs from traditional approaches and some next steps to implementation.
Different Methodologies Used By Programming TeamsNicole Gomez
The document discusses different programming team methodologies including:
- System development life cycle (SDLC), which is used for large projects and includes waterfall models. It takes time but ensures high quality.
- Agile methodology, designed for small projects, combines methods for faster development that changes with customer needs.
- Extreme programming allows close communication between developers and customers so the software can change rapidly based on customer feedback.
Overall agile methodologies seem to have advantages over SDLC and extreme programming by allowing faster development that can change with customer desires.
Releasing Software Without Testing TeamAkshay Mathur
The document discusses the challenges with the traditional software development model of having separate development and testing teams. It argues that having a single team responsible for both development and testing has several advantages, including encouraging developers to adopt a "test-first" approach, eliminating overhead from inter-team coordination, and reducing bug resolution times. The document also examines how requirements are often unclear and changing, which disrupts the linear waterfall model and makes it difficult for separate teams to work together effectively.
Slides from "Taking an Holistic Approach to Product Quality"Peter Marshall
This is the base material used during a half day workshop at expoQA 17 June 2019. Peter Marshall runs over the necessary technical, organisational, and improvement practices required to deliver high quality software. Deep dives into Continuous delivery, devops, organisational structures, agile and digital transformation.
The document discusses the history and principles of Extreme Programming (XP), an agile software development framework. It notes that the first XP project started in 1996 and that XP emphasizes customer satisfaction, delivering working software frequently, communication between developers and customers, and responding to changing requirements. The document also discusses other agile principles like accepting changing requirements, prioritizing working software over documentation, and designing software for simplicity and maintainability.
DevOps is a practice that unifies development and IT operations teams. It aims to reduce the time between implementing a change and releasing it to production through five core practices: planning and tracking, development, building and testing, monitoring and operations. This allows for shorter development cycles, faster innovation, reduced failures and recovery times, better communication between teams, and lower costs. Key roles in DevOps include developers, testers, engineers, administrators and architects.
This document provides information about the role of a Development Director in an enterprise IT organization. It begins with an introduction to the role and then discusses what the Development Director is responsible for, including all in-house and packaged applications and systems used across the business. It describes typical backgrounds of Development Directors and discusses who they report to and manage. It also provides a day-in-the-life example and discusses interactions with other IT roles. The document concludes with a glossary to help understand terminology relevant to the Development Director role.
Software Engineering in a Quick and Easy way - v1.pdfKAJAL MANDAL
The Most Common must know Software Development life cycle Models. As we discussed in our earlier article on Software Engineering, we have learned about the aspects of Software Engineering and the qualities that it should possess. Now let us move ahead and learn about the models of the software development life cycle. What is a software development life cycle? A software development life cycle, sometimes also called the SDLC life cycle, represents and describes the various activities that are to be performed to build a software product. These activities are grouped into several phases and sequentially linked in order. Hence we can also say, that a software development life cycle is a structured list of activities that are followed to develop software, from the inception to the delivery of the final product. During any phase of the life cycle of development, one or more activities might have to be carried out to start or finish that phase. For example, in the inception phase of actual coding, it is expected that the architectural designing phase is completed. Why software development life cycle model is required? In every model of SDLC, every phase may have its own child life cycle, for every team of a specific skill set. So in an environment of complicated projects and a variety of skill-based teams, it is vital to follow a pre-defined structured process. This creates discipline and maintains decorum in the working culture. All team members are interdependent. Failure of any one team will affect the deliverables of other teams. And all together it might lead to project failures. SDLC also defines entry and exit criteria for every phase. For example, say, if a team member starts coding, assuming that pro-activeness will help finish the project much earlier. This would be the perfect recipe for disaster and project failure. Why? Because, after putting down a month of effort they might realize that the project needs a roving vehicle on Mars to collect data. Unfortunately, the team doesn’t have that with them. So they can not proceed further. That means a feasibility study was not performed before the team started working on deliverables. Which in technical terms, is a breach of SDLC, and hence the loss of effort, or project failure. The team should have done a feasibility study before jumping straight into deliverables. Then they would have realized that the project is not doable, many days in advance. As so, they could have saved some unnecessary effort. Hence it is strongly suggested to follow a methodology, or process while working on complex and team-based projects. It becomes easier for the entire team to work together, support each other, manage, and track the progress of the development. Regardless of the model you follow, SDLC models always ensure smooth delivery, reporting, and chaos-free delivery of the project. Classic Waterfall Model. Prototyping Model. Iterative Waterfall Model. Rapid Action Development. Spiral Model.
Similar to DevOps Torino Meetup - SRE Concepts (20)
Introduction to Microsoft Azure Well Architected Framework in Italian - Session 6 of 6
Introduzione a Microsoft Azure Well Architected Framework in Italiano - Sessione 6 di 6
Modulo 6: efficienza delle prestazioni
Introduction to Microsoft Azure Well Architected Framework in Italian - Session 5 of 6
Introduzione a Microsoft Azure Well Architected Framework in Italiano - Sessione 5 di 6
Modulo 5: eccellenza operativa
Introduction to Microsoft Azure Well Architected Framework in Italian - Session 4 of 6
Introduzione a Microsoft Azure Well Architected Framework in Italiano - Sessione 4 di 6
Modulo 4: ottimizzazione dei costi
Introduction to Microsoft Azure Well Architected Framework in Italian - Session 3 of 6
Introduzione a Microsoft Azure Well Architected Framework in Italiano - Sessione 3 di 6
Modulo 3: sicurezza
Introduction to Microsoft Azure Well Architected Framework in Italian - Session 2 of 6
Introduzione a Microsoft Azure Well Architected Framework in Italiano - Sessione 2 di 6
Modulo 2: affidabilità
Introduction to Microsoft Azure Well Architected Framework in Italian - Session 1 of 6
Introduzione a Microsoft Azure Well Architected Framework in Italiano - Sessione 1 di 6
Modulo 1: introduzione, principi e concetti base
Terraform and Infrastructure as Code (IaC): an introduction of the reason why this kind of solution had been created and an explanation of the concepts and usage, with a link in the notes to a demo project available in GitHub.
Kubernetes the deltatre way the basics - introduction to containers and orc...Rauno De Pasquale
The basics - Introduction to Containers and Orchestrators (May 18th, 2020)
by Rauno De Pasquale (Newesis), supported by Cristiano Degiorgis (Deltatre)
A new version of the introduction to containers and orchestrator, done for the series of events "Kubernetes - The Deltatre way".
Knowing the context and concepts behind container use is essential to be able to proceed on the path that will lead to master Kubernetes and Cloud Native applications. This initial session is about basic skills to answer questions such as: what is a container image? Why did anyone feel the need for an orchestrator? Are there any alternatives to Docker and Kubernetes? How does working with containers and Kubernetes connect to traditional virtualization? The session aims to provide the basic skills to be able to guide yourself in the next sessions where the ways of creating and execution of applications in Kubernetes environment will be tackled.
Recorded session: YouTube | Facebook
Repository: https://github.com/deltatrelabs/community-events-kubernetes-the-deltatre-way
DevOps Torino Meetup - DevOps Engineer, a role that does not exist but is muc...Rauno De Pasquale
The third appointment of the DevOps Meetup in Turin. We made a survey to collect data to discuss about the usage of the term "DevOps Engineer" to define a specific role. Is it really a role? And how this role compare with the ones of SysAdmin, Cloud Engineer, SRE or Developer? Which are the different organisation model used for each of these roles? Which are the skills and area of competences?
DevOps Torino Meetup Group Kickoff Meeting - Why a meetup group on DevOps, wh...Rauno De Pasquale
Torino DevOps Meetup Group - Culture, Processes and Tools.
There is a lot of talking about DevOps culture and practices with different point of views and a lot of misunderstandings. This group aims to create a point of discussion to share experience, analysis and thoughts to help each us to better understand and implement DevOps approaches into our way of working in the Digital Services.
Si parla molto di DevOps ma rimane molta confusione circa il significato del termine, ci sono molti punti di vista diversi e anche diversi fraintendimenti. Questo gruppo si prefigge lo scopo di diventare un punto di aggregazione per condividere esperienze, studi e pensieri circa la cultura e le pratiche DevOps per poter giungere insieme a una migliore comprensione che ci possa aiutare a portare questo approccio nel nostro lavoro in ambito IT.
This document provides an introduction to containers and container orchestration technologies. It discusses the evolution from virtual machines to containers and the benefits of containers. It then explains why an orchestrator tool is needed to manage containers at scale. The remainder of the document defines common container and orchestration concepts, including Docker, Kubernetes objects and components, Helm for package management, and Istio for traffic management and security.
The document provides an introduction to cloud computing, discussing key concepts, common mistakes, and Newesis' experience with cloud adoption. Specifically:
- It defines cloud computing as a new type of service with capabilities beyond remote hosting, and discusses new use cases enabled by different cloud technologies.
- It outlines common mistakes like assuming costs will always be higher, capacity is infinite, or that availability and security are handled automatically.
- It shares Newesis' journey working with various cloud vendors since 2009 and why they recommend a multi-cloud approach to avoid lock-in and select the best cloud for a given need.
- Finally, it presents Newesis' "Cloud Cookbook" approach of transforming systems for the
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
Ready to Unlock the Power of Blockchain!Toptal Tech
Imagine a world where data flows freely, yet remains secure. A world where trust is built into the fabric of every transaction. This is the promise of blockchain, a revolutionary technology poised to reshape our digital landscape.
Toptal Tech is at the forefront of this innovation, connecting you with the brightest minds in blockchain development. Together, we can unlock the potential of this transformative technology, building a future of transparency, security, and endless possibilities.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
3. Is SRE an alternative to DevOps?
Google creates and evolved the concept
independently from DevOps movement.
«If you think of DevOps like an interface in a programming language,
class SRE implements DevOps. SRE includes additional practices and
recommendations that are not necessarily part of the DevOps
interface. DevOps and SRE are not two competing methods for
software development and operations, but rather close friends
designed to break down organizational barriers to deliver better
software faster.
DevOps emerged as a culture and a set of practices
that aims to reduce the gaps between software
development and software operation.
The DevOps movement does not explicitly define how
to succeed.
SRE prescribes how to succeed in the various DevOps
areas.” (Liz Fong-Jones, Seth Vargo)
DEVOPS VS SRE (SITE
RELIABILITY ENGINEERING)
6. Site reliability engineering (SRE) was born at Google in 2003, prior to the
DevOps movement.
“SRE is what happens when you ask a software engineer to design an
operations team.” - Ben Traynor, VP of engineering at Google and founder of
Google SRE
“SRE is fundamentally doing work that has historically been done by an
operations team but using engineers with software expertise and banking on
the fact that these engineers are inherently both predisposed to, and have
the ability to, substitute automation for human labour. In general, an SRE
team is responsible for availability, latency, performance, efficiency, change
management, monitoring, emergency response, and capacity planning.” -
Ben Traynor, VP of engineering at Google and founder of Google SRE
Site reliability engineers create a bridge between development and
operations by applying a software engineering mindset to system
administration topics.
They split their time between operations/on-call duties and developing
systems and software that help increase site reliability and performance.
Google puts a lot of emphasis on SREs not spending more than 50% of their
time on operations and considers any violation of this rule a sign of system
ill-health.
The ideal site reliability engineer candidate is either a software engineer with
a good administration background or a highly skilled system administrator
with knowledge of coding and automation.
“SRE teams are characterized by both rapid innovation and a large
acceptance of Change” - Ben Traynor, VP of engineering at Google and
founder of Google SRE
THE ORIGIN
Ben Traynor, VP of engineering at Google
7. “Software engineering has this in common with having
children: the labour before the birth is painful and difficult, but
the labour after the birth is where you actually spend most of
your effort. Yet software engineering as a discipline spends
much more time talking about the first period as opposed to
the second [omissis] there must be another discipline that
focuses on the whole lifecycle of software objects, from
inception, through deployment and operation, refinement,
and eventual peaceful decommissioning.”
“Running a service with a team that relies on manual
intervention for both change management and event handling
becomes expensive as the service and/or traffic to the service
grows”
“Traditional operations teams and their counterparts in
product development often end up in conflict, most visibly
over how quickly software can be released to production. At
their core, the development teams want to launch new
features and see them adopted by users. At their core, the ops
teams want to make sure the service doesn’t break while they
are holding the pager. Because most outages are caused by
some kind of change a new configuration, a new feature
launch, or a new type of user traffic the two teams’ goals are
fundamentally in tension.”
THE PROBLEM TO
SOLVE
8. Site : the terms site comes from the original duty of
the newly born role, having focus on google.com web
site.
Reliability: “reliability is the most fundamental
feature of any product: a system isn’t very useful if
nobody can use it!” (Ben Traynor)
Engineering: “we apply the principles of computer
science and engineering to the design and
development of computing systems” “an SRE team
must spend 50% of its time actually doing
development.” (Ben Traynor)
.
THE NAME
10. DevOps movement in the community and SRE initiative
in Google started from the same problem, the inefficiency
of having Developers and Operators working on the
different side of a wall, the first looking for feature and the
second for stability.
“One could view DevOps as a generalization of several
core SRE principles to a wider range of organizations,
management structures, and personnel. One could
equivalently view SRE as a specific implementation of
DevOps with some idiosyncratic extensions.“(Ben Traynor)
A DevOps Engineer is someone who understands the full
SDLC (Software Development Life Cycle)
DevOps focuses more on the automation part
SREs focus is more on the aspects like system availability,
observability, and scale
”The basic tenet of SRE is that doing operations well is a
software problem”
SRE VS DEVOPS
11. Reduce organisation silos: SREs share the ownership
of production with development teams and use the
same tools
Accept failure as normal: SRE’s concept of Error
Budget, avoiding 100% SLO
Implement gradual change: SRE is prescriptive about
usage of Canary Deployment
Leverage tooling and automation: SRE concept of
Toil and the guideline of “automate this year’s job”
Measure everything: SRE is prescriptive about
measuring Error Budget and Toil
SRE VS DEVOPS
12. DEVOPS VS. SRE: COMPETING STANDARDS OR FRIENDS?
(CLOUD NEXT ‘19) (@SETHVARGO)
14. Service Level Indicators (SLIs): metrics over time such
as request latency, throughput of requests per
second, or failures per request
Service Level Objectives (SLOs): targets for the
cumulative success of SLIs over a window of time
Service Level Agreements (SLAs): a promise by a
service provider, to a service consumer, about the
availability of a service
SLI, SLO, SLA – A
DEFINITION
15. “In general, for any software service or system, 100% is
not the right reliability target because no user can tell the
difference between a system being 100% available and
99.999% available.”
100% is not a viable availability target. Having a 100%
availability requirement severely limits a team or
developer’s ability to deliver updates and improvements
to a system.
To set the target isn’t a technical question at all, it’s a
product question, the business lead and the product
management must establish the system’s availability
target:
• What level of availability will the users be happy with,
given how they use the product?
• What alternatives are available to users who are
dissatisfied with the product’s availability?
• What happens to users’ usage of the product at
different availability levels?
SLO - TARGET
16. SLIs must be selected on the base of business value considerations.
Valuable SLIs:
• Request latency
• Batch throughput
• Failures per request (error rate)
Non valuable SLIs:
• CPU Time
• Memory Usage
• Operating System Uptime
Metrics must be aggregated overtime (for example “last 5 minutes”)
and a function as a percentile to be applied (for example “99
percentile”)
Metrics must be defined in advance and be known and accepted
across all organisation
Measurement must be reliable and not only the definition, also the
implementation of the measurement must be clearly defined and
approved
SLI - SELECTION
18. Error budget is a quantitative
measurement to establish the ratio
between work on implementing new
features and work on improving stability
“The error budget provides a clear,
objective metric that determines how
unreliable the service is allowed to be
within a single quarter. This metric
removes the politics from negotiations
between the SREs and the product
developers when deciding how much
risk to allow”
ERROR BUDGET AND
RISK
19. “If SLO violations occur frequently
enough to expend the error budget,
releases are temporarily halted while
additional resources are invested in
system testing and development to
make the system more resilient,
improve its performance, and so on.”
ERROR BUDGET AND
RISK
20. Use measurement to calculate expected
cost in terms of error budget of each
risk
Prioritise work on improvement on the
base of the computed impact of each
risk in terms of error budget
Cost for mitigating the risks could
require a review of the SLO due to
business value considerations
RISK MANAGEMENT AND
PRIORITISATION
22. Toil is not simply "work I don't like to do.”
Toil is not overhead (commuting, expenses reports,
meetings, …)
Toil is specifically tied to the running of a production
service.
It is work that tends to be manual, repetitive,
automatable, tactical and devoid of long-term value.
Toil is also reactive, as intervention done due to an alert.
When SREs find tasks that can be automated, they work to
engineer a solution to prevent that toil in the future.
Toil is not always bad. Predictable, repetitive tasks are
great ways to onboard a new team member and often
produce an immediate sense of accomplishment and
satisfaction with low risk and low stress.
TOIL AND TOIL
BUDGET
23. “We ensure that the teams consistently spending less than
50% of their time on development work change their
practices. Often this means shifting some of the
operations burden back to the development team, or
adding staff to the team without assigning that team
additional operational responsibilities.”
SREs must not be spending more than 50% of their time
on supportoperation activities
Developers should be involved in support regularly, but
their involvement become mandatory if a product
requires SREs to spend more than 50% of time on
operations
At least eight people need to be part of the on-call team to
correctly balance the load
Each on call person must handle no more than two events
per on-call shift to assure right quality
To preserve readiness for an effective on call support
practice handling hypothetical outages
ON CALL SUPPORT
24. “If you currently assign tickets randomly to victims on
your team, stop . Doing so is extremely disrespectful
of your team’s time, and works completely counter to
the principle of not being interruptible as much as
possible. Tickets should be a full-time role, for an
amount of time that’s manageable for a person. If you
happen to be in the unenviable position of having
more tickets than can be closed by the primary and
secondary on-call engineers combined, then structure
your ticket rotation to have two people handling
tickets at any given time. Don’t spread the load across
the entire team.”
SUPPORT TICKET
25. GOOGLE SRE – CONCEPTS –
OBSERVABILITY AND MONITORING
26. Each monitoring system should address
two questions: what’s broken
(symptom) and why (cause)
Two types of monitoring can be defined:
• White-box monitoring: it inspects the
internal state of the target service
(application components metrics,
traces, logs). Focus on causes.
• Black-box monitoring: it accesses the
systems from external, as a real user
(httptcp probes, dns resolution,
network ping). Symptom-oriented.
Active recognition of error condition.
WHITE-BOX AND BLACK-
BOX MONITORING
27. Monitoring may have only three output types:
• Pages - A human must do something now
• Tickets - A human must do something within a few
days
• Logging - No one need look at this output
immediately, but it’s available for later analysis if
needed
”Putting alerts into email and hoping that someone
will read all of them and notice the important ones is
the moral equivalent of piping them to /dev/null :
they will eventually be ignored.”
An alert must be analysed, if an alert is ignored,
remove the alerting rule.
MONITORING AND
ALERTING
28. Collecting metrics and alerting are
business driven activities.
Collecting metrics and manage alerting
are different thing. You need to collect
metrics to have visibility on your
systems, but you do not have to use all
metrics to generate alerts.
Measurement, monitoring and alerting
must be related to SLOs.
Do not alert on anything, alerts must
report a service impact and be
actionable.
OBSERVABILITY
29. Latency
The time it takes to service a request. It’s important to
distinguish between the latency of successful requests and the
latency of failed requests. For example, an HTTP 500 error
triggered due to loss of connection to a database or other critical
backend might be served very quickly; however, as an HTTP 500
error indicates a failed request, factoring 500s into your overall
latency might result in misleading calculations. On the other
hand, a slow error is even worse than a fast error! Therefore, it’s
important to track error latency, as opposed to just filtering out
errors.
Traffic
A measure of how much demand is being placed on your system,
measured in a high-level system-specific metric. For a web
service, this measurement is usually HTTP requests per second,
perhaps broken out by the nature of the requests (e.g., static
versus dynamic content). For an audio streaming system, this
measurement might focus on network I/O rate or concurrent
sessions. For a key-value storage system, this measurement
might be transactions and retrievals per second.
THE FOUR GOLDEN
SIGNALS (1/2)
30. Errors
The rate of requests that fail, either explicitly (e.g., HTTP 500s),
implicitly (for example, an HTTP 200 success response, but
coupled with the wrong content), or by policy (for example, "If
you committed to one-second response times, any request over
one second is an error"). Where protocol response codes are
insufficient to express all failure conditions, secondary (internal)
protocols may be necessary to track partial failure modes.
Monitoring these cases can be drastically different: catching
HTTP 500s at your load balancer can do a decent job of catching
all completely failed requests, while only end-to-end system
tests can detect that you’re serving the wrong content.
Saturation
How "full" your service is. A measure of your system fraction,
emphasizing the resources that are most constrained (e.g., in a
memory-constrained system, show memory; in an I/O-
constrained system, show I/O). Note that many systems degrade
in performance before they achieve 100% utilization, so having a
utilization target is essential.
THE FOUR GOLDEN
SIGNALS (2/2)
32. Effective incident management is key to limiting the
disruption caused by an incident and restoring normal
business operations as quickly as possible.
A well-designed incident management process has
the following features:
• Recursive separation of responsibilities
• Incident command
• Operational work
• Communication
• Planning
• A recognised command post
• Live incident state document
• Clear handoff
INCIDENT
MANAGEMENT
33. ”It is better to declare an incident early and then find
a simple fix and close out the incident than to have to
spin up the incident management framework hours
into a burgeoning problem.”
If any of the following is true, the event is an incident:
• Do you need to involve a second team in fixing the
problem?
• Is the outage visible to customers?
• Is the issue unsolved even after an hour’s
concentrated analysis?.
WHEN TO DECLARE AN
INCIDENT
34. Incident Management Roles:
• Incident Commander
• The incident commander holds the high-level state about the
incident. They structure the incident response task force,
assigning responsibilities according to need and priority.
• Operations lead
• The Ops lead works with the incident commander to respond
to the incident by applying operational tools to the task at
hand. The operations team should be the only group
modifying the system during an incident.
• Communication lead
• This person is the public face of the incident response task
force.
• Planning lead
• The planning role supports Ops by dealing with longer-term
issues, such as filing bugs, arranging handoffs, and tracking
how the system has diverged from the norm so it can be
reverted once the incident is resolved.
• Logistics lead
• The logistic role supports Ops by dealing with things as
ordering dinner or dealing with vendors for spare parts.
INCIDENT
MANAGEMENT
36. ”Postmortems should be blameless and
focus on process and technology, not
people. Assume the people involved in
an incident are intelligent, are well
intentioned, and were making the best
choices they could given the
information they had available at the
time.”
Questions about which data where
available, how the system was
behaving, which actions done and their
effect. Avoid questions about why an
action has been done or why not.
POSTMORTEMS AND
RETROSPECTIVES
37. Content of a postmortem:
• Incident summary
• Detailed timeline
• Detection
• Impact
• Root Causes
• Triggers
• Mitigation and Resolution
• Lesson learned
• What went well
• What went wrong
• Where we got lucky
• Action Items (with clear owner)
POSTMORTEMS AND
RETROSPECTIVES
39. Not all Google services receive close SRE
engagement.
Google defines three different
engagement models
• Simple PRR (Product Readiness
Reviews) Model
• Early Engagement Model
• Frameworks and SRE Platform
SRE ENGAGEMENT
MODEL
40. Development team requests that SRE take over
production management of a service, one to three SREs
are selected to conduct the PRR process.
The discussion covers matters such as:
• Establishing an SLO/SLA for the service
• Planning for potentially disruptive design changes
required to improve reliability
• Planning and training schedules
The Training phase unblocks onboarding of the service by
the SRE team. It involves a progressive transfer of
responsibilities and ownership of various production
aspects of the service, including parts of operations, the
change management process, access rights, and so forth.
To complete the transition, the development team must
be available to back up and advise the SRE team for a
period of time as it settles in managing production for
the service. This relationship becomes the basis for the
ongoing work between the teams.
SIMPLE PRR MODE
41. The Early Engagement Model introduces
SRE earlier in the development lifecycle.
SRE participates in Design and later
phases, eventually taking over the
service any time during or after the
Build phase.
This model is based on active
collaboration between the development
and SRE teams.
EARLY ENGAGEMENT
MODEL
42. SRE builds framework modules to implement canonical
solutions for the concerned production area. As a result,
development teams can focus on the business logic,
because the framework already takes care of correct
infrastructure use. A framework essentially is a
prescriptive implementation for using a set of software
components and a canonical way of combining these
components.
The service frameworks implement infrastructure code
in a standardized fashion and address various production
concerns. Each concern is encapsulated in one or more
framework modules, each of which provides a cohesive
solution for a problem domain or infrastructure
dependency.
Framework modules address the various SRE concerns
enumerated earlier, such as:
• Instrumentation and metrics
• Request logging
• Control systems involving traffic and load
management
FRAMEWORKS AND
SRE PLATFORM
44. How can I know if I’m doing DevOps?
• Do I consider Developers, Test Engineers and Sys Admins as
members of the same team having one common goal?
• Do I know and care about end users features and
functionalities (business value)?
• Do I automate infrastructure and code deployment? Do I
understand and manage SDLC?
• Do I focus on collecting as much metrics as possible from
build process to production runtime?
How can I know if I’m doing SRE?
• Do I protect my 50% of the time for engineering activities?
• Do I define SLO and manage Error Budget?
• Do I continuously work to improve self-healing of
production solutions? Do I target “autonomous” solutions
and not only “automated”?
• Do I use a structured Incident Management process with
clearly defined roles?
AM I DOING SRE OR
DEVOPS?
Text from preface and introduction of “Site Reliability Engineering”, authors Murphy, Niall Richard,Beyer, Betsy,Jones, Chris,Petoff, Jennifer, printed by O'Reilly Media.
Text from preface and introduction of “Site Reliability Engineering”, authors Murphy, Niall Richard,Beyer, Betsy,Jones, Chris,Petoff, Jennifer, printed by O'Reilly Media.
No more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight (each engineer is on-call
(primary or secondary) for one week every month.
For each on-call shift, an engineer should have sufficient time to deal with any incidents and follow-up activities such as writing postmortems. We’ve found that on average, dealing with the tasks involved in an on-call incident root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs takes 6 hours. It follows that the maximum number of incidents per day is 2 per 12-hour on-call shift.
When we mention “same team” it does not necessarily mean a concrete unique team; developers, test engineers and sys admins should consider themselves as part of a common virtual team, independently from the practical and concrete organisation.