SlideShare a Scribd company logo
1 of 20
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
N O V E M B E R 1 0 , 2 0 2 2
Architectural Patterns for
DevOps and SRE Teams
Shikha Srivastava, Marc Velasco - IBM
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Distinguished Engineer and Master Inventor
Role: Chief Technical architect responsible for
multiple SaaS across multiple Cloud
Senior Technical Staff Member
Role: Chief SRE responsible for operationalizing
SRE for SaaS Launch across multiple Cloud
Shikha Srivastava Marc Velasco
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
• Deliver multiple SaaS to multiple cloud
• Legacy environments and /or systems.
• Majority of systems have low DevOps CI/CD
maturity
• Silo organization structure and silo culture
• Workforce and upskilling / repurposing (e.g:
operations adopting engineering)
• Existing teams with existing processes
• Disparate Teams and silo’d processes
• Different release timelines, and even support levels
• Need for SRE expertise everywhere at once
• Balance of oncall/TOIL with Engineering work
• And more -----
Our Challenges
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
1. Platform approach with multiple
services leveraging platform
2. Applied Architectural approach to
SRE across Platform and Service
lines
So, what we did ? Onboarding – market place integration, subscription and Tenant
management, customer nurture (emails/notifications)
Service
Telemetry
–
Service
Growth
Billing
Metering
Security and Compliance
Runtime Services: ROSA/ EKS, Storage, Data
stores, N/W
Service
–
reliability
SRE
Setup
and
operations
Support
Multi-
tenant
SaaS
( ex.
APIC)
customer
nurture
(emails/notifications)
-----
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Our Goal
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Overall Goal
7
Plan Build Test Secure Release Maintain
Ideas
Ideas
--------
Availability test
Feedback Feedback
Goal: Customers feedback on feature function and none on
service availability and stability
Goal: feedback on missed failure points, and improvements in architecture and
design to better manage the service
Customers
Operations
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
SaaS Life Cycle from Reliability lens
Plan Build Test Secure Release Maintain
Ideas
Ideas
--------
Top Priority for SRE:
- Availability
- Resiliency
- Stability
- Reliability
SaaS Readiness with focus on:
- Availability
- Resiliency
- Stability
- Reliability
Availability test
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Typical SaaS Life Cycle from Reliability lens
Maintaining the service availability (SLO) is top priority. Service
recovery in case of outage is priority 1
Optimize routing incidents to reach the right team
Same incidents should not be recurring
Blameless Postmortem of incidents and RCAs
Identify and create missing monitors
Identify and create missing runbooks
Eliminate Toil, automate runbooks and procedures.
Identify and capture missed failure points. Create and automate runbooks
Feedback to dev on failure points due to architecture and design issues.
Robust pipeline for continuous delivery of patches, updates CVEs, bug
fixes, etc
Enable SLIs for key user journeys. And establish SLOs for the service
Identify all possible failure points
Runbooks for each identified failure points – manual or automated
Automate test, provision, deployment and operations
Disaster recovery procedures
Robust pipeline for canary testing
Plan Build Test Secure Release Maintain
Ideas
Ideas
Availability test
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Why DevSecOps?
• We all know the integration of development and operations
• Although there is no one answer to “what is devops?”
• Devops is more mentality and intention, supported by
tooling, practices, and processes.
• More academically: “a set of practices intended to reduce
the time between committing a change to a system and the
the change being placed into normal production, while
ensuring high quality” [Bass, Weber, Zhu 2015]
• Pros:
• Breaks down the wall between development and
operations
• Empowers team to make changes
• Cons:
• Easy to overburden the devops roles
• Lots of areas for technical debt
IT
Operations
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Why SRE? • SRE is the evolution of development for production
workloads
• Google coined the term and established the
first set of practices
• But many other companies were doing this as a
natural evolution of their business and products
• Spotify
• Uber
• Pros:
• SRE addresses a number of challenges that
most production workloads experienced with
scalability
• Velocity of changes
• The amount of manual work required (toil)
• Constantly requiring more humans to scale
workload
• Cons:
• Requires non-trivial investment to be able to
staff for oncall and engineering work
IT
Operations
DevSecOps
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
SRE patterns
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
• Embedded – SREs embed into a
team as it develops a service, SRE
builds reliability into the service
first-hand
SRE patterns
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
• Platform SRE – SREs responsible for a
platform set of services, scope is
limited to the platform
• Service SRE – SREs responsible for SRE
of an application or service deployed
on a platform
SRE patterns
Platform (hyperscaler services, SaaS platform services)
Platform SRE
SaaS
Offering
Service SRE
SaaS
Offering
Service SRE
SaaS
Offering
Service SRE
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Our Approach
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Service SRE
Embedded SRE or DevOps Teams
Service A Service B Service C
Service D
• Embedded SRE/DevOps work as
part of Development Team
• Build in SRE/DevOps practices
and requirements
• Service teams are responsible to
readying service for production
• Production SREs run service in
production and take over
engineering of the service for
reliability while development
team works on a new service
• Platform SRE is responsible for
the platform.
• Collaboration model, Process
flow and guardrails defined for
the SREs
Dev and
Embedded SRE
Service SRE
Dev and
Embedded SRE
Service SRE
Dev and
Embedded SRE
Service SRE
Dev and
Embedded SRE
--------
Bringing it all together
Platform SRE
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Rotating Responsibilities of:
• Toil/Oncall
• Engineering/Backup
SRE Work –
Automation,
Reliability,
Observability
Oncall SREs
US SRE Squad – Day 1
Outages Alerts
Customer
Events
SRE Work –
Automation,
Reliability,
Observability
US SRE Squad – Day 2
US SRE
Squad
EU SRE
Squad
AP SRE
Squad
SRE Work –
Automation,
Reliability,
Observability
US SRE Squad – Day 3
Outages Alerts
Customer
Events
Outages Alerts
Customer
Events
Life of an SRE
Oncall SREs Oncall SREs
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
What does SRE do in this model?
Embedded SRE Teams
Fanatical about engineering
• Develop automation and observability
Platform and Service SRE Teams
Max 50 % of SRE Manage toil
• Responding to outages, alerts, manual work (toil)
Fanatical about engineering
• Rest of SRE is all about engineering for reliability
• automation, upgrades, maintenance,
• Postmortems for outages and alerts
SRE Guild
• Common skill and practice share
• Blameless port-mortems and experience share
• SRE development and mentoring
• SRE certification Support
Maintain
momentum
of existing
product
teams
Allow
separate
teams to
evolve SRE
skill set
Encourage
and foster
SRE as a
profession
Be able to
scale more
services
with fixed
resources
Enable the
rapid
developmen
t of services
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Some references
Speed and resiliency: two sides of the same coin
Google SRE practices https://sre.google/workbook/table-of-contents/
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Questions
T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G

More Related Content

Similar to ADDO_2022_SRE Architectural Patterns_Nov10.pptx

Deepesh_Rai_Resume_Latest
Deepesh_Rai_Resume_LatestDeepesh_Rai_Resume_Latest
Deepesh_Rai_Resume_Latest
Deepesh Rai
 
Continuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchainContinuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchain
Serena Software
 
ott_calfee_resume
ott_calfee_resumeott_calfee_resume
ott_calfee_resume
Ott Calfee
 
Sachin's Professional Journey
Sachin's Professional JourneySachin's Professional Journey
Sachin's Professional Journey
Sachin Gupta
 
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdfSRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
Weaveworks
 

Similar to ADDO_2022_SRE Architectural Patterns_Nov10.pptx (20)

Deepesh_Rai_Resume_Latest
Deepesh_Rai_Resume_LatestDeepesh_Rai_Resume_Latest
Deepesh_Rai_Resume_Latest
 
Ignatius Prasad Guntupalli
Ignatius Prasad GuntupalliIgnatius Prasad Guntupalli
Ignatius Prasad Guntupalli
 
Migrating Your Apps to the Cloud: How to do it and What to Avoid
Migrating Your Apps to the Cloud: How to do it and What to AvoidMigrating Your Apps to the Cloud: How to do it and What to Avoid
Migrating Your Apps to the Cloud: How to do it and What to Avoid
 
Musharraf_Syed_Resume
Musharraf_Syed_ResumeMusharraf_Syed_Resume
Musharraf_Syed_Resume
 
The when & why of evolution of performance testing to performance engineering...
The when & why of evolution of performance testing to performance engineering...The when & why of evolution of performance testing to performance engineering...
The when & why of evolution of performance testing to performance engineering...
 
Pete Rim - Cisco's agile journey, continuous delivery and scaling scrum
Pete Rim - Cisco's agile journey, continuous delivery and scaling scrumPete Rim - Cisco's agile journey, continuous delivery and scaling scrum
Pete Rim - Cisco's agile journey, continuous delivery and scaling scrum
 
Continuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchainContinuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchain
 
Who needs EA… when we have DevOps?
Who needs EA… when we have DevOps?Who needs EA… when we have DevOps?
Who needs EA… when we have DevOps?
 
Hema_Testing
Hema_TestingHema_Testing
Hema_Testing
 
CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.
 
CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.
 
CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.
 
CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.
 
CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.
 
CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.CV_Castillo, Jecrison D.
CV_Castillo, Jecrison D.
 
Susheel Verma_CV
Susheel Verma_CVSusheel Verma_CV
Susheel Verma_CV
 
Webinar - Devops platform for the evolving enterprise
Webinar - Devops platform for the evolving enterpriseWebinar - Devops platform for the evolving enterprise
Webinar - Devops platform for the evolving enterprise
 
ott_calfee_resume
ott_calfee_resumeott_calfee_resume
ott_calfee_resume
 
Sachin's Professional Journey
Sachin's Professional JourneySachin's Professional Journey
Sachin's Professional Journey
 
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdfSRE and GitOps for Building Robust Kubernetes Platforms.pdf
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
 

More from Shikha Srivastava

How kubernetes operators can rescue dev secops in midst of a pandemic updated
How kubernetes operators can rescue dev secops in midst of a pandemic updatedHow kubernetes operators can rescue dev secops in midst of a pandemic updated
How kubernetes operators can rescue dev secops in midst of a pandemic updated
Shikha Srivastava
 

More from Shikha Srivastava (16)

DevOpsEnterpriseSummit_SaaSAnd DisasterRecovery.pptx
DevOpsEnterpriseSummit_SaaSAnd DisasterRecovery.pptxDevOpsEnterpriseSummit_SaaSAnd DisasterRecovery.pptx
DevOpsEnterpriseSummit_SaaSAnd DisasterRecovery.pptx
 
WITS 2022_ModernizationAndInfrastructureAsCode.pptx
WITS 2022_ModernizationAndInfrastructureAsCode.pptxWITS 2022_ModernizationAndInfrastructureAsCode.pptx
WITS 2022_ModernizationAndInfrastructureAsCode.pptx
 
Using Cloud-Native and SRE Principles to Achieve Speed and Resiliency
Using Cloud-Native and SRE Principles to Achieve Speed and ResiliencyUsing Cloud-Native and SRE Principles to Achieve Speed and Resiliency
Using Cloud-Native and SRE Principles to Achieve Speed and Resiliency
 
How kubernetes operators can rescue dev secops in midst of a pandemic updated
How kubernetes operators can rescue dev secops in midst of a pandemic updatedHow kubernetes operators can rescue dev secops in midst of a pandemic updated
How kubernetes operators can rescue dev secops in midst of a pandemic updated
 
Managing integration in a multi cluster world
Managing integration in a multi cluster worldManaging integration in a multi cluster world
Managing integration in a multi cluster world
 
Helm summit 2019_handling large number of charts_sept 10
Helm summit 2019_handling large number of charts_sept 10Helm summit 2019_handling large number of charts_sept 10
Helm summit 2019_handling large number of charts_sept 10
 
Kube con china_2019_7 missing factors for your production-quality 12-factor apps
Kube con china_2019_7 missing factors for your production-quality 12-factor appsKube con china_2019_7 missing factors for your production-quality 12-factor apps
Kube con china_2019_7 missing factors for your production-quality 12-factor apps
 
Why Ibm cloud private
Why Ibm cloud private Why Ibm cloud private
Why Ibm cloud private
 
Bluemix application monitoring
Bluemix application monitoring Bluemix application monitoring
Bluemix application monitoring
 
Modernization: Moving workloads to cloud
Modernization: Moving workloads to cloud Modernization: Moving workloads to cloud
Modernization: Moving workloads to cloud
 
Kibana globalization at the RTP meetup
Kibana globalization at the RTP meetupKibana globalization at the RTP meetup
Kibana globalization at the RTP meetup
 
Localizing kibana for the global language landscape
Localizing kibana for the global language landscapeLocalizing kibana for the global language landscape
Localizing kibana for the global language landscape
 
From Containerized Application to Secure and Scaling With Kubernetes
From Containerized Application to Secure and Scaling With KubernetesFrom Containerized Application to Secure and Scaling With Kubernetes
From Containerized Application to Secure and Scaling With Kubernetes
 
Developing and Deploying Microservices to IBM Cloud Private
Developing and Deploying Microservices to IBM Cloud PrivateDeveloping and Deploying Microservices to IBM Cloud Private
Developing and Deploying Microservices to IBM Cloud Private
 
4789 creating production-ready, secure and scalable applications in ibm cloud...
4789 creating production-ready, secure and scalable applications in ibm cloud...4789 creating production-ready, secure and scalable applications in ibm cloud...
4789 creating production-ready, secure and scalable applications in ibm cloud...
 
Panelist at women breakfast discussing latest technology trends at Elasticon
Panelist at women breakfast discussing latest technology trends at Elasticon Panelist at women breakfast discussing latest technology trends at Elasticon
Panelist at women breakfast discussing latest technology trends at Elasticon
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

ADDO_2022_SRE Architectural Patterns_Nov10.pptx

  • 1. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G N O V E M B E R 1 0 , 2 0 2 2 Architectural Patterns for DevOps and SRE Teams Shikha Srivastava, Marc Velasco - IBM
  • 2. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Distinguished Engineer and Master Inventor Role: Chief Technical architect responsible for multiple SaaS across multiple Cloud Senior Technical Staff Member Role: Chief SRE responsible for operationalizing SRE for SaaS Launch across multiple Cloud Shikha Srivastava Marc Velasco
  • 3. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G • Deliver multiple SaaS to multiple cloud • Legacy environments and /or systems. • Majority of systems have low DevOps CI/CD maturity • Silo organization structure and silo culture • Workforce and upskilling / repurposing (e.g: operations adopting engineering) • Existing teams with existing processes • Disparate Teams and silo’d processes • Different release timelines, and even support levels • Need for SRE expertise everywhere at once • Balance of oncall/TOIL with Engineering work • And more ----- Our Challenges
  • 4. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G 1. Platform approach with multiple services leveraging platform 2. Applied Architectural approach to SRE across Platform and Service lines So, what we did ? Onboarding – market place integration, subscription and Tenant management, customer nurture (emails/notifications) Service Telemetry – Service Growth Billing Metering Security and Compliance Runtime Services: ROSA/ EKS, Storage, Data stores, N/W Service – reliability SRE Setup and operations Support Multi- tenant SaaS ( ex. APIC) customer nurture (emails/notifications) -----
  • 5. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Our Goal
  • 6. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Overall Goal 7 Plan Build Test Secure Release Maintain Ideas Ideas -------- Availability test Feedback Feedback Goal: Customers feedback on feature function and none on service availability and stability Goal: feedback on missed failure points, and improvements in architecture and design to better manage the service Customers Operations
  • 7. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G SaaS Life Cycle from Reliability lens Plan Build Test Secure Release Maintain Ideas Ideas -------- Top Priority for SRE: - Availability - Resiliency - Stability - Reliability SaaS Readiness with focus on: - Availability - Resiliency - Stability - Reliability Availability test
  • 8. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Typical SaaS Life Cycle from Reliability lens Maintaining the service availability (SLO) is top priority. Service recovery in case of outage is priority 1 Optimize routing incidents to reach the right team Same incidents should not be recurring Blameless Postmortem of incidents and RCAs Identify and create missing monitors Identify and create missing runbooks Eliminate Toil, automate runbooks and procedures. Identify and capture missed failure points. Create and automate runbooks Feedback to dev on failure points due to architecture and design issues. Robust pipeline for continuous delivery of patches, updates CVEs, bug fixes, etc Enable SLIs for key user journeys. And establish SLOs for the service Identify all possible failure points Runbooks for each identified failure points – manual or automated Automate test, provision, deployment and operations Disaster recovery procedures Robust pipeline for canary testing Plan Build Test Secure Release Maintain Ideas Ideas Availability test
  • 9. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Why DevSecOps? • We all know the integration of development and operations • Although there is no one answer to “what is devops?” • Devops is more mentality and intention, supported by tooling, practices, and processes. • More academically: “a set of practices intended to reduce the time between committing a change to a system and the the change being placed into normal production, while ensuring high quality” [Bass, Weber, Zhu 2015] • Pros: • Breaks down the wall between development and operations • Empowers team to make changes • Cons: • Easy to overburden the devops roles • Lots of areas for technical debt IT Operations
  • 10. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Why SRE? • SRE is the evolution of development for production workloads • Google coined the term and established the first set of practices • But many other companies were doing this as a natural evolution of their business and products • Spotify • Uber • Pros: • SRE addresses a number of challenges that most production workloads experienced with scalability • Velocity of changes • The amount of manual work required (toil) • Constantly requiring more humans to scale workload • Cons: • Requires non-trivial investment to be able to staff for oncall and engineering work IT Operations DevSecOps
  • 11. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G SRE patterns
  • 12. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G • Embedded – SREs embed into a team as it develops a service, SRE builds reliability into the service first-hand SRE patterns
  • 13. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G • Platform SRE – SREs responsible for a platform set of services, scope is limited to the platform • Service SRE – SREs responsible for SRE of an application or service deployed on a platform SRE patterns Platform (hyperscaler services, SaaS platform services) Platform SRE SaaS Offering Service SRE SaaS Offering Service SRE SaaS Offering Service SRE
  • 14. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Our Approach
  • 15. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Service SRE Embedded SRE or DevOps Teams Service A Service B Service C Service D • Embedded SRE/DevOps work as part of Development Team • Build in SRE/DevOps practices and requirements • Service teams are responsible to readying service for production • Production SREs run service in production and take over engineering of the service for reliability while development team works on a new service • Platform SRE is responsible for the platform. • Collaboration model, Process flow and guardrails defined for the SREs Dev and Embedded SRE Service SRE Dev and Embedded SRE Service SRE Dev and Embedded SRE Service SRE Dev and Embedded SRE -------- Bringing it all together Platform SRE
  • 16. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Rotating Responsibilities of: • Toil/Oncall • Engineering/Backup SRE Work – Automation, Reliability, Observability Oncall SREs US SRE Squad – Day 1 Outages Alerts Customer Events SRE Work – Automation, Reliability, Observability US SRE Squad – Day 2 US SRE Squad EU SRE Squad AP SRE Squad SRE Work – Automation, Reliability, Observability US SRE Squad – Day 3 Outages Alerts Customer Events Outages Alerts Customer Events Life of an SRE Oncall SREs Oncall SREs
  • 17. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G What does SRE do in this model? Embedded SRE Teams Fanatical about engineering • Develop automation and observability Platform and Service SRE Teams Max 50 % of SRE Manage toil • Responding to outages, alerts, manual work (toil) Fanatical about engineering • Rest of SRE is all about engineering for reliability • automation, upgrades, maintenance, • Postmortems for outages and alerts SRE Guild • Common skill and practice share • Blameless port-mortems and experience share • SRE development and mentoring • SRE certification Support Maintain momentum of existing product teams Allow separate teams to evolve SRE skill set Encourage and foster SRE as a profession Be able to scale more services with fixed resources Enable the rapid developmen t of services
  • 18. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Some references Speed and resiliency: two sides of the same coin Google SRE practices https://sre.google/workbook/table-of-contents/
  • 19. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G Questions
  • 20. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G

Editor's Notes

  1. Must complete last name, first name, company
  2. Must fill center boxes 2 and 3
  3. Must organize the regions and related items