Architecture is often defined as the layout and usage of components based on external and internal constraints. Devops and SRE Teams are built around any number of services, but as services and the organizational landscape changes, team architecture can and should change to find the right fit based on the external and internal forces. This session is a conceptual session discussing architecture team patterns, yet technical topics will be covered encompassing Devops and SRE responsibilities and tooling.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
1. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
N O V E M B E R 1 0 , 2 0 2 2
Architectural Patterns for
DevOps and SRE Teams
Shikha Srivastava, Marc Velasco - IBM
2. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Distinguished Engineer and Master Inventor
Role: Chief Technical architect responsible for
multiple SaaS across multiple Cloud
Senior Technical Staff Member
Role: Chief SRE responsible for operationalizing
SRE for SaaS Launch across multiple Cloud
Shikha Srivastava Marc Velasco
3. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
• Deliver multiple SaaS to multiple cloud
• Legacy environments and /or systems.
• Majority of systems have low DevOps CI/CD
maturity
• Silo organization structure and silo culture
• Workforce and upskilling / repurposing (e.g:
operations adopting engineering)
• Existing teams with existing processes
• Disparate Teams and silo’d processes
• Different release timelines, and even support levels
• Need for SRE expertise everywhere at once
• Balance of oncall/TOIL with Engineering work
• And more -----
Our Challenges
4. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
1. Platform approach with multiple
services leveraging platform
2. Applied Architectural approach to
SRE across Platform and Service
lines
So, what we did ? Onboarding – market place integration, subscription and Tenant
management, customer nurture (emails/notifications)
Service
Telemetry
–
Service
Growth
Billing
Metering
Security and Compliance
Runtime Services: ROSA/ EKS, Storage, Data
stores, N/W
Service
–
reliability
SRE
Setup
and
operations
Support
Multi-
tenant
SaaS
( ex.
APIC)
customer
nurture
(emails/notifications)
-----
5. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Our Goal
6. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Overall Goal
7
Plan Build Test Secure Release Maintain
Ideas
Ideas
--------
Availability test
Feedback Feedback
Goal: Customers feedback on feature function and none on
service availability and stability
Goal: feedback on missed failure points, and improvements in architecture and
design to better manage the service
Customers
Operations
7. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
SaaS Life Cycle from Reliability lens
Plan Build Test Secure Release Maintain
Ideas
Ideas
--------
Top Priority for SRE:
- Availability
- Resiliency
- Stability
- Reliability
SaaS Readiness with focus on:
- Availability
- Resiliency
- Stability
- Reliability
Availability test
8. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Typical SaaS Life Cycle from Reliability lens
Maintaining the service availability (SLO) is top priority. Service
recovery in case of outage is priority 1
Optimize routing incidents to reach the right team
Same incidents should not be recurring
Blameless Postmortem of incidents and RCAs
Identify and create missing monitors
Identify and create missing runbooks
Eliminate Toil, automate runbooks and procedures.
Identify and capture missed failure points. Create and automate runbooks
Feedback to dev on failure points due to architecture and design issues.
Robust pipeline for continuous delivery of patches, updates CVEs, bug
fixes, etc
Enable SLIs for key user journeys. And establish SLOs for the service
Identify all possible failure points
Runbooks for each identified failure points – manual or automated
Automate test, provision, deployment and operations
Disaster recovery procedures
Robust pipeline for canary testing
Plan Build Test Secure Release Maintain
Ideas
Ideas
Availability test
9. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Why DevSecOps?
• We all know the integration of development and operations
• Although there is no one answer to “what is devops?”
• Devops is more mentality and intention, supported by
tooling, practices, and processes.
• More academically: “a set of practices intended to reduce
the time between committing a change to a system and the
the change being placed into normal production, while
ensuring high quality” [Bass, Weber, Zhu 2015]
• Pros:
• Breaks down the wall between development and
operations
• Empowers team to make changes
• Cons:
• Easy to overburden the devops roles
• Lots of areas for technical debt
IT
Operations
10. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Why SRE? • SRE is the evolution of development for production
workloads
• Google coined the term and established the
first set of practices
• But many other companies were doing this as a
natural evolution of their business and products
• Spotify
• Uber
• Pros:
• SRE addresses a number of challenges that
most production workloads experienced with
scalability
• Velocity of changes
• The amount of manual work required (toil)
• Constantly requiring more humans to scale
workload
• Cons:
• Requires non-trivial investment to be able to
staff for oncall and engineering work
IT
Operations
DevSecOps
11. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
SRE patterns
12. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
• Embedded – SREs embed into a
team as it develops a service, SRE
builds reliability into the service
first-hand
SRE patterns
13. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
• Platform SRE – SREs responsible for a
platform set of services, scope is
limited to the platform
• Service SRE – SREs responsible for SRE
of an application or service deployed
on a platform
SRE patterns
Platform (hyperscaler services, SaaS platform services)
Platform SRE
SaaS
Offering
Service SRE
SaaS
Offering
Service SRE
SaaS
Offering
Service SRE
14. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Our Approach
15. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Service SRE
Embedded SRE or DevOps Teams
Service A Service B Service C
Service D
• Embedded SRE/DevOps work as
part of Development Team
• Build in SRE/DevOps practices
and requirements
• Service teams are responsible to
readying service for production
• Production SREs run service in
production and take over
engineering of the service for
reliability while development
team works on a new service
• Platform SRE is responsible for
the platform.
• Collaboration model, Process
flow and guardrails defined for
the SREs
Dev and
Embedded SRE
Service SRE
Dev and
Embedded SRE
Service SRE
Dev and
Embedded SRE
Service SRE
Dev and
Embedded SRE
--------
Bringing it all together
Platform SRE
16. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Rotating Responsibilities of:
• Toil/Oncall
• Engineering/Backup
SRE Work –
Automation,
Reliability,
Observability
Oncall SREs
US SRE Squad – Day 1
Outages Alerts
Customer
Events
SRE Work –
Automation,
Reliability,
Observability
US SRE Squad – Day 2
US SRE
Squad
EU SRE
Squad
AP SRE
Squad
SRE Work –
Automation,
Reliability,
Observability
US SRE Squad – Day 3
Outages Alerts
Customer
Events
Outages Alerts
Customer
Events
Life of an SRE
Oncall SREs Oncall SREs
17. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
What does SRE do in this model?
Embedded SRE Teams
Fanatical about engineering
• Develop automation and observability
Platform and Service SRE Teams
Max 50 % of SRE Manage toil
• Responding to outages, alerts, manual work (toil)
Fanatical about engineering
• Rest of SRE is all about engineering for reliability
• automation, upgrades, maintenance,
• Postmortems for outages and alerts
SRE Guild
• Common skill and practice share
• Blameless port-mortems and experience share
• SRE development and mentoring
• SRE certification Support
Maintain
momentum
of existing
product
teams
Allow
separate
teams to
evolve SRE
skill set
Encourage
and foster
SRE as a
profession
Be able to
scale more
services
with fixed
resources
Enable the
rapid
developmen
t of services
18. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Some references
Speed and resiliency: two sides of the same coin
Google SRE practices https://sre.google/workbook/table-of-contents/
19. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G
Questions
20. T R A C K : S I T E R E L I A B I L I T Y E N G I N E E R I N G