3. Document Classification
3
For you today . . .
SRE refresher Is this SRE?
Consider this
SRE model
How Kanban practices enable SRE
1 2 3
The acronym ‘SRE’ is used interchangeably both for Site Reliability Engineering and Site Reliability Engineer
5. Document Classification
5
Site Reliability Engineering – definitions
• Doing work that has historically been done by an operations team, but using engineers
with software expertise
• The main goals are to create scalable and highly reliable software systems
• Created by Ben Treynor, Google who described it as “ it is what happens when you ask a
software engineer to design an operations team”
• The Site Reliability Engineer role is a hybrid of dev and ops roles. It balances developing
new features and ensuring that production systems run smoothly and reliably.
• SREs often use Kanban practices such as flow, visualization and WIP management
What is it
SRE spends 50 % of time in coding, the other 50 % is to take care of
existing applications (operations).
What it is not
• Not a replacement of DevOps
• Not a department in the IT organization
• Not just a set of tools
Did you know?
One of the co-editors of the
SRE book Dr Jennifer Petoff
holds a PhD in Synthetic
Chemistry!
6. Document Classification
6
SRE, DevOps and Kanban
SRE and DevOps are two sides of the same coin
Both aim to bridge the gap between dev and ops teams
While DevOps is about ‘What’ needs to be done, SRE talks about ‘How’ that can be done
SRE can be considered as specific implementation of DevOps
While DevOps sends problem to Dev to solve, the SRE approach is to find problems and
solve some of them themselves
While DevOps teams would usually choose the more conservative approach, leaving the prod
environment untouched unless absolutely necessary, SREs are more confident in their ability to
maintain a stable prod environment and push for rapid changes and updates
Kanban provides additional rigour to workflow management of the SREs
8. Document Classification
8
Is this SRE – 1
• Fix low priority bugs
• Solve minor design issues
• Conduct Performance Testing
• Analyze Security Vulnerability
• Conduct Pen Testing
• And . . .
• So . . .
• On
We do those things (a.k.a ‘miscellaneous’) that the Dev team cannot do
• Demand from multiple sources
• No specific accountability
• Lack of clarity on the skills required
• Typically floating team
• Output not aligned with dev or release cadence
• Low motivation and hence attrition
• Vague growth path
• Quality Issues
• Delays
9. Document Classification
9
Is this SRE – 2
• Identify architecture flaws and fix them
• Make Design changes in common applications
• Clear piled up coding, testing and configuration issues
• Provide Migration support – database, applications,
We take care of all the architecture and code issues, fix them and hand back to Dev
• Long backlog of items
• Majority are high priority items
• Effort intensive
• No direct development / operations support
• Limited connect with core business
• Challenges in measuring benefits
• Team burnout
• Lack of expertise hampering progress
• Probable lack of alignment with business priorities
What are the possible Kanban
practices in this scenario?
10. Document Classification
10
Is this SRE – 3
• Play the role of custodian of enterprise level tools
• Provide automation and tooling support to Dev, QA and Ops
• Create and maintain frameworks for app development, testing and hosting
• Provide some environment support
• Monitor tools usage and managing licenses
• Maintain common API libraries
We provide tooling, platform and framework support to all our tech teams
• Mixed bag of backlog of items
• Work prioritization challenges
• Vagueness in responsibility in tooling
support E.g. ‘DevOps’ team
• Challenges in maintaining the balance between
usage needs and cost of tools/licenses
• Challenges in integrating third party solutions
• Delays in servicing team request
11. Document Classification
11
Is this SRE – an illustration from a Financial Institution
In charge’ of
• Data base migration – on premise
• Dev Platform migration – on premise
• Cloud migration support
• Tech debt reduction
• API creation and support
Called ‘SRE’’
Program
Architecture
Data base migration
– on premise
Dev Platform
migration – on
premise
Cloud migration
support
Tech debt reduction
Scrum
Team 1
Tooling
and
Platform
Scrum
Team 2
Scrum
Team n
• Rotating team members – borrowed from Scrum teams
• Team too thin and too many items under their belt
• Silo’ed within a program
• No direct interactions with enterprise architecture
• Low alignment with business priorities
• Low team motivation
• Major initiatives behind schedule
• Low business satisfaction score
12. Document Classification
12
1. SREs can influence architecture decisions
2. Error budget is the cost of defects
3. SREs need dev Skills
4. SRE team is an integrated single team across the IT
organization
5. Where there is DevOps there is no SRE
6. SREs have a deep insight of a set of applications
Activity
14. Document Classification
14
Consulting & Services Integration
A better way to organize SRE – an Illustrative model
• Persistent SRE team per product team
• Integrated team with Dev and Ops skills
• Single backlog with all stories / work items
• Kanban used for managing workflow
• Closely working with Scrum teams
• Connected with Enterprise Architecture
and Business
Team structure and responsibilities
• Time to Market
• Deployment Frequency
• Change Failure Rate
• Application Up time
• Mean Time to Restore
• SLA
• Error Budget
Metrics
• Development stories
• L1 and L2 tickets
• Automation stories
• Retrospective action items
• Infra / DevOps related
stories
Backlog items
Scrum Team 1
Enterprise Architecture
Business Stakeholders
Product Team 2
SRE Team
Scrum
Team 1
Application
Monitoring
Scrum
Team 3
Scrum
Team n
Operations
Support L1, L2
Development
L2+, L3
Scrum
Team 2
Product Team 1
SRE Team
Scrum
Team 1
Application
Monitoring
Scrum
Team 3
Scrum
Team n
Operations
Support L1, L2
Development
L2+, L3
Scrum
Team 2
16. Document Classification
16
By adopting this philosophy SREs resolve the critical tickets to restore applications before
starting development work.
Key Kanban principles for successful SRE ways for working
Backlog
Management
Visualizing Work
Empowerment
WIP Limits
Stop starting and
start finishing
Backlog Management – SRE workflow could be managed using Kanban – especially for the
service tickets. These could co-exist in the Kanban board with the Dev stories.
SREs maintain absolute transparency of their work and closely collaborate with Dev,
Infrastructure, Architecture and other groups. They maintain visual dashboards of the
status of their work.
SRE approach encourages ‘act of leadership’ a key Kanban principle. SREs are empowered to
make decisions within their context. E.g. when to switch gears from dev work to ops tickets.
WIP Limit – the power of WIP limit could be exploited to effectively
utilize the Error Budget. E.g. Calibrate WIP limit with the threshold
of Error Budget : E.g. If Ops WIP > 3 then stop taking dev work?.
17. Document Classification
17
1. Don’t establish an enterprise wide
single SRE Team
2. Don’t set then up as DevOps team,
Tech Debt team or other specialized
common teams
3. Don’t dump them with
miscellaneous work
4. Don’t use them as buffer capacity
for the Dev team
5. Don’t measure them using
‘standard’ productivity metrics
In summary
1. Make the full advantage of Kanban
practices to manage their work
2. Provide them with architecture
support
3. Give then access to code, use security
principles – Role Based Access Control
(RBAC), need-to-know, least privilege
4. Involve them in innovation and
product evolution discussions
5. Engage them with Product Owners and
business stakeholders
18. Document Classification
18
1. Who should own SRE – Change or Run?
Will that matter?
2. Can the SREs do the Dev part of their
work using Agile principles and use
Kanban for their ops related services?
3. What are top 3 attributes of an SRE?
Food for thought