SlideShare a Scribd company logo
1 of 57
Download to read offline
Site Reliability Engineering
Organizational & Operational Framework
(a.k.a. “Work Mode”)
May 2019
Olaf Reitmaier Veracierta
Agenda
Introduction
Weakness / Strengths
• Culture of Service
• Bi-Modality Awareness
• Ways of Working
• Organization
• Innovation
• Communication
• Planning
Changes
Trust & Visibility
Round Robin (Q&A)
Introduction
Purpose
• Clarify expectations from company leaders
• Align all the people then address technology issues
• Allow collaborators to share concerns (at the end, take notes)
• Propose a set of criteria used to follow-up teamwork
• Opportunity to be honest, positive and share ideas
• Put everyone in the same (new) page, no excuses
• Avoid repeating this information several times
Site Reliability Engineering
Vision
• Be recognized as an expert and reliable group of people which
offers the best in class knowledge and support over the core IT
infrastructure.
Mission:
• Ensure the continuous and secure operation and support of the
network, compute, storage, backup, messaging, logging and
monitoring platforms of the core IT infrastructure enabling the
critical business application and services.
Site Reliability Engineering
https://sre.google/books/
“Reliability”
Weakness
& Strengths
Strengths
KNOWLEDGE PASSIONATE
COMPROMISE
++ Assertiveness
++ Resolution
Weakness
ACCOUNTABILITY
(TRACEABILITY)
COMMUNICATION
(WORKFLOWS)
VERSATILITY
(STRATEGY)
What
When
Whom
Where
Why
Who
-- Scalability
Changes
The Company / Business Culture
Vicious Cycle: I really like a different future, but
why change if everything is ok the way it is. The
(other) people has to change.
Micro
Macro
Safety
Circle
Blindness?
Start with You
Every:
• Business
• Customer
• Organization
• Employee
Has:
1. Opportunities
2. Values
3. Risks
4. Rules Standards
• Everybody has a movie
• Criteria (dis-)alignment
Change
CULTURE
(OF SERVICE)
BI-MODAL
(VERSATILITY)
WAYS OF WORKING
(ME / YOU / ALL)
LEGACY
(HR / IT)
• Incremental
• Iterative
• Agreed
Change
|
Culture
of Service
User Experience => Every Interaction Counts
Customers ⬄ Team / Group / Member / Partner (Internal & External)
Service Delivery Orientation
Efficient
Kind
Inefficient
Unkind
Inefficient
Kind
Efficient
Unkind
1.Developers
2.DevOps
3.SysOps
Emotional
Intelligence
KEY FOR PEOPLE
RELATIONS
KEY FOR WORK
RELATIONS
TRI-ONE BRAIN SYSTEM 1/2
Conflict <=> Negotiation
Tension <=> Nothing
Conflict != Fight
Triune Brain
(Thinking/(Re-)Acting)
With time habits (good/bad) becomes reptilian
bias, without noticing you loose reasoning skills
and ability to change at all.
System 1 and 2
(Thinking/(Re-)Acting)
Stimulation -> Time -> Reaction
“People gather together with people
that make the work easier”.
Empathy
“Put yourself in the other shoes”
“Ask for reciprocity about it”
Resilience
Self – Esteem / Empathy / Positiveness
Goals / Challenges / Opportunities
Capacity to recover or thrive
from any kind of difficulties
in work and personal life
Difficulties
Change
|
Bi-Modality
Awareness
Code
Infrastructure
Sys/Ops Dev
Bussiness
IT
Maintenance,
Support
IaaCode, Cloud, DevOps
Innovation
Mr. No Mr. Yes
Infrastructure vs. Code
Infrastructure Code
Hardware / Software Software
Static / Inmutable / Un-versioned Dynamic / Mutable / Versioned
Moore’s Law (Faster) Wirth’s Law (Slower)
Extrinsic Documentation Intrinsic Documentation
Administrator Driven (Indirect) User Driven (Direct)
Software Defined Anything
Virtualization Cloud Native Solutions Serverless
Still slow, really slow,
transition…
Changes
|
Way of Work
Changes
|
Way of Work
|
Organization
Technician vs.
Engineer vs.
Manager
Solves different problems
From technical to the business language
Doing, thinking or organizing people
(Hardly done in parallel by the same person)
From nothing to a book of knowledge
Managers
(aka Coachs)
• Helps team members
• Drive the team to improve
• Support team good/bad times
• Organization, planification and
auditing of the work progress
Deliver:
• Proactive progress reporting
• Active participation in solutions
• Answers for follow-up
question
• Ensure the job is done in time
Expects:
Performance
Classification
Autonomy
Seniority
Delegation
Operation
Infrastructure
Code
Legacy
Documentation
Training
(Re-)usability
Auditability
Autonomy
(Freedom to innovate)
You work within a team
(not alone) for good & bad
You implement and
support high-available
solutions (whenever is
possible)
You foster automation
instead of manual task
(wherever is possible)
You look for strategic approval
(specially for new projects /
scope changes)
Seniority
Years doing same stuff → Changing on a daily basis
Just being different → Making the difference
Smart person → Smart team
A lot of Power → Responsibilities
Learning → Teaching (New senior engineers will need it!)
Delegation
Knowing what are your
responsibilities is important
We are here because
somebody let us make
mistakes
Having time for innovation
needs delegation and teaching
to someone to be able to scale
Senior
Complex Tasks
Semi-Senior
Medium Tasks
Entry
Basic Tasks
2nd in Command (Backup)
RESPONSIBILITIES MATRIX COUPLE OF YOUR SELECTION
(HIRE A NEW MEMBER?)
PROMOTES SHARING OF
CRITICAL DOCUMENTATION
AND KNOWLEDGE
Update Responsibility Matrix
Infrastructure:
• Network
• Compute
• Storage
• Management Systems
• Security/Identity Devices/Systems
• Backup System and Disaster Recovery
• Monitoring / Alerting (NPM, ITM, APM)
• Supporting Business Applications
Documentation
(The Critical One)
Alive (always change)
Challenging for everyone (a lot of reasons)
Boring because it is not for you but for others
Needed for delegation to others (i.e. newcomers,
leftovers when vacation, sickness, renounce).
Important when companies
growth up in time
Reduce tribal knowledge
Raise quality levels
Allow auditing
Documentation: Layered / Trimmed / Useful
Abstraction
Update -> Critical
Archive -> Obsolete
Focused on Training
Discoverable (Searchable)
KPI
Documentation
(Workshops)
Resume: purpose, technology/vendor websites,
external articles/references.
Architecture: high level (at least) visual representation
of the platform or system (i.e. draw.io, dot/graphviz).
Assets: resources inventory labeling (not naming), links
(URL) and credentials (e.g. Vault, SSO).
How-To’s: about (re-)install and configure with focus
on critical and tricky in-house customizations.
Basic administration and troubleshooting: standard
procedures, know errors with brief solution explained,
references to article/tickets and similars.
Changes
|
Way of Work
|
Innovation
Legacy
• There was, there is and there will be
• It’s a matter of time, but it is important
• It is not neither bad nor good is just legacy
• Innovation is needed, but maintenance also
99% of the things are legacy since its go live
Innovation
(& Investigation)
Must be focused
Value
(Customer)
Should be planned
(End/Start)
Trackable
(Timeframe)
Must be a process
(Success/Failure)
Measured
(Deliverable)
Changes
|
Way of Work
|
Communication
Communication
• Notify:
• Live changes with enough anticipation to customers
• Absence with anticipation (book and or block calendar)
• Delay/Leave to/from office to the team members
• If you are working and applying changes on weekend
• Avoid:
• Doing unplanned changes not related to live issues on “Fridays”
• Implementing new features near days you are on vacations (freeze)
• Mixing informal/formal communication (e.g. me/x-team, misunderstanding, rumors)
• Overlapping vacations with your 2nd in command (a.k.a. my backup)
Estimated Time Ahead (ETA)
BROADCAST
INFORMATION
START, PROGRESS,
END, EVIDENCE
REPORT
CONTINUOUSLY
DON’T EXPECT TO
BE FOLLOW-UP
RESPONSE
QUESTIONS
TO EASY CLARIFY
FOLLOW-UP
APPLY TO
EVERYTHING:
INCIDENT,
PROBLEMS,
REQUIREMENTS.
Communication
(Too many
channels?)
• Many formals communications and calendaring
• Request to chats will be converted to tickets (as needed)
@Email
• Live Issues and News
• Daily/Weekly Slack-Up Trial (instead of Daily Stand-Up)
#Slack
• Kanban Board: Task/Requirements (Internal/External)
• Wiki: For publishing critical infrastructure information
JIRA (Tickets/Wiki)
• Only for emergencies
Phone Calls / WhatsApp
Mute Generation / Work Comm. Channel
Favorite Comm.
Channel (IM)
Daily Comm.
Channel (IM)
Family Comm.
Channel (IM)
So let’s Slack-Up!
• (Trial) Broadcast to slack channel instead of verbal stand-up.
• Daily/Weekly basis at the start of the day before 10 A.M.
• Reminder for achievements/contributions to OKR’s:
• Maintenance
• Innovation & Optimization
• Support (Only if not already a JIRA Ticket)
• Low, Lowest, Medium, High, Highest.
• Undesired/Unplanned requests (e.g. C-Level Urgencies, Last-Minute-Request)
• Help to track easily your progress and risks (red flags).
• Promote team awareness of issues and progress.
How it looks? https://slack.com/intl/en-de/slack-tips/run-daily-standups-or-check-ins
Changes
|
Way of Work
|
Planning
Planning
Objectives and
Key Results (OKRs)
Planning
(OKR)
Maintenance
Projects (30%)
Maintenance plans
Preventive monitoring
Optimization
& Innovation
Projects (20%)
Cost-efficiency driven plans
Security oriented driven plans
New technologies / features
try-out / deployment plans
Support
Projects (50%)
New requirements
Reactive incident response
• Known errors handling
• Problems troubleshooting
Initiatives: Pick up selection to improve operation?
URGENT NOT URGENT
IMPORTANT
DO
• Live Issue
• Slack until fixed
• Do the post-mortem
documentation
• OKR’s
• Slack-Up w/Manager
DECIDE
• Non-Live Issue
• Ask-for/Open ticket
• Slack-Up w/Manager
• Clarify expectations third
parties (e.g. ETA, Termin)
• Move to Initiatives
NOT IMPORTANT
DELEGATE
• Teach fishing
• Assign a less senior
• 2nd in Cmd. (Backup)?
• Manager
DELETE
• Don’t confuse it with
Innovation at OKR’s
• Stop thinking on it
Individual Planning / Report-Up
Estimated Time Ahead (ETA)
Never Ending Story
Unexpected
Additions/Drops
For critical
infrastructure is
important
Challenge for most
of the tech teams
Different for
innovation,
maintenance
and support
IT Incident Response Plan (Overview)
Alerts: Person, Call, E-Mail,
#Slack, SMS, without
standard classification
(severity).
Follow-up on slack
Operational (OPS) channel
(chat).
Direct Responsible Individual
(DRI) take care of him/her
alerts.
Meeting everyone involved in
Situational Room.
SLA defined in terms of
maximum downtime in hours
per system / application.
Spiral Escalation to Eng. Mgr.,
Head/Director, CTO, CEO.
<45 minutes
IT Incident Response Insights
24/7 Emergency Handling - Contacts
Technology Team Cloud Support - Contacts
On-Premise Datacenter Emergency - Contacts
Formal Mailing Lists for the Business / Technology Teams
Report +
Escalate
Pro vs.
Re-Act
Account
+ Adjust
Incident Response Plan - Responsibilities
DevOps & Site Reliability Teams – Responsibilities
Physical / IT Security & Legal Teams - Responsibilities
(Local) IT Support Teams - Responsibilities
Calls, Chats, Alerts, Tickets
Incident / Problems / Changes (ITIL’s way)
Stability & Post-Mortem Meetings (Agile way)
SLI
(Observe)
SLO
(Oversee)
SLA
(Own)
Progress (Follow-Up/Report-Up)
Tickets New/Completed (JIRA -> slack channel)
Daily/Weekly Slack-Up (You -> slack channel)
Individual Daily Pin Pointing (On Your Desk)
Bi-Weekly One-To-One (Room / Walk / Lunch)
Monthly Retrospective (Last Friday of the Month Afternoon)
Trust &
Visibility
Trust &
Visibility
Trust ring mitigates business hi-jacking
Critical credential and access levels should be
shared (i.e. OneLogin) with key team
members and C-Level (Breaking the glass)
Access should cover all infrastructure assets,
platforms and systems
Monitoring tools and central consoles alerts
for infrastructure must be broadcasted in
communication channels (i.e. slack, e-mail)

More Related Content

Similar to SRE Organizational Framework

Database change deployments: Performance matters
Database change deployments: Performance mattersDatabase change deployments: Performance matters
Database change deployments: Performance mattersvbarun01
 
Engineering Effectiveness
Engineering EffectivenessEngineering Effectiveness
Engineering EffectivenessMarcio Sete
 
Set the Path Forward with Smart Technology Decisions.pdf
Set the Path Forward with Smart Technology Decisions.pdfSet the Path Forward with Smart Technology Decisions.pdf
Set the Path Forward with Smart Technology Decisions.pdfTechSoup
 
Powerful and Quick Workflow Automation Solutions with Nintex
Powerful and Quick Workflow Automation Solutions with NintexPowerful and Quick Workflow Automation Solutions with Nintex
Powerful and Quick Workflow Automation Solutions with NintexNetwoven Inc.
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Database Industry perspective
Database Industry perspectiveDatabase Industry perspective
Database Industry perspectiveAmin Chowdhury
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015Shannon Lietz
 
Holistic Product Development
Holistic Product DevelopmentHolistic Product Development
Holistic Product DevelopmentGary Pedretti
 
CTO School Meetup - Jan 2013 Becoming Better Technical Leader
CTO School Meetup - Jan 2013   Becoming Better Technical LeaderCTO School Meetup - Jan 2013   Becoming Better Technical Leader
CTO School Meetup - Jan 2013 Becoming Better Technical LeaderJean Barmash
 
How to scale a chocked up mid-stage startup!!!
How to scale a chocked up mid-stage startup!!!How to scale a chocked up mid-stage startup!!!
How to scale a chocked up mid-stage startup!!!himey75
 
Roman Smolgovsky - Who Am I
Roman Smolgovsky - Who Am IRoman Smolgovsky - Who Am I
Roman Smolgovsky - Who Am IRoman Smolgovsky
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015Shannon Lietz
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Lucas Jellema
 
The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)Julien SIMON
 
Gary Hayashi Resume 2016_05_23
Gary Hayashi Resume 2016_05_23Gary Hayashi Resume 2016_05_23
Gary Hayashi Resume 2016_05_23Gary Hayashi
 
Atlassian Overview
Atlassian OverviewAtlassian Overview
Atlassian OverviewAtlassian
 

Similar to SRE Organizational Framework (20)

Resume_Rakesh_Kumawat
Resume_Rakesh_KumawatResume_Rakesh_Kumawat
Resume_Rakesh_Kumawat
 
Database change deployments: Performance matters
Database change deployments: Performance mattersDatabase change deployments: Performance matters
Database change deployments: Performance matters
 
Engineering Effectiveness
Engineering EffectivenessEngineering Effectiveness
Engineering Effectiveness
 
Set the Path Forward with Smart Technology Decisions.pdf
Set the Path Forward with Smart Technology Decisions.pdfSet the Path Forward with Smart Technology Decisions.pdf
Set the Path Forward with Smart Technology Decisions.pdf
 
Resume Ace(1)
Resume Ace(1)Resume Ace(1)
Resume Ace(1)
 
DevOps
DevOpsDevOps
DevOps
 
Powerful and Quick Workflow Automation Solutions with Nintex
Powerful and Quick Workflow Automation Solutions with NintexPowerful and Quick Workflow Automation Solutions with Nintex
Powerful and Quick Workflow Automation Solutions with Nintex
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Database Industry perspective
Database Industry perspectiveDatabase Industry perspective
Database Industry perspective
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
Holistic Product Development
Holistic Product DevelopmentHolistic Product Development
Holistic Product Development
 
CTO School Meetup - Jan 2013 Becoming Better Technical Leader
CTO School Meetup - Jan 2013   Becoming Better Technical LeaderCTO School Meetup - Jan 2013   Becoming Better Technical Leader
CTO School Meetup - Jan 2013 Becoming Better Technical Leader
 
How to scale a chocked up mid-stage startup!!!
How to scale a chocked up mid-stage startup!!!How to scale a chocked up mid-stage startup!!!
How to scale a chocked up mid-stage startup!!!
 
Roman Smolgovsky - Who Am I
Roman Smolgovsky - Who Am IRoman Smolgovsky - Who Am I
Roman Smolgovsky - Who Am I
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
 
DevSecCon Keynote
DevSecCon KeynoteDevSecCon Keynote
DevSecCon Keynote
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
 
The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)
 
Gary Hayashi Resume 2016_05_23
Gary Hayashi Resume 2016_05_23Gary Hayashi Resume 2016_05_23
Gary Hayashi Resume 2016_05_23
 
Atlassian Overview
Atlassian OverviewAtlassian Overview
Atlassian Overview
 

More from Olaf Reitmaier Veracierta

Bandwidth control approach - Cisco vs Mikrotik on Multitenancy
Bandwidth control approach - Cisco vs Mikrotik on MultitenancyBandwidth control approach - Cisco vs Mikrotik on Multitenancy
Bandwidth control approach - Cisco vs Mikrotik on MultitenancyOlaf Reitmaier Veracierta
 
Arquitectura de Referencia - BGP - GSLB - SLB
Arquitectura de Referencia - BGP - GSLB - SLBArquitectura de Referencia - BGP - GSLB - SLB
Arquitectura de Referencia - BGP - GSLB - SLBOlaf Reitmaier Veracierta
 
Estrategia para Despliegue de Contenedores (Agile/DevOps)
Estrategia para Despliegue de Contenedores (Agile/DevOps)Estrategia para Despliegue de Contenedores (Agile/DevOps)
Estrategia para Despliegue de Contenedores (Agile/DevOps)Olaf Reitmaier Veracierta
 

More from Olaf Reitmaier Veracierta (20)

PoC Azure Administration
PoC Azure AdministrationPoC Azure Administration
PoC Azure Administration
 
RabbitMQ Status Quo Critical Review
RabbitMQ Status Quo Critical ReviewRabbitMQ Status Quo Critical Review
RabbitMQ Status Quo Critical Review
 
AWS Graviton3 and GP3
AWS Graviton3 and GP3AWS Graviton3 and GP3
AWS Graviton3 and GP3
 
Kubernetes Workload Rebalancing
Kubernetes Workload RebalancingKubernetes Workload Rebalancing
Kubernetes Workload Rebalancing
 
KubeAdm vs. EKS - The IAM Roles Madness
KubeAdm vs. EKS - The IAM Roles MadnessKubeAdm vs. EKS - The IAM Roles Madness
KubeAdm vs. EKS - The IAM Roles Madness
 
AWS Cost Optimizations Risks
AWS Cost Optimizations RisksAWS Cost Optimizations Risks
AWS Cost Optimizations Risks
 
AWS Network Architecture Rework
AWS Network Architecture ReworkAWS Network Architecture Rework
AWS Network Architecture Rework
 
Insight - Architecture Design
Insight - Architecture DesignInsight - Architecture Design
Insight - Architecture Design
 
Bandwidth control approach - Cisco vs Mikrotik on Multitenancy
Bandwidth control approach - Cisco vs Mikrotik on MultitenancyBandwidth control approach - Cisco vs Mikrotik on Multitenancy
Bandwidth control approach - Cisco vs Mikrotik on Multitenancy
 
Transparent Layer 2 Bandwidth Shaper
Transparent Layer 2 Bandwidth ShaperTransparent Layer 2 Bandwidth Shaper
Transparent Layer 2 Bandwidth Shaper
 
Arquitectura de Referencia - BGP - GSLB - SLB
Arquitectura de Referencia - BGP - GSLB - SLBArquitectura de Referencia - BGP - GSLB - SLB
Arquitectura de Referencia - BGP - GSLB - SLB
 
Backup aaS Solution Architecture
Backup aaS Solution ArchitectureBackup aaS Solution Architecture
Backup aaS Solution Architecture
 
Presentación de Arquitectura en la Nube
Presentación de Arquitectura en la NubePresentación de Arquitectura en la Nube
Presentación de Arquitectura en la Nube
 
Distributed Web Cluster (LAPP)
Distributed Web Cluster (LAPP)Distributed Web Cluster (LAPP)
Distributed Web Cluster (LAPP)
 
Multi-Cloud Connection Architecture
Multi-Cloud Connection ArchitectureMulti-Cloud Connection Architecture
Multi-Cloud Connection Architecture
 
Managed Cloud Services Revision
Managed Cloud Services RevisionManaged Cloud Services Revision
Managed Cloud Services Revision
 
Ingeniería de Software
Ingeniería de SoftwareIngeniería de Software
Ingeniería de Software
 
Estrategia para Despliegue de Contenedores (Agile/DevOps)
Estrategia para Despliegue de Contenedores (Agile/DevOps)Estrategia para Despliegue de Contenedores (Agile/DevOps)
Estrategia para Despliegue de Contenedores (Agile/DevOps)
 
On-Premise Private Cloud Architecture
On-Premise Private Cloud ArchitectureOn-Premise Private Cloud Architecture
On-Premise Private Cloud Architecture
 
Multimedia Streaming Architecture
Multimedia Streaming ArchitectureMultimedia Streaming Architecture
Multimedia Streaming Architecture
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

SRE Organizational Framework

  • 1. Site Reliability Engineering Organizational & Operational Framework (a.k.a. “Work Mode”) May 2019 Olaf Reitmaier Veracierta
  • 2. Agenda Introduction Weakness / Strengths • Culture of Service • Bi-Modality Awareness • Ways of Working • Organization • Innovation • Communication • Planning Changes Trust & Visibility Round Robin (Q&A)
  • 4. Purpose • Clarify expectations from company leaders • Align all the people then address technology issues • Allow collaborators to share concerns (at the end, take notes) • Propose a set of criteria used to follow-up teamwork • Opportunity to be honest, positive and share ideas • Put everyone in the same (new) page, no excuses • Avoid repeating this information several times
  • 5. Site Reliability Engineering Vision • Be recognized as an expert and reliable group of people which offers the best in class knowledge and support over the core IT infrastructure. Mission: • Ensure the continuous and secure operation and support of the network, compute, storage, backup, messaging, logging and monitoring platforms of the core IT infrastructure enabling the critical business application and services.
  • 11. The Company / Business Culture Vicious Cycle: I really like a different future, but why change if everything is ok the way it is. The (other) people has to change. Micro Macro Safety Circle Blindness? Start with You
  • 12. Every: • Business • Customer • Organization • Employee Has: 1. Opportunities 2. Values 3. Risks 4. Rules Standards • Everybody has a movie • Criteria (dis-)alignment
  • 13. Change CULTURE (OF SERVICE) BI-MODAL (VERSATILITY) WAYS OF WORKING (ME / YOU / ALL) LEGACY (HR / IT) • Incremental • Iterative • Agreed
  • 15. User Experience => Every Interaction Counts Customers ⬄ Team / Group / Member / Partner (Internal & External)
  • 17. Emotional Intelligence KEY FOR PEOPLE RELATIONS KEY FOR WORK RELATIONS TRI-ONE BRAIN SYSTEM 1/2 Conflict <=> Negotiation Tension <=> Nothing Conflict != Fight
  • 18. Triune Brain (Thinking/(Re-)Acting) With time habits (good/bad) becomes reptilian bias, without noticing you loose reasoning skills and ability to change at all.
  • 19. System 1 and 2 (Thinking/(Re-)Acting) Stimulation -> Time -> Reaction “People gather together with people that make the work easier”.
  • 20. Empathy “Put yourself in the other shoes” “Ask for reciprocity about it”
  • 21. Resilience Self – Esteem / Empathy / Positiveness Goals / Challenges / Opportunities Capacity to recover or thrive from any kind of difficulties in work and personal life Difficulties
  • 24. Infrastructure vs. Code Infrastructure Code Hardware / Software Software Static / Inmutable / Un-versioned Dynamic / Mutable / Versioned Moore’s Law (Faster) Wirth’s Law (Slower) Extrinsic Documentation Intrinsic Documentation Administrator Driven (Indirect) User Driven (Direct) Software Defined Anything Virtualization Cloud Native Solutions Serverless Still slow, really slow, transition…
  • 27. Technician vs. Engineer vs. Manager Solves different problems From technical to the business language Doing, thinking or organizing people (Hardly done in parallel by the same person) From nothing to a book of knowledge
  • 28. Managers (aka Coachs) • Helps team members • Drive the team to improve • Support team good/bad times • Organization, planification and auditing of the work progress Deliver: • Proactive progress reporting • Active participation in solutions • Answers for follow-up question • Ensure the job is done in time Expects:
  • 30. Autonomy (Freedom to innovate) You work within a team (not alone) for good & bad You implement and support high-available solutions (whenever is possible) You foster automation instead of manual task (wherever is possible) You look for strategic approval (specially for new projects / scope changes)
  • 31. Seniority Years doing same stuff → Changing on a daily basis Just being different → Making the difference Smart person → Smart team A lot of Power → Responsibilities Learning → Teaching (New senior engineers will need it!)
  • 32. Delegation Knowing what are your responsibilities is important We are here because somebody let us make mistakes Having time for innovation needs delegation and teaching to someone to be able to scale Senior Complex Tasks Semi-Senior Medium Tasks Entry Basic Tasks
  • 33. 2nd in Command (Backup) RESPONSIBILITIES MATRIX COUPLE OF YOUR SELECTION (HIRE A NEW MEMBER?) PROMOTES SHARING OF CRITICAL DOCUMENTATION AND KNOWLEDGE
  • 34. Update Responsibility Matrix Infrastructure: • Network • Compute • Storage • Management Systems • Security/Identity Devices/Systems • Backup System and Disaster Recovery • Monitoring / Alerting (NPM, ITM, APM) • Supporting Business Applications
  • 35. Documentation (The Critical One) Alive (always change) Challenging for everyone (a lot of reasons) Boring because it is not for you but for others Needed for delegation to others (i.e. newcomers, leftovers when vacation, sickness, renounce). Important when companies growth up in time Reduce tribal knowledge Raise quality levels Allow auditing
  • 36. Documentation: Layered / Trimmed / Useful Abstraction Update -> Critical Archive -> Obsolete Focused on Training Discoverable (Searchable) KPI
  • 37. Documentation (Workshops) Resume: purpose, technology/vendor websites, external articles/references. Architecture: high level (at least) visual representation of the platform or system (i.e. draw.io, dot/graphviz). Assets: resources inventory labeling (not naming), links (URL) and credentials (e.g. Vault, SSO). How-To’s: about (re-)install and configure with focus on critical and tricky in-house customizations. Basic administration and troubleshooting: standard procedures, know errors with brief solution explained, references to article/tickets and similars.
  • 39. Legacy • There was, there is and there will be • It’s a matter of time, but it is important • It is not neither bad nor good is just legacy • Innovation is needed, but maintenance also 99% of the things are legacy since its go live
  • 40. Innovation (& Investigation) Must be focused Value (Customer) Should be planned (End/Start) Trackable (Timeframe) Must be a process (Success/Failure) Measured (Deliverable)
  • 42. Communication • Notify: • Live changes with enough anticipation to customers • Absence with anticipation (book and or block calendar) • Delay/Leave to/from office to the team members • If you are working and applying changes on weekend • Avoid: • Doing unplanned changes not related to live issues on “Fridays” • Implementing new features near days you are on vacations (freeze) • Mixing informal/formal communication (e.g. me/x-team, misunderstanding, rumors) • Overlapping vacations with your 2nd in command (a.k.a. my backup)
  • 43. Estimated Time Ahead (ETA) BROADCAST INFORMATION START, PROGRESS, END, EVIDENCE REPORT CONTINUOUSLY DON’T EXPECT TO BE FOLLOW-UP RESPONSE QUESTIONS TO EASY CLARIFY FOLLOW-UP APPLY TO EVERYTHING: INCIDENT, PROBLEMS, REQUIREMENTS.
  • 44. Communication (Too many channels?) • Many formals communications and calendaring • Request to chats will be converted to tickets (as needed) @Email • Live Issues and News • Daily/Weekly Slack-Up Trial (instead of Daily Stand-Up) #Slack • Kanban Board: Task/Requirements (Internal/External) • Wiki: For publishing critical infrastructure information JIRA (Tickets/Wiki) • Only for emergencies Phone Calls / WhatsApp
  • 45. Mute Generation / Work Comm. Channel Favorite Comm. Channel (IM) Daily Comm. Channel (IM) Family Comm. Channel (IM)
  • 46. So let’s Slack-Up! • (Trial) Broadcast to slack channel instead of verbal stand-up. • Daily/Weekly basis at the start of the day before 10 A.M. • Reminder for achievements/contributions to OKR’s: • Maintenance • Innovation & Optimization • Support (Only if not already a JIRA Ticket) • Low, Lowest, Medium, High, Highest. • Undesired/Unplanned requests (e.g. C-Level Urgencies, Last-Minute-Request) • Help to track easily your progress and risks (red flags). • Promote team awareness of issues and progress. How it looks? https://slack.com/intl/en-de/slack-tips/run-daily-standups-or-check-ins
  • 49. Planning (OKR) Maintenance Projects (30%) Maintenance plans Preventive monitoring Optimization & Innovation Projects (20%) Cost-efficiency driven plans Security oriented driven plans New technologies / features try-out / deployment plans Support Projects (50%) New requirements Reactive incident response • Known errors handling • Problems troubleshooting
  • 50. Initiatives: Pick up selection to improve operation?
  • 51. URGENT NOT URGENT IMPORTANT DO • Live Issue • Slack until fixed • Do the post-mortem documentation • OKR’s • Slack-Up w/Manager DECIDE • Non-Live Issue • Ask-for/Open ticket • Slack-Up w/Manager • Clarify expectations third parties (e.g. ETA, Termin) • Move to Initiatives NOT IMPORTANT DELEGATE • Teach fishing • Assign a less senior • 2nd in Cmd. (Backup)? • Manager DELETE • Don’t confuse it with Innovation at OKR’s • Stop thinking on it Individual Planning / Report-Up
  • 52. Estimated Time Ahead (ETA) Never Ending Story Unexpected Additions/Drops For critical infrastructure is important Challenge for most of the tech teams Different for innovation, maintenance and support
  • 53. IT Incident Response Plan (Overview) Alerts: Person, Call, E-Mail, #Slack, SMS, without standard classification (severity). Follow-up on slack Operational (OPS) channel (chat). Direct Responsible Individual (DRI) take care of him/her alerts. Meeting everyone involved in Situational Room. SLA defined in terms of maximum downtime in hours per system / application. Spiral Escalation to Eng. Mgr., Head/Director, CTO, CEO. <45 minutes
  • 54. IT Incident Response Insights 24/7 Emergency Handling - Contacts Technology Team Cloud Support - Contacts On-Premise Datacenter Emergency - Contacts Formal Mailing Lists for the Business / Technology Teams Report + Escalate Pro vs. Re-Act Account + Adjust Incident Response Plan - Responsibilities DevOps & Site Reliability Teams – Responsibilities Physical / IT Security & Legal Teams - Responsibilities (Local) IT Support Teams - Responsibilities Calls, Chats, Alerts, Tickets Incident / Problems / Changes (ITIL’s way) Stability & Post-Mortem Meetings (Agile way) SLI (Observe) SLO (Oversee) SLA (Own)
  • 55. Progress (Follow-Up/Report-Up) Tickets New/Completed (JIRA -> slack channel) Daily/Weekly Slack-Up (You -> slack channel) Individual Daily Pin Pointing (On Your Desk) Bi-Weekly One-To-One (Room / Walk / Lunch) Monthly Retrospective (Last Friday of the Month Afternoon)
  • 57. Trust & Visibility Trust ring mitigates business hi-jacking Critical credential and access levels should be shared (i.e. OneLogin) with key team members and C-Level (Breaking the glass) Access should cover all infrastructure assets, platforms and systems Monitoring tools and central consoles alerts for infrastructure must be broadcasted in communication channels (i.e. slack, e-mail)