SlideShare a Scribd company logo
1 of 25
Download to read offline
TRACK: SITE RELIABILTY ENGINEERING
NOVEMBER 12, 2020
Deepak Ramchandani, Contino
Driving Digital
Transformation
through CloudOps
and SRE
TRACK: SITE RELIABILTY ENGINEERING
Deepak Vensi
Account Principal @ Contino
linkedin.com/in/deepakrv/
● Multi-Cloud adoption within Regulated Industries
● Advocate & build in-house mature Engineering practices with a focus on
SRE
● Changing the Operating Model with a focus on FinOps, GitOps &
DevSecOps
● Help develop Cloud Native Products and Services for Enterprises
● Build sustainable in-house digital capabilities for long term adoption
TRACK: SITE RELIABILTY ENGINEERING
• Core functions part of IT Operations
• What does Operations mean in Cloud
• Why SRE?
• How can you bring Reliability / Docs / Code
/ Controls together?
• How to upskills a team towards SRE
Agenda
TRACK: SITE RELIABILTY ENGINEERING
• Incident Management
• Change Management
• Release Management
• Configuration Management
• Capacity Management
• Business Continuity & Backup
• Asset Management
• Demand Management
• Knowledge Management
• Risk Management
Core Functions of Traditional IT
Operations
TRACK: SITE RELIABILTY ENGINEERING
What does a “Pipeline” look like?
Prereqs Request Define Qualify
Materiality &
Risk
Data
Assessment
BC/ Resilience Architecture Build in DEV
Secure @
Design
Service
Transition
Build in PROD
Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue
Secure @
Build
Endorse Promote
Queue Queue
TRACK: SITE RELIABILTY ENGINEERING
• Reliability
• Security
• Operability
• Predictability
• “Control”
What do IT Ops really want?
TRACK: SITE RELIABILTY ENGINEERING
The Gap to Bridge
Traditional IT
Ops
Public Cloud &
Digital
New Modes of
Operations
TRACK: SITE RELIABILTY ENGINEERING
Site reliability engineering (SRE) is a discipline
that incorporates aspects of software engineering
and applies them to infrastructure and operations
problems.[1] The main goals are to create
scalable and highly reliable software systems.
According to Ben Treynor, founder of Google's
Site Reliability Team, SRE is "what happens when
a software engineer is tasked with what used to
be called operations."
SRE to the Rescue!
TRACK: SITE RELIABILTY ENGINEERING
The four key pillars to CloudOps
SRE FinOps GitOps DevSecOps
TRACK: SITE RELIABILTY ENGINEERING
SRE to the Rescue!
SRE Stepping stones:
lay the foundation first
Reliability Hierarchy
of Needs
Review architecture for reliability
SLx
Monitoring & Alerting
Testing & Release
Capacity planning
Post Mortem/RCA
Development
Product
TRACK: SITE RELIABILTY ENGINEERING
Fear not!
You Build it you Run it
Design Run
Build
Act like a
Developer, Think
like a Systems
Operator
TRACK: SITE RELIABILTY ENGINEERING
Roles and Responsibilities
Site Reliability Engineer
“Act like a Developer, think
like a Systems Operator”.
4. Eliminate Toil via
Automation
2. Create Service Level
Indicators & Objectives
1. Consult with
Consumers
3. Design, Build & Run
Platforms / Services / Apps
5. Monitor Distributed
Systems & Develop Products
9. Emergency Response
& On-call Support
6. Postmortems & Learn
from Failure
8. Track Platform
Outages
7. Manage Platform &
Consumer Incidents
TRACK: SITE RELIABILTY ENGINEERING
How to get started?
TRACK: SITE RELIABILTY ENGINEERING
Team Topology
Cloud Platforms
Landing Zones | API Gateway |
Patterns | Shared Services
Platform Team(s)
Team 1
Product 3
Application 2
Service 4
Stream Aligned Team(s)
Developer
Experience
Enablement
Team
Customers: Internal Product/App
teams
Reliability: Quite high!
Customers: External facing services
Reliability: Service dependant
TRACK: SITE RELIABILTY ENGINEERING
Team Topology
Cloud Platforms
Landing Zones | API Gateway |
Patterns | Shared Services
Platform Team(s)
Team 1
Product 3
Application 2
Service 4
Stream Aligned Team(s)
Mandates for Success
● Defining the “Team API” for Platform Teams
● Cloud Platforms enable the demand from
the stream aligned teams, not dictate them
● Platforms Teams roadmap and priorities
are public and open
● Platforms Teams have wiki based
documentation available for consumption
● Team dependencies should be reduced to
enable Flow
● Developer experience metrics should be
tracked for Platform Teams, along with the
4 key metrics
● Technical Decisions have to be made based
on Team Cognitive Load
● The Three key Interaction Models:
Facilitate, Collaborate, X as a Service
Developer
Experience
Enablement
Team
TRACK: SITE RELIABILTY ENGINEERING
Code-Docs-Controls-Reliability
Architectural Decision
Record (ADR)
The good stuff Documentation as code
Dashboards
TRACK: SITE RELIABILTY ENGINEERING
Platform
Operations
Cloud Platform Team
Applications
Engineering
Application
STRATEGIC
Platform
Operations
Cloud Platform Team
Applications
Engineering
Application
STRATEGIC
Platform
Operations
Cloud Platform Team
Applications
Engineering
Applications
Engineering
Applications Operations
TRANSITIONAL
Cloud Managed Services
Mode 1
“Traditional Operations”
Mode 2
“Distributed Ops”
Mode 3
“Decentralised Ops”
TRACK: SITE RELIABILTY ENGINEERING
• Kitchen Sink, a.k.a. “Everything SRE”
• Infrastructure / Platforms
• Tools
• Product/application
• Embedded
• Consulting
How get organised and started?
https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started
TRACK: SITE RELIABILTY ENGINEERING
• Infrastructure / Platforms
• Developer Experience / Developer Velocity
• Function - Stability - Ease of Use - Clarity -
• Product/application
• DORA Metrics:
• Lead Time for Changes - Change Failure Rate - Time
to Restore - Deployment Frequency Availability
What to Measure
TRACK: SITE RELIABILTY ENGINEERING
How to upskills a team towards
SRE
TRACK: SITE RELIABILTY ENGINEERING
Get the basics right
• Find out what type of SRE team you are
• Start with a small product / platform
• Identify your customers
• Figure out what you need to measure
• Agree the metrics WITH your customers
• Mesure - Improve - Fail - Improve - Measure
TRACK: SITE RELIABILTY ENGINEERING
How to scale
• Align engineering OKRs to the organisation
metrics
• Start running chaos engineering practices
across teams for product improvements
• Start benchmarking your SRE maturity
• Use engineering mitosis to scale SRE
practices
TRACK: SITE RELIABILTY ENGINEERING
A game day simulates a failure or event to test systems,
processes, and team responses. The purpose is to actually
perform the actions the team would perform as if an exceptional
event happened.
● To test whether or not the current systems are more or less resilient; with
the adequate processes to support it
● To build the "muscle memory" of the team(s) on how to respond if an
exceptional event happened.
● To evaluate a team's ability to design-build-run systems taking into
consideration the well architected pillars of operations, security,
reliability, performance, and cost.
● To test all aspects of one’s business for Reliability Readiness; specifically
operations, test, development, security, business operations, and
business leaders.
● Game Days are run in order to improve the availability of a system with
the goal of increasing reliability by purposefully creating major failures on
a regular basis.
“
What is a Game Day?
TRACK: SITE RELIABILTY ENGINEERING
● https://landing.google.com/sre/
● https://www.gremlin.com/blog
● Reliability Engineering at the Core of
Continuous Innovation
● Boost Your Apps With An SRE Approach
to Development
TRACK: SITE RELIABILTY ENGINEERING
THANK YOU TO OUR SPONSORS

More Related Content

Similar to ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf

(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014Amazon Web Services
 
Clover Infotech Corporate PPT
Clover Infotech Corporate PPTClover Infotech Corporate PPT
Clover Infotech Corporate PPTSwetha Elias
 
Raghu VM_Cloud Resume
Raghu VM_Cloud ResumeRaghu VM_Cloud Resume
Raghu VM_Cloud ResumeRaghu Ravi
 
How to consolidate Citrix Monitoring in a Single Pane of Glass
How to consolidate Citrix Monitoring in a Single Pane of GlassHow to consolidate Citrix Monitoring in a Single Pane of Glass
How to consolidate Citrix Monitoring in a Single Pane of GlasseG Innovations
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
 
Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013
Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013
Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013Amazon Web Services
 
5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network Operations5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network OperationsJames Kelly
 
Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Vadym Kazulkin
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsNicolas (Nick) Barcet
 
Microdeployments for microservices dev ops nashville
Microdeployments for microservices   dev ops nashvilleMicrodeployments for microservices   dev ops nashville
Microdeployments for microservices dev ops nashvilleNathaniel (Ned) Bauerle
 
Software Product Engineering Services | Digital Transformation
Software Product Engineering  Services | Digital TransformationSoftware Product Engineering  Services | Digital Transformation
Software Product Engineering Services | Digital TransformationSkizzle Technolabs
 
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0minseok kim
 
VLS_Capability_Presentation
VLS_Capability_PresentationVLS_Capability_Presentation
VLS_Capability_PresentationBill Nelson
 
AWS OpsWorks for Chef Automate
AWS OpsWorks for Chef AutomateAWS OpsWorks for Chef Automate
AWS OpsWorks for Chef AutomatePolarSeven Pty Ltd
 
Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...
Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...
Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...Ed Sattar
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
 
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...Amazon Web Services
 
Unit 9 and Unit 10.pptx
Unit 9 and Unit 10.pptxUnit 9 and Unit 10.pptx
Unit 9 and Unit 10.pptxReshmaGummadi1
 

Similar to ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf (20)

(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014
 
Clover Infotech Corporate PPT
Clover Infotech Corporate PPTClover Infotech Corporate PPT
Clover Infotech Corporate PPT
 
Raghu VM_Cloud Resume
Raghu VM_Cloud ResumeRaghu VM_Cloud Resume
Raghu VM_Cloud Resume
 
How to consolidate Citrix Monitoring in a Single Pane of Glass
How to consolidate Citrix Monitoring in a Single Pane of GlassHow to consolidate Citrix Monitoring in a Single Pane of Glass
How to consolidate Citrix Monitoring in a Single Pane of Glass
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 
Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013
Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013
Cloud Adoption in the Enterprise: Industry Perspective IP Expo 2013
 
5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network Operations5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network Operations
 
Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOps
 
Microdeployments for microservices dev ops nashville
Microdeployments for microservices   dev ops nashvilleMicrodeployments for microservices   dev ops nashville
Microdeployments for microservices dev ops nashville
 
Software Product Engineering Services | Digital Transformation
Software Product Engineering  Services | Digital TransformationSoftware Product Engineering  Services | Digital Transformation
Software Product Engineering Services | Digital Transformation
 
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
 
VLS_Capability_Presentation
VLS_Capability_PresentationVLS_Capability_Presentation
VLS_Capability_Presentation
 
AWS OpsWorks for Chef Automate
AWS OpsWorks for Chef AutomateAWS OpsWorks for Chef Automate
AWS OpsWorks for Chef Automate
 
Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...
Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...
Cloud Based Cognitive Learning & IT Project Performance Platform (CLIPP Platf...
 
Elastic-Engineering
Elastic-EngineeringElastic-Engineering
Elastic-Engineering
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
 
Unit 9 and Unit 10.pptx
Unit 9 and Unit 10.pptxUnit 9 and Unit 10.pptx
Unit 9 and Unit 10.pptx
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf

  • 1. TRACK: SITE RELIABILTY ENGINEERING NOVEMBER 12, 2020 Deepak Ramchandani, Contino Driving Digital Transformation through CloudOps and SRE
  • 2. TRACK: SITE RELIABILTY ENGINEERING Deepak Vensi Account Principal @ Contino linkedin.com/in/deepakrv/ ● Multi-Cloud adoption within Regulated Industries ● Advocate & build in-house mature Engineering practices with a focus on SRE ● Changing the Operating Model with a focus on FinOps, GitOps & DevSecOps ● Help develop Cloud Native Products and Services for Enterprises ● Build sustainable in-house digital capabilities for long term adoption
  • 3. TRACK: SITE RELIABILTY ENGINEERING • Core functions part of IT Operations • What does Operations mean in Cloud • Why SRE? • How can you bring Reliability / Docs / Code / Controls together? • How to upskills a team towards SRE Agenda
  • 4. TRACK: SITE RELIABILTY ENGINEERING • Incident Management • Change Management • Release Management • Configuration Management • Capacity Management • Business Continuity & Backup • Asset Management • Demand Management • Knowledge Management • Risk Management Core Functions of Traditional IT Operations
  • 5. TRACK: SITE RELIABILTY ENGINEERING What does a “Pipeline” look like? Prereqs Request Define Qualify Materiality & Risk Data Assessment BC/ Resilience Architecture Build in DEV Secure @ Design Service Transition Build in PROD Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue Secure @ Build Endorse Promote Queue Queue
  • 6. TRACK: SITE RELIABILTY ENGINEERING • Reliability • Security • Operability • Predictability • “Control” What do IT Ops really want?
  • 7. TRACK: SITE RELIABILTY ENGINEERING The Gap to Bridge Traditional IT Ops Public Cloud & Digital New Modes of Operations
  • 8. TRACK: SITE RELIABILTY ENGINEERING Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[1] The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google's Site Reliability Team, SRE is "what happens when a software engineer is tasked with what used to be called operations." SRE to the Rescue!
  • 9. TRACK: SITE RELIABILTY ENGINEERING The four key pillars to CloudOps SRE FinOps GitOps DevSecOps
  • 10. TRACK: SITE RELIABILTY ENGINEERING SRE to the Rescue! SRE Stepping stones: lay the foundation first Reliability Hierarchy of Needs Review architecture for reliability SLx Monitoring & Alerting Testing & Release Capacity planning Post Mortem/RCA Development Product
  • 11. TRACK: SITE RELIABILTY ENGINEERING Fear not! You Build it you Run it Design Run Build Act like a Developer, Think like a Systems Operator
  • 12. TRACK: SITE RELIABILTY ENGINEERING Roles and Responsibilities Site Reliability Engineer “Act like a Developer, think like a Systems Operator”. 4. Eliminate Toil via Automation 2. Create Service Level Indicators & Objectives 1. Consult with Consumers 3. Design, Build & Run Platforms / Services / Apps 5. Monitor Distributed Systems & Develop Products 9. Emergency Response & On-call Support 6. Postmortems & Learn from Failure 8. Track Platform Outages 7. Manage Platform & Consumer Incidents
  • 13. TRACK: SITE RELIABILTY ENGINEERING How to get started?
  • 14. TRACK: SITE RELIABILTY ENGINEERING Team Topology Cloud Platforms Landing Zones | API Gateway | Patterns | Shared Services Platform Team(s) Team 1 Product 3 Application 2 Service 4 Stream Aligned Team(s) Developer Experience Enablement Team Customers: Internal Product/App teams Reliability: Quite high! Customers: External facing services Reliability: Service dependant
  • 15. TRACK: SITE RELIABILTY ENGINEERING Team Topology Cloud Platforms Landing Zones | API Gateway | Patterns | Shared Services Platform Team(s) Team 1 Product 3 Application 2 Service 4 Stream Aligned Team(s) Mandates for Success ● Defining the “Team API” for Platform Teams ● Cloud Platforms enable the demand from the stream aligned teams, not dictate them ● Platforms Teams roadmap and priorities are public and open ● Platforms Teams have wiki based documentation available for consumption ● Team dependencies should be reduced to enable Flow ● Developer experience metrics should be tracked for Platform Teams, along with the 4 key metrics ● Technical Decisions have to be made based on Team Cognitive Load ● The Three key Interaction Models: Facilitate, Collaborate, X as a Service Developer Experience Enablement Team
  • 16. TRACK: SITE RELIABILTY ENGINEERING Code-Docs-Controls-Reliability Architectural Decision Record (ADR) The good stuff Documentation as code Dashboards
  • 17. TRACK: SITE RELIABILTY ENGINEERING Platform Operations Cloud Platform Team Applications Engineering Application STRATEGIC Platform Operations Cloud Platform Team Applications Engineering Application STRATEGIC Platform Operations Cloud Platform Team Applications Engineering Applications Engineering Applications Operations TRANSITIONAL Cloud Managed Services Mode 1 “Traditional Operations” Mode 2 “Distributed Ops” Mode 3 “Decentralised Ops”
  • 18. TRACK: SITE RELIABILTY ENGINEERING • Kitchen Sink, a.k.a. “Everything SRE” • Infrastructure / Platforms • Tools • Product/application • Embedded • Consulting How get organised and started? https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started
  • 19. TRACK: SITE RELIABILTY ENGINEERING • Infrastructure / Platforms • Developer Experience / Developer Velocity • Function - Stability - Ease of Use - Clarity - • Product/application • DORA Metrics: • Lead Time for Changes - Change Failure Rate - Time to Restore - Deployment Frequency Availability What to Measure
  • 20. TRACK: SITE RELIABILTY ENGINEERING How to upskills a team towards SRE
  • 21. TRACK: SITE RELIABILTY ENGINEERING Get the basics right • Find out what type of SRE team you are • Start with a small product / platform • Identify your customers • Figure out what you need to measure • Agree the metrics WITH your customers • Mesure - Improve - Fail - Improve - Measure
  • 22. TRACK: SITE RELIABILTY ENGINEERING How to scale • Align engineering OKRs to the organisation metrics • Start running chaos engineering practices across teams for product improvements • Start benchmarking your SRE maturity • Use engineering mitosis to scale SRE practices
  • 23. TRACK: SITE RELIABILTY ENGINEERING A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. ● To test whether or not the current systems are more or less resilient; with the adequate processes to support it ● To build the "muscle memory" of the team(s) on how to respond if an exceptional event happened. ● To evaluate a team's ability to design-build-run systems taking into consideration the well architected pillars of operations, security, reliability, performance, and cost. ● To test all aspects of one’s business for Reliability Readiness; specifically operations, test, development, security, business operations, and business leaders. ● Game Days are run in order to improve the availability of a system with the goal of increasing reliability by purposefully creating major failures on a regular basis. “ What is a Game Day?
  • 24. TRACK: SITE RELIABILTY ENGINEERING ● https://landing.google.com/sre/ ● https://www.gremlin.com/blog ● Reliability Engineering at the Core of Continuous Innovation ● Boost Your Apps With An SRE Approach to Development
  • 25. TRACK: SITE RELIABILTY ENGINEERING THANK YOU TO OUR SPONSORS