3. Challenges with Traditional NOC
High Alert
Noise
✓No alert prioritization (all alerts are getting converted into incidents directly)
✓High volume of incidents due to lack of event prioritization
Underutilize
NOC
✓NOC engineers are playing only alert escalation & follow up role (purely L1)
✓No technical inputs in case of high severity incidents (P1 Outage) resulted into High MTTR (Mean Time To Resolution)
✓80% of work involves around manually monitoring alerts , watching line of graphs on screen.
Problem
Management
✓Mindset is only on alert resolution rather than problem management
✓Lack of RCA & CAPA practice (Corrective Action & Preventive Action)for repetitive high severity incidents
Scalability
Issue
✓Not able to scale rapidity due to multiple manual process in case of infra expansion
✓High chance of missing monitoring coverage due to manual process & lack of feedback system
SLA Issues
✓Service Level Agreement (SLA) are not business aligned & focus is only on availability of infrastructure
✓Lack of SLIs (Service Level Indicator) & Service Level Objective which resulted into inefficient SLA tracking
✓SLIs are the best way to ensure availability & performance instead of SLA
4. Predict
Notify &
Act
SRE Roadmap
Collect Data
Correlate and
Triage
Identify
Trends
SRE Golden Signals (Alerting , Troubleshooting ,Tuning & Capacity Planning)
Monitoring , Auditing , Troubleshooting & Security(Compute| Storage | Network | Application)
Start Monitoring CIs
Work closely toward 100%
monitoring coverage using
continuous monitoring
(immutable Infrastructure
as Code)
Monitoring Data Source
▪ Solarwind(Compute,Sto
rage & Network)
▪ Dynatrace(APM)
▪ Synthetic Monitoring
Design & implement CMDB
(Single Source of truth) for
entire infrastructure
Trends & Anomalies
▪ Capacity Planning
▪ Cost
Recommendations
▪ Continuous
compliance
(Detect deviations
from a “golden
baseline” )
▪ Release-to-release
benchmarks
▪ Toil – Automate
repetitive task
Problem Management
▪ Publish Top N noise
makers Cis
▪ Post-mortem
Culture using
Problem
Management
(Learning from
failure)
▪ Implement custom
Self Healing for IT
Infrastructure &
services
▪ Publish SLIs , SLO &
SLM reports
Event Management
▪ Design & implement
AIOps based layer which
will collect
data(metrics/events)
from multiple data
sources & present into
single pane of glass
▪ Design & build service
models
▪ Build event correlation
(topology/stream) to
reduce alert noise
▪ Monitoring Tools
consolidation
Incident Management
▪ Integration of
monitoring events
with ITSM Ticketing
▪ Robust automated
alert notification
(Pager duty | Alarm
Point)
▪ Define SLIs, SLOs&
SLMs
▪ Data available during
production outage
5. SRE Level(L1) SRE Level(L2) SRE (Tools &Automation SMEs)
Improve MTTD
▪ Virtual team for Live 24*7 monitoring
(availability & performance)
▪ Automated alert escalation to L2 NOC
Support team(P1|P2|P3 - Incidents )
▪ Tracking of escalated alerts till alert
resolution
▪ Engage Incident Management in case
P1& P2 incidents
▪ Engage NOC Dev team in case of
monitoring miss opportunities
▪ Perform Schedule Health Check-up
▪ Daily Schedule Reports(Availability |
Performance | Outage etc)
▪ Other BAU activities
Improve MTTR
▪ Provide L2 analysis for all incidents
▪ Escalate incident to L3/Product
SMEs for open incident
▪ Analyse & fix monitoring alerts
▪ Runbook - Step by step guide for
resolving an incident
▪ Incident Response Report
▪ Post mortem reports(RCA and task to
be performed to avoid future outage)
▪ Engage NOC Dev team for repetitive
task
Note : This team will have L2/ SMEs
from OS , App , DB , Middleware&
Network domain)
Improve MTBF
▪ Monitor every possible metric in
environment
▪ Design & configure robust monitoring
system(Continuous Monitoring)
▪ Working on new monitoring
opportunities
▪ Automate Runbook (Self-Healing)
▪ Toil – Automate repetitive task(shift
from manual to automated approach)
Site Reliability Engineering - Landscape
SRE/DevOps Team Structure