Unified Operations Vision

Unified Operations Vision
By Steve
Mushero
March, 2019

Overview
● Where IT Operations is Today
● Integrated Monitoring
● Automated Operations
● AI-Ops

Where We Are Now - Diversity
● Lots of IT Systems
● Physical Servers, Disks, RAID
● Network Devices of Many Types
● SANs & Storage Systems

Where We Are Now - Diversity
● Private Clouds - VMWare & OpenStack
● Public Clouds - Many
● Hybrid Clouds & Everything Else
● and more ...

Where We Are Now - Many Monitoring Systems
● Each Piece Monitored a Different Way
● Zabbix, Prometheus, Cacti
● Networks by SNMP, Cacti,
● Commercial via BMC, etc.
● APM & Tracing, too
● Logs to ELK, or nowhere

Where We Are Now - Resource Monitoring
● 75% of Monitoring is Resources
○ CPU, RAM, Network,
● 20% for Services
○ Performance
● 5% for URLs
● Very Little For:
○ Apps & Customer Experience
○ Internal Services & Golden Signals
○ Architecture, Topology
○ Configuration

Challenges I
● Alarm Overload & Fatigue
● Hard to Set Thresholds
● Hard to Know What’s Wrong
● Getting Worse
○ DevOps & Faster Releases
○ Dynamic Systems
○ Microservices
○ Clouds & Cloud Services
● More Players - Ops, Developers, DevOps, SRE

Challenges II
● Systems Not Connected
● Collection Methods Vary
● Metric Definitions Vary
● Alerts Vary
● Can’t Unify Anything
○ Metrics
○ Alerts
○ Incidents
○ Understanding

Goals - Single System
● Single Monitoring System
● Single Metric System
● Single Source of Truth
● Anomaly Detection
● Resource Metrics
● Golden Signals
● Logs & Events
● APM & Tracing

Goals - Rich Data & Information
● Add Context & Details
● Discovery - Metrics on All
○ Hosts & Nodes
○ Services
○ Connections
○ Dependencies
● CMDB
○ Deep Service Configuration
○ Security & Governance

How to Get There - Re-Thinking Monitoring
● Monitoring Strategy
● What to Monitor
● How to Monitor
● How to Alert
● How to Troubleshoot
● How to Manage Incidents

Monitoring Strategy
● Business Level - Key KPIs, drivers of below items
● User Level - What the User Experiences
● App Level - Engineers, Managers, Users think in apps
● Service Level - The Real Work & Real Problems
● Resource Level - Underlying Everything
● Security - Important Everywhere

Monitoring Sources - Need Them All
● OS Metrics & Logs
● Service Metrics & Logs
● App Metrics & Logs
● Cloud Metrics & Logs
● APM Data & Tracing
● CMDB Configs
● Architecture & Dependencies
● Auto-Discovery of Everything

What to Monitor
● Focus on User & App Level (Results)
○ User & Browser via RUM (Real User Monitoring)
● Modern Golden Signals
● System Health (Health & Status Endpoints)
● Hard Errors (Only Alert if User Impact)
○ Disk full, service/server dead, etc.
● Background Useful Data (no alerts)

Golden Signals - Key to SRE
● Modern Golden Signals
○ USE - Resources (Utilization, Saturation, Errors)
○ RED - Results (Rate, Errors, Duration/Latency)
● Use Specialized Agents
● For Every Service
● At Every System Level
● Down to Disks, Networks, etc.

Getting Data & Metrics
● Many Collection Methods
○ Agents, SNMP, SSH, Cloud APIs
○ Defined vs. Ad Hoc Metrics
● Focus on Golden Signals
○ Need Special Tools for RED
● Use Good Statistics
○ Medians, Percentiles, Sampling

CMDB
● Key to Many Processes
● Service Discovery
● Security & Compliance
● Change Tracking
● Faster Troubleshooting
● Expert System Source
● Needs Specialized Collector

Architecture & Topology
● Use Special Agents to Discover
● Key to Understanding
● Drives System Diagram
● Drives Understanding
● Drives Dependencies
● Often Changing

Dependencies
● What Depends on What
● Critical for MicroServices
● Key to Alert Impact Analyses
● Key to Alert Consolidation
● Key to AIOps Root Cause

Observability - For Developers & DevOps
● Add Metrics to Application Code
● Uses Structured Logs & Emitted Events
○ With Metrics, Latency, and Errors
● Lots of Tags
○ User, Customer, Browser, Product, OS, much more
● Canned & Ad Hoc Analytics & Exploration
● Should be Integrated into Single System
○ Correlate USE/RED Metrics with APM Data

Visualization
● Dashboards
● Diagrams - Layered Arch, Dependencies
● Graphs - Summary, System, Service, Deeper
● Advanced - Heatmaps, Box Plots, Histograms
● Analytics - Cluster, Deviance, Cycles

How to Alert
● Two Types of Alerts
○ Alert to Wake Someone Up/Urgent
○ Alert as Information (FYI)
● Alert on User Impact
● All the Rest is Background Info
● Smart Alert Strategy
○ Thresholds where it makes sense
○ Anomalies are Key, but noisy, too

Anomaly Alerting
● Many Types
● Historical Checks
○ Univariate, Multi-variate, Neural Networks, Seasons
● Cluster Checks
○ Different rom Peers
● Ratio Checks
○ Metrics don’t Match
○ e.g. Requests vs. Queries

How to Manage Incidents
● Incidents are Real Issues (ITIL)
○ Something Broke
○ Combine Many Alerts
● Categorize & Document
● Troubleshoot, Fix, Document
● Communicate
● Review & Report (Post Mortem)

How to Troubleshoot
● Train People in Troubleshooting
● Defined Processes - Especially for Emergencies
● Use All the Data
○ Alerts & Incidents
○ Metrics
○ Logs & Events
○ Topology & Dependencies
● Root Cause Analyses Critical
● Runbooks & Team Communications

How to Manage Problems
● Problems are Recurring Incidents (ITIL)
● Key to Reducing False Alerts & Fatigue
● Key to Improving Alert Thresholds
● Key to Improving Systems
● Ideally Dedicated Team or Resources
● Needs Monitoring System Support

Getting To Integrated Monitoring

Getting to Integrated Monitoring
● Big Project
● Multiple Phases
● Lots of Details
● Involves Integration
● And New Strategies

Getting to Integrated Monitoring
● Upgrade & Integrate Monitoring
● Add Golden Signals, App/User Focus
● Add Discovery & Dependencies
● Add APM & Tracing
● Upgrade & Integrate Logging
● Add Observability in Code
● Upgraded Alerting, Incident, Problems
● Train Teams

Upgrade & Integrate Monitoring
● Plan What to Monitor
● Plan How to Monitor
● Single Unified System
● Multi-Phase Process
● Temporary Integrations

Add Golden Signals, App/User Focus
● Setup App & Service Structures
● Plan Golden Signals
○ Needs Special Agents & Techniques
○ Varies a Lot by Service
● Set Baselines & Anomalies

Add Discovery & Dependencies
● Driven by Monitoring System
● Get Data
● Build Diagrams
● Verify Dependencies

Add APM & Tracing
● Part of Monitoring
● Initial Setup & Test
● Tune for Transactions
● Extract User/RUM Metrics
● Set Alerts as Needed
● Train Developers on Tracing
● Correlate USE/RED Metrics with APM Data

Upgrade & Integrate Logging
● Logs are Key Part of Troubleshooting
○ OS Level - Linux & Windows
○ Services - Web, Java/Tomcat, MySQL, etc.
● Send All to Unified Platform
● Get Metrics & Analytics
● Add Alertings on Logs - Errors & Metrics

Add Observability in Code
● Structured Logs are Best - Events
● Emitted by App Code with Tags & Dimensions
● Include Metrics - Ideally Latency & Errors
● Build Analyses Dashboards

Upgraded Alerting, Incident, Problems
● Move to ITIL Naming & Processes
● Build Procedures
● Dedicated Problem Team

Training Teams
● General Training
● ITIL & ITOP Training
● Golden Signals Thinking
● APM, RUM, Tracing Usage

Automated Operations - Goals
● Automate Things
● Build Things Faster
● Change Things Faster
● Fix Things Faster
● Reduce Manual Mistakes
● Improve Consistency
● Support Large Scale Systems

Key Components
● Clouds & Dynamic Systems
● Infrastructure as Code
● Config Management Systems
○ Ansible, Puppet, Chef, SaltStack
● Automated Troubleshooting
● Auto-Healing
● Auto-Governance

Clouds & Dynamic Systems
● Cloud APIs Support Automation
● Clouds Can Change Themselves
○ Auto-Scaling
● CI/CD Systems Can Change Them
● Core of Lots of Automated Processes

Infrastructure as Code
● Define Infrastructure Programmatically
● Cloud Formation, Terraform, etc.
● Build & Change Continually
● Continually Updated
● Versioned & Can Diff

Config Management Systems
● Support Auto Deployments
● Leverage Infra-as-Code
● Ansible, Puppet, Chef, SaltStack
● Build Large, Reproducible Systems

Automated Troubleshooting
● Built on Data & Rule Systems
● Auto-Gather More Details
● Help Find Root Causes
● Use Automation System
● Advanced Use Needs AI-Ops

Auto-Healing
● Automatically Fix Things
● Driven by Rule Engines
● Fix Things Faster
● Rapid Response 7x24
● Use Automation System

Auto-Governance
● Continual Compliance
● Guardrails to Prevent Risks
● Systems Auto-Correct
○ Remove Bad Security

AI-Ops - What is it?
● Analytical IT Operations
● Artificial Intelligence (AI) Opeations
● Machine Learning for Operations

Really, What is It?
● Synthesis of Many Sources of Information
● Better Understanding of Problems & Situation
● Predictions for the Future
● Usually with Machine Learning / Big Data

Goals
● Impact Analysis
● Alert Reduction
● Alert Consolidation
● Root Cause Analysis
● Auto-Healing
● Prediction

Method - Combine Everything
● Alerts
● Events
● Metrics
● History
● Topology
● Dependencies

Many Vendors
● Most Focus on Alert Consolidation
● Some on Root Alert/Issue
● Few on Real Root Cause
● Need Deep Data for RCA

Alert Reduction & Consolidation
● Merge Related & Duplicate Alerts
○ By Time, Dependencies, History, etc.
● Helps Ops Teams Focus
● Avoids Missing Key Things in Noise
● First Phase of any AIOPS

Root Cause Analysis
● Multi-Mode & Method
● Expert System Uses Everything
● Use Dependencies from Discovery
● Use History with Feedback
● Sort & Prioritize
● Enrich with Additional Data Collection
○ Including Automated Actions

Manual-Healing
● One-Click Fixing
● Procedure/Runbooks
● For high-risk, complex solutions
● Automatic Real-Time
● Uses Automation Platform

Auto-Healing
● Automatic Real-Time Fixing
● Uses Automation Platform
● Helps 7x24, Reducing On-Call
● Responds in Seconds, not Hours

Prediction
● See Problems in Advance
● Solve Problems in Advance
● Capacity Planning - Resources
● Elevated Errors & Pending Failures

Summary
● IT Operations is Manual & Messy
● Monitoring is Diverse & Distributed
● Automation helps take Action
● AI-Ops help Fix Stuff Faster

Unified Operations Vision

More Related Content

Similar to Unified Operations Vision

Recently uploaded

Unified Operations Vision