Unified Operations Vision
By Steve
Mushero
March, 2019
Overview
● Where IT Operations is Today
● Integrated Monitoring
● Automated Operations
● AI-Ops
Where We Are Now - Diversity
● Lots of IT Systems
● Physical Servers, Disks, RAID
● Network Devices of Many Types
● SANs & Storage Systems
Where We Are Now - Diversity
● Private Clouds - VMWare & OpenStack
● Public Clouds - Many
● Hybrid Clouds & Everything Else
● and more ...
Where We Are Now - Many Monitoring Systems
● Each Piece Monitored a Different Way
● Zabbix, Prometheus, Cacti
● Networks by SNMP, Cacti,
● Commercial via BMC, etc.
● APM & Tracing, too
● Logs to ELK, or nowhere
So Many Tools ...
Where We Are Now - Resource Monitoring
● 75% of Monitoring is Resources
○ CPU, RAM, Network,
● 20% for Services
○ Performance
● 5% for URLs
● Very Little For:
○ Apps & Customer Experience
○ Internal Services & Golden Signals
○ Architecture, Topology
○ Configuration
Challenges I
● Alarm Overload & Fatigue
● Hard to Set Thresholds
● Hard to Know What’s Wrong
● Getting Worse
○ DevOps & Faster Releases
○ Dynamic Systems
○ Microservices
○ Clouds & Cloud Services
● More Players - Ops, Developers, DevOps, SRE
Challenges II
● Systems Not Connected
● Collection Methods Vary
● Metric Definitions Vary
● Alerts Vary
● Can’t Unify Anything
○ Metrics
○ Alerts
○ Incidents
○ Understanding
Integrated Monitoring
Goals - Single System
● Single Monitoring System
● Single Metric System
● Single Source of Truth
● Anomaly Detection
● Resource Metrics
● Golden Signals
● Logs & Events
● APM & Tracing
Goals - Rich Data & Information
● Add Context & Details
● Discovery - Metrics on All
○ Hosts & Nodes
○ Services
○ Connections
○ Dependencies
● CMDB
○ Deep Service Configuration
○ Security & Governance
How to Get There - Re-Thinking Monitoring
● Monitoring Strategy
● What to Monitor
● How to Monitor
● How to Alert
● How to Troubleshoot
● How to Manage Incidents
What to Monitor
Monitoring Strategy
● Business Level - Key KPIs, drivers of below items
● User Level - What the User Experiences
● App Level - Engineers, Managers, Users think in apps
● Service Level - The Real Work & Real Problems
● Resource Level - Underlying Everything
● Security - Important Everywhere
Monitoring Sources - Need Them All
● OS Metrics & Logs
● Service Metrics & Logs
● App Metrics & Logs
● Cloud Metrics & Logs
● APM Data & Tracing
● CMDB Configs
● Architecture & Dependencies
● Auto-Discovery of Everything
What to Monitor
● Focus on User & App Level (Results)
○ User & Browser via RUM (Real User Monitoring)
● Modern Golden Signals
● System Health (Health & Status Endpoints)
● Hard Errors (Only Alert if User Impact)
○ Disk full, service/server dead, etc.
● Background Useful Data (no alerts)
Golden Signals - Key to SRE
● Modern Golden Signals
○ USE - Resources (Utilization, Saturation, Errors)
○ RED - Results (Rate, Errors, Duration/Latency)
● Use Specialized Agents
● For Every Service
● At Every System Level
● Down to Disks, Networks, etc.
Getting Data & Metrics
● Many Collection Methods
○ Agents, SNMP, SSH, Cloud APIs
○ Defined vs. Ad Hoc Metrics
● Focus on Golden Signals
○ Need Special Tools for RED
● Use Good Statistics
○ Medians, Percentiles, Sampling
CMDB
● Key to Many Processes
● Service Discovery
● Security & Compliance
● Change Tracking
● Faster Troubleshooting
● Expert System Source
● Needs Specialized Collector
Architecture & Topology
● Use Special Agents to Discover
● Key to Understanding
● Drives System Diagram
● Drives Understanding
● Drives Dependencies
● Often Changing
Dependencies
● What Depends on What
● Critical for MicroServices
● Key to Alert Impact Analyses
● Key to Alert Consolidation
● Key to AIOps Root Cause
Observability - For Developers & DevOps
● Add Metrics to Application Code
● Uses Structured Logs & Emitted Events
○ With Metrics, Latency, and Errors
● Lots of Tags
○ User, Customer, Browser, Product, OS, much more
● Canned & Ad Hoc Analytics & Exploration
● Should be Integrated into Single System
○ Correlate USE/RED Metrics with APM Data
Visualization
● Dashboards
● Diagrams - Layered Arch, Dependencies
● Graphs - Summary, System, Service, Deeper
● Advanced - Heatmaps, Box Plots, Histograms
● Analytics - Cluster, Deviance, Cycles
How to Alert
● Two Types of Alerts
○ Alert to Wake Someone Up/Urgent
○ Alert as Information (FYI)
● Alert on User Impact
● All the Rest is Background Info
● Smart Alert Strategy
○ Thresholds where it makes sense
○ Anomalies are Key, but noisy, too
Anomaly Alerting
● Many Types
● Historical Checks
○ Univariate, Multi-variate, Neural Networks, Seasons
● Cluster Checks
○ Different rom Peers
● Ratio Checks
○ Metrics don’t Match
○ e.g. Requests vs. Queries
How to Manage Incidents
● Incidents are Real Issues (ITIL)
○ Something Broke
○ Combine Many Alerts
● Categorize & Document
● Troubleshoot, Fix, Document
● Communicate
● Review & Report (Post Mortem)
How to Troubleshoot
● Train People in Troubleshooting
● Defined Processes - Especially for Emergencies
● Use All the Data
○ Alerts & Incidents
○ Metrics
○ Logs & Events
○ Topology & Dependencies
● Root Cause Analyses Critical
● Runbooks & Team Communications
How to Manage Problems
● Problems are Recurring Incidents (ITIL)
● Key to Reducing False Alerts & Fatigue
● Key to Improving Alert Thresholds
● Key to Improving Systems
● Ideally Dedicated Team or Resources
● Needs Monitoring System Support
Getting To Integrated Monitoring
Getting to Integrated Monitoring
● Big Project
● Multiple Phases
● Lots of Details
● Involves Integration
● And New Strategies
Getting to Integrated Monitoring
● Upgrade & Integrate Monitoring
● Add Golden Signals, App/User Focus
● Add Discovery & Dependencies
● Add APM & Tracing
● Upgrade & Integrate Logging
● Add Observability in Code
● Upgraded Alerting, Incident, Problems
● Train Teams
Upgrade & Integrate Monitoring
● Plan What to Monitor
● Plan How to Monitor
● Single Unified System
● Multi-Phase Process
● Temporary Integrations
Add Golden Signals, App/User Focus
● Setup App & Service Structures
● Plan Golden Signals
○ Needs Special Agents & Techniques
○ Varies a Lot by Service
● Set Baselines & Anomalies
Add Discovery & Dependencies
● Driven by Monitoring System
● Get Data
● Build Diagrams
● Verify Dependencies
Add APM & Tracing
● Part of Monitoring
● Initial Setup & Test
● Tune for Transactions
● Extract User/RUM Metrics
● Set Alerts as Needed
● Train Developers on Tracing
● Correlate USE/RED Metrics with APM Data
Tracing Example
Upgrade & Integrate Logging
● Logs are Key Part of Troubleshooting
○ OS Level - Linux & Windows
○ Services - Web, Java/Tomcat, MySQL, etc.
● Send All to Unified Platform
● Get Metrics & Analytics
● Add Alertings on Logs - Errors & Metrics
Add Observability in Code
● Structured Logs are Best - Events
● Emitted by App Code with Tags & Dimensions
● Include Metrics - Ideally Latency & Errors
● Build Analyses Dashboards
Upgraded Alerting, Incident, Problems
● Move to ITIL Naming & Processes
● Build Procedures
● Dedicated Problem Team
Training Teams
● General Training
● ITIL & ITOP Training
● Golden Signals Thinking
● APM, RUM, Tracing Usage
Automated Operations
Automated Operations - Goals
● Automate Things
● Build Things Faster
● Change Things Faster
● Fix Things Faster
● Reduce Manual Mistakes
● Improve Consistency
● Support Large Scale Systems
Key Components
● Clouds & Dynamic Systems
● Infrastructure as Code
● Config Management Systems
○ Ansible, Puppet, Chef, SaltStack
● Automated Troubleshooting
● Auto-Healing
● Auto-Governance
Clouds & Dynamic Systems
● Cloud APIs Support Automation
● Clouds Can Change Themselves
○ Auto-Scaling
● CI/CD Systems Can Change Them
● Core of Lots of Automated Processes
Infrastructure as Code
● Define Infrastructure Programmatically
● Cloud Formation, Terraform, etc.
● Build & Change Continually
● Continually Updated
● Versioned & Can Diff
Config Management Systems
● Support Auto Deployments
● Leverage Infra-as-Code
● Ansible, Puppet, Chef, SaltStack
● Build Large, Reproducible Systems
Automated Troubleshooting
● Built on Data & Rule Systems
● Auto-Gather More Details
● Help Find Root Causes
● Use Automation System
● Advanced Use Needs AI-Ops
Auto-Healing
● Automatically Fix Things
● Driven by Rule Engines
● Fix Things Faster
● Rapid Response 7x24
● Use Automation System
Auto-Governance
● Continual Compliance
● Guardrails to Prevent Risks
● Systems Auto-Correct
○ Remove Bad Security
AI-Ops
AI-Ops - What is it?
● Analytical IT Operations
● Artificial Intelligence (AI) Opeations
● Machine Learning for Operations
Really, What is It?
● Synthesis of Many Sources of Information
● Better Understanding of Problems & Situation
● Predictions for the Future
● Usually with Machine Learning / Big Data
From Gartner
Goals
● Impact Analysis
● Alert Reduction
● Alert Consolidation
● Root Cause Analysis
● Auto-Healing
● Prediction
Method - Combine Everything
● Alerts
● Events
● Metrics
● History
● Topology
● Dependencies
Many Vendors
● Most Focus on Alert Consolidation
● Some on Root Alert/Issue
● Few on Real Root Cause
● Need Deep Data for RCA
Alert Reduction & Consolidation
● Merge Related & Duplicate Alerts
○ By Time, Dependencies, History, etc.
● Helps Ops Teams Focus
● Avoids Missing Key Things in Noise
● First Phase of any AIOPS
Root Cause Analysis
● Multi-Mode & Method
● Expert System Uses Everything
● Use Dependencies from Discovery
● Use History with Feedback
● Sort & Prioritize
● Enrich with Additional Data Collection
○ Including Automated Actions
Manual-Healing
● One-Click Fixing
● Procedure/Runbooks
● For high-risk, complex solutions
● Automatic Real-Time
● Uses Automation Platform
Auto-Healing
● Automatic Real-Time Fixing
● Uses Automation Platform
● Helps 7x24, Reducing On-Call
● Responds in Seconds, not Hours
Prediction
● See Problems in Advance
● Solve Problems in Advance
● Capacity Planning - Resources
● Elevated Errors & Pending Failures
Summary
● IT Operations is Manual & Messy
● Monitoring is Diverse & Distributed
● Automation helps take Action
● AI-Ops help Fix Stuff Faster
Thank You
www.Siglos.io

Unified Operations Vision

  • 1.
    Unified Operations Vision BySteve Mushero March, 2019
  • 2.
    Overview ● Where ITOperations is Today ● Integrated Monitoring ● Automated Operations ● AI-Ops
  • 3.
    Where We AreNow - Diversity ● Lots of IT Systems ● Physical Servers, Disks, RAID ● Network Devices of Many Types ● SANs & Storage Systems
  • 4.
    Where We AreNow - Diversity ● Private Clouds - VMWare & OpenStack ● Public Clouds - Many ● Hybrid Clouds & Everything Else ● and more ...
  • 5.
    Where We AreNow - Many Monitoring Systems ● Each Piece Monitored a Different Way ● Zabbix, Prometheus, Cacti ● Networks by SNMP, Cacti, ● Commercial via BMC, etc. ● APM & Tracing, too ● Logs to ELK, or nowhere
  • 6.
  • 7.
    Where We AreNow - Resource Monitoring ● 75% of Monitoring is Resources ○ CPU, RAM, Network, ● 20% for Services ○ Performance ● 5% for URLs ● Very Little For: ○ Apps & Customer Experience ○ Internal Services & Golden Signals ○ Architecture, Topology ○ Configuration
  • 8.
    Challenges I ● AlarmOverload & Fatigue ● Hard to Set Thresholds ● Hard to Know What’s Wrong ● Getting Worse ○ DevOps & Faster Releases ○ Dynamic Systems ○ Microservices ○ Clouds & Cloud Services ● More Players - Ops, Developers, DevOps, SRE
  • 9.
    Challenges II ● SystemsNot Connected ● Collection Methods Vary ● Metric Definitions Vary ● Alerts Vary ● Can’t Unify Anything ○ Metrics ○ Alerts ○ Incidents ○ Understanding
  • 10.
  • 11.
    Goals - SingleSystem ● Single Monitoring System ● Single Metric System ● Single Source of Truth ● Anomaly Detection ● Resource Metrics ● Golden Signals ● Logs & Events ● APM & Tracing
  • 12.
    Goals - RichData & Information ● Add Context & Details ● Discovery - Metrics on All ○ Hosts & Nodes ○ Services ○ Connections ○ Dependencies ● CMDB ○ Deep Service Configuration ○ Security & Governance
  • 13.
    How to GetThere - Re-Thinking Monitoring ● Monitoring Strategy ● What to Monitor ● How to Monitor ● How to Alert ● How to Troubleshoot ● How to Manage Incidents
  • 14.
  • 15.
    Monitoring Strategy ● BusinessLevel - Key KPIs, drivers of below items ● User Level - What the User Experiences ● App Level - Engineers, Managers, Users think in apps ● Service Level - The Real Work & Real Problems ● Resource Level - Underlying Everything ● Security - Important Everywhere
  • 16.
    Monitoring Sources -Need Them All ● OS Metrics & Logs ● Service Metrics & Logs ● App Metrics & Logs ● Cloud Metrics & Logs ● APM Data & Tracing ● CMDB Configs ● Architecture & Dependencies ● Auto-Discovery of Everything
  • 17.
    What to Monitor ●Focus on User & App Level (Results) ○ User & Browser via RUM (Real User Monitoring) ● Modern Golden Signals ● System Health (Health & Status Endpoints) ● Hard Errors (Only Alert if User Impact) ○ Disk full, service/server dead, etc. ● Background Useful Data (no alerts)
  • 18.
    Golden Signals -Key to SRE ● Modern Golden Signals ○ USE - Resources (Utilization, Saturation, Errors) ○ RED - Results (Rate, Errors, Duration/Latency) ● Use Specialized Agents ● For Every Service ● At Every System Level ● Down to Disks, Networks, etc.
  • 19.
    Getting Data &Metrics ● Many Collection Methods ○ Agents, SNMP, SSH, Cloud APIs ○ Defined vs. Ad Hoc Metrics ● Focus on Golden Signals ○ Need Special Tools for RED ● Use Good Statistics ○ Medians, Percentiles, Sampling
  • 20.
    CMDB ● Key toMany Processes ● Service Discovery ● Security & Compliance ● Change Tracking ● Faster Troubleshooting ● Expert System Source ● Needs Specialized Collector
  • 21.
    Architecture & Topology ●Use Special Agents to Discover ● Key to Understanding ● Drives System Diagram ● Drives Understanding ● Drives Dependencies ● Often Changing
  • 22.
    Dependencies ● What Dependson What ● Critical for MicroServices ● Key to Alert Impact Analyses ● Key to Alert Consolidation ● Key to AIOps Root Cause
  • 23.
    Observability - ForDevelopers & DevOps ● Add Metrics to Application Code ● Uses Structured Logs & Emitted Events ○ With Metrics, Latency, and Errors ● Lots of Tags ○ User, Customer, Browser, Product, OS, much more ● Canned & Ad Hoc Analytics & Exploration ● Should be Integrated into Single System ○ Correlate USE/RED Metrics with APM Data
  • 24.
    Visualization ● Dashboards ● Diagrams- Layered Arch, Dependencies ● Graphs - Summary, System, Service, Deeper ● Advanced - Heatmaps, Box Plots, Histograms ● Analytics - Cluster, Deviance, Cycles
  • 25.
    How to Alert ●Two Types of Alerts ○ Alert to Wake Someone Up/Urgent ○ Alert as Information (FYI) ● Alert on User Impact ● All the Rest is Background Info ● Smart Alert Strategy ○ Thresholds where it makes sense ○ Anomalies are Key, but noisy, too
  • 26.
    Anomaly Alerting ● ManyTypes ● Historical Checks ○ Univariate, Multi-variate, Neural Networks, Seasons ● Cluster Checks ○ Different rom Peers ● Ratio Checks ○ Metrics don’t Match ○ e.g. Requests vs. Queries
  • 27.
    How to ManageIncidents ● Incidents are Real Issues (ITIL) ○ Something Broke ○ Combine Many Alerts ● Categorize & Document ● Troubleshoot, Fix, Document ● Communicate ● Review & Report (Post Mortem)
  • 28.
    How to Troubleshoot ●Train People in Troubleshooting ● Defined Processes - Especially for Emergencies ● Use All the Data ○ Alerts & Incidents ○ Metrics ○ Logs & Events ○ Topology & Dependencies ● Root Cause Analyses Critical ● Runbooks & Team Communications
  • 29.
    How to ManageProblems ● Problems are Recurring Incidents (ITIL) ● Key to Reducing False Alerts & Fatigue ● Key to Improving Alert Thresholds ● Key to Improving Systems ● Ideally Dedicated Team or Resources ● Needs Monitoring System Support
  • 30.
  • 31.
    Getting to IntegratedMonitoring ● Big Project ● Multiple Phases ● Lots of Details ● Involves Integration ● And New Strategies
  • 32.
    Getting to IntegratedMonitoring ● Upgrade & Integrate Monitoring ● Add Golden Signals, App/User Focus ● Add Discovery & Dependencies ● Add APM & Tracing ● Upgrade & Integrate Logging ● Add Observability in Code ● Upgraded Alerting, Incident, Problems ● Train Teams
  • 33.
    Upgrade & IntegrateMonitoring ● Plan What to Monitor ● Plan How to Monitor ● Single Unified System ● Multi-Phase Process ● Temporary Integrations
  • 34.
    Add Golden Signals,App/User Focus ● Setup App & Service Structures ● Plan Golden Signals ○ Needs Special Agents & Techniques ○ Varies a Lot by Service ● Set Baselines & Anomalies
  • 35.
    Add Discovery &Dependencies ● Driven by Monitoring System ● Get Data ● Build Diagrams ● Verify Dependencies
  • 36.
    Add APM &Tracing ● Part of Monitoring ● Initial Setup & Test ● Tune for Transactions ● Extract User/RUM Metrics ● Set Alerts as Needed ● Train Developers on Tracing ● Correlate USE/RED Metrics with APM Data
  • 37.
  • 38.
    Upgrade & IntegrateLogging ● Logs are Key Part of Troubleshooting ○ OS Level - Linux & Windows ○ Services - Web, Java/Tomcat, MySQL, etc. ● Send All to Unified Platform ● Get Metrics & Analytics ● Add Alertings on Logs - Errors & Metrics
  • 39.
    Add Observability inCode ● Structured Logs are Best - Events ● Emitted by App Code with Tags & Dimensions ● Include Metrics - Ideally Latency & Errors ● Build Analyses Dashboards
  • 40.
    Upgraded Alerting, Incident,Problems ● Move to ITIL Naming & Processes ● Build Procedures ● Dedicated Problem Team
  • 41.
    Training Teams ● GeneralTraining ● ITIL & ITOP Training ● Golden Signals Thinking ● APM, RUM, Tracing Usage
  • 42.
  • 43.
    Automated Operations -Goals ● Automate Things ● Build Things Faster ● Change Things Faster ● Fix Things Faster ● Reduce Manual Mistakes ● Improve Consistency ● Support Large Scale Systems
  • 44.
    Key Components ● Clouds& Dynamic Systems ● Infrastructure as Code ● Config Management Systems ○ Ansible, Puppet, Chef, SaltStack ● Automated Troubleshooting ● Auto-Healing ● Auto-Governance
  • 45.
    Clouds & DynamicSystems ● Cloud APIs Support Automation ● Clouds Can Change Themselves ○ Auto-Scaling ● CI/CD Systems Can Change Them ● Core of Lots of Automated Processes
  • 46.
    Infrastructure as Code ●Define Infrastructure Programmatically ● Cloud Formation, Terraform, etc. ● Build & Change Continually ● Continually Updated ● Versioned & Can Diff
  • 47.
    Config Management Systems ●Support Auto Deployments ● Leverage Infra-as-Code ● Ansible, Puppet, Chef, SaltStack ● Build Large, Reproducible Systems
  • 48.
    Automated Troubleshooting ● Builton Data & Rule Systems ● Auto-Gather More Details ● Help Find Root Causes ● Use Automation System ● Advanced Use Needs AI-Ops
  • 49.
    Auto-Healing ● Automatically FixThings ● Driven by Rule Engines ● Fix Things Faster ● Rapid Response 7x24 ● Use Automation System
  • 50.
    Auto-Governance ● Continual Compliance ●Guardrails to Prevent Risks ● Systems Auto-Correct ○ Remove Bad Security
  • 51.
  • 52.
    AI-Ops - Whatis it? ● Analytical IT Operations ● Artificial Intelligence (AI) Opeations ● Machine Learning for Operations
  • 53.
    Really, What isIt? ● Synthesis of Many Sources of Information ● Better Understanding of Problems & Situation ● Predictions for the Future ● Usually with Machine Learning / Big Data
  • 54.
  • 55.
    Goals ● Impact Analysis ●Alert Reduction ● Alert Consolidation ● Root Cause Analysis ● Auto-Healing ● Prediction
  • 56.
    Method - CombineEverything ● Alerts ● Events ● Metrics ● History ● Topology ● Dependencies
  • 57.
    Many Vendors ● MostFocus on Alert Consolidation ● Some on Root Alert/Issue ● Few on Real Root Cause ● Need Deep Data for RCA
  • 59.
    Alert Reduction &Consolidation ● Merge Related & Duplicate Alerts ○ By Time, Dependencies, History, etc. ● Helps Ops Teams Focus ● Avoids Missing Key Things in Noise ● First Phase of any AIOPS
  • 60.
    Root Cause Analysis ●Multi-Mode & Method ● Expert System Uses Everything ● Use Dependencies from Discovery ● Use History with Feedback ● Sort & Prioritize ● Enrich with Additional Data Collection ○ Including Automated Actions
  • 61.
    Manual-Healing ● One-Click Fixing ●Procedure/Runbooks ● For high-risk, complex solutions ● Automatic Real-Time ● Uses Automation Platform
  • 62.
    Auto-Healing ● Automatic Real-TimeFixing ● Uses Automation Platform ● Helps 7x24, Reducing On-Call ● Responds in Seconds, not Hours
  • 63.
    Prediction ● See Problemsin Advance ● Solve Problems in Advance ● Capacity Planning - Resources ● Elevated Errors & Pending Failures
  • 64.
    Summary ● IT Operationsis Manual & Messy ● Monitoring is Diverse & Distributed ● Automation helps take Action ● AI-Ops help Fix Stuff Faster
  • 65.