SlideShare a Scribd company logo
Monitoring
Far Beyond the
Operating System
WeOp 2014
Marcus Vechiato - @vechiato
http://weop.com.br
Agenda
⦿ Goal
⦿ How do we envision a monitoring system?
⦿ From simple to complex
⦿ What to monitor?
⦿ What to track?
⦿ Locaweb numbers
⦿ Where some get lost
⦿ Configuration automation
⦿ ITIL and ITSM Tools Automatic Incident Creation
⦿ Tools already being used
⦿ Challenges
Goal
The objective of this presentation is to explore monitoring
implementations without focusing on tools.
Best practices highlighting what worked well and the
lessons learned from mistakes made over the years.
How do we envision a monitoring system?
How do we envision a monitoring system?
⦿ It's not just a tool.
⦿ The monitoring tool is one of the components of the
process.
⦿ Process - it can lead to bureaucracy if it's not effective.
Locaweb numbers
⦿ Network
⚫ Brocade / Cisco / Force10 and others
⦿ ~21k servers (physical and virtual)
⚫ Windows (2003/2008/2012)
⚫ Linux (CentOs/Redhat/Debian)
⚫ Oracle/MySql/Postgre/MSSQL/MongoDB
⚫ VmWare/Xen
⦿ ~500 thousand items/services monitored every minute
⦿ ~17 thousand incidents handled per month
From simple to complex
⦿ Have a clear understanding of your biggest challenges to define your
objectives.
⦿ Do not idealize the perfect system that will cover all the gaps, it does not
exist.
⦿ Remember: what are your resources and what are the real skills of the
team.
⦿ Prefer a gradual implementation with well-defined deliverables.
What to monitor?
⦿ Core Services and Infrastructure - network/uninterruptible power
supply/temperature/DNS
⦿ Operating System (memory/CPU/local network/disk) where applicable
Applications
⚫ User perspective (HTTP/TCP requests)
⚫ Local (memory usage/threads/processes/etc.)
⦿ Business Indicators/errors
⚫ Example: Sales per hour Example:
⚫ Authentication failures per minute
What to track?
Convert the view of infrastructure indicators to products/components/teams
⦿ Dashboards for different audiences
⚫ Operations
○ KPI view by teams/infrastructure
⚫ Ex.: MTTR of N1 incidents by priority
⚫ Ex.: SLA and MTTR of storage abc
⚫ Products/Business
○ Common and specific indicator view
⚫ Ex.: SLA of product xyz 99.89%
⚫ Ex.: MTTR of product xyz 0h45m
Where some get lost
⦿ It's oversight to diagnose: "the xyz tool doesn't work, we need a new one."
⦿ Monitoring probe intervals are too short.
⦿ Retries are important to reduce false positives.
⦿ From my experience:
⚫ Standard probe intervals range from 1 to 5 minutes
⚫ Retries:
○ 5 minutes during deployment/with known instabilities.
○ 3 minutes in stable environments.
Configuration Automation
⦿ Monitoring is the best place to start managing component installation and
configurations.
⚫ Start with the monitoring agent (if available).
⚫ Monitoring server
○ Via API where possible
○ Configuration files
⦿ Which tool to use for automation?
⚫ It depends on your environment and the team's knowledge. Chef and
Puppet are good options to start with.
ITIL and ITSM Tools
⦿ ITSM Tools
⚫ I strongly recommend
⚫ If you intend to manage incidents automatically, spend more time
evaluating which tool will be used
⦿ Processes are the backbone
⚫ Incident Management
⚫ Problem Management
⚫ Change Management
⦿ CMDB - registration/control is mandatory
⚫ In small installations, your monitoring tool is your CMDB
⚫ In larger environments, you will need to synchronize it with the ITSM
tool
Automatic Incident Creation
Some benefits of automatic incident creation in larger environments:
⦿ Addresses the inefficiency of manual incident logging
⦿ Registers failures exactly when they occur
⦿ Allows predefining the importance of each component/service and
prioritizing its resolution in case of failure
⦿ Reduces informal incident resolution without logging
⦿ Provides insight for in-depth analysis of the environment
⦿ Integrated with crisis management, reduces resolution time and improves
related communication
⦿ Enables realistic calculation of OLAs and SLAs
Automatic Incident Creation
⦿ Integration via:
⚫ API preferably (REST/SOAP)
⚫ Email - with templates, most tools allow it (only use as a last resort)
⦿ Use the priority when opening the incident to allow prioritization by the
resolving team. According to ITIL, on a scale of 1-5:
⚫ Priorities (think of a pyramid):
○ 1 and 2: should be less than 5% of incidents
○ 3: 20%
○ 4: 30%
○ 5: 45%
⦿ For each priority, define different resolution OLAs. Remember that this will
directly affect the size of the team.
Automatic Incident Creation
⦿ Automatic reopening of incidents if resolved and
continue failing in monitoring or fail again within 30
minutes.
⦿ New incident in case of new alarm after 30 minutes
from the last resolved incident.
⦿ Suppress incident creation during scheduled
maintenance
Automatic Incident Creation
⦿ Automatic closure of incidents if monitoring normalizes before team
intervention with status "no intervention" allows:
⚫ Refinement of the solution and its efficiency
⚫ Adjustment of very tight thresholds
⚫ Information for opening Problems
⚫ Failures in planning/execution of changes
⚫ Quickly resume incident treatment after events with
hundreds/thousands of incidents opened in a short period of time
Tools already being used
⦿ Monitoring (open source):
⚫ Nagios
⚫ Check_mk – Locaweb
⚫ Zabbix
⦿ ITSM:
⚫ Service Now (API) – Locaweb
⚫ CA – Service Desk Manager (API) – Locaweb
⚫ HP – Service Center (API)
⚫ OTRS – (API)
Challenges
⦿ Golden Rule: "Every alarm must have a corrective action" even if it's just
adjusting the thresholds in case of false positives.
⦿ Don't be fooled - in the beginning, you will have many false positives.
Persistence is key.
⦿ If you don't close incidents automatically during instabilities, typically
network-related, you will be buried in incidents and will miss important
alarms when the instability ceases.
Challenges
⦿ Who implements the solution and who administers day-to-day operations?
⚫ Implementation of the solution: naturally the most Senior team/person.
⚫ Who should enable the monitoring in new systems? If you thought in
the intern or the Junior members of the team, you're mistaken. It's also
the responsibility of the most Senior members. It should be automated.
Challenges
More important than the tools are the people and adherence to the defined
processes, end-to-end.
Periodically revisit the processes to adjust and evolve according to current
needs.
If any process is not working, change it. Do not allow it to be abandoned or
circumvented.
Q&A ?

More Related Content

Similar to Monitoring Far Beyond the Operating System - WeOp 2014

The Why and How of Continuous Delivery
The Why and How of Continuous DeliveryThe Why and How of Continuous Delivery
The Why and How of Continuous DeliveryNigel McNie
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
T. Alexander Lystad
 
SplunkLive! Paris 2018: Event Management Is Dead
SplunkLive! Paris 2018: Event Management Is DeadSplunkLive! Paris 2018: Event Management Is Dead
SplunkLive! Paris 2018: Event Management Is Dead
Splunk
 
Mission: IT operations for a good night's sleep
Mission: IT operations for a good night's sleepMission: IT operations for a good night's sleep
Mission: IT operations for a good night's sleep
wwwally
 
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporaçõesLuiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Agile Trends
 
Patch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare ITPatch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare IT Kaseya
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
Ricardo Amaro
 
DRP.ppt
DRP.pptDRP.ppt
Patch Management: 4 Best Practices and More for Today’s Banking IT Leaders
Patch Management: 4 Best Practices and More for Today’s Banking IT LeadersPatch Management: 4 Best Practices and More for Today’s Banking IT Leaders
Patch Management: 4 Best Practices and More for Today’s Banking IT LeadersKaseya
 
SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...
SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...
SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...
Splunk
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafa
Lama K Banna
 
The Final Frontier, Automating Dynamic Security Testing
The Final Frontier, Automating Dynamic Security TestingThe Final Frontier, Automating Dynamic Security Testing
The Final Frontier, Automating Dynamic Security Testing
Matt Tesauro
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Brian Brazil
 
Liberate Your IT Team
Liberate Your IT TeamLiberate Your IT Team
Liberate Your IT Teamvblackwell
 
Test Automation
Test AutomationTest Automation
Test Automation
nikos batsios
 
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdfITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
ManishKumar526001
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations Vision
Steve Mushero
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
SplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
SplunkLive! Zurich 2018: Monitoring the End User Experience with SplunkSplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
SplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
Splunk
 
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with SplunkSplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
Splunk
 

Similar to Monitoring Far Beyond the Operating System - WeOp 2014 (20)

The Why and How of Continuous Delivery
The Why and How of Continuous DeliveryThe Why and How of Continuous Delivery
The Why and How of Continuous Delivery
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
 
SplunkLive! Paris 2018: Event Management Is Dead
SplunkLive! Paris 2018: Event Management Is DeadSplunkLive! Paris 2018: Event Management Is Dead
SplunkLive! Paris 2018: Event Management Is Dead
 
Mission: IT operations for a good night's sleep
Mission: IT operations for a good night's sleepMission: IT operations for a good night's sleep
Mission: IT operations for a good night's sleep
 
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporaçõesLuiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
Luiz Fernando Testa Contador - Aplicando DevOps em grandes corporações
 
Patch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare ITPatch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare IT
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
 
DRP.ppt
DRP.pptDRP.ppt
DRP.ppt
 
Patch Management: 4 Best Practices and More for Today’s Banking IT Leaders
Patch Management: 4 Best Practices and More for Today’s Banking IT LeadersPatch Management: 4 Best Practices and More for Today’s Banking IT Leaders
Patch Management: 4 Best Practices and More for Today’s Banking IT Leaders
 
SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...
SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...
SplunkLive! Munich 2018: Predictive, Proactive, and Collaborative ML with IT ...
 
Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafa
 
The Final Frontier, Automating Dynamic Security Testing
The Final Frontier, Automating Dynamic Security TestingThe Final Frontier, Automating Dynamic Security Testing
The Final Frontier, Automating Dynamic Security Testing
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Liberate Your IT Team
Liberate Your IT TeamLiberate Your IT Team
Liberate Your IT Team
 
Test Automation
Test AutomationTest Automation
Test Automation
 
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdfITIL-v3-Incident-Management-Process-PPT-RED.pdf
ITIL-v3-Incident-Management-Process-PPT-RED.pdf
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations Vision
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
SplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
SplunkLive! Zurich 2018: Monitoring the End User Experience with SplunkSplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
SplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
 
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with SplunkSplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
 

Recently uploaded

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 

Recently uploaded (20)

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 

Monitoring Far Beyond the Operating System - WeOp 2014

  • 1. Monitoring Far Beyond the Operating System WeOp 2014 Marcus Vechiato - @vechiato http://weop.com.br
  • 2. Agenda ⦿ Goal ⦿ How do we envision a monitoring system? ⦿ From simple to complex ⦿ What to monitor? ⦿ What to track? ⦿ Locaweb numbers ⦿ Where some get lost ⦿ Configuration automation ⦿ ITIL and ITSM Tools Automatic Incident Creation ⦿ Tools already being used ⦿ Challenges
  • 3. Goal The objective of this presentation is to explore monitoring implementations without focusing on tools. Best practices highlighting what worked well and the lessons learned from mistakes made over the years.
  • 4. How do we envision a monitoring system?
  • 5. How do we envision a monitoring system? ⦿ It's not just a tool. ⦿ The monitoring tool is one of the components of the process. ⦿ Process - it can lead to bureaucracy if it's not effective.
  • 6. Locaweb numbers ⦿ Network ⚫ Brocade / Cisco / Force10 and others ⦿ ~21k servers (physical and virtual) ⚫ Windows (2003/2008/2012) ⚫ Linux (CentOs/Redhat/Debian) ⚫ Oracle/MySql/Postgre/MSSQL/MongoDB ⚫ VmWare/Xen ⦿ ~500 thousand items/services monitored every minute ⦿ ~17 thousand incidents handled per month
  • 7. From simple to complex ⦿ Have a clear understanding of your biggest challenges to define your objectives. ⦿ Do not idealize the perfect system that will cover all the gaps, it does not exist. ⦿ Remember: what are your resources and what are the real skills of the team. ⦿ Prefer a gradual implementation with well-defined deliverables.
  • 8. What to monitor? ⦿ Core Services and Infrastructure - network/uninterruptible power supply/temperature/DNS ⦿ Operating System (memory/CPU/local network/disk) where applicable Applications ⚫ User perspective (HTTP/TCP requests) ⚫ Local (memory usage/threads/processes/etc.) ⦿ Business Indicators/errors ⚫ Example: Sales per hour Example: ⚫ Authentication failures per minute
  • 9. What to track? Convert the view of infrastructure indicators to products/components/teams ⦿ Dashboards for different audiences ⚫ Operations ○ KPI view by teams/infrastructure ⚫ Ex.: MTTR of N1 incidents by priority ⚫ Ex.: SLA and MTTR of storage abc ⚫ Products/Business ○ Common and specific indicator view ⚫ Ex.: SLA of product xyz 99.89% ⚫ Ex.: MTTR of product xyz 0h45m
  • 10. Where some get lost ⦿ It's oversight to diagnose: "the xyz tool doesn't work, we need a new one." ⦿ Monitoring probe intervals are too short. ⦿ Retries are important to reduce false positives. ⦿ From my experience: ⚫ Standard probe intervals range from 1 to 5 minutes ⚫ Retries: ○ 5 minutes during deployment/with known instabilities. ○ 3 minutes in stable environments.
  • 11. Configuration Automation ⦿ Monitoring is the best place to start managing component installation and configurations. ⚫ Start with the monitoring agent (if available). ⚫ Monitoring server ○ Via API where possible ○ Configuration files ⦿ Which tool to use for automation? ⚫ It depends on your environment and the team's knowledge. Chef and Puppet are good options to start with.
  • 12. ITIL and ITSM Tools ⦿ ITSM Tools ⚫ I strongly recommend ⚫ If you intend to manage incidents automatically, spend more time evaluating which tool will be used ⦿ Processes are the backbone ⚫ Incident Management ⚫ Problem Management ⚫ Change Management ⦿ CMDB - registration/control is mandatory ⚫ In small installations, your monitoring tool is your CMDB ⚫ In larger environments, you will need to synchronize it with the ITSM tool
  • 13. Automatic Incident Creation Some benefits of automatic incident creation in larger environments: ⦿ Addresses the inefficiency of manual incident logging ⦿ Registers failures exactly when they occur ⦿ Allows predefining the importance of each component/service and prioritizing its resolution in case of failure ⦿ Reduces informal incident resolution without logging ⦿ Provides insight for in-depth analysis of the environment ⦿ Integrated with crisis management, reduces resolution time and improves related communication ⦿ Enables realistic calculation of OLAs and SLAs
  • 14. Automatic Incident Creation ⦿ Integration via: ⚫ API preferably (REST/SOAP) ⚫ Email - with templates, most tools allow it (only use as a last resort) ⦿ Use the priority when opening the incident to allow prioritization by the resolving team. According to ITIL, on a scale of 1-5: ⚫ Priorities (think of a pyramid): ○ 1 and 2: should be less than 5% of incidents ○ 3: 20% ○ 4: 30% ○ 5: 45% ⦿ For each priority, define different resolution OLAs. Remember that this will directly affect the size of the team.
  • 15. Automatic Incident Creation ⦿ Automatic reopening of incidents if resolved and continue failing in monitoring or fail again within 30 minutes. ⦿ New incident in case of new alarm after 30 minutes from the last resolved incident. ⦿ Suppress incident creation during scheduled maintenance
  • 16. Automatic Incident Creation ⦿ Automatic closure of incidents if monitoring normalizes before team intervention with status "no intervention" allows: ⚫ Refinement of the solution and its efficiency ⚫ Adjustment of very tight thresholds ⚫ Information for opening Problems ⚫ Failures in planning/execution of changes ⚫ Quickly resume incident treatment after events with hundreds/thousands of incidents opened in a short period of time
  • 17. Tools already being used ⦿ Monitoring (open source): ⚫ Nagios ⚫ Check_mk – Locaweb ⚫ Zabbix ⦿ ITSM: ⚫ Service Now (API) – Locaweb ⚫ CA – Service Desk Manager (API) – Locaweb ⚫ HP – Service Center (API) ⚫ OTRS – (API)
  • 18. Challenges ⦿ Golden Rule: "Every alarm must have a corrective action" even if it's just adjusting the thresholds in case of false positives. ⦿ Don't be fooled - in the beginning, you will have many false positives. Persistence is key. ⦿ If you don't close incidents automatically during instabilities, typically network-related, you will be buried in incidents and will miss important alarms when the instability ceases.
  • 19. Challenges ⦿ Who implements the solution and who administers day-to-day operations? ⚫ Implementation of the solution: naturally the most Senior team/person. ⚫ Who should enable the monitoring in new systems? If you thought in the intern or the Junior members of the team, you're mistaken. It's also the responsibility of the most Senior members. It should be automated.
  • 20. Challenges More important than the tools are the people and adherence to the defined processes, end-to-end. Periodically revisit the processes to adjust and evolve according to current needs. If any process is not working, change it. Do not allow it to be abandoned or circumvented.
  • 21. Q&A ?