SlideShare a Scribd company logo
1 of 26
Download to read offline
Uptime Day #1
Stream your monitoring
Stanislav Osipov
2017.04.08
Stanislav Osipov
• DevOps Architect at ECommPay
• Sr. DevOps Engineer at CityAds Media
• CISO & CIO at Payler, Runet Award 2014
• H. SysOps Engineer at Mainpeople Worldwide
• Sr. Deployment Engineer & Project Manager at Mirantis IT
• Sr. DevOps at Undev (Digital October)
• … and 10 more companies …
Topics
• Context
• Global problem statistics
• Tools
• 3 metric layers for Zabbix
• Channels
• Escalations
• Additional sources
• Receive it together
• Help to your managers
Online advertising industry
CPA, RTB, etc.
•A lot of traffic
•A lot of buzz and big data
•A lot of short-term initiatives
•A lot of IT support
Online advertising industry
CPA, RTB, etc.
•A lot of madness & fuckups
•A lot of
fun & lulz!
Classify problems
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
ENISA Annual Incident Report 2015
(issued on October 05, 2016)
https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports
98,9 Connectivity failures
Datacenter failures
Failures overhead
/dev/hands
Problems distribution
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
– up to 99% of failures
Tools
External:
•New Relic Synthetics
•Pingdom
Internal:
•Zabbix
•Munin
•New Relic APM (runtime context)
•собственные скрипты
Concept of 3 layers for Zabbix
• Base/System: OS metrics
- Disk space, RAM, CPU, LA, net
• Components: Daemon metrics
- Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc.
• Advanced: Services & Apps metrics, DBA metrics, etc
- Money earned ;-)
Group or tag hosts in Zabbix
• By envs: Production, Staging, Testing, Development
• By datacenters or locations
• By VM guest type
• By operating systems ot type
• By projects, services & components
• By teams & ppl, if you cannot override the chaos
Channels
• Call/SMS
• E-mail
• New Relic mobile
• Slack
• Telegram
Channels
• Call/SMS
• E-mail
• New Relic mobile app
• Slack
• Telegram
• … and screens!
Escalations – Production only
• >= HIGH: Notify Ops on duty – All channels, incl. SMS
• >=CRITICAL: notify all Ops,
add +15 mins delay before if not work hours (all channels)
• 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels)
• >= HIGH: Notify SDEs on daylight time – All channels
• >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if
not work hours (all channels)
• = DISASTER: Notify IT top management & SD Team of the affected
systems,
add +5 mins delay if not work hours
What more?
New Relic Alerts on top of New Relic APM:
• Detect % of fatal errors in code runtime. Create NR Alerts
Policy: “If count is more than threshold”. If so, NR Alerts sends
e-mail to bugs@company.tld or what you enter.
• E-mail must trigger a “Bug on the production” ticket to IT
support.
• IT support assigns the ticket to the appropriate responsible
team.
• Every a such ticket must decrease KPI of the responsible team.
• PROFIT: nobody wants bugs on production!
What more?
Should Ops know about the deployments?
What more?
Should Ops know about the deployments?
• Add Jenkins jobs (about production deployments)
hook notifications to IMs (Slack, Telegram, etc)
• Add Pingdom bot to IMs
• Add New Relic Synthetics notifications to IMs
• Add anything else happens in your SD & IT …
Too many chats!!!
Stream this.
One stream chat per IM.
For everything.
Stream the monitoring
Help to your manager
bonus track
Incident management
Custom report for Zabbix
Count incidents and stats every month
Implement CI/CD and monitor the uptime
PROFIT: Uptime with observable result.
Questions?
Stanislav S. Osipov
oss+uptime@gkos.name
Thanks!

More Related Content

Similar to Стриминг мониторинга

2017-05-10 Gate4SPICE: "Legacy Software"
2017-05-10 Gate4SPICE: "Legacy Software"2017-05-10 Gate4SPICE: "Legacy Software"
2017-05-10 Gate4SPICE: "Legacy Software"Alexander Much
 
Verification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different ApproachVerification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different ApproachDVClub
 
Information Technology Department.pptx
Information Technology Department.pptxInformation Technology Department.pptx
Information Technology Department.pptxAQEELAHMAD938119
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck VictorOps
 
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...OVHcloud
 
Perfect Profilers Final Presentation
Perfect Profilers Final PresentationPerfect Profilers Final Presentation
Perfect Profilers Final PresentationJulie Michlinski
 
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...BDekkema
 
TIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlledTIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlledThe Incredible Automation Day
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education ITKaseya
 
Moving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team ServicesMoving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team ServicesVSTS Community MSFT
 
Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign
 
Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Gary Dare
 
Witekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenanceWitekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenanceWitekio
 
Пирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрияПирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрияSQALab
 
Test Pyramid vs Roi
Test Pyramid vs Roi Test Pyramid vs Roi
Test Pyramid vs Roi COMAQA.BY
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservicesDynatrace
 
Architectural considerations when building an API
Architectural considerations when building an APIArchitectural considerations when building an API
Architectural considerations when building an APIRod Hemphill
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
 
What a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysisWhat a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysisAndrey Karpov
 

Similar to Стриминг мониторинга (20)

2017-05-10 Gate4SPICE: "Legacy Software"
2017-05-10 Gate4SPICE: "Legacy Software"2017-05-10 Gate4SPICE: "Legacy Software"
2017-05-10 Gate4SPICE: "Legacy Software"
 
Verification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different ApproachVerification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different Approach
 
Information Technology Department.pptx
Information Technology Department.pptxInformation Technology Department.pptx
Information Technology Department.pptx
 
DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck DevOps Roadtrip Final Speaking Deck
DevOps Roadtrip Final Speaking Deck
 
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...
 
Perfect Profilers Final Presentation
Perfect Profilers Final PresentationPerfect Profilers Final Presentation
Perfect Profilers Final Presentation
 
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
 
TIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlledTIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlled
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT
 
Moving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team ServicesMoving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team Services
 
Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414
 
Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Witekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenanceWitekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenance
 
Пирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрияПирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрия
 
Test Pyramid vs Roi
Test Pyramid vs Roi Test Pyramid vs Roi
Test Pyramid vs Roi
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices
 
Architectural considerations when building an API
Architectural considerations when building an APIArchitectural considerations when building an API
Architectural considerations when building an API
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
 
What a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysisWhat a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysis
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Стриминг мониторинга

  • 1. Uptime Day #1 Stream your monitoring Stanislav Osipov 2017.04.08
  • 2. Stanislav Osipov • DevOps Architect at ECommPay • Sr. DevOps Engineer at CityAds Media • CISO & CIO at Payler, Runet Award 2014 • H. SysOps Engineer at Mainpeople Worldwide • Sr. Deployment Engineer & Project Manager at Mirantis IT • Sr. DevOps at Undev (Digital October) • … and 10 more companies …
  • 3. Topics • Context • Global problem statistics • Tools • 3 metric layers for Zabbix • Channels • Escalations • Additional sources • Receive it together • Help to your managers
  • 4. Online advertising industry CPA, RTB, etc. •A lot of traffic •A lot of buzz and big data •A lot of short-term initiatives •A lot of IT support
  • 5. Online advertising industry CPA, RTB, etc. •A lot of madness & fuckups •A lot of fun & lulz!
  • 6. Classify problems 1. Datacenter failures 2. Connectivity failures 3. /dev/hands
  • 7. ENISA Annual Incident Report 2015 (issued on October 05, 2016) https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports
  • 8. 98,9 Connectivity failures Datacenter failures Failures overhead /dev/hands Problems distribution 1. Datacenter failures 2. Connectivity failures 3. /dev/hands – up to 99% of failures
  • 9. Tools External: •New Relic Synthetics •Pingdom Internal: •Zabbix •Munin •New Relic APM (runtime context) •собственные скрипты
  • 10. Concept of 3 layers for Zabbix • Base/System: OS metrics - Disk space, RAM, CPU, LA, net • Components: Daemon metrics - Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc. • Advanced: Services & Apps metrics, DBA metrics, etc - Money earned ;-)
  • 11. Group or tag hosts in Zabbix • By envs: Production, Staging, Testing, Development • By datacenters or locations • By VM guest type • By operating systems ot type • By projects, services & components • By teams & ppl, if you cannot override the chaos
  • 12. Channels • Call/SMS • E-mail • New Relic mobile • Slack • Telegram
  • 13. Channels • Call/SMS • E-mail • New Relic mobile app • Slack • Telegram • … and screens!
  • 14. Escalations – Production only • >= HIGH: Notify Ops on duty – All channels, incl. SMS • >=CRITICAL: notify all Ops, add +15 mins delay before if not work hours (all channels) • 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels) • >= HIGH: Notify SDEs on daylight time – All channels • >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if not work hours (all channels) • = DISASTER: Notify IT top management & SD Team of the affected systems, add +5 mins delay if not work hours
  • 15. What more? New Relic Alerts on top of New Relic APM: • Detect % of fatal errors in code runtime. Create NR Alerts Policy: “If count is more than threshold”. If so, NR Alerts sends e-mail to bugs@company.tld or what you enter. • E-mail must trigger a “Bug on the production” ticket to IT support. • IT support assigns the ticket to the appropriate responsible team. • Every a such ticket must decrease KPI of the responsible team. • PROFIT: nobody wants bugs on production!
  • 16. What more? Should Ops know about the deployments?
  • 17. What more? Should Ops know about the deployments? • Add Jenkins jobs (about production deployments) hook notifications to IMs (Slack, Telegram, etc) • Add Pingdom bot to IMs • Add New Relic Synthetics notifications to IMs • Add anything else happens in your SD & IT …
  • 19. Stream this. One stream chat per IM. For everything.
  • 21. Help to your manager bonus track
  • 24. Count incidents and stats every month
  • 25. Implement CI/CD and monitor the uptime PROFIT: Uptime with observable result.