SlideShare a Scribd company logo
1 of 26
Download to read offline
Uptime Day #1
Stream your monitoring
Stanislav Osipov
2017.04.08
Stanislav Osipov
• DevOps Architect at ECommPay
• Sr. DevOps Engineer at CityAds Media
• CISO & CIO at Payler, Runet Award 2014
• H. SysOps Engineer at Mainpeople Worldwide
• Sr. Deployment Engineer & Project Manager at Mirantis IT
• Sr. DevOps at Undev (Digital October)
• … and 10 more companies …
Topics
• Context
• Global problem statistics
• Tools
• 3 metric layers for Zabbix
• Channels
• Escalations
• Additional sources
• Receive it together
• Help to your managers
Online advertising industry
CPA, RTB, etc.
•A lot of traffic
•A lot of buzz and big data
•A lot of short-term initiatives
•A lot of IT support
Online advertising industry
CPA, RTB, etc.
•A lot of madness & fuckups
•A lot of
fun & lulz!
Classify problems
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
ENISA Annual Incident Report 2015
(issued on October 05, 2016)
https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports
98,9 Connectivity failures
Datacenter failures
Failures overhead
/dev/hands
Problems distribution
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
– up to 99% of failures
Tools
External:
•New Relic Synthetics
•Pingdom
Internal:
•Zabbix
•Munin
•New Relic APM (runtime context)
•собственные скрипты
Concept of 3 layers for Zabbix
• Base/System: OS metrics
- Disk space, RAM, CPU, LA, net
• Components: Daemon metrics
- Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc.
• Advanced: Services & Apps metrics, DBA metrics, etc
- Money earned ;-)
Group or tag hosts in Zabbix
• By envs: Production, Staging, Testing, Development
• By datacenters or locations
• By VM guest type
• By operating systems ot type
• By projects, services & components
• By teams & ppl, if you cannot override the chaos
Channels
• Call/SMS
• E-mail
• New Relic mobile
• Slack
• Telegram
Channels
• Call/SMS
• E-mail
• New Relic mobile app
• Slack
• Telegram
• … and screens!
Escalations – Production only
• >= HIGH: Notify Ops on duty – All channels, incl. SMS
• >=CRITICAL: notify all Ops,
add +15 mins delay before if not work hours (all channels)
• 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels)
• >= HIGH: Notify SDEs on daylight time – All channels
• >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if
not work hours (all channels)
• = DISASTER: Notify IT top management & SD Team of the affected
systems,
add +5 mins delay if not work hours
What more?
New Relic Alerts on top of New Relic APM:
• Detect % of fatal errors in code runtime. Create NR Alerts
Policy: “If count is more than threshold”. If so, NR Alerts sends
e-mail to bugs@company.tld or what you enter.
• E-mail must trigger a “Bug on the production” ticket to IT
support.
• IT support assigns the ticket to the appropriate responsible
team.
• Every a such ticket must decrease KPI of the responsible team.
• PROFIT: nobody wants bugs on production!
What more?
Should Ops know about the deployments?
What more?
Should Ops know about the deployments?
• Add Jenkins jobs (about production deployments)
hook notifications to IMs (Slack, Telegram, etc)
• Add Pingdom bot to IMs
• Add New Relic Synthetics notifications to IMs
• Add anything else happens in your SD & IT …
Too many chats!!!
Stream this.
One stream chat per IM.
For everything.
Stream the monitoring
Help to your manager
bonus track
Incident management
Custom report for Zabbix
Count incidents and stats every month
Implement CI/CD and monitor the uptime
PROFIT: Uptime with observable result.
Questions?
Stanislav S. Osipov
oss+uptime@gkos.name
Thanks!

More Related Content

Similar to Стриминг мониторинга

Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...BDekkema
 
TIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlledTIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlledThe Incredible Automation Day
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education ITKaseya
 
Moving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team ServicesMoving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team ServicesVSTS Community MSFT
 
Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign
 
Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Gary Dare
 
Witekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenanceWitekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenanceWitekio
 
Пирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрияПирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрияSQALab
 
Test Pyramid vs Roi
Test Pyramid vs Roi Test Pyramid vs Roi
Test Pyramid vs Roi COMAQA.BY
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservicesDynatrace
 
Architectural considerations when building an API
Architectural considerations when building an APIArchitectural considerations when building an API
Architectural considerations when building an APIRod Hemphill
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
 
What a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysisWhat a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysisAndrey Karpov
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013Nick Galbreath
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathDevopsdays
 
XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...
XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...
XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...Publicis Sapient Engineering
 
Green Code Lab Challenge 2015 Subject Details
Green Code Lab Challenge 2015 Subject DetailsGreen Code Lab Challenge 2015 Subject Details
Green Code Lab Challenge 2015 Subject DetailsOlivier Philippot
 
JOE_CV2014
JOE_CV2014JOE_CV2014
JOE_CV2014Naidoo J
 
Rational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspectiveRational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspectiveJoakim Lindbom
 

Similar to Стриминг мониторинга (20)

Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
 
TIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlledTIAD 2016 : Continuous Integration mesured and controlled
TIAD 2016 : Continuous Integration mesured and controlled
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT
 
Moving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team ServicesMoving 65,000 Microsofties to DevOps with Visual Studio Team Services
Moving 65,000 Microsofties to DevOps with Visual Studio Team Services
 
Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414Space Codesign at TandemLaunch 20150414
Space Codesign at TandemLaunch 20150414
 
Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414Space Codesign at TandemLaunch Lunch & Learn 20150414
Space Codesign at TandemLaunch Lunch & Learn 20150414
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Witekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenanceWitekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenance
 
Пирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрияПирамида Тестирования через призму ROI калькулятора и прочая геометрия
Пирамида Тестирования через призму ROI калькулятора и прочая геометрия
 
Test Pyramid vs Roi
Test Pyramid vs Roi Test Pyramid vs Roi
Test Pyramid vs Roi
 
6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices6 ways DevOps helped PrepSportswear move from monolith to microservices
6 ways DevOps helped PrepSportswear move from monolith to microservices
 
Architectural considerations when building an API
Architectural considerations when building an APIArchitectural considerations when building an API
Architectural considerations when building an API
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
 
What a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysisWhat a DevOps specialist has to know about static code analysis
What a DevOps specialist has to know about static code analysis
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
 
XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...
XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...
XebiConFr 15 - AXA : Transformation digitale, les enjeux d'un grand groupe (Y...
 
Green Code Lab Challenge 2015 Subject Details
Green Code Lab Challenge 2015 Subject DetailsGreen Code Lab Challenge 2015 Subject Details
Green Code Lab Challenge 2015 Subject Details
 
JOE_CV2014
JOE_CV2014JOE_CV2014
JOE_CV2014
 
Rational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspectiveRational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspective
 

More from Uptime Community

Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...
Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...
Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...Uptime Community
 
Как устроен мониторинг в Badoo
Как устроен мониторинг в BadooКак устроен мониторинг в Badoo
Как устроен мониторинг в BadooUptime Community
 
Эффективная техподдержка 24х7: инструкция по применению
Эффективная техподдержка 24х7: инструкция по применениюЭффективная техподдержка 24х7: инструкция по применению
Эффективная техподдержка 24х7: инструкция по применениюUptime Community
 
Мониторинг, когда не тестируешь
Мониторинг, когда не тестируешьМониторинг, когда не тестируешь
Мониторинг, когда не тестируешьUptime Community
 
Типовое внедрение мониторинга
Типовое внедрение мониторингаТиповое внедрение мониторинга
Типовое внедрение мониторингаUptime Community
 
Изобретая колесо: как мы писали свой мониторинг
Изобретая колесо: как мы писали свой мониторингИзобретая колесо: как мы писали свой мониторинг
Изобретая колесо: как мы писали свой мониторингUptime Community
 

More from Uptime Community (6)

Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...
Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...
Как жить в облаке почти без админов: мониторинг и эксплуатация сотен виртуаль...
 
Как устроен мониторинг в Badoo
Как устроен мониторинг в BadooКак устроен мониторинг в Badoo
Как устроен мониторинг в Badoo
 
Эффективная техподдержка 24х7: инструкция по применению
Эффективная техподдержка 24х7: инструкция по применениюЭффективная техподдержка 24х7: инструкция по применению
Эффективная техподдержка 24х7: инструкция по применению
 
Мониторинг, когда не тестируешь
Мониторинг, когда не тестируешьМониторинг, когда не тестируешь
Мониторинг, когда не тестируешь
 
Типовое внедрение мониторинга
Типовое внедрение мониторингаТиповое внедрение мониторинга
Типовое внедрение мониторинга
 
Изобретая колесо: как мы писали свой мониторинг
Изобретая колесо: как мы писали свой мониторингИзобретая колесо: как мы писали свой мониторинг
Изобретая колесо: как мы писали свой мониторинг
 

Recently uploaded

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Recently uploaded (20)

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Стриминг мониторинга

  • 1. Uptime Day #1 Stream your monitoring Stanislav Osipov 2017.04.08
  • 2. Stanislav Osipov • DevOps Architect at ECommPay • Sr. DevOps Engineer at CityAds Media • CISO & CIO at Payler, Runet Award 2014 • H. SysOps Engineer at Mainpeople Worldwide • Sr. Deployment Engineer & Project Manager at Mirantis IT • Sr. DevOps at Undev (Digital October) • … and 10 more companies …
  • 3. Topics • Context • Global problem statistics • Tools • 3 metric layers for Zabbix • Channels • Escalations • Additional sources • Receive it together • Help to your managers
  • 4. Online advertising industry CPA, RTB, etc. •A lot of traffic •A lot of buzz and big data •A lot of short-term initiatives •A lot of IT support
  • 5. Online advertising industry CPA, RTB, etc. •A lot of madness & fuckups •A lot of fun & lulz!
  • 6. Classify problems 1. Datacenter failures 2. Connectivity failures 3. /dev/hands
  • 7. ENISA Annual Incident Report 2015 (issued on October 05, 2016) https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports
  • 8. 98,9 Connectivity failures Datacenter failures Failures overhead /dev/hands Problems distribution 1. Datacenter failures 2. Connectivity failures 3. /dev/hands – up to 99% of failures
  • 9. Tools External: •New Relic Synthetics •Pingdom Internal: •Zabbix •Munin •New Relic APM (runtime context) •собственные скрипты
  • 10. Concept of 3 layers for Zabbix • Base/System: OS metrics - Disk space, RAM, CPU, LA, net • Components: Daemon metrics - Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc. • Advanced: Services & Apps metrics, DBA metrics, etc - Money earned ;-)
  • 11. Group or tag hosts in Zabbix • By envs: Production, Staging, Testing, Development • By datacenters or locations • By VM guest type • By operating systems ot type • By projects, services & components • By teams & ppl, if you cannot override the chaos
  • 12. Channels • Call/SMS • E-mail • New Relic mobile • Slack • Telegram
  • 13. Channels • Call/SMS • E-mail • New Relic mobile app • Slack • Telegram • … and screens!
  • 14. Escalations – Production only • >= HIGH: Notify Ops on duty – All channels, incl. SMS • >=CRITICAL: notify all Ops, add +15 mins delay before if not work hours (all channels) • 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels) • >= HIGH: Notify SDEs on daylight time – All channels • >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if not work hours (all channels) • = DISASTER: Notify IT top management & SD Team of the affected systems, add +5 mins delay if not work hours
  • 15. What more? New Relic Alerts on top of New Relic APM: • Detect % of fatal errors in code runtime. Create NR Alerts Policy: “If count is more than threshold”. If so, NR Alerts sends e-mail to bugs@company.tld or what you enter. • E-mail must trigger a “Bug on the production” ticket to IT support. • IT support assigns the ticket to the appropriate responsible team. • Every a such ticket must decrease KPI of the responsible team. • PROFIT: nobody wants bugs on production!
  • 16. What more? Should Ops know about the deployments?
  • 17. What more? Should Ops know about the deployments? • Add Jenkins jobs (about production deployments) hook notifications to IMs (Slack, Telegram, etc) • Add Pingdom bot to IMs • Add New Relic Synthetics notifications to IMs • Add anything else happens in your SD & IT …
  • 19. Stream this. One stream chat per IM. For everything.
  • 21. Help to your manager bonus track
  • 24. Count incidents and stats every month
  • 25. Implement CI/CD and monitor the uptime PROFIT: Uptime with observable result.