Стриминг мониторинга

•

0 likes•6,174 views

Станислав Осипов — рекламные платформы; особенности R'n'D и Ops в рекламе. — три кита, на которых Zabbix можно превратить в полезный для восприятия инструмент. — табличка, патч и отчетность — отстрел Ops managers обратно на орбиту. — все не как у людей: стриминг самочувствия системы. — каналы (SMS, Tg, Sl, Ml), потоки/группы. — а теперь все вместе: Zabbix, New Relic, Jenkins и другие.

Technology

Uptime Day #1
Stream your monitoring
Stanislav Osipov
2017.04.08

Stanislav Osipov
• DevOps Architect at ECommPay
• Sr. DevOps Engineer at CityAds Media
• CISO & CIO at Payler, Runet Award 2014
• H. SysOps Engineer at Mainpeople Worldwide
• Sr. Deployment Engineer & Project Manager at Mirantis IT
• Sr. DevOps at Undev (Digital October)
• … and 10 more companies …

Topics
• Context
• Global problem statistics
• Tools
• 3 metric layers for Zabbix
• Channels
• Escalations
• Additional sources
• Receive it together
• Help to your managers

Online advertising industry
CPA, RTB, etc.
•A lot of traffic
•A lot of buzz and big data
•A lot of short-term initiatives
•A lot of IT support

Online advertising industry
CPA, RTB, etc.
•A lot of madness & fuckups
•A lot of
fun & lulz!

Classify problems
1. Datacenter failures
2. Connectivity failures
3. /dev/hands

ENISA Annual Incident Report 2015
(issued on October 05, 2016)
https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports

98,9 Connectivity failures
Datacenter failures
Failures overhead
/dev/hands
Problems distribution
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
– up to 99% of failures

Tools
External:
•New Relic Synthetics
•Pingdom
Internal:
•Zabbix
•Munin
•New Relic APM (runtime context)
•собственные скрипты

Concept of 3 layers for Zabbix
• Base/System: OS metrics
- Disk space, RAM, CPU, LA, net
• Components: Daemon metrics
- Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc.
• Advanced: Services & Apps metrics, DBA metrics, etc
- Money earned ;-)

Group or tag hosts in Zabbix
• By envs: Production, Staging, Testing, Development
• By datacenters or locations
• By VM guest type
• By operating systems ot type
• By projects, services & components
• By teams & ppl, if you cannot override the chaos

Channels
• Call/SMS
• E-mail
• New Relic mobile
• Slack
• Telegram

Channels
• Call/SMS
• E-mail
• New Relic mobile app
• Slack
• Telegram
• … and screens!

Escalations – Production only
• >= HIGH: Notify Ops on duty – All channels, incl. SMS
• >=CRITICAL: notify all Ops,
add +15 mins delay before if not work hours (all channels)
• 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels)
• >= HIGH: Notify SDEs on daylight time – All channels
• >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if
not work hours (all channels)
• = DISASTER: Notify IT top management & SD Team of the affected
systems,
add +5 mins delay if not work hours

What more?
New Relic Alerts on top of New Relic APM:
• Detect % of fatal errors in code runtime. Create NR Alerts
Policy: “If count is more than threshold”. If so, NR Alerts sends
e-mail to bugs@company.tld or what you enter.
• E-mail must trigger a “Bug on the production” ticket to IT
support.
• IT support assigns the ticket to the appropriate responsible
team.
• Every a such ticket must decrease KPI of the responsible team.
• PROFIT: nobody wants bugs on production!

What more?
Should Ops know about the deployments?

What more?
Should Ops know about the deployments?
• Add Jenkins jobs (about production deployments)
hook notifications to IMs (Slack, Telegram, etc)
• Add Pingdom bot to IMs
• Add New Relic Synthetics notifications to IMs
• Add anything else happens in your SD & IT …

Stream this.
One stream chat per IM.
For everything.

Implement CI/CD and monitor the uptime
PROFIT: Uptime with observable result.

Questions?
Stanislav S. Osipov
oss+uptime@gkos.name
Thanks!

Similar to Стриминг мониторинга

2017-05-10 Gate4SPICE: "Legacy Software"Alexander Much

Verification Bug Metrics: A Different ApproachDVClub

Information Technology Department.pptxAQEELAHMAD938119

DevOps Roadtrip Final Speaking Deck VictorOps

Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...OVHcloud

Perfect Profilers Final PresentationJulie Michlinski

Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...BDekkema

TIAD 2016 : Continuous Integration mesured and controlledThe Incredible Automation Day

4 Best Practices for Patch Management in Education ITKaseya

Moving 65,000 Microsofties to DevOps with Visual Studio Team ServicesVSTS Community MSFT

Space Codesign at TandemLaunch 20150414Space Codesign

Space Codesign at TandemLaunch Lunch & Learn 20150414Gary Dare

Self-Service Analytics on Hadoop: Lessons LearnedDataWorks Summit/Hadoop Summit

Witekio introducing-predictive-maintenanceWitekio

Пирамида Тестирования через призму ROI калькулятора и прочая геометрияSQALab

Test Pyramid vs Roi COMAQA.BY

6 ways DevOps helped PrepSportswear move from monolith to microservicesDynatrace

Architectural considerations when building an APIRod Hemphill

“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv

What a DevOps specialist has to know about static code analysisAndrey Karpov

Similar to Стриминг мониторинга (20)

2017-05-10 Gate4SPICE: "Legacy Software"

Verification Bug Metrics: A Different Approach

Information Technology Department.pptx

DevOps Roadtrip Final Speaking Deck

Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...

Perfect Profilers Final Presentation

Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...

TIAD 2016 : Continuous Integration mesured and controlled

4 Best Practices for Patch Management in Education IT

Moving 65,000 Microsofties to DevOps with Visual Studio Team Services

Space Codesign at TandemLaunch 20150414

Space Codesign at TandemLaunch Lunch & Learn 20150414

Self-Service Analytics on Hadoop: Lessons Learned

Witekio introducing-predictive-maintenance

Пирамида Тестирования через призму ROI калькулятора и прочая геометрия

Test Pyramid vs Roi

6 ways DevOps helped PrepSportswear move from monolith to microservices

Architectural considerations when building an API

“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...

What a DevOps specialist has to know about static code analysis

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

WordPress Websites for Engineers: Elevate Your Brand

Connect Wave/ connectwave Pitch Deck Presentation

"Debugging python applications inside k8s environment", Andrii Soldatenko

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Ensuring Technical Readiness For Copilot in Microsoft 365

Human Factors of XR: Using Human Factors to Design XR Systems

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Powerpoint exploring the locations used in television show Time Clash

My INSURER PTE LTD - Insurtech Innovation Award 2024

Unleash Your Potential - Namagunga Girls Coding Club

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

SIP trunking in Janus @ Kamailio World 2024

DMCC Future of Trade Web3 - Special Edition

Scanning the Internet for External Cloud Exposures via SSL Certs

Unraveling Multimodality with Large Language Models.pdf

My Hashitalk Indonesia April 2024 Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Streamlining Python Development: A Guide to a Modern Project Setup

Стриминг мониторинга

1. Uptime Day #1 Stream your monitoring Stanislav Osipov 2017.04.08

2. Stanislav Osipov • DevOps Architect at ECommPay • Sr. DevOps Engineer at CityAds Media • CISO & CIO at Payler, Runet Award 2014 • H. SysOps Engineer at Mainpeople Worldwide • Sr. Deployment Engineer & Project Manager at Mirantis IT • Sr. DevOps at Undev (Digital October) • … and 10 more companies …

3. Topics • Context • Global problem statistics • Tools • 3 metric layers for Zabbix • Channels • Escalations • Additional sources • Receive it together • Help to your managers

4. Online advertising industry CPA, RTB, etc. •A lot of traffic •A lot of buzz and big data •A lot of short-term initiatives •A lot of IT support

5. Online advertising industry CPA, RTB, etc. •A lot of madness & fuckups •A lot of fun & lulz!

6. Classify problems 1. Datacenter failures 2. Connectivity failures 3. /dev/hands

7. ENISA Annual Incident Report 2015 (issued on October 05, 2016) https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports

8. 98,9 Connectivity failures Datacenter failures Failures overhead /dev/hands Problems distribution 1. Datacenter failures 2. Connectivity failures 3. /dev/hands – up to 99% of failures

9. Tools External: •New Relic Synthetics •Pingdom Internal: •Zabbix •Munin •New Relic APM (runtime context) •собственные скрипты

10. Concept of 3 layers for Zabbix • Base/System: OS metrics - Disk space, RAM, CPU, LA, net • Components: Daemon metrics - Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc. • Advanced: Services & Apps metrics, DBA metrics, etc - Money earned ;-)

11. Group or tag hosts in Zabbix • By envs: Production, Staging, Testing, Development • By datacenters or locations • By VM guest type • By operating systems ot type • By projects, services & components • By teams & ppl, if you cannot override the chaos

12. Channels • Call/SMS • E-mail • New Relic mobile • Slack • Telegram

13. Channels • Call/SMS • E-mail • New Relic mobile app • Slack • Telegram • … and screens!

14. Escalations – Production only • >= HIGH: Notify Ops on duty – All channels, incl. SMS • >=CRITICAL: notify all Ops, add +15 mins delay before if not work hours (all channels) • 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels) • >= HIGH: Notify SDEs on daylight time – All channels • >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if not work hours (all channels) • = DISASTER: Notify IT top management & SD Team of the affected systems, add +5 mins delay if not work hours

15. What more? New Relic Alerts on top of New Relic APM: • Detect % of fatal errors in code runtime. Create NR Alerts Policy: “If count is more than threshold”. If so, NR Alerts sends e-mail to bugs@company.tld or what you enter. • E-mail must trigger a “Bug on the production” ticket to IT support. • IT support assigns the ticket to the appropriate responsible team. • Every a such ticket must decrease KPI of the responsible team. • PROFIT: nobody wants bugs on production!

16. What more? Should Ops know about the deployments?

17. What more? Should Ops know about the deployments? • Add Jenkins jobs (about production deployments) hook notifications to IMs (Slack, Telegram, etc) • Add Pingdom bot to IMs • Add New Relic Synthetics notifications to IMs • Add anything else happens in your SD & IT …

18. Too many chats!!!

19. Stream this. One stream chat per IM. For everything.

20. Stream the monitoring

21. Help to your manager bonus track

22. Incident management

23. Custom report for Zabbix

24. Count incidents and stats every month

25. Implement CI/CD and monitor the uptime PROFIT: Uptime with observable result.

26. Questions? Stanislav S. Osipov oss+uptime@gkos.name Thanks!