Стриминг мониторинга

Uptime Day #1
Stream your monitoring
Stanislav Osipov
2017.04.08

Stanislav Osipov
• DevOps Architect at ECommPay
• Sr. DevOps Engineer at CityAds Media
• CISO & CIO at Payler, Runet Award 2014
• H. SysOps Engineer at Mainpeople Worldwide
• Sr. Deployment Engineer & Project Manager at Mirantis IT
• Sr. DevOps at Undev (Digital October)
• … and 10 more companies …

Topics
• Context
• Global problem statistics
• Tools
• 3 metric layers for Zabbix
• Channels
• Escalations
• Additional sources
• Receive it together
• Help to your managers

Online advertising industry
CPA, RTB, etc.
•A lot of traffic
•A lot of buzz and big data
•A lot of short-term initiatives
•A lot of IT support

Online advertising industry
CPA, RTB, etc.
•A lot of madness & fuckups
•A lot of
fun & lulz!

Classify problems
1. Datacenter failures
2. Connectivity failures
3. /dev/hands

ENISA Annual Incident Report 2015
(issued on October 05, 2016)
https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports

98,9 Connectivity failures
Datacenter failures
Failures overhead
/dev/hands
Problems distribution
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
– up to 99% of failures

Tools
External:
•New Relic Synthetics
•Pingdom
Internal:
•Zabbix
•Munin
•New Relic APM (runtime context)
•собственные скрипты

Concept of 3 layers for Zabbix
• Base/System: OS metrics
- Disk space, RAM, CPU, LA, net
• Components: Daemon metrics
- Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc.
• Advanced: Services & Apps metrics, DBA metrics, etc
- Money earned ;-)

Group or tag hosts in Zabbix
• By envs: Production, Staging, Testing, Development
• By datacenters or locations
• By VM guest type
• By operating systems ot type
• By projects, services & components
• By teams & ppl, if you cannot override the chaos

Channels
• Call/SMS
• E-mail
• New Relic mobile
• Slack
• Telegram

Channels
• Call/SMS
• E-mail
• New Relic mobile app
• Slack
• Telegram
• … and screens!

Escalations – Production only
• >= HIGH: Notify Ops on duty – All channels, incl. SMS
• >=CRITICAL: notify all Ops,
add +15 mins delay before if not work hours (all channels)
• 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels)
• >= HIGH: Notify SDEs on daylight time – All channels
• >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if
not work hours (all channels)
• = DISASTER: Notify IT top management & SD Team of the affected
systems,
add +5 mins delay if not work hours

What more?
New Relic Alerts on top of New Relic APM:
• Detect % of fatal errors in code runtime. Create NR Alerts
Policy: “If count is more than threshold”. If so, NR Alerts sends
e-mail to bugs@company.tld or what you enter.
• E-mail must trigger a “Bug on the production” ticket to IT
support.
• IT support assigns the ticket to the appropriate responsible
team.
• Every a such ticket must decrease KPI of the responsible team.
• PROFIT: nobody wants bugs on production!

What more?
Should Ops know about the deployments?

What more?
Should Ops know about the deployments?
• Add Jenkins jobs (about production deployments)
hook notifications to IMs (Slack, Telegram, etc)
• Add Pingdom bot to IMs
• Add New Relic Synthetics notifications to IMs
• Add anything else happens in your SD & IT …

Stream this.
One stream chat per IM.
For everything.

Help to your manager
bonus track

Count incidents and stats every month

Implement CI/CD and monitor the uptime
PROFIT: Uptime with observable result.

Questions?
Stanislav S. Osipov
oss+uptime@gkos.name
Thanks!

Стриминг мониторинга

More Related Content

Similar to Стриминг мониторинга

Recently uploaded

Стриминг мониторинга