Uptime Day #1
Stream your monitoring
Stanislav Osipov
2017.04.08
Stanislav Osipov
• DevOps Architect at ECommPay
• Sr. DevOps Engineer at CityAds Media
• CISO & CIO at Payler, Runet Award 2014
• H. SysOps Engineer at Mainpeople Worldwide
• Sr. Deployment Engineer & Project Manager at Mirantis IT
• Sr. DevOps at Undev (Digital October)
• … and 10 more companies …
Topics
• Context
• Global problem statistics
• Tools
• 3 metric layers for Zabbix
• Channels
• Escalations
• Additional sources
• Receive it together
• Help to your managers
Online advertising industry
CPA, RTB, etc.
•A lot of traffic
•A lot of buzz and big data
•A lot of short-term initiatives
•A lot of IT support
Online advertising industry
CPA, RTB, etc.
•A lot of madness & fuckups
•A lot of
fun & lulz!
Classify problems
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
ENISA Annual Incident Report 2015
(issued on October 05, 2016)
https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports
98,9 Connectivity failures
Datacenter failures
Failures overhead
/dev/hands
Problems distribution
1. Datacenter failures
2. Connectivity failures
3. /dev/hands
– up to 99% of failures
Tools
External:
•New Relic Synthetics
•Pingdom
Internal:
•Zabbix
•Munin
•New Relic APM (runtime context)
•собственные скрипты
Concept of 3 layers for Zabbix
• Base/System: OS metrics
- Disk space, RAM, CPU, LA, net
• Components: Daemon metrics
- Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc.
• Advanced: Services & Apps metrics, DBA metrics, etc
- Money earned ;-)
Group or tag hosts in Zabbix
• By envs: Production, Staging, Testing, Development
• By datacenters or locations
• By VM guest type
• By operating systems ot type
• By projects, services & components
• By teams & ppl, if you cannot override the chaos
Channels
• Call/SMS
• E-mail
• New Relic mobile
• Slack
• Telegram
Channels
• Call/SMS
• E-mail
• New Relic mobile app
• Slack
• Telegram
• … and screens!
Escalations – Production only
• >= HIGH: Notify Ops on duty – All channels, incl. SMS
• >=CRITICAL: notify all Ops,
add +15 mins delay before if not work hours (all channels)
• 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels)
• >= HIGH: Notify SDEs on daylight time – All channels
• >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if
not work hours (all channels)
• = DISASTER: Notify IT top management & SD Team of the affected
systems,
add +5 mins delay if not work hours
What more?
New Relic Alerts on top of New Relic APM:
• Detect % of fatal errors in code runtime. Create NR Alerts
Policy: “If count is more than threshold”. If so, NR Alerts sends
e-mail to bugs@company.tld or what you enter.
• E-mail must trigger a “Bug on the production” ticket to IT
support.
• IT support assigns the ticket to the appropriate responsible
team.
• Every a such ticket must decrease KPI of the responsible team.
• PROFIT: nobody wants bugs on production!
What more?
Should Ops know about the deployments?
What more?
Should Ops know about the deployments?
• Add Jenkins jobs (about production deployments)
hook notifications to IMs (Slack, Telegram, etc)
• Add Pingdom bot to IMs
• Add New Relic Synthetics notifications to IMs
• Add anything else happens in your SD & IT …
Too many chats!!!
Stream this.
One stream chat per IM.
For everything.
Stream the monitoring
Help to your manager
bonus track
Incident management
Custom report for Zabbix
Count incidents and stats every month
Implement CI/CD and monitor the uptime
PROFIT: Uptime with observable result.
Questions?
Stanislav S. Osipov
oss+uptime@gkos.name
Thanks!

Стриминг мониторинга

  • 1.
    Uptime Day #1 Streamyour monitoring Stanislav Osipov 2017.04.08
  • 2.
    Stanislav Osipov • DevOpsArchitect at ECommPay • Sr. DevOps Engineer at CityAds Media • CISO & CIO at Payler, Runet Award 2014 • H. SysOps Engineer at Mainpeople Worldwide • Sr. Deployment Engineer & Project Manager at Mirantis IT • Sr. DevOps at Undev (Digital October) • … and 10 more companies …
  • 3.
    Topics • Context • Globalproblem statistics • Tools • 3 metric layers for Zabbix • Channels • Escalations • Additional sources • Receive it together • Help to your managers
  • 4.
    Online advertising industry CPA,RTB, etc. •A lot of traffic •A lot of buzz and big data •A lot of short-term initiatives •A lot of IT support
  • 5.
    Online advertising industry CPA,RTB, etc. •A lot of madness & fuckups •A lot of fun & lulz!
  • 6.
    Classify problems 1. Datacenterfailures 2. Connectivity failures 3. /dev/hands
  • 7.
    ENISA Annual IncidentReport 2015 (issued on October 05, 2016) https://www.enisa.europa.eu/topics/incident-reporting/for-telcos/annual-reports
  • 8.
    98,9 Connectivity failures Datacenterfailures Failures overhead /dev/hands Problems distribution 1. Datacenter failures 2. Connectivity failures 3. /dev/hands – up to 99% of failures
  • 9.
    Tools External: •New Relic Synthetics •Pingdom Internal: •Zabbix •Munin •NewRelic APM (runtime context) •собственные скрипты
  • 10.
    Concept of 3layers for Zabbix • Base/System: OS metrics - Disk space, RAM, CPU, LA, net • Components: Daemon metrics - Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc. • Advanced: Services & Apps metrics, DBA metrics, etc - Money earned ;-)
  • 11.
    Group or taghosts in Zabbix • By envs: Production, Staging, Testing, Development • By datacenters or locations • By VM guest type • By operating systems ot type • By projects, services & components • By teams & ppl, if you cannot override the chaos
  • 12.
    Channels • Call/SMS • E-mail •New Relic mobile • Slack • Telegram
  • 13.
    Channels • Call/SMS • E-mail •New Relic mobile app • Slack • Telegram • … and screens!
  • 14.
    Escalations – Productiononly • >= HIGH: Notify Ops on duty – All channels, incl. SMS • >=CRITICAL: notify all Ops, add +15 mins delay before if not work hours (all channels) • 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels) • >= HIGH: Notify SDEs on daylight time – All channels • >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if not work hours (all channels) • = DISASTER: Notify IT top management & SD Team of the affected systems, add +5 mins delay if not work hours
  • 15.
    What more? New RelicAlerts on top of New Relic APM: • Detect % of fatal errors in code runtime. Create NR Alerts Policy: “If count is more than threshold”. If so, NR Alerts sends e-mail to bugs@company.tld or what you enter. • E-mail must trigger a “Bug on the production” ticket to IT support. • IT support assigns the ticket to the appropriate responsible team. • Every a such ticket must decrease KPI of the responsible team. • PROFIT: nobody wants bugs on production!
  • 16.
    What more? Should Opsknow about the deployments?
  • 17.
    What more? Should Opsknow about the deployments? • Add Jenkins jobs (about production deployments) hook notifications to IMs (Slack, Telegram, etc) • Add Pingdom bot to IMs • Add New Relic Synthetics notifications to IMs • Add anything else happens in your SD & IT …
  • 18.
  • 19.
    Stream this. One streamchat per IM. For everything.
  • 20.
  • 21.
    Help to yourmanager bonus track
  • 22.
  • 23.
  • 24.
    Count incidents andstats every month
  • 25.
    Implement CI/CD andmonitor the uptime PROFIT: Uptime with observable result.
  • 26.