Станислав Осипов
— рекламные платформы; особенности R'n'D и Ops в рекламе.
— три кита, на которых Zabbix можно превратить в полезный для восприятия инструмент.
— табличка, патч и отчетность — отстрел Ops managers обратно на орбиту.
— все не как у людей: стриминг самочувствия системы.
— каналы (SMS, Tg, Sl, Ml), потоки/группы.
— а теперь все вместе: Zabbix, New Relic, Jenkins и другие.
2. Stanislav Osipov
• DevOps Architect at ECommPay
• Sr. DevOps Engineer at CityAds Media
• CISO & CIO at Payler, Runet Award 2014
• H. SysOps Engineer at Mainpeople Worldwide
• Sr. Deployment Engineer & Project Manager at Mirantis IT
• Sr. DevOps at Undev (Digital October)
• … and 10 more companies …
3. Topics
• Context
• Global problem statistics
• Tools
• 3 metric layers for Zabbix
• Channels
• Escalations
• Additional sources
• Receive it together
• Help to your managers
4. Online advertising industry
CPA, RTB, etc.
•A lot of traffic
•A lot of buzz and big data
•A lot of short-term initiatives
•A lot of IT support
10. Concept of 3 layers for Zabbix
• Base/System: OS metrics
- Disk space, RAM, CPU, LA, net
• Components: Daemon metrics
- Nginx, PHP-FPM, memcached, MySQL, PgSQL, etc.
• Advanced: Services & Apps metrics, DBA metrics, etc
- Money earned ;-)
11. Group or tag hosts in Zabbix
• By envs: Production, Staging, Testing, Development
• By datacenters or locations
• By VM guest type
• By operating systems ot type
• By projects, services & components
• By teams & ppl, if you cannot override the chaos
14. Escalations – Production only
• >= HIGH: Notify Ops on duty – All channels, incl. SMS
• >=CRITICAL: notify all Ops,
add +15 mins delay before if not work hours (all channels)
• 1 hour & >=CRITICAL: notify Ops management (any way, any time, all channels)
• >= HIGH: Notify SDEs on daylight time – All channels
• >= CRITICAL: notify SD TL of the affected system, add +15 mins delay before if
not work hours (all channels)
• = DISASTER: Notify IT top management & SD Team of the affected
systems,
add +5 mins delay if not work hours
15. What more?
New Relic Alerts on top of New Relic APM:
• Detect % of fatal errors in code runtime. Create NR Alerts
Policy: “If count is more than threshold”. If so, NR Alerts sends
e-mail to bugs@company.tld or what you enter.
• E-mail must trigger a “Bug on the production” ticket to IT
support.
• IT support assigns the ticket to the appropriate responsible
team.
• Every a such ticket must decrease KPI of the responsible team.
• PROFIT: nobody wants bugs on production!
17. What more?
Should Ops know about the deployments?
• Add Jenkins jobs (about production deployments)
hook notifications to IMs (Slack, Telegram, etc)
• Add Pingdom bot to IMs
• Add New Relic Synthetics notifications to IMs
• Add anything else happens in your SD & IT …