SlideShare a Scribd company logo
1 of 34
Download to read offline
Bringing Business Awareness
to Your Operation Team
Nagios World Conference 2013
Nicolas Brousse
Director, Operations Engineering
About TubeMogul...
● Founded in 2006
● Formerly a video distribution and analytics platform
● TubeMogul is a Brand-Focused Video Marketing
Company
○ Build for Branding
○ Integrate real-time media buying, ad serving, targeting,
optimization and brand measurement

TubeMogul simplifies the delivery of video ads and
maximizes the impact of every dollar spent by brand
marketers
http://www.tubemogul.com
About TubeMogul...
● Monitoring between 800 and 1000 servers
● Servers spread across 6 different locations
○ 4 Amazon EC2 Regions
○ 1 Hosted (Liquidweb) & 1 VPS (Linode)

● Little monitoring resources
○ Collecting over 120,000 metrics
○ Monitoring over 20,000 services with Nagios

● Multiple billions of HTTP requests a day
○ Most of it must be served in less than 100ms
○ Lost of traffic could mean lost of business opportunity
○ Or worst, over-spending…
Our Environment
● Over 80 different server profiles
● Our stack:
○
○
○
○
○
○

Java (Embedded Jetty, Tomcat)
PHP, RoR
Hadoop: HDFS, M/R, Hbase, Hive
Couchbase
MySQL, Vertica
ElasticSearch

● Monitoring: Nagios, NSCA
● Graphing: Ganglia, sFlow, Graphite
● Configuration Management: Puppet
Amazon Cloud Environment
Amazon Cloud Environment
● We use EC2, SDB, SQS, EMR, S3, etc.
● We don’t use ELB
● We heavily use EC2 Tags
ec2-describe-instances -F tag:hostname=dev-build01
Automated Monitoring
Automated Monitoring
Process of event when starting a new host and add it to our monitoring:
1.

We start a new instance using Cerveza and Cloud-init

2.

Puppet configure Gmond or Host sFlow on the instance

3.

Our monitoring server running Gmond and Gmetad get data from the new
instance

4.

A Nagios check run every minute and check for new hosts
○
○
○

5.

Look for new hosts using EC2 API
Look for EC2 tag “hostname” to confirm it’s a legit host, not a zombie / fail start
Look for EC2 tag “nagios_host” to see if the host belong to this monitoring instance

If a new host is found:
○
○

We build a config for the host based on a template file and doing some string replace
Once all config have been generated, we rebuild pre-cache objects and reload Nagios

6.

If we find “Zombie” host, we generate a Warning alert

7.

If the config is corrupt, we send a Critical alert
Efficient Monitoring
We reduce noise by disabling most notifications and using our "cluster check"
Efficient Monitoring
Efficient Monitoring
Efficient on-call rotation
● Follow the sun
○

OPS team is in Ukraine, no more Tier 1 night on-call for US Team

● Timeperiod and escalation are a pain to maintain
○ Nagios notification plugged to Google Calendar
■
■
■
■

Using our own notification script for email and paging
Google Calendar make it easy for each team to manage their
own on-call calendar
Support for multiple Tier and complex schedules
Caching Google Calendar info locally every hour

○ Simpler definitions and rules in Nagios contacts
○ Notify only people on-call, unless they asked for “off
call” emails
Efficient on-call rotation
Efficient on-call rotation
●
●
●
●

Simple contact definitions
Google Calendar info
Tier Filter (Regex)
Tier Interval (time to wait before escalating alert since
last tier)
● Off call email
Efficient on-call rotation
● Centralized dashboard!
○ Plugged to Google Apps
○ List all on-call contacts
Efficient on-call rotation
● Centralized View of Multiple Nagios!
Efficient on-call rotation
Efficient on-call rotation
Business Awareness!

Now that I know what is
breaking…
Which one should I fix first?
Business Awareness!

Service Health Dashboard!
Business Awareness!
● A service represent a Business Critical function
● A service can be global or limited to a region
● A service is defined by multiple Service Component
○ We use Nagios Event Handler to update services
component status
○ REST API dashboard allow easy update from third
party monitoring, QA test, scripts, crons, Nagios
● We can define service SLA and quickly see SLA
breaking (based on OLA)
● OPS Team can perform more actions, post comments,
link to Jira tickets
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
Business Awareness!
How do we make sure we answer the business
needs?
● Review SLAs and monitoring configurations
monthly/quarterly
● Have a checklist when launching new
product or features
○ We now have a SRE Hand-off Checklist with
detailed questions
○ How is capacity planning done?
○ Who can have an impact on traffic, storage, etc?
○ Make sure to ask about expected OLAs or SLAs
To summarize...
● We easily automate our deployment across
multiple geos
● We control the noise
● We easily schedule our on-call rotations
● In one location we know everything
● We know what is impacting the business
and how we should prioritize our actions
Team Process - Daily Kanban!
Request based on
Dashboards,
Monitoring, Paging or
Engineers
Ticket categorized in two Swimlane:
● Production Support
○ High Priority: Top to Bottom
○ On-call 24/7, OLAs, SLAs...
○ Incident are handled 1st
○ Maintenance are handled 2nd
● Developer Support
○ Best Effort: Top to Bottom
○ Long effort request moved to
Infrastructure pipeline
Team Process - Long Term Agile!
Request moved from OPS to INF
SRE Hand-off Checklist used to
define Epic, Story and Tasks

Check

Plan
Do

Act
OPS @ TubeMogul
All this wouldn’t be possible without a strong
SRE Operation Engineering team:
Aleksey Mykhailov
Marylene Tanfin
Mykola Mogylenko
Nicolas Brousse
Pierre Gohon
Pierre Grandin
Stan Rudenko
We are Hiring !

http://www.tubemogul.com/jobs

More Related Content

Viewers also liked

Viewers also liked (13)

Sales Operations Insights v1.0
Sales Operations Insights v1.0Sales Operations Insights v1.0
Sales Operations Insights v1.0
 
A New Perspective on Operational Excellence
A New Perspective on Operational ExcellenceA New Perspective on Operational Excellence
A New Perspective on Operational Excellence
 
Sales training: program, execution and evaluation
Sales training: program, execution and evaluationSales training: program, execution and evaluation
Sales training: program, execution and evaluation
 
The Next Generation Sales Operations Team
The Next Generation Sales Operations TeamThe Next Generation Sales Operations Team
The Next Generation Sales Operations Team
 
Designing Training Programs
Designing Training ProgramsDesigning Training Programs
Designing Training Programs
 
Designing an IT Solution
Designing an IT SolutionDesigning an IT Solution
Designing an IT Solution
 
operations management
operations managementoperations management
operations management
 
Sales Training
Sales TrainingSales Training
Sales Training
 
Basic sales training
Basic sales trainingBasic sales training
Basic sales training
 
Structured Approach to Solution Architecture
Structured Approach to Solution ArchitectureStructured Approach to Solution Architecture
Structured Approach to Solution Architecture
 
Production & operations management
Production & operations managementProduction & operations management
Production & operations management
 
Beyond the Gig Economy
Beyond the Gig EconomyBeyond the Gig Economy
Beyond the Gig Economy
 
Professional basic selling skills
Professional basic selling skillsProfessional basic selling skills
Professional basic selling skills
 

Similar to Bringing Business Awareness to Your Operation Team (Nagios World Conference 2013)

Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...Nagios
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...WSO2
 
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudInterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudiMasters
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksClarotech_Events
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Nicolas Brousse
 
JUST EAT: Embracing DevOps
JUST EAT: Embracing DevOpsJUST EAT: Embracing DevOps
JUST EAT: Embracing DevOpsPeter Mounce
 
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CDMulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CDGonzalo Marcos Ansoain
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Severalnines
 
Mds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuMds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuDavid Scruggs
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Splunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsSplunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsTimur Bagirov
 
Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...
Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...
Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...Nagios
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Amazon Web Services
 
Using SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseUsing SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseChristian McHugh
 
Lesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdfLesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdfMinh Quân Đoàn
 
How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product Invotra
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 

Similar to Bringing Business Awareness to Your Operation Team (Nagios World Conference 2013) (20)

Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
 
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
[WSO2Con EU 2018] Implementing a Zero Downtime WSO2 API Manager with an API C...
 
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloudInterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and Tricks
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
 
JUST EAT: Embracing DevOps
JUST EAT: Embracing DevOpsJUST EAT: Embracing DevOps
JUST EAT: Embracing DevOps
 
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CDMulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
 
Mds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuMds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to heroku
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Splunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsSplunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT Operations
 
Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...
Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...
Nagios Conference 2013 - Sam Lansing - Getting Started With Nagios XI, Core, ...
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016
 
Using SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseUsing SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterprise
 
Lesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdfLesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdf
 
How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 

More from Nicolas Brousse

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...Nicolas Brousse
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningNicolas Brousse
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...Nicolas Brousse
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...Nicolas Brousse
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackNicolas Brousse
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteNicolas Brousse
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...Nicolas Brousse
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Nicolas Brousse
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentNicolas Brousse
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Nicolas Brousse
 

More from Nicolas Brousse (11)

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine Learning
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Bringing Business Awareness to Your Operation Team (Nagios World Conference 2013)

  • 1. Bringing Business Awareness to Your Operation Team Nagios World Conference 2013 Nicolas Brousse Director, Operations Engineering
  • 2. About TubeMogul... ● Founded in 2006 ● Formerly a video distribution and analytics platform ● TubeMogul is a Brand-Focused Video Marketing Company ○ Build for Branding ○ Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com
  • 3. About TubeMogul... ● Monitoring between 800 and 1000 servers ● Servers spread across 6 different locations ○ 4 Amazon EC2 Regions ○ 1 Hosted (Liquidweb) & 1 VPS (Linode) ● Little monitoring resources ○ Collecting over 120,000 metrics ○ Monitoring over 20,000 services with Nagios ● Multiple billions of HTTP requests a day ○ Most of it must be served in less than 100ms ○ Lost of traffic could mean lost of business opportunity ○ Or worst, over-spending…
  • 4. Our Environment ● Over 80 different server profiles ● Our stack: ○ ○ ○ ○ ○ ○ Java (Embedded Jetty, Tomcat) PHP, RoR Hadoop: HDFS, M/R, Hbase, Hive Couchbase MySQL, Vertica ElasticSearch ● Monitoring: Nagios, NSCA ● Graphing: Ganglia, sFlow, Graphite ● Configuration Management: Puppet
  • 6. Amazon Cloud Environment ● We use EC2, SDB, SQS, EMR, S3, etc. ● We don’t use ELB ● We heavily use EC2 Tags ec2-describe-instances -F tag:hostname=dev-build01
  • 8. Automated Monitoring Process of event when starting a new host and add it to our monitoring: 1. We start a new instance using Cerveza and Cloud-init 2. Puppet configure Gmond or Host sFlow on the instance 3. Our monitoring server running Gmond and Gmetad get data from the new instance 4. A Nagios check run every minute and check for new hosts ○ ○ ○ 5. Look for new hosts using EC2 API Look for EC2 tag “hostname” to confirm it’s a legit host, not a zombie / fail start Look for EC2 tag “nagios_host” to see if the host belong to this monitoring instance If a new host is found: ○ ○ We build a config for the host based on a template file and doing some string replace Once all config have been generated, we rebuild pre-cache objects and reload Nagios 6. If we find “Zombie” host, we generate a Warning alert 7. If the config is corrupt, we send a Critical alert
  • 9. Efficient Monitoring We reduce noise by disabling most notifications and using our "cluster check"
  • 12. Efficient on-call rotation ● Follow the sun ○ OPS team is in Ukraine, no more Tier 1 night on-call for US Team ● Timeperiod and escalation are a pain to maintain ○ Nagios notification plugged to Google Calendar ■ ■ ■ ■ Using our own notification script for email and paging Google Calendar make it easy for each team to manage their own on-call calendar Support for multiple Tier and complex schedules Caching Google Calendar info locally every hour ○ Simpler definitions and rules in Nagios contacts ○ Notify only people on-call, unless they asked for “off call” emails
  • 14. Efficient on-call rotation ● ● ● ● Simple contact definitions Google Calendar info Tier Filter (Regex) Tier Interval (time to wait before escalating alert since last tier) ● Off call email
  • 15. Efficient on-call rotation ● Centralized dashboard! ○ Plugged to Google Apps ○ List all on-call contacts
  • 16. Efficient on-call rotation ● Centralized View of Multiple Nagios!
  • 19. Business Awareness! Now that I know what is breaking… Which one should I fix first?
  • 21. Business Awareness! ● A service represent a Business Critical function ● A service can be global or limited to a region ● A service is defined by multiple Service Component ○ We use Nagios Event Handler to update services component status ○ REST API dashboard allow easy update from third party monitoring, QA test, scripts, crons, Nagios ● We can define service SLA and quickly see SLA breaking (based on OLA) ● OPS Team can perform more actions, post comments, link to Jira tickets
  • 29. Business Awareness! How do we make sure we answer the business needs? ● Review SLAs and monitoring configurations monthly/quarterly ● Have a checklist when launching new product or features ○ We now have a SRE Hand-off Checklist with detailed questions ○ How is capacity planning done? ○ Who can have an impact on traffic, storage, etc? ○ Make sure to ask about expected OLAs or SLAs
  • 30. To summarize... ● We easily automate our deployment across multiple geos ● We control the noise ● We easily schedule our on-call rotations ● In one location we know everything ● We know what is impacting the business and how we should prioritize our actions
  • 31. Team Process - Daily Kanban! Request based on Dashboards, Monitoring, Paging or Engineers Ticket categorized in two Swimlane: ● Production Support ○ High Priority: Top to Bottom ○ On-call 24/7, OLAs, SLAs... ○ Incident are handled 1st ○ Maintenance are handled 2nd ● Developer Support ○ Best Effort: Top to Bottom ○ Long effort request moved to Infrastructure pipeline
  • 32. Team Process - Long Term Agile! Request moved from OPS to INF SRE Hand-off Checklist used to define Epic, Story and Tasks Check Plan Do Act
  • 33. OPS @ TubeMogul All this wouldn’t be possible without a strong SRE Operation Engineering team: Aleksey Mykhailov Marylene Tanfin Mykola Mogylenko Nicolas Brousse Pierre Gohon Pierre Grandin Stan Rudenko
  • 34. We are Hiring ! http://www.tubemogul.com/jobs