SlideShare a Scribd company logo
1 of 36
Monitoring and Tuning your Chef Server
Andrew DuFour and Nathan Cerny
Andrew DuFour
adufour@chef.io
Success Engineer
Chef Software
@andrewdufour
Nathan Cerny
ncerny@chef.io
Success Team Manager
Chef Software
@ndcerny
The Art of Monitoring
“There is no instance of a nation benefitting
from prolonged warfare.”
― Sun Tzu, The Art of War
Problem Statement
To make effective decisions and to
effectively respond to incidents, we must
have visibility into our systems.
Start Small
Simplicity > Perfection
“Everything should be made as simple as
possible. But not simpler.”
― Albert Einstein
Continuous Improvement
kaizen
改善
Alert Fatigue
Monitor Everything
Monitoring just your Chef Server is low
value.
The Science of Monitoring
“who wishes to fight must first count the
cost”
― Sun Tzu, The Art of War
What should you monitor?
Supporting Services
RabbitMQ
Solr
PostgreSQL
Application Services
Bifrost
Application Logs
Tools 101
• StatsD – A network daemon that runs on the Node.js platform and listens for
statistics, like counters and timers.
https://github.com/etsy/statsd
• Grafana - Beautiful dashboards
• TICK Stack – A series of tools that comprise the ‘Influx Data Platform’, including
an easily scalable time series database.
https://influxdata.com/time-series-platform/
• Sensu - Monitoring that doesn't suck.
https://sensuapp.org/
• Splunk – centralized logging, operational intelligence, big machine data tool
http://www.splunk.com/
Instrumenting our Erlang Based Services
Bifrost
Instrumenting our Erlang Based Services - StatsHero
• Example metrics emitted in Statsd format:
test_hero.upstreamRequests.rdbms:1200|h
• Enabling StatsHero in your chef-server.rb:
Estatsd[‘enabled’] = true
Estatsd[‘protocol’] = ‘stastd’
Estatsd[‘vip’] = ‘<statsd server>’
Estatsd[‘port’] = ‘<statsd port>’
Namespace Category Metric Measurement
Metric Type (H=histogram)
Instrumenting our Erlang Based Services
Bifrost
Graphite
Instrumenting our Erlang Based Services - Folsom
Metrics
• Example metrics:
pooler.chef_depsolver.in_use_count
pooler.chef_depsolver.free_count
pooler.sqerl.in_use_count
pooler.sqerl.free_count
• Enabling folsom metrics in your chef-server.rb
folsom_graphite['enabled'] = true
folsom_graphite[‘host’] = ‘<your graphite host>’
folsom_graphite[‘port’] = ‘<your graphite port>’
Instrumenting our Erlang Based Services
Bifrost
Graphite
Instrumenting our Erlang Based Services – Collecting
Logs
• Use a full featured log collector like Splunk to centralize logs.
• All of our services log into a common directory structure:
/var/log/opscode/<service name>
• The two most important files within that directory are:
current
error
• There are also request logs which repeat information available elsewhere
• All services shipped with the omnibus package, not just Erlang services, log
here
Tuning
Client Side Tuning
USE THE SPLAY, LUKE!
Sometimes Ohai tuning is needed
(e.g.. Centrify)
ALWAYS USE PARTIAL SEARCH!
(and look at SafeSearch)
Know what a dependency graph is
… and manage it.
Server Side Tuning
Almost Everything is Tunable
Chef-server.rb
• https://docs.chef.io/config_rb_server.html
• https://docs.chef.io/config_rb_server_optional_settings.html
• https://github.com/chef/chef-server/blob/master/omnibus/files/private-chef-
cookbooks/private-chef/attributes/default.rb
• How does chef-server.rb work?
The Chef servers’ reconfigure is driven by a cookbook called PrivateChef.
PrivateChef is a cookbook that’s just like any other - with some helper libraries to read your
chef-server.rb, and make sense of it
• Actually tuning a setting:
opscode_erchef[‘db_pool_size’] = “20”
A quick look at PrivateChef
You can see, we’re creating a new
Module called PrivateChef.
The Configuration attributes are
defined as new Mashes. When you say
opscode_erchef[‘key’] = value, you’re
truly just assigning a value to the Mash
created in the PrivateChef module.
Looking at the Low
Hanging Fruit
Bifrost
Erchef
Nginx
Enable cookbook cache
S3 URL Expiry
Bifrost
Db pooler timeout
Db pooler queue size
Authz
Db pool size
Authz
Initial Pool Count
Max Pool Count
Max Queue Size
Bifrost
Erchef
Nginx
Depsolver workers
Depsolver timeout
Authz
Db pooler timeout
Db pooler queue size
Db pool size
Keygen_cache_size
RabbitMQ
PostgreSQL
PostgreSQL
Checkpoint Segments
Checkpoint completion target
Log min duration statement
Solr
Heap size
New size
RabbitMQ
Analytics max length
Dark launch
Max connections
Helpful Links
• https://sensuapp.org/
• https://github.com/sensu-plugins/sensu-plugins-postgres
• https://github.com/sensu-plugins/sensu-plugins-rabbitmq
• https://github.com/sensu-plugins/sensu-plugins-solr
• https://github.com/sensu-plugins/sensu-plugins-nginx
• https://github.com/sensu-plugins/sensu-plugins-filesystem-checks
Sensu:
Statsd: https://github.com/etsy/statsd
InfluxDB: https://influxdata.com/
Splunk: http://www.splunk.com/
More Useful Tools
• PGBadger - https://github.com/dalibo/pgbadger
• Monitor Postgresql: https://wiki.postgresql.org/wiki/Monitoring
• How to Monitor Nginx: https://www.scalyr.com/community/guides/how-to-
monitor-nginx-the-essential-guide
• Pgtune - http://pgfoundry.org/projects/pgtune
pgtune takes the wimpy default postgresql.conf and expands the database server to be as
powerful as the hardware it's being deployed on
Be careful about shared resources, Pgtune assumes you have a dedicated Postgres server.
• GCViewer
Helps you analyze your GC activity, so you can make decisiosn on tuning.
http://www.tagtraum.com/gcviewer.html
Alternatives Tools
• ELK: https://www.elastic.co/webinars/introduction-elk-stack
• Graylog: https://www.graylog.org/
• Loggly: https://www.loggly.com/
• Graphite: https://github.com/graphite-project/
• Datadog - https://www.datadoghq.com/
• So many more….
Special Thanks
• Irving Popovetsky and his tuning the chef server for scale blog:
http://irvingpop.github.io/blog/2015/04/20/tuning-the-chef-server-for-scale/
• Mark Harrison, Paul Mooring and the Chef server team. The dashboards are
heavily based on their dashboards for hosted Chef.
• Phil Dibowitz and Facebook for teaching Andrew a lot about tuning the Chef
server for scale that almost none of our other customers hit.
Live Demo
• Link to github: https://github.com/andy-dufour/chef-server-
monitoring/
Monitoring and tuning your chef server - chef conf talk

More Related Content

What's hot

LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...
LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...
LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...Sandesh Rao
 
BKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVABKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVALinaro
 
Linux interview questions and answers
Linux interview questions and answersLinux interview questions and answers
Linux interview questions and answersGanapathi Raju
 
Oracle RAC on Engineered Systems
Oracle RAC on Engineered SystemsOracle RAC on Engineered Systems
Oracle RAC on Engineered SystemsMarkus Michalewicz
 
Process and Threads in Linux - PPT
Process and Threads in Linux - PPTProcess and Threads in Linux - PPT
Process and Threads in Linux - PPTQUONTRASOLUTIONS
 
The Future of Archives is Participatory: A New Mission for Archives
The Future of Archives is Participatory: A New Mission for ArchivesThe Future of Archives is Participatory: A New Mission for Archives
The Future of Archives is Participatory: A New Mission for ArchivesKate Theimer
 
Oracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-IOracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-IAnju Garg
 
Oracle Database Introduction
Oracle Database IntroductionOracle Database Introduction
Oracle Database IntroductionChhom Karath
 
From DTrace to Linux
From DTrace to LinuxFrom DTrace to Linux
From DTrace to LinuxBrendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administrationsreehari orienit
 
Windows Registry Forensics - Artifacts
Windows Registry Forensics - Artifacts Windows Registry Forensics - Artifacts
Windows Registry Forensics - Artifacts MD SAQUIB KHAN
 
Open Source Software Presentation
Open Source Software PresentationOpen Source Software Presentation
Open Source Software PresentationHenry Briggs
 
Codetainer: a Docker-based browser code 'sandbox'
Codetainer: a Docker-based browser code 'sandbox'Codetainer: a Docker-based browser code 'sandbox'
Codetainer: a Docker-based browser code 'sandbox'Jen Andre
 
Using open source software to build an industrial grade embedded linux platfo...
Using open source software to build an industrial grade embedded linux platfo...Using open source software to build an industrial grade embedded linux platfo...
Using open source software to build an industrial grade embedded linux platfo...SZ Lin
 
Les16[1]Declaring Variables
Les16[1]Declaring VariablesLes16[1]Declaring Variables
Les16[1]Declaring Variablessiavosh kaviani
 

What's hot (20)

LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...
LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...
LAD - GroundBreakers - Jul 2019 - Using Oracle Autonomous Health Framework to...
 
BKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVABKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVA
 
Linux interview questions and answers
Linux interview questions and answersLinux interview questions and answers
Linux interview questions and answers
 
Oracle RAC on Engineered Systems
Oracle RAC on Engineered SystemsOracle RAC on Engineered Systems
Oracle RAC on Engineered Systems
 
Alfresco Certificates
Alfresco Certificates Alfresco Certificates
Alfresco Certificates
 
Process and Threads in Linux - PPT
Process and Threads in Linux - PPTProcess and Threads in Linux - PPT
Process and Threads in Linux - PPT
 
The Future of Archives is Participatory: A New Mission for Archives
The Future of Archives is Participatory: A New Mission for ArchivesThe Future of Archives is Participatory: A New Mission for Archives
The Future of Archives is Participatory: A New Mission for Archives
 
Oracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-IOracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-I
 
BSOD Presentation
BSOD PresentationBSOD Presentation
BSOD Presentation
 
Oracle Database Introduction
Oracle Database IntroductionOracle Database Introduction
Oracle Database Introduction
 
Video Drivers
Video DriversVideo Drivers
Video Drivers
 
From DTrace to Linux
From DTrace to LinuxFrom DTrace to Linux
From DTrace to Linux
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Introduction to Datastore
Introduction to DatastoreIntroduction to Datastore
Introduction to Datastore
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
 
Windows Registry Forensics - Artifacts
Windows Registry Forensics - Artifacts Windows Registry Forensics - Artifacts
Windows Registry Forensics - Artifacts
 
Open Source Software Presentation
Open Source Software PresentationOpen Source Software Presentation
Open Source Software Presentation
 
Codetainer: a Docker-based browser code 'sandbox'
Codetainer: a Docker-based browser code 'sandbox'Codetainer: a Docker-based browser code 'sandbox'
Codetainer: a Docker-based browser code 'sandbox'
 
Using open source software to build an industrial grade embedded linux platfo...
Using open source software to build an industrial grade embedded linux platfo...Using open source software to build an industrial grade embedded linux platfo...
Using open source software to build an industrial grade embedded linux platfo...
 
Les16[1]Declaring Variables
Les16[1]Declaring VariablesLes16[1]Declaring Variables
Les16[1]Declaring Variables
 

Viewers also liked

NSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoNSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoRodrigo Montoro
 
Security For Humans
Security For HumansSecurity For Humans
Security For Humansconjur_inc
 
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...Bhakti Mehta
 
Een Gezond Gebit2
Een Gezond Gebit2Een Gezond Gebit2
Een Gezond Gebit2guest031320
 
AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...
AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...
AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...Amazon Web Services
 
Neuigkeiten von DEPAROM & Co
Neuigkeiten von DEPAROM & CoNeuigkeiten von DEPAROM & Co
Neuigkeiten von DEPAROM & CoArne Krueger
 
(SEC313) Security & Compliance at the Petabyte Scale
(SEC313) Security & Compliance at the Petabyte Scale(SEC313) Security & Compliance at the Petabyte Scale
(SEC313) Security & Compliance at the Petabyte ScaleAmazon Web Services
 
Reversing malware analysis training part3 windows pefile formatbasics
Reversing malware analysis training part3 windows pefile formatbasicsReversing malware analysis training part3 windows pefile formatbasics
Reversing malware analysis training part3 windows pefile formatbasicsCysinfo Cyber Security Community
 
Persistence in the cloud with bosh
Persistence in the cloud with boshPersistence in the cloud with bosh
Persistence in the cloud with boshm_richardson
 
Setting up a Digital Business on Cloud
Setting up a Digital Business on CloudSetting up a Digital Business on Cloud
Setting up a Digital Business on CloudAmazon Web Services
 
API Management - Practical Enterprise Implementation Experience
API Management - Practical Enterprise Implementation ExperienceAPI Management - Practical Enterprise Implementation Experience
API Management - Practical Enterprise Implementation ExperienceCapgemini
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQlxfontes
 
The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)Julien SIMON
 
Hunting powerpoint
Hunting powerpointHunting powerpoint
Hunting powerpointKJRoss9
 
Mobile and Serverless : an Untold Story
Mobile and Serverless : an Untold StoryMobile and Serverless : an Untold Story
Mobile and Serverless : an Untold StoryVidyasagar Machupalli
 

Viewers also liked (20)

NSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoNSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland Chapeco
 
AWS + Puppet = Dynamic Scale
AWS + Puppet = Dynamic ScaleAWS + Puppet = Dynamic Scale
AWS + Puppet = Dynamic Scale
 
Introduction to smpc
Introduction to smpc Introduction to smpc
Introduction to smpc
 
Security For Humans
Security For HumansSecurity For Humans
Security For Humans
 
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...
 
Een Gezond Gebit2
Een Gezond Gebit2Een Gezond Gebit2
Een Gezond Gebit2
 
AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...
AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...
AWS re:Invent 2016: Deploying and Managing .NET Pipelines and Microsoft Workl...
 
Neuigkeiten von DEPAROM & Co
Neuigkeiten von DEPAROM & CoNeuigkeiten von DEPAROM & Co
Neuigkeiten von DEPAROM & Co
 
(SEC313) Security & Compliance at the Petabyte Scale
(SEC313) Security & Compliance at the Petabyte Scale(SEC313) Security & Compliance at the Petabyte Scale
(SEC313) Security & Compliance at the Petabyte Scale
 
Reversing malware analysis training part3 windows pefile formatbasics
Reversing malware analysis training part3 windows pefile formatbasicsReversing malware analysis training part3 windows pefile formatbasics
Reversing malware analysis training part3 windows pefile formatbasics
 
Persistence in the cloud with bosh
Persistence in the cloud with boshPersistence in the cloud with bosh
Persistence in the cloud with bosh
 
You know, for search
You know, for searchYou know, for search
You know, for search
 
Analyze, Influence and Engage Your Customer - v1.7
Analyze, Influence and Engage Your Customer - v1.7Analyze, Influence and Engage Your Customer - v1.7
Analyze, Influence and Engage Your Customer - v1.7
 
Setting up a Digital Business on Cloud
Setting up a Digital Business on CloudSetting up a Digital Business on Cloud
Setting up a Digital Business on Cloud
 
API Management - Practical Enterprise Implementation Experience
API Management - Practical Enterprise Implementation ExperienceAPI Management - Practical Enterprise Implementation Experience
API Management - Practical Enterprise Implementation Experience
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQ
 
The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)
 
Hunting powerpoint
Hunting powerpointHunting powerpoint
Hunting powerpoint
 
Heelal
HeelalHeelal
Heelal
 
Mobile and Serverless : an Untold Story
Mobile and Serverless : an Untold StoryMobile and Serverless : an Untold Story
Mobile and Serverless : an Untold Story
 

Similar to Monitoring and tuning your chef server - chef conf talk

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
 
Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016Yuta Iwama
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisBrendan Gregg
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Splunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsSplunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsAnthony D Hendricks
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Amin Astaneh
 
16aug06.ppt
16aug06.ppt16aug06.ppt
16aug06.pptzagreb2
 
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with SpinnakerSpinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with SpinnakerAndrew Phillips
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopBrian Christner
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsGraham Dumpleton
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systemsBill Buchan
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 

Similar to Monitoring and tuning your chef server - chef conf talk (20)

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016
 
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance AnalysisMonitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Splunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsSplunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shells
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
16aug06.ppt
16aug06.ppt16aug06.ppt
16aug06.ppt
 
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with SpinnakerSpinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
Spinnaker Summit 2018: CI/CD Patterns for Kubernetes with Spinnaker
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
Logstash
LogstashLogstash
Logstash
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web Applications
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Monitoring and tuning your chef server - chef conf talk

  • 1. Monitoring and Tuning your Chef Server Andrew DuFour and Nathan Cerny
  • 2. Andrew DuFour adufour@chef.io Success Engineer Chef Software @andrewdufour Nathan Cerny ncerny@chef.io Success Team Manager Chef Software @ndcerny
  • 3. The Art of Monitoring “There is no instance of a nation benefitting from prolonged warfare.” ― Sun Tzu, The Art of War
  • 4. Problem Statement To make effective decisions and to effectively respond to incidents, we must have visibility into our systems.
  • 6. Simplicity > Perfection “Everything should be made as simple as possible. But not simpler.” ― Albert Einstein
  • 7.
  • 10. Monitor Everything Monitoring just your Chef Server is low value.
  • 11. The Science of Monitoring “who wishes to fight must first count the cost” ― Sun Tzu, The Art of War
  • 12. What should you monitor? Supporting Services RabbitMQ Solr PostgreSQL Application Services Bifrost Application Logs
  • 13. Tools 101 • StatsD – A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers. https://github.com/etsy/statsd • Grafana - Beautiful dashboards • TICK Stack – A series of tools that comprise the ‘Influx Data Platform’, including an easily scalable time series database. https://influxdata.com/time-series-platform/ • Sensu - Monitoring that doesn't suck. https://sensuapp.org/ • Splunk – centralized logging, operational intelligence, big machine data tool http://www.splunk.com/
  • 14. Instrumenting our Erlang Based Services Bifrost
  • 15. Instrumenting our Erlang Based Services - StatsHero • Example metrics emitted in Statsd format: test_hero.upstreamRequests.rdbms:1200|h • Enabling StatsHero in your chef-server.rb: Estatsd[‘enabled’] = true Estatsd[‘protocol’] = ‘stastd’ Estatsd[‘vip’] = ‘<statsd server>’ Estatsd[‘port’] = ‘<statsd port>’ Namespace Category Metric Measurement Metric Type (H=histogram)
  • 16. Instrumenting our Erlang Based Services Bifrost Graphite
  • 17. Instrumenting our Erlang Based Services - Folsom Metrics • Example metrics: pooler.chef_depsolver.in_use_count pooler.chef_depsolver.free_count pooler.sqerl.in_use_count pooler.sqerl.free_count • Enabling folsom metrics in your chef-server.rb folsom_graphite['enabled'] = true folsom_graphite[‘host’] = ‘<your graphite host>’ folsom_graphite[‘port’] = ‘<your graphite port>’
  • 18. Instrumenting our Erlang Based Services Bifrost Graphite
  • 19. Instrumenting our Erlang Based Services – Collecting Logs • Use a full featured log collector like Splunk to centralize logs. • All of our services log into a common directory structure: /var/log/opscode/<service name> • The two most important files within that directory are: current error • There are also request logs which repeat information available elsewhere • All services shipped with the omnibus package, not just Erlang services, log here
  • 21. Client Side Tuning USE THE SPLAY, LUKE!
  • 22. Sometimes Ohai tuning is needed (e.g.. Centrify) ALWAYS USE PARTIAL SEARCH! (and look at SafeSearch) Know what a dependency graph is … and manage it.
  • 25. Chef-server.rb • https://docs.chef.io/config_rb_server.html • https://docs.chef.io/config_rb_server_optional_settings.html • https://github.com/chef/chef-server/blob/master/omnibus/files/private-chef- cookbooks/private-chef/attributes/default.rb • How does chef-server.rb work? The Chef servers’ reconfigure is driven by a cookbook called PrivateChef. PrivateChef is a cookbook that’s just like any other - with some helper libraries to read your chef-server.rb, and make sense of it • Actually tuning a setting: opscode_erchef[‘db_pool_size’] = “20”
  • 26. A quick look at PrivateChef You can see, we’re creating a new Module called PrivateChef. The Configuration attributes are defined as new Mashes. When you say opscode_erchef[‘key’] = value, you’re truly just assigning a value to the Mash created in the PrivateChef module.
  • 27. Looking at the Low Hanging Fruit
  • 28. Bifrost Erchef Nginx Enable cookbook cache S3 URL Expiry Bifrost Db pooler timeout Db pooler queue size Authz Db pool size Authz Initial Pool Count Max Pool Count Max Queue Size
  • 29. Bifrost Erchef Nginx Depsolver workers Depsolver timeout Authz Db pooler timeout Db pooler queue size Db pool size Keygen_cache_size
  • 30. RabbitMQ PostgreSQL PostgreSQL Checkpoint Segments Checkpoint completion target Log min duration statement Solr Heap size New size RabbitMQ Analytics max length Dark launch Max connections
  • 31. Helpful Links • https://sensuapp.org/ • https://github.com/sensu-plugins/sensu-plugins-postgres • https://github.com/sensu-plugins/sensu-plugins-rabbitmq • https://github.com/sensu-plugins/sensu-plugins-solr • https://github.com/sensu-plugins/sensu-plugins-nginx • https://github.com/sensu-plugins/sensu-plugins-filesystem-checks Sensu: Statsd: https://github.com/etsy/statsd InfluxDB: https://influxdata.com/ Splunk: http://www.splunk.com/
  • 32. More Useful Tools • PGBadger - https://github.com/dalibo/pgbadger • Monitor Postgresql: https://wiki.postgresql.org/wiki/Monitoring • How to Monitor Nginx: https://www.scalyr.com/community/guides/how-to- monitor-nginx-the-essential-guide • Pgtune - http://pgfoundry.org/projects/pgtune pgtune takes the wimpy default postgresql.conf and expands the database server to be as powerful as the hardware it's being deployed on Be careful about shared resources, Pgtune assumes you have a dedicated Postgres server. • GCViewer Helps you analyze your GC activity, so you can make decisiosn on tuning. http://www.tagtraum.com/gcviewer.html
  • 33. Alternatives Tools • ELK: https://www.elastic.co/webinars/introduction-elk-stack • Graylog: https://www.graylog.org/ • Loggly: https://www.loggly.com/ • Graphite: https://github.com/graphite-project/ • Datadog - https://www.datadoghq.com/ • So many more….
  • 34. Special Thanks • Irving Popovetsky and his tuning the chef server for scale blog: http://irvingpop.github.io/blog/2015/04/20/tuning-the-chef-server-for-scale/ • Mark Harrison, Paul Mooring and the Chef server team. The dashboards are heavily based on their dashboards for hosted Chef. • Phil Dibowitz and Facebook for teaching Andrew a lot about tuning the Chef server for scale that almost none of our other customers hit.
  • 35. Live Demo • Link to github: https://github.com/andy-dufour/chef-server- monitoring/

Editor's Notes

  1. When you don’t have proper monitoring in place, you are constantly fighting a war against incidents and service interruptions. We believe that monitoring is an art, that is fed and nourished by science. We wanted to kick off today by talking about the art of monitoring. We’ll then get into the science and details of what you should be monitoring, and wrap up with a demo specially prepared for you by Andrew.
  2. But before we get going too fast, we want to define what the problem is that we’re looking to solve. We need to be able to make effective decisions and to effectively respond to incidents. We believe that visibility into our systems is necessary to solve this problem. And monitoring provides that visibility. 2 types of monitoring Reactive alerting – when you are paged out because some conditions were met (usually at 3 am for some reason) Business Intelligence – display of data and metrics in a consumable way that helps drive tuning, prioritize work, identify trends, and proactively prevent issues.
  3. Now that we know what the problem is that we’re solving, what approach should we take? Let’s start small, and get moving quickly. What is the most important thing to know when we’re monitoring? Is the application up or down! Next we should build out the smallest useful monitoring profile – follow the 5 minute rule. What are the things you would check in the first 5 minutes of logging into the system to see if the application is healthy or unhealthy? Those are the things you should be monitoring for at first. Next level of importance is to get instrumentation in place to provide the business intelligence that we’ll need in the future. First rule should be a simple up/down rule Build out the smallest possible monitoring profile based on real experience Resist the urge to build out everything you can think of – 5 minute rule.
  4. A very common pitfall is to attempt to build the perfect system. Spoiler alert: it doesn’t exist. There is a reason that alongside the DevOps movement, micro-services have become a fad – simple systems are easier to implement, less fault prone, and easier to reason about as a human. For these reasons, they tend to be much more stable. Especially in a monitoring system, stability is a good thing. So try to keep your monitoring rules as simple as possible while covering all of your important use-cases. The best way to do this, is to start by asking yourselves the question “What is really important to our application and end-users?” Why would we write a monitor for network bandwidth, when our application is only latency-sensitive? Simple systems are easier to implement Simple systems are less fault prone In a monitoring system, stability is a good thing Figure out what you care about, and start there. Is there a reason we should monitor bandwidth when our service is only latency sensitive?
  5. You don’t have a scale problem, until you do, but you probably don’t. Don’t over-architect your systems or monitors for problems you don’t yet have. Be aware of the real things that are causing issues in your application (through business-intelligence), and monitor for those things. You don’t have a scale problem until you have a scale problem.
  6. We firmly believe that continuous improvement is essential in almost all processes that exist. When you come across a real issue that you’re currently not monitoring for – add in the monitor for it. The system doesn’t have to be perfect, it just has to be good enough. Once something has an alert, then you should use the metrics from your business intelligence to prioritize resolution. This could be a newer version of the application, tuning the system, or some form of automation. In a perfect world, you would never see the same alert twice. However, the world is not perfect, and none of us have unlimited time. So use your monitoring tools, to prioritize fixes in the way that gets you the most sleep. Continuously work to improve your systems – the more you invest back into your applications and infrastructure, the better your returns.
  7. Can I see a show of hands? How many of you get more than 20 emails in a day? Keep them up if you read and action each of those emails. Now what if you’re getting 50? 100? 200 emails a day? If no one is reading the alerts, is there still an issue? So if you see an alert that is firing frequently – it should probably be your top priority to resolve. If the alert is just spam, get rid of it. And remembers: Alert fatigue is real – don’t drown in a sea of numbers!
  8. Monitor everything – why do we care that our Chef Server is up, when our application is down? Likewise, while Lean tells us to use the best tool for the job, it’s unlikely that your infrastructure, and your applications, are different enough to warrant different tools, or artisinal, custom designed tools. Avoid the temptation to write something special – use what’s already in place, or chose the thing that allows you to move most freely. DevOps isn’t just a movement about people, processes and technology. The motivations that are driving those things are about providing value to your business. It’s also a movement about metrics. Who cares if your Chef server is up if your eCommerce site is down and no one can buy your product? Having instrumentation and metrics for more apps than just your Chef server is essential. You should either reuse the monitoring stack you build for your Chef server to also monitor your applications, or use the monitoring tools you already have for your applications to monitor the Chef server. Say NO to artisanal hand crafting of application stacks. There are – perhaps - some cold hard truths on this slide. Hammer home no artisanal monitoring stacks, and monitor your other apps..
  9. Hardware/OS CPU – - user, system, idle, iowait, irq, Steal, load average Memory - free, used, swap Disk space, utilization Centralized logging (splunk, elk) for syslog You should be monitoring the applications we bundle into the chef-server omnibus packages – Postgres, Solr, RabbitMQ, Nginx. Chef Server, we’ll talk about instrumenting our Erlang services
  10. Statsd – stats sent over UDP orTCP and sends aggregates to one or more pluggable backend services (e.g., Graphite). Grafana – aggregates multiple data sources into dashboards. TSDB - Time-series data is nothing more than a sequence of data points, typically consisting of successive measurements made from the same source over a time interval. Put another way, if you were to plot your points on a graph, one of your axes would always be time. - Actually started building cookbook with Graphite, but switched to Influx because of ease of use over carbon and graphite. Sensu – Why do legacy monitoring tools suck in the cloud?
  11. There are some issues with Folsom (rant about histograms), but it will give you some useful statistics such as each of the pools instrumented by our Erlang pooler software.
  12. Instrument 3 things – Stats hero, folsom graphite and logs
  13. Using a central logging framework like an ELK stack, or Splunk, you should collect your application logs in a central place. Logs are located under /var/log/opscode/ There are subdirectories for each service (e.g., RabbitMQ, PostgreSQL) You should at least collect the current and error logs for each service, from each node in your chef-server cluster. All logs on the Chef server are frequently log rotated, by shipping logs you’re both making it easier to access them and preserving them in the event of an incident that isn’t detected right away.
  14. Let’s find a common language to talk about Chef server load Talking about number of nodes is almost useless when discussing Chef server scale. How often do your nodes converge? What’s their splay? Adam don't have a scale problem!
  15. Set your splay to almost the same duration as your interval for client runs. This allows for a maximum set of randomization of your runs. Look at how splay actually works…
  16. Add a couple words If you’re dealing with an extremely high load system you should consider limiting the Ohai data you collect and store only the ohai data you need. Get it, little Ohai? Hah. I kill me. Especially at 2AM. Eliminate redundant and unnecessary search use, ALWAYS use partial search Set a policy of only the last N (lets say 5) versions of your cookbook will be kept on the Chef server. The rest can stay in git history if you really need them. 200 versions of your application cookbook on the Chef server when only two versions are ever in use is useless and complicates your dependency graph. Alternatively, ensure you use environments and environment cookbooks with tight dependency constraints.
  17. Don’t use DRBD. Look at our new HA model.
  18. Don’t turn into Homer Simpson - everything is tunable, stay focused on what matters.
  19. NGINX: Cookbook cache is important to keep load off your Bookshelf service. Your cookbooks are cached on disk on the front-end server instead of requiring an API call to Bookshelf. This is even more important if you’re storing your Bookshelf data in PGSQL. Extending the S3 URL expiry window delays when Erchef will need to fetch fresh cookbooks. Bifrost (also applies to Erchef) Starting in Chef server v12.2 we implemented bounded pools for our database connections and some of our http connections. Prior to this we just kept opening connections till we simply couldn’t. In a high load environment it’s extremely important to take advantage of these bounded pools and their respective queues. Having 20-50 configured pool connections per service per front-end and 1-2x that available in queue slots is what we recommend for your Chef server. The Authz service is another bounded queue, it’s important when you increase your db pool size that you also increase your authz pool size in order to minimize overhead of spawning/killing authz processes.
  20. Depsolver workers are single threaded workers that determine your dependency graph. Our recommendation is to have 1 depsolver per CPU on your server if running in a tiered infrastructure, or number of CPUs-1 if running in a standalone infrastructure. The bounded DB queues have the same rules as bifrost. Along with managing a pool of depsolvers Erchef has another CPU intensive task, which is generating keys to be provided to Chef clients. If you run in an environment that is constantly registering chef clients, or that has chef clients register in waves (e.g., when a new application environment is launched) you may want to increase the number of key’s that are pre-generated. Note that starting in chef-client 12 our default is to generate the keys on the client side, so this setting is becoming less important. Unless you are explicitly telling chef-client 12 to get keys from the server, or have a large fleet of chef 11 client’s, this setting may not need to be tuned anymore.
  21. PostgreSQL writes new transactions to the database in files called WAL segments that are 16MB in size. Every time checkpoint_segments worth of these files have been written, by default 3, a checkpoint occurs. Checkpoints can be resource intensive, and on a modern system doing one every 48MB will be a serious performance bottleneck. Setting checkpoint_segments to a much larger value improves that. Unless you're running on a very small configuration, you'll almost certainly be better setting this to at least 10, which also allows usefully increasing the completion target. We recommend setting checkpoint segments to at least 32 – 64 unless you have a smaller back-end server. We recommend setting the completion target to 0.9 -- meaning that the WAL writing completion should be completed by the time we reach 90% of the next checkpoint. Solr has two settings that we commonly tune – Heap size and new size. Heap size commonly needs to be tuned because the logic in the PrivateChef Cookbook limits us to 1GB of max heap. It’s common to need to push this to 4GB of total heap, and if you have 16GB of memory available on your back-end I’d recommend using 4GB of heap. Since we frequently write new objects into Solr the second setting is new size. The JVM sets new size to be 1/16 of total memory by default, sometimes this needs to be boosted. The maximum you should set your new size to with a 4GB heap is 512MB. Finally we have RabbitMQ. There really isn’t much to tune here. We recommend setting a maximum length for your analytics queue. If you’re not using analytics it may be worthwhile to explicitly disable your analytics queu
  22. Links to useful Sensu plugins
  23. Rename from helpful links to alternative technologies?