11. The Science of Monitoring
“who wishes to fight must first count the
cost”
― Sun Tzu, The Art of War
12. What should you monitor?
Supporting Services
RabbitMQ
Solr
PostgreSQL
Application Services
Bifrost
Application Logs
13. Tools 101
• StatsD – A network daemon that runs on the Node.js platform and listens for
statistics, like counters and timers.
https://github.com/etsy/statsd
• Grafana - Beautiful dashboards
• TICK Stack – A series of tools that comprise the ‘Influx Data Platform’, including
an easily scalable time series database.
https://influxdata.com/time-series-platform/
• Sensu - Monitoring that doesn't suck.
https://sensuapp.org/
• Splunk – centralized logging, operational intelligence, big machine data tool
http://www.splunk.com/
19. Instrumenting our Erlang Based Services – Collecting
Logs
• Use a full featured log collector like Splunk to centralize logs.
• All of our services log into a common directory structure:
/var/log/opscode/<service name>
• The two most important files within that directory are:
current
error
• There are also request logs which repeat information available elsewhere
• All services shipped with the omnibus package, not just Erlang services, log
here
22. Sometimes Ohai tuning is needed
(e.g.. Centrify)
ALWAYS USE PARTIAL SEARCH!
(and look at SafeSearch)
Know what a dependency graph is
… and manage it.
25. Chef-server.rb
• https://docs.chef.io/config_rb_server.html
• https://docs.chef.io/config_rb_server_optional_settings.html
• https://github.com/chef/chef-server/blob/master/omnibus/files/private-chef-
cookbooks/private-chef/attributes/default.rb
• How does chef-server.rb work?
The Chef servers’ reconfigure is driven by a cookbook called PrivateChef.
PrivateChef is a cookbook that’s just like any other - with some helper libraries to read your
chef-server.rb, and make sense of it
• Actually tuning a setting:
opscode_erchef[‘db_pool_size’] = “20”
26. A quick look at PrivateChef
You can see, we’re creating a new
Module called PrivateChef.
The Configuration attributes are
defined as new Mashes. When you say
opscode_erchef[‘key’] = value, you’re
truly just assigning a value to the Mash
created in the PrivateChef module.
32. More Useful Tools
• PGBadger - https://github.com/dalibo/pgbadger
• Monitor Postgresql: https://wiki.postgresql.org/wiki/Monitoring
• How to Monitor Nginx: https://www.scalyr.com/community/guides/how-to-
monitor-nginx-the-essential-guide
• Pgtune - http://pgfoundry.org/projects/pgtune
pgtune takes the wimpy default postgresql.conf and expands the database server to be as
powerful as the hardware it's being deployed on
Be careful about shared resources, Pgtune assumes you have a dedicated Postgres server.
• GCViewer
Helps you analyze your GC activity, so you can make decisiosn on tuning.
http://www.tagtraum.com/gcviewer.html
34. Special Thanks
• Irving Popovetsky and his tuning the chef server for scale blog:
http://irvingpop.github.io/blog/2015/04/20/tuning-the-chef-server-for-scale/
• Mark Harrison, Paul Mooring and the Chef server team. The dashboards are
heavily based on their dashboards for hosted Chef.
• Phil Dibowitz and Facebook for teaching Andrew a lot about tuning the Chef
server for scale that almost none of our other customers hit.
35. Live Demo
• Link to github: https://github.com/andy-dufour/chef-server-
monitoring/
Editor's Notes
When you don’t have proper monitoring in place, you are constantly fighting a war against incidents and service interruptions.
We believe that monitoring is an art, that is fed and nourished by science. We wanted to kick off today by talking about the art of monitoring. We’ll then get into the science and details of what you should be monitoring, and wrap up with a demo specially prepared for you by Andrew.
But before we get going too fast, we want to define what the problem is that we’re looking to solve. We need to be able to make effective decisions and to effectively respond to incidents. We believe that visibility into our systems is necessary to solve this problem. And monitoring provides that visibility.
2 types of monitoring
Reactive alerting – when you are paged out because some conditions were met (usually at 3 am for some reason)
Business Intelligence – display of data and metrics in a consumable way that helps drive tuning, prioritize work, identify trends, and proactively prevent issues.
Now that we know what the problem is that we’re solving, what approach should we take?
Let’s start small, and get moving quickly. What is the most important thing to know when we’re monitoring? Is the application up or down!
Next we should build out the smallest useful monitoring profile – follow the 5 minute rule. What are the things you would check in the first 5 minutes of logging into the system to see if the application is healthy or unhealthy? Those are the things you should be monitoring for at first.
Next level of importance is to get instrumentation in place to provide the business intelligence that we’ll need in the future.
First rule should be a simple up/down rule
Build out the smallest possible monitoring profile based on real experience
Resist the urge to build out everything you can think of – 5 minute rule.
A very common pitfall is to attempt to build the perfect system. Spoiler alert: it doesn’t exist.
There is a reason that alongside the DevOps movement, micro-services have become a fad – simple systems are easier to implement, less fault prone, and easier to reason about as a human. For these reasons, they tend to be much more stable.
Especially in a monitoring system, stability is a good thing.
So try to keep your monitoring rules as simple as possible while covering all of your important use-cases. The best way to do this, is to start by asking yourselves the question “What is really important to our application and end-users?” Why would we write a monitor for network bandwidth, when our application is only latency-sensitive?
Simple systems are easier to implement
Simple systems are less fault prone
In a monitoring system, stability is a good thing
Figure out what you care about, and start there. Is there a reason we should monitor bandwidth when our service is only latency sensitive?
You don’t have a scale problem, until you do, but you probably don’t. Don’t over-architect your systems or monitors for problems you don’t yet have. Be aware of the real things that are causing issues in your application (through business-intelligence), and monitor for those things.
You don’t have a scale problem until you have a scale problem.
We firmly believe that continuous improvement is essential in almost all processes that exist.
When you come across a real issue that you’re currently not monitoring for – add in the monitor for it. The system doesn’t have to be perfect, it just has to be good enough.
Once something has an alert, then you should use the metrics from your business intelligence to prioritize resolution. This could be a newer version of the application, tuning the system, or some form of automation. In a perfect world, you would never see the same alert twice. However, the world is not perfect, and none of us have unlimited time. So use your monitoring tools, to prioritize fixes in the way that gets you the most sleep.
Continuously work to improve your systems – the more you invest back into your applications and infrastructure, the better your returns.
Can I see a show of hands? How many of you get more than 20 emails in a day? Keep them up if you read and action each of those emails. Now what if you’re getting 50? 100? 200 emails a day? If no one is reading the alerts, is there still an issue?
So if you see an alert that is firing frequently – it should probably be your top priority to resolve. If the alert is just spam, get rid of it.
And remembers: Alert fatigue is real – don’t drown in a sea of numbers!
Monitor everything – why do we care that our Chef Server is up, when our application is down?
Likewise, while Lean tells us to use the best tool for the job, it’s unlikely that your infrastructure, and your applications, are different enough to warrant different tools, or artisinal, custom designed tools. Avoid the temptation to write something special – use what’s already in place, or chose the thing that allows you to move most freely.
DevOps isn’t just a movement about people, processes and technology. The motivations that are driving those things are about providing value to your business. It’s also a movement about metrics.
Who cares if your Chef server is up if your eCommerce site is down and no one can buy your product?
Having instrumentation and metrics for more apps than just your Chef server is essential.
You should either reuse the monitoring stack you build for your Chef server to also monitor your applications, or use the monitoring tools you already have for your applications to monitor the Chef server.
Say NO to artisanal hand crafting of application stacks.
There are – perhaps - some cold hard truths on this slide.
Hammer home no artisanal monitoring stacks, and monitor your other apps..
Hardware/OS
CPU –
- user, system, idle, iowait, irq, Steal, load average
Memory
- free, used, swap
Disk space, utilization
Centralized logging (splunk, elk) for syslog
You should be monitoring the applications we bundle into the chef-server omnibus packages – Postgres, Solr, RabbitMQ, Nginx.
Chef Server, we’ll talk about instrumenting our Erlang services
Statsd – stats sent over UDP orTCP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
Grafana – aggregates multiple data sources into dashboards.
TSDB - Time-series data is nothing more than a sequence of data points, typically consisting of successive measurements made from the same source over a time interval. Put another way, if you were to plot your points on a graph, one of your axes would always be time.
- Actually started building cookbook with Graphite, but switched to Influx because of ease of use over carbon and graphite.
Sensu – Why do legacy monitoring tools suck in the cloud?
There are some issues with Folsom (rant about histograms), but it will give you some useful statistics such as each of the pools instrumented by our Erlang pooler software.
Instrument 3 things –
Stats hero, folsom graphite and logs
Using a central logging framework like an ELK stack, or Splunk, you should collect your application logs in a central place.
Logs are located under /var/log/opscode/
There are subdirectories for each service (e.g., RabbitMQ, PostgreSQL)
You should at least collect the current and error logs for each service, from each node in your chef-server cluster.
All logs on the Chef server are frequently log rotated, by shipping logs you’re both making it easier to access them and preserving them in the event of an incident that isn’t detected right away.
Let’s find a common language to talk about Chef server load
Talking about number of nodes is almost useless when discussing Chef server scale.
How often do your nodes converge?
What’s their splay?
Adam don't have a scale problem!
Set your splay to almost the same duration as your interval for client runs. This allows for a maximum set of randomization of your runs.
Look at how splay actually works…
Add a couple words
If you’re dealing with an extremely high load system you should consider limiting the Ohai data you collect and store only the ohai data you need. Get it, little Ohai? Hah. I kill me. Especially at 2AM.
Eliminate redundant and unnecessary search use, ALWAYS use partial search
Set a policy of only the last N (lets say 5) versions of your cookbook will be kept on the Chef server.
The rest can stay in git history if you really need them.
200 versions of your application cookbook on the Chef server when only two versions are ever in use is useless and complicates your dependency graph.
Alternatively, ensure you use environments and environment cookbooks with tight dependency constraints.
Don’t use DRBD.
Look at our new HA model.
Don’t turn into Homer Simpson - everything is tunable, stay focused on what matters.
NGINX:
Cookbook cache is important to keep load off your Bookshelf service. Your cookbooks are cached on disk on the front-end server instead of requiring an API call to Bookshelf. This is even more important if you’re storing your Bookshelf data in PGSQL.
Extending the S3 URL expiry window delays when Erchef will need to fetch fresh cookbooks.
Bifrost (also applies to Erchef)
Starting in Chef server v12.2 we implemented bounded pools for our database connections and some of our http connections. Prior to this we just kept opening connections till we simply couldn’t.
In a high load environment it’s extremely important to take advantage of these bounded pools and their respective queues. Having 20-50 configured pool connections per service per front-end and 1-2x that available in queue slots is what we recommend for your Chef server.
The Authz service is another bounded queue, it’s important when you increase your db pool size that you also increase your authz pool size in order to minimize overhead of spawning/killing authz processes.
Depsolver workers are single threaded workers that determine your dependency graph. Our recommendation is to have 1 depsolver per CPU on your server if running in a tiered infrastructure, or number of CPUs-1 if running in a standalone infrastructure.
The bounded DB queues have the same rules as bifrost.
Along with managing a pool of depsolvers Erchef has another CPU intensive task, which is generating keys to be provided to Chef clients. If you run in an environment that is constantly registering chef clients, or that has chef clients register in waves (e.g., when a new application environment is launched) you may want to increase the number of key’s that are pre-generated. Note that starting in chef-client 12 our default is to generate the keys on the client side, so this setting is becoming less important. Unless you are explicitly telling chef-client 12 to get keys from the server, or have a large fleet of chef 11 client’s, this setting may not need to be tuned anymore.
PostgreSQL writes new transactions to the database in files called WAL segments that are 16MB in size. Every time checkpoint_segments worth of these files have been written, by default 3, a checkpoint occurs. Checkpoints can be resource intensive, and on a modern system doing one every 48MB will be a serious performance bottleneck. Setting checkpoint_segments to a much larger value improves that. Unless you're running on a very small configuration, you'll almost certainly be better setting this to at least 10, which also allows usefully increasing the completion target.
We recommend setting checkpoint segments to at least 32 – 64 unless you have a smaller back-end server.
We recommend setting the completion target to 0.9 -- meaning that the WAL writing completion should be completed by the time we reach 90% of the next checkpoint.
Solr has two settings that we commonly tune – Heap size and new size. Heap size commonly needs to be tuned because the logic in the PrivateChef Cookbook limits us to 1GB of max heap. It’s common to need to push this to 4GB of total heap, and if you have 16GB of memory available on your back-end I’d recommend using 4GB of heap. Since we frequently write new objects into Solr the second setting is new size. The JVM sets new size to be 1/16 of total memory by default, sometimes this needs to be boosted. The maximum you should set your new size to with a 4GB heap is 512MB.
Finally we have RabbitMQ. There really isn’t much to tune here.
We recommend setting a maximum length for your analytics queue.
If you’re not using analytics it may be worthwhile to explicitly disable your analytics queu
Links to useful Sensu plugins
Rename from helpful links to alternative technologies?