3. Configuration
● Minimise it
● What should live where?
● Defaults
max_mispellings = 100
● Intolerance
Design For Deployment 3
4. Logging
● Think about levels
● More useful if accessible
Design For Deployment 4
5. Monitoring
● be engaged
●think of useful metrics
●expose data if you need
●help to specify trigger levels
Design For Deployment 5
6. Access to the live system
you want to be able to:
● see the logs, in real time
●view the current config
●access the monitoring
Design For Deployment 6
7. Dying (with dignity)
● don't be shy about dying
● explain why
● use a restart mechanism
Design For Deployment 7
8. The End
Design For Deployment 8
● Thank you!
Anthony Kirby
anthony@anthony.org
Editor's Notes
*** 1 Introduction ***
- This talk is aimed at developers. I'm assuming that you're working in an environment with an infrastructure/SA team etc – but I think some of this applies if you're doing that role yourself.
- some of this is going to be obvious but, stuff often is in retrospect. I'm hoping there'll be something that you find unintuitive & make you think
*** my background ***
until recently I've been designing/running on-line registry systems for a large DNS provider; I've changed role so this is a good point for me to reflect on what has worked for me.
*** anecdote ***
Here's an anecdote – illustrating the motivation behind the talk.
An infrastructure/sysadmin team was doing some routine maintenance; announced downtime, made changes, brought systems back online, checked everything, went home. But (you've guessed it) some services had failed. You could say that the Infrastructure team should have checked more carefully (and they should) but I suggest that developers are also at fault because of the way they didn't think about one of their customers, the SA team. The devs couldn't have prevented the original misconfiguration, but they could have written their applications detect what was wrong, flag it up. It would have been quick to fix, and downtime would have been minimal.
Packaging
*** yes, do this ***
- understand how to make packages for your target [talk in itself]
- even if you only have a single file, and it'll never change, it's nevertheless still worth putting in a package. You know it won't be overwritten by another file (pkg manager would stop it) & you can ask the pkg manager what the source package is for any file
*** understand your target ***
Linux - Filesystem Hierarchy Standard; people will find things in places they expect, and you'll not be bitten when e.g. files in /tmp cleared, or /var/run is cleared on reboot
*** minimise it ***
[murphy's law] - for a boolean option, there's 50% chance of of it being set wrong (and a disciple of Murphy would have it that it's more than 50%]
So only expose things that you want someone to fiddle with!
*** what should live where? ***
- "per server" (like SSL cert) - file on server, via SA config management (e.g. puppet)
- "property of complete system (with many servers/processes) like a feature that could be enabled, and you have an app DB" -> put it there
- if it'll only change when you build a new package -> fixed, in code (header file, java XML)
- if it'll need tweaking, by whom?
*** defaults ***
[pet hate]
- "max_mispellings=100# defaults to 1000"
*** intolerance ***
for critical apps write your own fascist config parser
- app refuses to start if any unexpected config entries, or any missing
*** think about levels ***
- think what you mean by debug/info/warning/error & be consistent
idea:
Error == "service died because of something" [not: user entered something silly]
Warning == "internal/programmer error but we recovered
[set a job to grep the logs for either of them & email you!]
*** more useful if accessible ***
- make sure the logs go somewhere you can get to them easily
*** be engaged ***
it's not just the sysadmins' problem :-)
*** think of useful metrics ***
You are uniquely placed to do this!
- try to find an honest metric
- be as creative (as you would for choice of an algorithm or data structure)
e.g.
OK (better than nothing): "size of queue"
... but is 100,000 bad?
Better: "age of oldest item in queue"
... relates directly to your SLA (maybe?)
*** expose data if you need ***
- JMX (for Java) [obvious]
- have a thread write values (e.g. "timestamp key=value key2=value2 ...") every second
- daily job: append "job succeeded @ <timestamp>" to a file => then monitor age of file
*** help to specify trigger levels ***
- aspect of “be engaged”
Access to the live system:
===============
Depending on type of environment, and whether it's possible you'll be phoned at 3 in the morning.
you need to be able to:
- see the logs
- view the current config (redacted if necessary)
- view the monitoring
May need to make a business case for this!
Dying (with dignity)
=====
*** 1 don't be shy about dying ***
Don't be reticent about exiting if your service has failed :-)
[ SAs will notice this & try to help. ]
e.g. unable to find/read SSL cert; no point app running if socket set-up failed [seen this one]
*** 2. explain why ***
Make it obvious why the application exited
Make sure you wrote something in the log:
1. explain what's wrong
2. at the bottom, not obscured by too much noise
[ a stack trace may be great for you (and worth including) but not sufficient by itself ]
Think about where the message goes; if you're writing to a logfile, but your error message went to stdout/stderr then it's lost forever
Patterns:
- catch-all exceptions; but print & exit. Would this catch the message if your JVM runs out of memory?
*** 3. use a restart mechanism ***
O/Ss often provide a process monitor – Solaris SMF, on Linux upstart or systemd. This will check that the process is still running, and restart it if it crashes. They're all different & YMWV with different tools - OTOH I've written & used a simple app that'll run & monitor child process.
(Monitors can sometimes have problems with processes that do “clever” things like forking & dropping privileges)
That's it. I died.
- The restarter kicked in, restarted me 5 times and gave up.
- The monitoring noticed the restarter, sent an alert. When it gave up it sent a page.
- Monitoring also noticed that it couldn't connect to the service, sent an alert. (Also the process isn't running, monitoring noticed that too)
The program left a message at the very end of the logfile, describing the problem. So hopefully it'll be fixed before we breach our SLA (or lose customers) etc.