Monitor All the
Things!
Félix-Étienne Trépanier
Call me Félix
- Software Engineer - Independant
- Code Java / Scala (mainly)
- Build Distributed Systems
- Organizer of Scala-Montreal
Meetups
scope
motivations
Facets
- Logs
- Metrics
- Alerts
- Traces
logs
Typically event based and
unstructured messages
written to files.
Often the key for figuring
out what happened.
- timestamp
- severity
- thread
- source
- clear complete message
Tips
- single line *
- log configuration
- log state/event on error
- avoid log flooding
- local log rotation!
Tips
Think about what you
would like to know if it
happens.
Tips
In distributed systems,
accessing all the log files
can be painful.
A few options
syslog
Example
input {
file {
path => "/var/log/demo/demo.log"
}
}
filter {
grok {
match => [ "message", "%{DATESTAMP} %{LOGLEVEL:log_level} %{GREEDYDATA}" ]
}
mutate {
add_field => {
"service" => "demo"
}
}
}
output {
elasticsearch {
host => mysearchserver
}
}
Example
{
"message":"2015-02-01 21:31:02,076 INFO [dispatcher-29] awesome log",
"@version":"1",
"@timestamp":"2015-02-02T02:31:02.860Z",
"host":"mammouth",
"path":"/var/log/demo/demo.log",
"log_level":"INFO",
"service":"demo"
}
Persistence?
metrics
Measurement points
in the system.
- gauges
- counters
- histograms
- meters
- timers
Tips
Tips
- call per second (for each endpoint)
- call latency (for each endpoint)
- downstream dependencies latency
- limited resource size
- cache
- heap
- thread pool
- connection pool
- error count/rate
- unexpected conditions
- feature metrics
History
alerts
Notification sent
when the state of a
component is
critical.
(usually during the night)
Tips
- simple triggers/checks
- limit false positives
- no false negatives
- alert message should
give some context
Tips
For services:
- healthchecks
- test query
- measure around alerts
- log around alerts
Some options
traces
A record of the
paths a request
took in the system.
Why is this request
slow?
Trace Aggregation
- request sampling
- annotations logged locally
- published ‘trace’ server
- trace reconstruction
- trace indexation
- trace service UI
summary
Craft your logs.
Centralize them.
Store metrics.
Use the graphs and
create a dashboard.
Services should
publish their health
status.
Monitor metrics and
health status with
simple checks to
raise alerts on
failure.
Build your services
for production.
Contact and Links
Félix-Étienne Trépanier
Twitter - @felixtrepanier
Github - @coderunner
Links and References: https://github.
com/coderunner/monitoring-references

Monitor all the things - Confoo