Monitor all the things - Confoo

Monitor All the
Things!
Félix-Étienne Trépanier

Call me Félix
- Software Engineer - Independant
- Code Java / Scala (mainly)
- Build Distributed Systems
- Organizer of Scala-Montreal
Meetups

Facets
- Logs
- Metrics
- Alerts
- Traces

Typically event based and
unstructured messages
written to files.

Often the key for figuring
out what happened.

- timestamp
- severity
- thread
- source
- clear complete message
Tips

- single line *
- log configuration
- log state/event on error
- avoid log flooding
- local log rotation!
Tips

Think about what you
would like to know if it
happens.
Tips

In distributed systems,
accessing all the log files
can be painful.

Example
input {
file {
path => "/var/log/demo/demo.log"
}
}
filter {
grok {
match => [ "message", "%{DATESTAMP} %{LOGLEVEL:log_level} %{GREEDYDATA}" ]
}
mutate {
add_field => {
"service" => "demo"
}
}
}
output {
elasticsearch {
host => mysearchserver
}
}

Example
{
"message":"2015-02-01 21:31:02,076 INFO [dispatcher-29] awesome log",
"@version":"1",
"@timestamp":"2015-02-02T02:31:02.860Z",
"host":"mammouth",
"path":"/var/log/demo/demo.log",
"log_level":"INFO",
"service":"demo"
}

Measurement points
in the system.

- gauges
- counters
- histograms
- meters
- timers

Tips
- call per second (for each endpoint)
- call latency (for each endpoint)
- downstream dependencies latency
- limited resource size
- cache
- heap
- thread pool
- connection pool
- error count/rate
- unexpected conditions
- feature metrics

Notification sent
when the state of a
component is
critical.
(usually during the night)

Tips
- simple triggers/checks
- limit false positives
- no false negatives
- alert message should
give some context

Tips
For services:
- healthchecks
- test query
- measure around alerts
- log around alerts

A record of the
paths a request
took in the system.

Trace Aggregation
- request sampling
- annotations logged locally
- published ‘trace’ server
- trace reconstruction
- trace indexation
- trace service UI

Craft your logs.
Centralize them.

Store metrics.
Use the graphs and
create a dashboard.

Services should
publish their health
status.

Monitor metrics and
health status with
simple checks to
raise alerts on
failure.

Build your services
for production.

Contact and Links
Félix-Étienne Trépanier
Twitter - @felixtrepanier
Github - @coderunner
Links and References: https://github.
com/coderunner/monitoring-references

Monitor all the things - Confoo

More Related Content

Similar to Monitor all the things - Confoo

Recently uploaded

Monitor all the things - Confoo