Practical logstash -
 beyond the basics.
Tomas Doran (t0m) <bobtfish@bobtfish.net>
Who are you

• Sysadmin at TIM Group
• t0m on irc.freenode.net
• twitter.com/bobtfish
• github.com/bobtfish
• slideshare.com/bobtfish
Logstash
Logstash
• I hope you already know what logstash is?
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
 • Metrics
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
 • Metrics
 • Nagios
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
 • Metrics
 • Nagios
 • Riemann
> 55 million messages a day
> 55 million messages a day

• Now ~30Gb of indexed data per day
• All our applications
• All of syslog
• Used by developers and product managers
• 2 x DL360s with 8x600Gb discs, also
  graphite install
About 4 months old
About 4 months old

• Almost all apps onboard to various levels
• All of syslog was easy
• Still haven’t done apache logs
• Haven’t comprehensively done router/
  switches
• Lots of apps still emit directly to graphite
Java
Java

• All our apps are Java / Scala / Clojure
Java

• All our apps are Java / Scala / Clojure
• https://github.com/tlrx/slf4j-logback-zeromq
Java

• All our apps are Java / Scala / Clojure
• https://github.com/tlrx/slf4j-logback-zeromq
• Own layer (x2 1 Java, 1 Scala) for sending
  structured events as JSON
Java

• All our apps are Java / Scala / Clojure
• https://github.com/tlrx/slf4j-logback-zeromq
• Own layer (x2 1 Java, 1 Scala) for sending
  structured events as JSON
• Java developers hate native code
On host log collector
On host log collector

• Need a lightweight log shipper.
• VMs with 1Gb of RAM..

• Message::Passing - perl library I wrote.
• Small, light, pluggable
On host log collector
On host log collector
• Application to logcollector is ZMQ
 • Small amount of buffering (1000
    messages)
On host log collector
• Application to logcollector is ZMQ
 • Small amount of buffering (1000
    messages)
• logcollector to logstash is ZMQ
 • Large amount of buffering (disc offload,
    100s of thousands of messages)
ZeroMQ has the
correct semantics
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
• Buffer sizes / locations configureable
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
• Buffer sizes / locations configureable
• Arbitrary message size
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
• Buffer sizes / locations configureable
• Arbitrary message size
• IO done in a background thread (nice in
  interpreted languages - ruby/perl/python)
What, no AMQP?
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
• ZMQ buffering ‘good enough’
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
• ZMQ buffering ‘good enough’
• logstash uses a pure ruby AMQP decoder
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
• ZMQ buffering ‘good enough’
• logstash uses a pure ruby AMQP decoder
• Slooooowwwwww
Reliability
Reliability

• Multiple Elasticsearch servers (obvious)!
Reliability

• Multiple Elasticsearch servers (obvious)!
• Due to ZMQ buffering, you can:
 • restart logstash, messages just buffer on
    hosts whilst it’s unavailable
  • restart logcollector, messages from apps
    buffer (lose some syslog)
Reliability: TODO
Reliability: TODO

• Elasticsearch cluster getting sick happens
Reliability: TODO

• Elasticsearch cluster getting sick happens
• In-flight messages in logstash lost :(
Reliability: TODO

• Elasticsearch cluster getting sick happens
• In-flight messages in logstash lost :(
• Solution - elasticsearch_river output
 • logstash => durable RabbitMQ queue
 • ES reads from queue
 • Also faster - uses bulk API
Redundancy
Redundancy
• Add a UUID to each message at emission
  point.
Redundancy
• Add a UUID to each message at emission
  point.
• Index in elasticsearch by UUID
Redundancy
• Add a UUID to each message at emission
  point.
• Index in elasticsearch by UUID
• Emit to two backend logstash instances
  (TODO)
Redundancy
• Add a UUID to each message at emission
  point.
• Index in elasticsearch by UUID
• Emit to two backend logstash instances
  (TODO)
• Index everything twice! (TODO)
Elasticsearch
         optimisation
• You need a template
 • compress source
 • disable _all
 • discard unwanted fields from source /
    indexing
 • tweak shards and replicas
• compact your yesterday’s index at end of
  day!
Elasticsearch size
Elasticsearch size
• 87 daily indexes
Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
• Just bumped ES heap to 22G
 • Just writing data - 2Gb
 • Query over all indexes - 17Gb!
Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
• Just bumped ES heap to 22G
 • Just writing data - 2Gb
 • Query over all indexes - 17Gb!
• Hang on - 800/87 does not = 33Gb/day!
Rate has increased!


             Text
              Text



  We may have problems fitting
    onto 5 x 600Gb discs!
Standard log message
Standard event message
TimedWebRequest
TimedWebRequest
• Most obvious example of a standard event
 • App name
 • Environment
 • HTTP status
 • Page generation time
 • Request / Response size
TimedWebRequest
• Most obvious example of a standard event
 • App name
 • Environment
 • HTTP status
 • Page generation time
 • Request / Response size
• Can derive loads of metrics from this!
statsd
statsd
• Rolls up counters and timers into metrics
statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
  seconds
statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
  seconds
• Counters: Request rate, HTTP status rate
statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
  seconds
• Counters: Request rate, HTTP status rate
• Timers: Total page time, mean page time,
  min/max page times
statsd
statsd
JSON everywhere
JSON everywhere

• Legacy shell ftp mirror scripts
• gitolite hooks for deployments
• keepalived health checks
JSON everywhere
echo "JSON:{"nagios_service":"${SERVICE}",
"nagios_status":"${STATUS_CODE}",
"message":"${STATUS_TEXT}"}" |
 logger -t nagios
Alerting
Alerting use cases:

• Replaced nsca client with standardised log
  pipeline
• Developers log an event and get (one!)
  email warning of client side exceptions
• Passive health monitoring - ‘did we log
  something recently’
Riemann
Riemann

• Using for some simple health checking
Riemann

• Using for some simple health checking
 • logcollector health
Riemann

• Using for some simple health checking
 • logcollector health
 • Load balancer instance health
Riemann
Riemann
• Ambitious plans to do more
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
 • Replace statsd
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
 • Replace statsd
 • Transit collectd data via logstash and
    use to emit to graphite
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
 • Replace statsd
 • Transit collectd data via logstash and
    use to emit to graphite
  • disc usage trending / prediction
Metadata
Metadata

• It’s all about the metadata
Metadata

• It’s all about the metadata
• Structured events are describable
Metadata

• It’s all about the metadata
• Structured events are describable
• Common patterns to give standard
  metrics / alerting for free
Metadata

• It’s all about the metadata
• Structured events are describable
• Common patterns to give standard
  metrics / alerting for free
• Dashboards!
Dashboard love/hate
Dashboard love/hate
• Riemann x 2
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
• CI radiator
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
• CI radiator
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
• CI radiator

• Information overload!
Thanks!

• Questions?

• slides with more detail about my log
  collector code:
  • http://slideshare.net/bobtfish/

London devops logging

Editor's Notes