London devops logging
Upcoming SlideShare
Loading in...5
×
 

London devops logging

on

  • 6,398 views

 

Statistics

Views

Total Views
6,398
Slideshare-icon Views on SlideShare
6,214
Embed Views
184

Actions

Likes
25
Downloads
111
Comments
0

13 Embeds 184

http://michelleleonpacheco.host22.com 63
https://twitter.com 56
http://carlosburgos.netii.net 39
http://twitter.com 6
http://localhost 5
http://www.onlydoo.com 5
http://vrindavijayan.blogspot.in 3
https://si0.twimg.com 2
http://fat.sparkasse.at 1
http://cuadernoelmaderal.blogspot.com.es 1
http://tweetedtimes.com 1
http://tmtarmenia.blogspot.ru 1
https://twimg0-a.akamaihd.net 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

London devops logging London devops logging Presentation Transcript

  • Practical logstash - beyond the basics.Tomas Doran (t0m) <bobtfish@bobtfish.net>
  • Who are you• Sysadmin at TIM Group• t0m on irc.freenode.net• twitter.com/bobtfish• github.com/bobtfish• slideshare.com/bobtfish
  • Logstash
  • Logstash• I hope you already know what logstash is?
  • Logstash• I hope you already know what logstash is?• I’m going to talk about our implementation.
  • Logstash• I hope you already know what logstash is?• I’m going to talk about our implementation. • Elasticsearch
  • Logstash• I hope you already know what logstash is?• I’m going to talk about our implementation. • Elasticsearch • Metrics
  • Logstash• I hope you already know what logstash is?• I’m going to talk about our implementation. • Elasticsearch • Metrics • Nagios
  • Logstash• I hope you already know what logstash is?• I’m going to talk about our implementation. • Elasticsearch • Metrics • Nagios • Riemann
  • > 55 million messages a day
  • > 55 million messages a day• Now ~30Gb of indexed data per day• All our applications• All of syslog• Used by developers and product managers• 2 x DL360s with 8x600Gb discs, also graphite install
  • About 4 months old
  • About 4 months old• Almost all apps onboard to various levels• All of syslog was easy• Still haven’t done apache logs• Haven’t comprehensively done router/ switches• Lots of apps still emit directly to graphite
  • Java
  • Java• All our apps are Java / Scala / Clojure
  • Java• All our apps are Java / Scala / Clojure• https://github.com/tlrx/slf4j-logback-zeromq
  • Java• All our apps are Java / Scala / Clojure• https://github.com/tlrx/slf4j-logback-zeromq• Own layer (x2 1 Java, 1 Scala) for sending structured events as JSON
  • Java• All our apps are Java / Scala / Clojure• https://github.com/tlrx/slf4j-logback-zeromq• Own layer (x2 1 Java, 1 Scala) for sending structured events as JSON• Java developers hate native code
  • On host log collector
  • On host log collector• Need a lightweight log shipper.• VMs with 1Gb of RAM..• Message::Passing - perl library I wrote.• Small, light, pluggable
  • On host log collector
  • On host log collector• Application to logcollector is ZMQ • Small amount of buffering (1000 messages)
  • On host log collector• Application to logcollector is ZMQ • Small amount of buffering (1000 messages)• logcollector to logstash is ZMQ • Large amount of buffering (disc offload, 100s of thousands of messages)
  • ZeroMQ has thecorrect semantics
  • ZeroMQ has the correct semantics• Pub/Sub sockets
  • ZeroMQ has the correct semantics• Pub/Sub sockets• Never, ever blocking
  • ZeroMQ has the correct semantics• Pub/Sub sockets• Never, ever blocking• Lossy! (If needed)
  • ZeroMQ has the correct semantics• Pub/Sub sockets• Never, ever blocking• Lossy! (If needed)• Buffer sizes / locations configureable
  • ZeroMQ has the correct semantics• Pub/Sub sockets• Never, ever blocking• Lossy! (If needed)• Buffer sizes / locations configureable• Arbitrary message size
  • ZeroMQ has the correct semantics• Pub/Sub sockets• Never, ever blocking• Lossy! (If needed)• Buffer sizes / locations configureable• Arbitrary message size• IO done in a background thread (nice in interpreted languages - ruby/perl/python)
  • What, no AMQP?
  • What, no AMQP?• Could go logcollector => AMQP => logstash for extra durability
  • What, no AMQP?• Could go logcollector => AMQP => logstash for extra durability• ZMQ buffering ‘good enough’
  • What, no AMQP?• Could go logcollector => AMQP => logstash for extra durability• ZMQ buffering ‘good enough’• logstash uses a pure ruby AMQP decoder
  • What, no AMQP?• Could go logcollector => AMQP => logstash for extra durability• ZMQ buffering ‘good enough’• logstash uses a pure ruby AMQP decoder• Slooooowwwwww
  • Reliability
  • Reliability• Multiple Elasticsearch servers (obvious)!
  • Reliability• Multiple Elasticsearch servers (obvious)!• Due to ZMQ buffering, you can: • restart logstash, messages just buffer on hosts whilst it’s unavailable • restart logcollector, messages from apps buffer (lose some syslog)
  • Reliability: TODO
  • Reliability: TODO• Elasticsearch cluster getting sick happens
  • Reliability: TODO• Elasticsearch cluster getting sick happens• In-flight messages in logstash lost :(
  • Reliability: TODO• Elasticsearch cluster getting sick happens• In-flight messages in logstash lost :(• Solution - elasticsearch_river output • logstash => durable RabbitMQ queue • ES reads from queue • Also faster - uses bulk API
  • Redundancy
  • Redundancy• Add a UUID to each message at emission point.
  • Redundancy• Add a UUID to each message at emission point.• Index in elasticsearch by UUID
  • Redundancy• Add a UUID to each message at emission point.• Index in elasticsearch by UUID• Emit to two backend logstash instances (TODO)
  • Redundancy• Add a UUID to each message at emission point.• Index in elasticsearch by UUID• Emit to two backend logstash instances (TODO)• Index everything twice! (TODO)
  • Elasticsearch optimisation• You need a template • compress source • disable _all • discard unwanted fields from source / indexing • tweak shards and replicas• compact your yesterday’s index at end of day!
  • Elasticsearch size
  • Elasticsearch size• 87 daily indexes
  • Elasticsearch size• 87 daily indexes• 800Gb of data (per instance)
  • Elasticsearch size• 87 daily indexes• 800Gb of data (per instance)• Just bumped ES heap to 22G • Just writing data - 2Gb • Query over all indexes - 17Gb!
  • Elasticsearch size• 87 daily indexes• 800Gb of data (per instance)• Just bumped ES heap to 22G • Just writing data - 2Gb • Query over all indexes - 17Gb!• Hang on - 800/87 does not = 33Gb/day!
  • Rate has increased! Text Text We may have problems fitting onto 5 x 600Gb discs!
  • Standard log message
  • Standard event message
  • TimedWebRequest
  • TimedWebRequest• Most obvious example of a standard event • App name • Environment • HTTP status • Page generation time • Request / Response size
  • TimedWebRequest• Most obvious example of a standard event • App name • Environment • HTTP status • Page generation time • Request / Response size• Can derive loads of metrics from this!
  • statsd
  • statsd• Rolls up counters and timers into metrics
  • statsd• Rolls up counters and timers into metrics• One bucket per stat, emits values every 10 seconds
  • statsd• Rolls up counters and timers into metrics• One bucket per stat, emits values every 10 seconds• Counters: Request rate, HTTP status rate
  • statsd• Rolls up counters and timers into metrics• One bucket per stat, emits values every 10 seconds• Counters: Request rate, HTTP status rate• Timers: Total page time, mean page time, min/max page times
  • statsd
  • statsd
  • JSON everywhere
  • JSON everywhere• Legacy shell ftp mirror scripts• gitolite hooks for deployments• keepalived health checks
  • JSON everywhereecho "JSON:{"nagios_service":"${SERVICE}","nagios_status":"${STATUS_CODE}","message":"${STATUS_TEXT}"}" | logger -t nagios
  • Alerting
  • Alerting use cases:• Replaced nsca client with standardised log pipeline• Developers log an event and get (one!) email warning of client side exceptions• Passive health monitoring - ‘did we log something recently’
  • Riemann
  • Riemann• Using for some simple health checking
  • Riemann• Using for some simple health checking • logcollector health
  • Riemann• Using for some simple health checking • logcollector health • Load balancer instance health
  • Riemann
  • Riemann• Ambitious plans to do more
  • Riemann• Ambitious plans to do more • Web pool health (>= n nodes)
  • Riemann• Ambitious plans to do more • Web pool health (>= n nodes) • Replace statsd
  • Riemann• Ambitious plans to do more • Web pool health (>= n nodes) • Replace statsd • Transit collectd data via logstash and use to emit to graphite
  • Riemann• Ambitious plans to do more • Web pool health (>= n nodes) • Replace statsd • Transit collectd data via logstash and use to emit to graphite • disc usage trending / prediction
  • Metadata
  • Metadata• It’s all about the metadata
  • Metadata• It’s all about the metadata• Structured events are describable
  • Metadata• It’s all about the metadata• Structured events are describable• Common patterns to give standard metrics / alerting for free
  • Metadata• It’s all about the metadata• Structured events are describable• Common patterns to give standard metrics / alerting for free• Dashboards!
  • Dashboard love/hate
  • Dashboard love/hate• Riemann x 2
  • Dashboard love/hate• Riemann x 2• Graphite dashboards x 2
  • Dashboard love/hate• Riemann x 2• Graphite dashboards x 2• Nagios x 3
  • Dashboard love/hate• Riemann x 2• Graphite dashboards x 2• Nagios x 3• CI radiator
  • Dashboard love/hate• Riemann x 2• Graphite dashboards x 2• Nagios x 3• CI radiator
  • Dashboard love/hate• Riemann x 2• Graphite dashboards x 2• Nagios x 3• CI radiator• Information overload!
  • Thanks!• Questions?• slides with more detail about my log collector code: • http://slideshare.net/bobtfish/