12. > 55 million messages a day
• Now ~30Gb of indexed data per day
• All our applications
• All of syslog
• Used by developers and product managers
• 2 x DL360s with 8x600Gb discs, also
graphite install
14. About 4 months old
• Almost all apps onboard to various levels
• All of syslog was easy
• Still haven’t done apache logs
• Haven’t comprehensively done router/
switches
• Lots of apps still emit directly to graphite
21. On host log collector
• Need a lightweight log shipper.
• VMs with 1Gb of RAM..
• Message::Passing - perl library I wrote.
• Small, light, pluggable
23. On host log collector
• Application to logcollector is ZMQ
• Small amount of buffering (1000
messages)
24. On host log collector
• Application to logcollector is ZMQ
• Small amount of buffering (1000
messages)
• logcollector to logstash is ZMQ
• Large amount of buffering (disc offload,
100s of thousands of messages)
33. What, no AMQP?
• Could go logcollector => AMQP =>
logstash for extra durability
34. What, no AMQP?
• Could go logcollector => AMQP =>
logstash for extra durability
• ZMQ buffering ‘good enough’
35. What, no AMQP?
• Could go logcollector => AMQP =>
logstash for extra durability
• ZMQ buffering ‘good enough’
• logstash uses a pure ruby AMQP decoder
36. What, no AMQP?
• Could go logcollector => AMQP =>
logstash for extra durability
• ZMQ buffering ‘good enough’
• logstash uses a pure ruby AMQP decoder
• Slooooowwwwww
39. Reliability
• Multiple Elasticsearch servers (obvious)!
• Due to ZMQ buffering, you can:
• restart logstash, messages just buffer on
hosts whilst it’s unavailable
• restart logcollector, messages from apps
buffer (lose some syslog)
46. Redundancy
• Add a UUID to each message at emission
point.
• Index in elasticsearch by UUID
47. Redundancy
• Add a UUID to each message at emission
point.
• Index in elasticsearch by UUID
• Emit to two backend logstash instances
(TODO)
48. Redundancy
• Add a UUID to each message at emission
point.
• Index in elasticsearch by UUID
• Emit to two backend logstash instances
(TODO)
• Index everything twice! (TODO)
49. Elasticsearch
optimisation
• You need a template
• compress source
• disable _all
• discard unwanted fields from source /
indexing
• tweak shards and replicas
• compact your yesterday’s index at end of
day!
54. Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
• Just bumped ES heap to 22G
• Just writing data - 2Gb
• Query over all indexes - 17Gb!
55. Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
• Just bumped ES heap to 22G
• Just writing data - 2Gb
• Query over all indexes - 17Gb!
• Hang on - 800/87 does not = 33Gb/day!
56. Rate has increased!
Text
Text
We may have problems fitting
onto 5 x 600Gb discs!
60. TimedWebRequest
• Most obvious example of a standard event
• App name
• Environment
• HTTP status
• Page generation time
• Request / Response size
61. TimedWebRequest
• Most obvious example of a standard event
• App name
• Environment
• HTTP status
• Page generation time
• Request / Response size
• Can derive loads of metrics from this!
65. statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
seconds
66. statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
seconds
• Counters: Request rate, HTTP status rate
67. statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
seconds
• Counters: Request rate, HTTP status rate
• Timers: Total page time, mean page time,
min/max page times
74. Alerting use cases:
• Replaced nsca client with standardised log
pipeline
• Developers log an event and get (one!)
email warning of client side exceptions
• Passive health monitoring - ‘did we log
something recently’
83. Riemann
• Ambitious plans to do more
• Web pool health (>= n nodes)
• Replace statsd
• Transit collectd data via logstash and
use to emit to graphite
84. Riemann
• Ambitious plans to do more
• Web pool health (>= n nodes)
• Replace statsd
• Transit collectd data via logstash and
use to emit to graphite
• disc usage trending / prediction
88. Metadata
• It’s all about the metadata
• Structured events are describable
• Common patterns to give standard
metrics / alerting for free
89. Metadata
• It’s all about the metadata
• Structured events are describable
• Common patterns to give standard
metrics / alerting for free
• Dashboards!
The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n