Building any average complex system in the cloud requires telemetry to be the number one concern: you would probably even start with planning and building it first (or perhaps you wish you had!). As quoted by Werner Vogels “Netflix is a log generating application, that happens to stream video quote” - Logging/Monitoring/Alerting has been central to the success of Netflix.
In ASOS, we currently generate more than 1TB of logs daily that gets stored and analysed in our Elasticsearch cluster for monitoring and alerting purposes. ELK stack (Elasticsearch, Logstash and Kibana) has been a very popular tool for logging and monitoring but tuning ELasticsearch for handling such a load is an art form in itself.
In this talk, we start with an overview of ELK stack (we in ASOS use CoveyorBelt instead of logstash so ECK for us) and then move to sharing what we have learned from trying to scale our Elasticsearch for this load: from tuning various configuration parameters to planning your shards and mapping strategy, this talk has quite a bit to equip you to build or tune an ELK stack in your own company.
16. @aliostad
1
2
3
At source (perf counters)
At the storage (Circonus)
In the visualisation tool (Kibana)
/// aggregations
4 In the pipeline (Riemann)
18. @aliostad
/// use cases
• Metrics (Visualisation)
• CPU, number of errors
• Response time percentiles
• Full-text search capability (logs and errors)
• Correlating across services
• Alerting when there is an SLO breach
29. @aliostad
/// elasticsearch
• Linearly-scalable and HA* search (and visualisation)
• ELK Stack
• Open Source (enterprise features require license)
• Speaks JSON
• REST API and very developer-friendly
40. @aliostad
/// bulk api
• Always use Bulk API to index documents
• Batches of 1K-5K documents
• Watch-out for error 429 and back-off pattern
• Check bulk rejects [change bulk queue length]
46. @aliostad
c l i e n t
k i b a n a
m a s t e r
d a t a
traffic
traffic
47. @aliostad
3x client
2x kibana
20x data (hot)
traffic
traffic
10x data (warm)
/// our setup
3x master
ARM
Template
Desire State
Configuration
48. @aliostad
/// hot/warm
• Hot => CPU, Warm => Memory
• Index Allocation/Routing
• At the index:
"index.routing.allocation.require.box_type" : "warm"
• At the node (elasticsearch.yml)
box_type: warm
50. @aliostad
/// administration
• Like all: logs, slow query logs, etc
• top, htop, iostat
• collectd + local logstash
• two clusters, each watching the other
• curator for hot/cold and deleting old indices
55. @aliostad
/// watcher notes
• All watches get executed on the active master
• Use Action Throttling to limit alerts
• Use watch templates when you see common patterns
• Use transforms and metadata to include context in
actions/emails
57. @aliostad
/// Do you speak CAP?
• Consistency? Treat all data dispensable. Back up data
that gets mastered in Elasticsearch. Not a document db.
• Highly-Available? For >99.9% availability use redundancy
• Partition-Intolerance? Node intercommunication highly
chatty, ideally keep in the same data centre and even in
the same VPC (aws)/VNet (azure)
58. @aliostad
/// Beware
• Split brain common
• Data corruption possible
• Backup data that gets mastered in ES (kibana indices)
• It seems safest High Availability is redundancy
(expensive)