Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch

>>> Realtime Monitoring
Storing 2TB of logs a day in
Elasticsearch
@aliostad
Ali Kheyrollahi, ASOS

@aliostad
The joy of hitting F5

@aliostad
The joy of a single process

@aliostad
The joy of a having a
production-size database
locally

@aliostad
The joy of having a
dev machine build
running all services

@aliostad
/// What if your systems
is “Microservices”?

@aliostad
- +40 platform teams
- 1oos of microservices
- some services >10k rps

@aliostad
> stackoverflow
> £1.5 bln
global fashion
destination
> 35% year-on-year

@aliostad
/// elements of
observability

@aliostad
/// observability
>>> Control Theory
“a measure for how well internal states of a
system can be inferred by knowledge of its
external outputs”

@aliostad
Logging
Telemetry
Tracing
/// mixed concerns

@aliostad
Tracing
Logging
Credit: Peter Bourgon
Events
Aggregations
Request Scope
Telemetry
Alerting
/// Scope

@aliostad
Logging Telemetry Tracing Alerting
Log4Net ✓
Time-series DBs ✓ ✓ ✓
Zipkin ✓ ✓ ✓ ✓
Prometheus ✓ ✓ ✓
Elasticsearch ✓ ✓ ✓ ✓
New Relic* ✓ ✓ ✓
Circonus* ✓ ✓ ✓
* paid services
/// comparison

@aliostad
1
2
3
At source (perf counters)
At the storage (Circonus)
In the visualisation tool (Kibana)
/// aggregations
4 In the pipeline (Riemann)

@aliostad
/// use cases
• Metrics (Visualisation)
• CPU, number of errors
• Response time percentiles
• Full-text search capability (logs and errors)
• Correlating across services
• Alerting when there is an SLO breach

@aliostad
/// azure logs
• Azure Diagnostics (WADLogs table)
• IIS logs
• VM Windows Event Logs
• Performance Counters (standard + custom)

@aliostad
/// application logs
Microservice
ETW
SLAB Azure
Table Sink
ETW
Application
Logs
EC
e.g. CRIT_ORD_API_DatabaseDown

@aliostad
/// instrumentation logs
Microservice Perf Counters
SLAB Azure
Table Sink
ETW
Instrumentation
Logs
Azure Performance
Counter Logs
PerfIt
Azure
Agent

@aliostad
/// logstash
QUEUE
VM
Logstash
collectd
syslog
Logstash
app logs
nginx
To
Elasticsearch
UDP
File-tailing

@aliostad
/// ConveyorBelt
Performance
Counters
ConveyorBelt
Azure
WAD logs
ETW Logs
Elasticsearch
Instrumentation
Logs
IIS Logs
Woodpecker
Outputs (Pull Logs)
Sources Config
Up to 2TB/day

@aliostad
/// ConveyorBelt
Source
Source
Source Conﬁg
Scheduler
Parser
units of
work
To Elasticsearch
Source
Actor
Actor
Actor

@aliostad
/// Woodpecker
Source
Source
Source Conﬁg
Pull
Telemetry
record
Azure
Table
(Regular
Intervals)
Source

@aliostad
/// elasticsearch
intro

@aliostad
/// elasticsearch
• Linearly-scalable and HA* search (and visualisation)
• ELK Stack
• Open Source (enterprise features require license)
• Speaks JSON
• REST API and very developer-friendly

@aliostad
/// cluster
• Cluster: No ZK
• Gossip / discovery
• Node type:
• master - leader election
• data
• client

@aliostad
/// data hierarchy
• Index
• Shard
• Replica
• Type/Mapping
• Document: JSON, immutable, 
versioned
INDEX
MAPPING
MAPPING
MAPPING
…
Document
Document
Document
…

@aliostad
/// data types
• JSON data types: bool, long, ﬂoat, string*, datetime
• Array?
• String tokenisation/analysers
• Best of both world?
• Object
• nested
{
“a”: {
“b” : {
“c”: 42
}
}
}

@aliostad
/// doc operations
• Upsert
• Delete
• Partial Update
• Search (JSON-based query DSL)

@aliostad
/// index shard/replica
Write Read Master
Shard Shard
Replica Replica
Index Index

@aliostad
/// index
• Daily indices
• Hot/Cold with index alias
• Creation => templates
• Settings:
• refresh_interval

@aliostad
/// mapping/type
• Schema
• How many mappings per index?
• Dynamic mapping
• Operations
• Upsert
• Delete

@aliostad
/// templates
PUT https://es_cluster:9200/_template/my_template
{
“template”: “my_index_*”,
“settings”: {…}
“mappings”: {
“mapping_1”: {…},
“mapping_2”: {…}
}
}

@aliostad
/// bulk api
• Always use Bulk API to index documents
• Batches of 1K-5K documents
• Watch-out for error 429 and back-oﬀ pattern
• Check bulk rejects [change bulk queue length]

@aliostad
/// physical
architecture

@aliostad
/// resources
node type
• Data: Disk, RAM, CPU, Network
• Master: CPU, Network, (RAM)
• Client: Network, CPU, (RAM)
• Kibana: CPU, Network, RAM

@aliostad
/// simple
data/master/kibana
• CPU
• RAM
• Disk
• Network

@aliostad
/// next level
d a t a / m a s t e r
kibana

@aliostad
c l i e n t
k i b a n a
m a s t e r
d a t a
trafﬁc
trafﬁc

@aliostad
3x client
2x kibana
20x data (hot)
traffic
traffic
10x data (warm)
/// our setup
3x master
ARM
Template
Desire State
Configuration

@aliostad
/// hot/warm
• Hot => CPU, Warm => Memory
• Index Allocation/Routing
• At the index: 
"index.routing.allocation.require.box_type" : "warm"
• At the node (elasticsearch.yml) 
box_type: warm

@aliostad
/// security
• x-pack: SSL + username/password security (basic,
Kerberos)
• No Federated Authentication
• Proxy (nginx, apache, etc)
• IP-whitelisting

@aliostad
/// administration
• Like all: logs, slow query logs, etc
• top, htop, iostat
• collectd + local logstash
• two clusters, each watching the other
• curator for hot/cold and deleting old indices

@aliostad
/// ActivityId
Microservice
Id
IdId Thread Local Storage
Id
To Other APIs
Id
Event

@aliostad
/// watcher
• Trigger
• Input
• Condition
• Action

@aliostad
/// watcher notes
• All watches get executed on the active master
• Use Action Throttling to limit alerts
• Use watch templates when you see common patterns
• Use transforms and metadata to include context in
actions/emails

@aliostad
/// Do you speak CAP?
• Consistency? Treat all data dispensable. Back up data
that gets mastered in Elasticsearch. Not a document db.
• Highly-Available? For >99.9% availability use redundancy
• Partition-Intolerance? Node intercommunication highly
chatty, ideally keep in the same data centre and even in
the same VPC (aws)/VNet (azure)

@aliostad
/// Beware
• Split brain common
• Data corruption possible
• Backup data that gets mastered in ES (kibana indices)
• It seems safest High Availability is redundancy
(expensive)

@aliostad
Thank you :)
Questions…?

@aliostad
Credits
• Picture: Embroidery thread macro - https://www.flickr.com/photos/39908901@N06/
• Picture: Calculate Red - https://www.flickr.com/photos/93277085@N08/10398245145
• Picture: 1950's wristwatch workings - https://www.flickr.com/photos/
134832191@N08/27301612554/
• Picture: Tokyo Tower_58 - https://www.flickr.com/photos/ajari/2756645901
• Picture: 1Bamboo and Rust - https://www.flickr.com/photos/hammershaug/5816522126/
• Picture: IMG_1899 - https://www.flickr.com/photos/johnas/9650255412/
• Picture: fan2 https://www.flickr.com/photos/sidelong/444054290/
• Picture: Glass jar filled with pasta https://www.flickr.com/photos/76588981@N02/16766079567/
• Picture: Rusty cogs https://www.flickr.com/photos/paperpariah/25375888671/
• Picture: Do you see the world in different colours? https://www.flickr.com/photos/luopl/6012467435/
• Picture: danger https://www.flickr.com/photos/armydre2008/9650951334/
• Link: ETW equivalent for Linux http://blogs.microsoft.co.il/sasha/2017/04/02/tracing-net-core-on-linux-
with-usdt-and-bcc/

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch

Similar to Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch (20)

More from Ali Kheyrollahi

More from Ali Kheyrollahi (16)

Recently uploaded

Recently uploaded (20)

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch