From our Feb 25, 2016 webcast on operating Elasticsearch at scale, the metrics to monitor, and how to create low-noise meaningful alerts on Elasticsearch performance.
Human Factors of XR: Using Human Factors to Design XR Systems
SignalFx Elasticsearch Metrics Monitoring and Alerting
1. M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R E
An e e l L a k h a n i
M a h d i Be n H a m i d a
Monitoring Elasticsearch
Performance and Capacity
2. SignalFlow
TM
Streaming & Historical
Analytics
Real-time visibility and
correlation across the stack
Compare incoming patterns against
historical patterns in real-time
No query language needed
Intelligent & dynamic alerting
Resolution down to 1s
Use existing investments in metrics,
events and logs
Prebuilt integrations and content
S Y S T E M M E T R I C S &
E V E N T S
A P P M E T R I C C &
E V E N T S
U S E R M E T R I C S
& E V E N T S
B U S I N E S S M E T R I C S
& E V E N T S
W H Y
SIGNALFX: MONITORING FOR MODERN INFRASTRUCTURE
3. Elasticsearch at SignalFx
• Used for storing metadata about metrics, events, and other
objects in the system
• Source of truth is Cassandra. Elasticsearch allows us to do ad-
hoc queries and full-text search
• 4 clusters in production (+more in testing/staging)
• Biggest cluster has 75 nodes (72 data nodes + 3 dedicated
master nodes)
• ~20TB of data, half a billion documents and growing !
• 24 shards with 2 replicas (moving to 168 shards as we speak)
• Running in EC2 across 3 availability zones
4. Monitoring Elasticsearch
• Metrics are collected from ES nodes using the open source
collectd agent
• collectd uses ES REST api to fetch metrics at a fixed,
configurable interval
• metrics are sent to SignalFx
• By default, SignalFx will create dashboards showing the most
important metrics of Elasticsearch
• We monitor infrastructure, cluster, node and index level
metrics
• We have alerts setup to notify us when something is wrong
5. Key Performance Metrics
• CPU load
• JVM heap, garbage collection
• Indexing, query rates and respective latencies
• Segment merges
• Thread pool queues and rejections
• Filter and field data cache sizes
6. Key Alerts
• High CPU load, low disk storage
• Master nodes availability
• Cluster state (green/yellow/red)
• Unassigned shards
• Sustained thread pool rejections
7. M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R EDEMO
8. M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R E
T H A N K Y O U !
S I G N U P F O R A T R I A L AT:
signalfx.com
9. M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R EAPPENDIX
10. MODERN APPS ARE FUNDAMENTALLY DIFFERENT
More scale-out, more open-source, and more ephemeral infrastructure
L E G A C Y A P P S M O D E R N A P P S
Monolithic, scale-up,
running on enterprise-grade
infrastructure
Elastic, scale-out, running on
ephemeral infrastructure
Apps
VM
Checkout Service
VM VM VM
VM VM VM VM
IT Public/Private Cloud
(w/ Self-Service APIs)
11. HOST SPECIFIC ALERTS GENERATE NOISE
Noisy, reactive monitoring
C H A L L E N G E
• Too many alerts fire at once for a cluster-
wide problem
• Is the machine down because we scaled
down the cluster or because we had a real
problem?
• Do we even care if a single node is down?
• Very high overhead to setup and reconfigure
monitoring every time you add/remove
nodes in a cluster
What
matters?
Where to
start?
?
12. BUT A CENTRALIZED VIEW IS CRITICAL
2/3 OF MACHINES DOWN
(CAPACITY DOWN TO 1/3)
LOAD INCREASED BY 2X
Y O U WA N T TO B E A LERTED!
13. USE ANALYTICS TO CALCULATE THE NUMBER OF DAYS OF
DISK CAPACITY YOU HAVE LEFT ACROSS A SHARDED DATA
STORE – ALERT WHEN YOU HAVE < 7 DAYS
0%
83%
100%
t
D I S K U S A G E
BUILD ACTIONABLE & TIMELY ALERTS
Alert here!
It is the only way to do quality alerting
PROACTIVELY DISCOVER A DISK ISSUE BEFORE IT CRIPPLES
YOUR SYSTEM
14. GET STARTED QUICKLY WITH INTEGRATIONS
For platforms, technologies and 3rd party business processes
G R O W I N G A N D V I B R A N T E C O S Y S T E M , P R E - B U I LT C O N T E N T U S I N G A N A LY T I C S