SlideShare a Scribd company logo
1 of 14
Download to read offline
M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R E
An e e l L a k h a n i
M a h d i Be n H a m i d a
Monitoring Elasticsearch
Performance and Capacity
SignalFlow
TM
Streaming & Historical
Analytics
Real-time visibility and
correlation across the stack
Compare incoming patterns against
historical patterns in real-time
No query language needed
Intelligent & dynamic alerting
Resolution down to 1s
Use existing investments in metrics,
events and logs
Prebuilt integrations and content
S Y S T E M M E T R I C S &
E V E N T S
A P P M E T R I C C &
E V E N T S
U S E R M E T R I C S
& E V E N T S
B U S I N E S S M E T R I C S
& E V E N T S
W H Y
SIGNALFX: MONITORING FOR MODERN INFRASTRUCTURE
Elasticsearch at SignalFx
• Used for storing metadata about metrics, events, and other
objects in the system
• Source of truth is Cassandra. Elasticsearch allows us to do ad-
hoc queries and full-text search
• 4 clusters in production (+more in testing/staging)
• Biggest cluster has 75 nodes (72 data nodes + 3 dedicated
master nodes)
• ~20TB of data, half a billion documents and growing !
• 24 shards with 2 replicas (moving to 168 shards as we speak)
• Running in EC2 across 3 availability zones
Monitoring Elasticsearch
• Metrics are collected from ES nodes using the open source
collectd agent
• collectd uses ES REST api to fetch metrics at a fixed,
configurable interval
• metrics are sent to SignalFx
• By default, SignalFx will create dashboards showing the most
important metrics of Elasticsearch
• We monitor infrastructure, cluster, node and index level
metrics
• We have alerts setup to notify us when something is wrong
Key Performance Metrics
• CPU load
• JVM heap, garbage collection
• Indexing, query rates and respective latencies
• Segment merges
• Thread pool queues and rejections
• Filter and field data cache sizes
Key Alerts
• High CPU load, low disk storage
• Master nodes availability
• Cluster state (green/yellow/red)
• Unassigned shards
• Sustained thread pool rejections
M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R EDEMO
M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R E
T H A N K Y O U !
S I G N U P F O R A T R I A L AT:
signalfx.com
M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R EAPPENDIX
MODERN APPS ARE FUNDAMENTALLY DIFFERENT
More scale-out, more open-source, and more ephemeral infrastructure
L E G A C Y A P P S M O D E R N A P P S
Monolithic, scale-up,
running on enterprise-grade
infrastructure
Elastic, scale-out, running on
ephemeral infrastructure
Apps
VM
Checkout Service
VM VM VM
VM VM VM VM
IT Public/Private Cloud 

(w/ Self-Service APIs)
HOST SPECIFIC ALERTS GENERATE NOISE
Noisy, reactive monitoring
C H A L L E N G E
• Too many alerts fire at once for a cluster-
wide problem
• Is the machine down because we scaled
down the cluster or because we had a real
problem?
• Do we even care if a single node is down?
• Very high overhead to setup and reconfigure
monitoring every time you add/remove
nodes in a cluster
What
matters?
Where to
start?
?
BUT A CENTRALIZED VIEW IS CRITICAL
2/3 OF MACHINES DOWN
(CAPACITY DOWN TO 1/3)
LOAD INCREASED BY 2X
Y O U WA N T TO B E A LERTED!
USE ANALYTICS TO CALCULATE THE NUMBER OF DAYS OF
DISK CAPACITY YOU HAVE LEFT ACROSS A SHARDED DATA
STORE – ALERT WHEN YOU HAVE < 7 DAYS
0%
83%
100%
t
D I S K U S A G E
BUILD ACTIONABLE & TIMELY ALERTS
Alert here!
It is the only way to do quality alerting
PROACTIVELY DISCOVER A DISK ISSUE BEFORE IT CRIPPLES
YOUR SYSTEM
GET STARTED QUICKLY WITH INTEGRATIONS
For platforms, technologies and 3rd party business processes
G R O W I N G A N D V I B R A N T E C O S Y S T E M , P R E - B U I LT C O N T E N T U S I N G A N A LY T I C S

More Related Content

Viewers also liked

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemMicroservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemSignalFx
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...SignalFx
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxSignalFx
 
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...Grier Johnson
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana
 

Viewers also liked (7)

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemMicroservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFx
 
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
 

Similar to SignalFx Elasticsearch Metrics Monitoring and Alerting

Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
 
FIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container DeploymentFIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container DeploymentPriyanka Aash
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Andre Essing
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesMarco Parenzan
 
Time Series Anomaly Detection for .net and Azure
Time Series Anomaly Detection for .net and AzureTime Series Anomaly Detection for .net and Azure
Time Series Anomaly Detection for .net and AzureMarco Parenzan
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Energy Monitoring With Self-taught Deep Network
Energy Monitoring With Self-taught Deep NetworkEnergy Monitoring With Self-taught Deep Network
Energy Monitoring With Self-taught Deep NetworkYiqun Hu
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
A Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsA Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsBigPanda
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Amazon Web Services
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorchgeetachauhan
 

Similar to SignalFx Elasticsearch Metrics Monitoring and Alerting (20)

Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
Technical overview of Azure Cosmos DB
Technical overview of Azure Cosmos DBTechnical overview of Azure Cosmos DB
Technical overview of Azure Cosmos DB
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
FIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container DeploymentFIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container Deployment
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
 
Time Series Anomaly Detection for .net and Azure
Time Series Anomaly Detection for .net and AzureTime Series Anomaly Detection for .net and Azure
Time Series Anomaly Detection for .net and Azure
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Energy Monitoring With Self-taught Deep Network
Energy Monitoring With Self-taught Deep NetworkEnergy Monitoring With Self-taught Deep Network
Energy Monitoring With Self-taught Deep Network
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
A Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsA Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOps
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 

Recently uploaded

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

SignalFx Elasticsearch Metrics Monitoring and Alerting

  • 1. M M / D D / Y Y YOUR TITLE HERE P R E PA R E D F O R : P L A C E L O G O H E R E An e e l L a k h a n i M a h d i Be n H a m i d a Monitoring Elasticsearch Performance and Capacity
  • 2. SignalFlow TM Streaming & Historical Analytics Real-time visibility and correlation across the stack Compare incoming patterns against historical patterns in real-time No query language needed Intelligent & dynamic alerting Resolution down to 1s Use existing investments in metrics, events and logs Prebuilt integrations and content S Y S T E M M E T R I C S & E V E N T S A P P M E T R I C C & E V E N T S U S E R M E T R I C S & E V E N T S B U S I N E S S M E T R I C S & E V E N T S W H Y SIGNALFX: MONITORING FOR MODERN INFRASTRUCTURE
  • 3. Elasticsearch at SignalFx • Used for storing metadata about metrics, events, and other objects in the system • Source of truth is Cassandra. Elasticsearch allows us to do ad- hoc queries and full-text search • 4 clusters in production (+more in testing/staging) • Biggest cluster has 75 nodes (72 data nodes + 3 dedicated master nodes) • ~20TB of data, half a billion documents and growing ! • 24 shards with 2 replicas (moving to 168 shards as we speak) • Running in EC2 across 3 availability zones
  • 4. Monitoring Elasticsearch • Metrics are collected from ES nodes using the open source collectd agent • collectd uses ES REST api to fetch metrics at a fixed, configurable interval • metrics are sent to SignalFx • By default, SignalFx will create dashboards showing the most important metrics of Elasticsearch • We monitor infrastructure, cluster, node and index level metrics • We have alerts setup to notify us when something is wrong
  • 5. Key Performance Metrics • CPU load • JVM heap, garbage collection • Indexing, query rates and respective latencies • Segment merges • Thread pool queues and rejections • Filter and field data cache sizes
  • 6. Key Alerts • High CPU load, low disk storage • Master nodes availability • Cluster state (green/yellow/red) • Unassigned shards • Sustained thread pool rejections
  • 7. M M / D D / Y Y YOUR TITLE HERE P R E PA R E D F O R : P L A C E L O G O H E R EDEMO
  • 8. M M / D D / Y Y YOUR TITLE HERE P R E PA R E D F O R : P L A C E L O G O H E R E T H A N K Y O U ! S I G N U P F O R A T R I A L AT: signalfx.com
  • 9. M M / D D / Y Y YOUR TITLE HERE P R E PA R E D F O R : P L A C E L O G O H E R EAPPENDIX
  • 10. MODERN APPS ARE FUNDAMENTALLY DIFFERENT More scale-out, more open-source, and more ephemeral infrastructure L E G A C Y A P P S M O D E R N A P P S Monolithic, scale-up, running on enterprise-grade infrastructure Elastic, scale-out, running on ephemeral infrastructure Apps VM Checkout Service VM VM VM VM VM VM VM IT Public/Private Cloud 
 (w/ Self-Service APIs)
  • 11. HOST SPECIFIC ALERTS GENERATE NOISE Noisy, reactive monitoring C H A L L E N G E • Too many alerts fire at once for a cluster- wide problem • Is the machine down because we scaled down the cluster or because we had a real problem? • Do we even care if a single node is down? • Very high overhead to setup and reconfigure monitoring every time you add/remove nodes in a cluster What matters? Where to start? ?
  • 12. BUT A CENTRALIZED VIEW IS CRITICAL 2/3 OF MACHINES DOWN (CAPACITY DOWN TO 1/3) LOAD INCREASED BY 2X Y O U WA N T TO B E A LERTED!
  • 13. USE ANALYTICS TO CALCULATE THE NUMBER OF DAYS OF DISK CAPACITY YOU HAVE LEFT ACROSS A SHARDED DATA STORE – ALERT WHEN YOU HAVE < 7 DAYS 0% 83% 100% t D I S K U S A G E BUILD ACTIONABLE & TIMELY ALERTS Alert here! It is the only way to do quality alerting PROACTIVELY DISCOVER A DISK ISSUE BEFORE IT CRIPPLES YOUR SYSTEM
  • 14. GET STARTED QUICKLY WITH INTEGRATIONS For platforms, technologies and 3rd party business processes G R O W I N G A N D V I B R A N T E C O S Y S T E M , P R E - B U I LT C O N T E N T U S I N G A N A LY T I C S