SlideShare a Scribd company logo
Metrics Aggregation &
Monitoring System at Exotel
Vishnu Gajendran
SDE 3
Exotel
● Provide voice & SMS APIs
● IVR, last mile delivery for e-commerce and other online services etc… are
some classic use cases
● nOTP, a missed call based OTP verification service
● Other enterprise products tailor-made for specific customer use-cases
Metrics Aggregation
Need
● Add instrumentation for services that runs on hundreds of servers in cloud
● Collect stats about servers where application runs
● Visualize & monitor the metrics
Goal
Build a reliable, scalable metrics aggregation system
that supports rich query capabilities
Sample Metric Request
{
metric: “make_call_api”,
tags: {
tenant_id: “tenant1”,
http_code: “200”
},
fields: {
latency_ms: “10”
}
}
Functional requirements
● Metric document should be a json. User should be able to add any number of
arbitrary tags (with high cardinality) to a metric datapoint
● Rich query capabilities
Non-Functional requirements
1. Reliable
2. Scalable
3. Low end to end latency
4. Flexible
Options explored
1. InfluxDB
2. Elasticsearch (with X-Pack license)
3. Prometheus
Downsides
1. Expensive
2. Does not support tags with high cardinality (e.g. InfluxDB free version)
3. Single point of failure (e.g. Prometheus, InfluxDB free version)
4. Not scalable (e.g. Prometheus, InfluxDB free version)
Design
Rsyslog
● Default system logging/shipper service for most of linux flavours
Pros
● Robust & very less CPU & memory consumption compared to other shipper
services
● Metrics is batched & compressed at localhost before sending to Kafka
Cons
● Configuring Rsyslog is a bit painful
Telegraf
● Metrics collector agent
● Robust & very less CPU & memory consumption
● Can collect metrics about 80+ systems like MySQL, httpd etc...
● Easy to configure
Kafka
● Highly reliable & scalable message broker service
Pros
● Reliable
● Can be scaled up to handle millions of writes per sec
● Decouple producers & consumers
● Enable fault tolerant processing
Cons
● May not be trivial to operate
Kafka optimizations
● Configure number of partitions based on read/write throughput requirements
● Define multiple message queues and categorize/prioritize metrics
Logstash
● Data processor that can read data from various sources, process and write it
to various data stores
Pros
● Stateless service and we can scale up/down data processing based on the
metrics ingestion rate
Cons
● Very heavy on memory & CPU. sometimes crashes due to OOO memory!
Elasticsearch
● Json datastore that supports rich query capabilities
Pros
● Reliable
● Can be scaled up to handle millions of writes per sec
● Supports tags with high cardinality
● Rich query capabilities with support for complex aggregations and nested
bucketing
Cons
● The storage is not very optimized for metrics aggregation use-case (no
support for metrics down sampling etc...)
● Not very trivial to operate
ES optimizations
1. Do not store raw json document and save disk space
2. Enable indexing (searching) only on tags, not on fields
3. Force merge documents (compaction) during off-peak hours
Visualizing metrics
● Use Grafana
● Grafana supports Elasticsearch as a datasource
● But no support for monitoring & alerting with Elasticsearch as datastore
Metrics Aggregation and Monitoring System at Exotel
Monitoring
Monitoring requirements
● Query Elasticsearch at a predefined time interval and send alerts if the metric
breaches the threshold
● The monitoring system should be reliable & scalable
Example
Every 5 mins, run the following rule:
If avg CPU load on server X >= 5, send an alert
ElastAlert
● An open source monitoring tool implemented by Yelp
(https://github.com/Yelp/elastalert)
● It queries Elasticsearch based on rule definition and can send alert(s) to
various alerting systems (like pager duty, email, http endpoint etc...)
● Originally implemented to query logs and send alerts, we extended ElastAlert
to support metrics aggregations
Scheduling ElastAlert
● Schedule & run ElastAlert application as a AWS Lambda function with cron
trigger
● Lambda makes the scheduling more reliable and scalable
● AWS Lambda also stores logs for each function invocation. So, easy to debug
issues if any
Pros
● Stateless service, it stores all the alert information in Elasticsearch
Cons
● Deploying new rules to AWS lambda is not straightforward
Monitoring the metrics pipeline
Stats
Resource requirements
● Kafka - 3 node cluster, 2 core machine with instance store SSD
● Logstash - 2 node Auto Scale Group, 2 core machine
● Elasticsearch - 4 node cluster, 2 core machine with instance store SSD
Current traffic
● Metrics ingestion rate: 1,50,000 per minute / 22 GB per day
● Number of search queries at shard level: 15,000 per minute
● Data retention: 70 days
● Disk utilization: 1.3 TB across 4 Elasticsearch nodes
● Number of metric datapoints stored at any point in time: 7 billion
Deployments
● Terraform for bringing up resources in AWS
● Ansible playbook for deployment
● Download ansible playbook for Kafka, Elasticsearch, Logstash from Ansible
Galaxy
Future improvements
● Replace Logstash with Rsyslog for better reliability
● Enrich metric datapoint by adding more metadata just before ingesting into
Elasticsearch
● Anomaly detection support in ElastAlert
Q&A

More Related Content

Recently uploaded

Rockets and missiles notes engineering ppt
Rockets and missiles notes engineering pptRockets and missiles notes engineering ppt
Rockets and missiles notes engineering ppt
archithaero
 
Evento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recapEvento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recap
Rafael Santos
 
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptxIE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
BehairyAhmed2
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
idelewebmestre
 
Top EPC companies in India - Best EPC Contractor
Top EPC companies in India - Best EPC  ContractorTop EPC companies in India - Best EPC  Contractor
Top EPC companies in India - Best EPC Contractor
MangeshK6
 
Online fraud prediction and prevention.pptx
Online fraud prediction and prevention.pptxOnline fraud prediction and prevention.pptx
Online fraud prediction and prevention.pptx
madihasultana209
 
Business Development_ Identifying and Seizing Market Opportunities with Skyle...
Business Development_ Identifying and Seizing Market Opportunities with Skyle...Business Development_ Identifying and Seizing Market Opportunities with Skyle...
Business Development_ Identifying and Seizing Market Opportunities with Skyle...
Skyler Bloom
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
celiosilva66
 
Software Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity PlanningSoftware Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity Planning
Prakhyath Rai
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
OBD II
 
Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
leakingvideo
 
Stiffness Method for structure analysis - Truss
Stiffness Method  for structure analysis - TrussStiffness Method  for structure analysis - Truss
Stiffness Method for structure analysis - Truss
adninhaerul
 
Jet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdfJet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdf
KIET Group of Institutions
 
Online airline reservation system project report.pdf
Online airline reservation system project report.pdfOnline airline reservation system project report.pdf
Online airline reservation system project report.pdf
Kamal Acharya
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxxARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
alemaro1123
 
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
Generative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdfGenerative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdf
Aries716858
 
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptxPresentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Er. Kushal Ghimire
 

Recently uploaded (20)

Rockets and missiles notes engineering ppt
Rockets and missiles notes engineering pptRockets and missiles notes engineering ppt
Rockets and missiles notes engineering ppt
 
Evento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recapEvento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recap
 
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptxIE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
IE-469-Lecture-Notes-3IE-469-Lecture-Notes-3.pptx
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
 
Top EPC companies in India - Best EPC Contractor
Top EPC companies in India - Best EPC  ContractorTop EPC companies in India - Best EPC  Contractor
Top EPC companies in India - Best EPC Contractor
 
Online fraud prediction and prevention.pptx
Online fraud prediction and prevention.pptxOnline fraud prediction and prevention.pptx
Online fraud prediction and prevention.pptx
 
Business Development_ Identifying and Seizing Market Opportunities with Skyle...
Business Development_ Identifying and Seizing Market Opportunities with Skyle...Business Development_ Identifying and Seizing Market Opportunities with Skyle...
Business Development_ Identifying and Seizing Market Opportunities with Skyle...
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
 
Software Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity PlanningSoftware Engineering and Project Management - Activity Planning
Software Engineering and Project Management - Activity Planning
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
 
Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
 
Stiffness Method for structure analysis - Truss
Stiffness Method  for structure analysis - TrussStiffness Method  for structure analysis - Truss
Stiffness Method for structure analysis - Truss
 
Jet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdfJet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdf
 
Online airline reservation system project report.pdf
Online airline reservation system project report.pdfOnline airline reservation system project report.pdf
Online airline reservation system project report.pdf
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxxARITMETICO.pdf  xxxxxxxxxxxxxxxxxxxxxxxx
ARITMETICO.pdf xxxxxxxxxxxxxxxxxxxxxxxx
 
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
Generative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdfGenerative-AI-a-boost-for-operations-Presentation.pdf
Generative-AI-a-boost-for-operations-Presentation.pdf
 
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptxPresentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
Presentation slide on DESIGN AND FABRICATION OF MOBILE CONTROLLED DRAINAGE.pptx
 

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Metrics Aggregation and Monitoring System at Exotel

  • 1. Metrics Aggregation & Monitoring System at Exotel Vishnu Gajendran SDE 3
  • 2. Exotel ● Provide voice & SMS APIs ● IVR, last mile delivery for e-commerce and other online services etc… are some classic use cases ● nOTP, a missed call based OTP verification service ● Other enterprise products tailor-made for specific customer use-cases
  • 4. Need ● Add instrumentation for services that runs on hundreds of servers in cloud ● Collect stats about servers where application runs ● Visualize & monitor the metrics
  • 5. Goal Build a reliable, scalable metrics aggregation system that supports rich query capabilities
  • 6. Sample Metric Request { metric: “make_call_api”, tags: { tenant_id: “tenant1”, http_code: “200” }, fields: { latency_ms: “10” } }
  • 7. Functional requirements ● Metric document should be a json. User should be able to add any number of arbitrary tags (with high cardinality) to a metric datapoint ● Rich query capabilities
  • 8. Non-Functional requirements 1. Reliable 2. Scalable 3. Low end to end latency 4. Flexible
  • 9. Options explored 1. InfluxDB 2. Elasticsearch (with X-Pack license) 3. Prometheus
  • 10. Downsides 1. Expensive 2. Does not support tags with high cardinality (e.g. InfluxDB free version) 3. Single point of failure (e.g. Prometheus, InfluxDB free version) 4. Not scalable (e.g. Prometheus, InfluxDB free version)
  • 12. Rsyslog ● Default system logging/shipper service for most of linux flavours Pros ● Robust & very less CPU & memory consumption compared to other shipper services ● Metrics is batched & compressed at localhost before sending to Kafka Cons ● Configuring Rsyslog is a bit painful
  • 13. Telegraf ● Metrics collector agent ● Robust & very less CPU & memory consumption ● Can collect metrics about 80+ systems like MySQL, httpd etc... ● Easy to configure
  • 14. Kafka ● Highly reliable & scalable message broker service Pros ● Reliable ● Can be scaled up to handle millions of writes per sec ● Decouple producers & consumers ● Enable fault tolerant processing Cons ● May not be trivial to operate
  • 15. Kafka optimizations ● Configure number of partitions based on read/write throughput requirements ● Define multiple message queues and categorize/prioritize metrics
  • 16. Logstash ● Data processor that can read data from various sources, process and write it to various data stores Pros ● Stateless service and we can scale up/down data processing based on the metrics ingestion rate Cons ● Very heavy on memory & CPU. sometimes crashes due to OOO memory!
  • 17. Elasticsearch ● Json datastore that supports rich query capabilities Pros ● Reliable ● Can be scaled up to handle millions of writes per sec ● Supports tags with high cardinality ● Rich query capabilities with support for complex aggregations and nested bucketing
  • 18. Cons ● The storage is not very optimized for metrics aggregation use-case (no support for metrics down sampling etc...) ● Not very trivial to operate
  • 19. ES optimizations 1. Do not store raw json document and save disk space 2. Enable indexing (searching) only on tags, not on fields 3. Force merge documents (compaction) during off-peak hours
  • 20. Visualizing metrics ● Use Grafana ● Grafana supports Elasticsearch as a datasource ● But no support for monitoring & alerting with Elasticsearch as datastore
  • 23. Monitoring requirements ● Query Elasticsearch at a predefined time interval and send alerts if the metric breaches the threshold ● The monitoring system should be reliable & scalable
  • 24. Example Every 5 mins, run the following rule: If avg CPU load on server X >= 5, send an alert
  • 25. ElastAlert ● An open source monitoring tool implemented by Yelp (https://github.com/Yelp/elastalert) ● It queries Elasticsearch based on rule definition and can send alert(s) to various alerting systems (like pager duty, email, http endpoint etc...) ● Originally implemented to query logs and send alerts, we extended ElastAlert to support metrics aggregations
  • 26. Scheduling ElastAlert ● Schedule & run ElastAlert application as a AWS Lambda function with cron trigger ● Lambda makes the scheduling more reliable and scalable ● AWS Lambda also stores logs for each function invocation. So, easy to debug issues if any
  • 27. Pros ● Stateless service, it stores all the alert information in Elasticsearch Cons ● Deploying new rules to AWS lambda is not straightforward
  • 29. Stats Resource requirements ● Kafka - 3 node cluster, 2 core machine with instance store SSD ● Logstash - 2 node Auto Scale Group, 2 core machine ● Elasticsearch - 4 node cluster, 2 core machine with instance store SSD Current traffic ● Metrics ingestion rate: 1,50,000 per minute / 22 GB per day ● Number of search queries at shard level: 15,000 per minute ● Data retention: 70 days ● Disk utilization: 1.3 TB across 4 Elasticsearch nodes ● Number of metric datapoints stored at any point in time: 7 billion
  • 30. Deployments ● Terraform for bringing up resources in AWS ● Ansible playbook for deployment ● Download ansible playbook for Kafka, Elasticsearch, Logstash from Ansible Galaxy
  • 31. Future improvements ● Replace Logstash with Rsyslog for better reliability ● Enrich metric datapoint by adding more metadata just before ingesting into Elasticsearch ● Anomaly detection support in ElastAlert
  • 32. Q&A