SlideShare a Scribd company logo
1 of 28
Download to read offline
Scalable Time Series
Monitoring and Analysis
https://NationalSecurityAgency.github.io/timely
Who are we?
Drew Farris
Chief Technologist
Booz | Allen | Hamilton
Bill Oley
Senior Lead Technologist
Booz | Allen | Hamilton
Timely Refresher
— Time Series Database (TSDB) built on Accumulo
— SSL/TLS access
— Metric-level access control
— Supports UDP, TCP, HTTPS, Websocket
— Collectd plugins
— Grafana app (datasource, built-in dashboards)
— Query API
— Subscription API
Scaling Timely Ingest and Query
Accumulo
Master
Hadoop2 Metrics
Tablet Server
Hadoop2 Metrics
Datanode
Hadoop2 Metrics
Tablet Server
Hadoop2 Metrics
Datanode
Hadoop2 Metrics
Tablet Server
Hadoop2 Metrics
Datanode
Hadoop2 Metrics
Collectd
Collectd
Collectd
Collectd
HAProxy
NSQD
NSQD
NSQD
NSQD
NSQ
Pipe
NSQ
Pipe
NSQ
Pipe
HAProxy
Tablet Server
Tablet Server
Tablet Server
Tablet Server
Tablet Server
Tablet Server
Tablet Server
Tablet Server
MasterTimely
(write)
Timely
(write)
Timely
(write)
Timely
(read)
Timely
(read)
Accumulo
Browser Browser
HAProxy
Architectural Components
— Collectd deployed on every node to gather application,
OS, hardware metrics.
— NSQ “fan-in” collects messages from many collectd
instances and routes them to a relatively small number
of timely servers.
— NSQ Pipe consumes data from the queue and writes to
the Timely Socket
— HAProxy Plays Multiple Roles:
— Distributes connections from collectd to NSQ
— Distributes write connections from NSQ to Timely
— Distributes read connections into Timely from browsers.
Scaling Concerns
— Deployment ratios that work for us:
— 15-30 clients per NSQ broker
— 8-15 brokers per Timely server
— Single HAProxy for them all
— Keeping up versus catching up
— The bottleneck is the insert rate into Accumulo
— Running with the safeties disabled
Grafana Alerts
— Will be implemented for Timely datasource - issue
#152
— OpenTSDB data source will also work for most
Timely queries
— Alerts executed on Grafana back-end
— Can alert when Min, Max, Sum, Count, Last, Median
is Above, Below, Outside Range, Inside Range
— Alerts can be sent to email, slack, custom web hook,
etc with graphs attached or linked
Subscription API
— User can subscribe to metrics using websockets
— Unique subscription id allows multiple subscriptions
in the same websocket
— Each responses includes subscription id
— Retrieve raw metrics in a time range or create an
ongoing subscription
— Tags supported, but no downsampling (use the
query API for this)
Subscription Sequence
— create - assign a subscription uniqueId
— add - call one or more times to assign metrics to a
subscription
— Read responses from websocket
— remove - delete metrics from a subscription
— close - remove subscription uniqueId from Timely
Websocket API – create/close
{
"operation" : "create",
"subscriptionId" : "<unique id>"
}
{
"operation" : ”close",
"subscriptionId" : "<unique id>"
}
Websocket API – add/remove
{
"operation": "add",
"subscriptionId": <unique id>
"metric": "sys.cpu.user",
"tags" : { // opt key/value pairs
"tag1" : "value1”,
"tag2" : "value2”,
},
"startTime” : null, // optional start time as long
"endTime" : null, // optional end time as long
"delayTime" : 1000 // wait time for new data
}
{
"operation” : ”remove",
"subscriptionId” : <unique id>,
"metric” : "sys.cpu.user”
}
Websocket API – response
{
"responses” :
[
{
"metric" : "sys.cpu.user",
"timestamp" : 1469028728091,
"value" : 1.0,
"tags” :
[
{
"key" : "rack",
"value" : "r1"
}
],
"subscriptionId” : <unique id>,
"complete” : false,
}
]
}
Python Websocket
— Created base websocket class with callbacks using
tornado library
— TimelyMetric class uses Timely’s websocket API and
implements callback methods for handling
asynchronous websocket responses
— Results are assembled in a pandas DataFrame using
a DatetimeIndex
— Columns are metric name and tag names
— Values are corresponding metric/tag values
Python Analytics
— Pandas supports data pivoting, resampling, rolling
averages, and more
— Graph using plot.ly offline methods. HTML/JavaScript
page allows post-analytic data exploration
— Allow isolation of discrete anomalies hiding in a stream
of metric data (Series only plotted if alerting)
— Challenges
— Inconsistent data / duplicate metric issues (tags)
— Writing analytic methods for reuse across many different
types of metrics
— Determining how to isolate trigger events for each type of
metric – i.e. what’s normal / abnormal
Python Analytic Example
TimelyMetric Parameters
— hostport – hostname:port
— metric – metric name
— tags – comma separated key=value
— begin – yyMMdd HHmmss
— end – yyMMdd HHmmss
— duration – after begin or before end
— sample – resample period
Normal Variance in QueuedMajC
Sept 5, 0000-2359
for one rack of servers
Sept 6, 0000-2359
for one rack of servers
Tools for Assessing Normality
— Maximum, minimum – but we may need to rule out
transient spikes to minimize false positives
— Maximum, minimum of rolling average (configurable) –
dampening effect
— Percentage above or below rolling average (configurable)
– useful if the level moves around, but you need to detect
sudden changes
— Minimum alert period – how long is too long?
— Window – only alert if anomaly detected in the last N
minutes, hours, etc. – useful for continuous monitoring
Queued MajC Anomaly
Queued MajC – Early Detection
Roll-ups with Apache Flink
— Consolidate data across different time windows
— Roll up raw data to variable time resolution.
— Aggregate using a number of functions.
— Difficulties with Accumulo Aggregators
— Ingest, query, aggregation resource contention
— Evaluated a number of streaming frameworks
— Storm, Kafka, etc.
— Ultimately settled on Flink Streaming API.
— (a rich understanding of event time vs. processing time,
watermarks, etc.)
Timely Analytics SummaryJob
— SubscriptionSource
— A Flink RichSourceFunction
— Select Start, End, Specific Metrics & Window
— Bounded Execution
Timely
(write)
Timely
(write)
Timely
(read)
Timely
(read)
Tablet Server
Tablet Server
Tablet Server
Master
Accumulo
Tablet Server
Tablet Server
Tablet Server
HAProxy
HAProxy
Flink
Worker
Flink
Worker
Flink
Worker
Flink
Worker
Timely Analytics SummaryJob
— SummaryJob
— WebSocket Subsciption API Source
— Summarizes multiple metrics simultaneously
— Collect windows in Flink
— Flush windows back to Timely
— Metric Aggregations
— Average, Count, Max, Min, Sum
— Percentiles: 50, 75, 90, 99
— Metrics are in Double.
— Probably not large enough.
Lessons Learned : Architecture
(don’t repeat our mistakes)
— If Collectd metrics are not being reported often enough
— Insufficient ReadThreads – too many source plugins
— Insufficient WriteThreads – queue building up, metrics
dropped at random to compensate
— CollectdParentPlugin uses a single synchronized socket
connection. Increasing write threads is inconsequential.
— (See Issue #156)
— HAProxy does not handle UDP connections
— (nginx is a possible solution)
— Make sure that your NSQPipe process reconnects to Timely
on communication error
— Monitor ingest / query performance using Timely’s internal
metrics
Lessons Learned : Compactions
(don’t repeat our mistakes)
— Watch your compactions / compaction ratio
— HDFS ran out of space multiple times
— Accumulo killed with too many compactions
— Compaction Ratio was tricky to get right
— Full table compactions worked well
Lessons Learned : HDFS
(don’t repeat our mistakes)
— iostat is your friend.
— Multiple spindles often necessary
— Understand your configuration
Lessons Learned: Metrics
(don’t repeat our mistakes)
— Review and Cull Metrics
— You can track everything…
— ...but it doesn’t mean you should
— (do you really need stats for lo0)?
— I heard you like metrics...
— ...so I put some metrics on your metrics.
Summary & Questions

More Related Content

What's hot

Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudAmazon Web Services
 
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Spark Summit
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Spark Summit
 
Powering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and PrestoPowering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and PrestoAlluxio, Inc.
 
Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Cloudera, Inc.
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Architecture Best Practices to Master + Pitfalls to Avoid
Architecture Best Practices to Master + Pitfalls to AvoidArchitecture Best Practices to Master + Pitfalls to Avoid
Architecture Best Practices to Master + Pitfalls to AvoidElasticsearch
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesAlluxio, Inc.
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesDatabricks
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationMaruthi Nataraj K
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsClaudiu Barbura
 

What's hot (20)

Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
ebay
ebayebay
ebay
 
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
 
Powering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and PrestoPowering Interactive Analytics with Alluxio and Presto
Powering Interactive Analytics with Alluxio and Presto
 
Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Architecture Best Practices to Master + Pitfalls to Avoid
Architecture Best Practices to Master + Pitfalls to AvoidArchitecture Best Practices to Master + Pitfalls to Avoid
Architecture Best Practices to Master + Pitfalls to Avoid
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
 

Similar to Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System

Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017Sumo Logic
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software ValidationBioDec
 
Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018Sumo Logic
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusBol.com Techlab
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusBol.com Techlab
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyDaniel Hochman
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
Monitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and GrafanaMonitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and GrafanaJulien Pivotto
 
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien PivottoOSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien PivottoNETWAYS
 
PAC 2019 virtual Arjan Van Den Berg
PAC 2019 virtual Arjan Van Den Berg  PAC 2019 virtual Arjan Van Den Berg
PAC 2019 virtual Arjan Van Den Berg Neotys
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandMaarten Balliauw
 
Setting Up Sumo Logic - Sep 2017
Setting Up Sumo Logic -  Sep 2017Setting Up Sumo Logic -  Sep 2017
Setting Up Sumo Logic - Sep 2017mariosany
 

Similar to Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System (20)

Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017Setting Up Sumo Logic - Apr 2017
Setting Up Sumo Logic - Apr 2017
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
 
QSpiders - Installation and Brief Dose of Load Runner
QSpiders - Installation and Brief Dose of Load RunnerQSpiders - Installation and Brief Dose of Load Runner
QSpiders - Installation and Brief Dose of Load Runner
 
Building Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQLBuilding Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQL
 
Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018Using Sumo Logic - Apr 2018
Using Sumo Logic - Apr 2018
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with Envoy
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Monitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and GrafanaMonitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and Grafana
 
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien PivottoOSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
 
PAC 2019 virtual Arjan Van Den Berg
PAC 2019 virtual Arjan Van Den Berg  PAC 2019 virtual Arjan Van Den Berg
PAC 2019 virtual Arjan Van Den Berg
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days Poland
 
Setting Up Sumo Logic - Sep 2017
Setting Up Sumo Logic -  Sep 2017Setting Up Sumo Logic -  Sep 2017
Setting Up Sumo Logic - Sep 2017
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System

  • 1. Scalable Time Series Monitoring and Analysis https://NationalSecurityAgency.github.io/timely
  • 2. Who are we? Drew Farris Chief Technologist Booz | Allen | Hamilton Bill Oley Senior Lead Technologist Booz | Allen | Hamilton
  • 3. Timely Refresher — Time Series Database (TSDB) built on Accumulo — SSL/TLS access — Metric-level access control — Supports UDP, TCP, HTTPS, Websocket — Collectd plugins — Grafana app (datasource, built-in dashboards) — Query API — Subscription API
  • 4. Scaling Timely Ingest and Query Accumulo Master Hadoop2 Metrics Tablet Server Hadoop2 Metrics Datanode Hadoop2 Metrics Tablet Server Hadoop2 Metrics Datanode Hadoop2 Metrics Tablet Server Hadoop2 Metrics Datanode Hadoop2 Metrics Collectd Collectd Collectd Collectd HAProxy NSQD NSQD NSQD NSQD NSQ Pipe NSQ Pipe NSQ Pipe HAProxy Tablet Server Tablet Server Tablet Server Tablet Server Tablet Server Tablet Server Tablet Server Tablet Server MasterTimely (write) Timely (write) Timely (write) Timely (read) Timely (read) Accumulo Browser Browser HAProxy
  • 5. Architectural Components — Collectd deployed on every node to gather application, OS, hardware metrics. — NSQ “fan-in” collects messages from many collectd instances and routes them to a relatively small number of timely servers. — NSQ Pipe consumes data from the queue and writes to the Timely Socket — HAProxy Plays Multiple Roles: — Distributes connections from collectd to NSQ — Distributes write connections from NSQ to Timely — Distributes read connections into Timely from browsers.
  • 6. Scaling Concerns — Deployment ratios that work for us: — 15-30 clients per NSQ broker — 8-15 brokers per Timely server — Single HAProxy for them all — Keeping up versus catching up — The bottleneck is the insert rate into Accumulo — Running with the safeties disabled
  • 7. Grafana Alerts — Will be implemented for Timely datasource - issue #152 — OpenTSDB data source will also work for most Timely queries — Alerts executed on Grafana back-end — Can alert when Min, Max, Sum, Count, Last, Median is Above, Below, Outside Range, Inside Range — Alerts can be sent to email, slack, custom web hook, etc with graphs attached or linked
  • 8. Subscription API — User can subscribe to metrics using websockets — Unique subscription id allows multiple subscriptions in the same websocket — Each responses includes subscription id — Retrieve raw metrics in a time range or create an ongoing subscription — Tags supported, but no downsampling (use the query API for this)
  • 9. Subscription Sequence — create - assign a subscription uniqueId — add - call one or more times to assign metrics to a subscription — Read responses from websocket — remove - delete metrics from a subscription — close - remove subscription uniqueId from Timely
  • 10. Websocket API – create/close { "operation" : "create", "subscriptionId" : "<unique id>" } { "operation" : ”close", "subscriptionId" : "<unique id>" }
  • 11. Websocket API – add/remove { "operation": "add", "subscriptionId": <unique id> "metric": "sys.cpu.user", "tags" : { // opt key/value pairs "tag1" : "value1”, "tag2" : "value2”, }, "startTime” : null, // optional start time as long "endTime" : null, // optional end time as long "delayTime" : 1000 // wait time for new data } { "operation” : ”remove", "subscriptionId” : <unique id>, "metric” : "sys.cpu.user” }
  • 12. Websocket API – response { "responses” : [ { "metric" : "sys.cpu.user", "timestamp" : 1469028728091, "value" : 1.0, "tags” : [ { "key" : "rack", "value" : "r1" } ], "subscriptionId” : <unique id>, "complete” : false, } ] }
  • 13. Python Websocket — Created base websocket class with callbacks using tornado library — TimelyMetric class uses Timely’s websocket API and implements callback methods for handling asynchronous websocket responses — Results are assembled in a pandas DataFrame using a DatetimeIndex — Columns are metric name and tag names — Values are corresponding metric/tag values
  • 14. Python Analytics — Pandas supports data pivoting, resampling, rolling averages, and more — Graph using plot.ly offline methods. HTML/JavaScript page allows post-analytic data exploration — Allow isolation of discrete anomalies hiding in a stream of metric data (Series only plotted if alerting) — Challenges — Inconsistent data / duplicate metric issues (tags) — Writing analytic methods for reuse across many different types of metrics — Determining how to isolate trigger events for each type of metric – i.e. what’s normal / abnormal
  • 16. TimelyMetric Parameters — hostport – hostname:port — metric – metric name — tags – comma separated key=value — begin – yyMMdd HHmmss — end – yyMMdd HHmmss — duration – after begin or before end — sample – resample period
  • 17. Normal Variance in QueuedMajC Sept 5, 0000-2359 for one rack of servers Sept 6, 0000-2359 for one rack of servers
  • 18. Tools for Assessing Normality — Maximum, minimum – but we may need to rule out transient spikes to minimize false positives — Maximum, minimum of rolling average (configurable) – dampening effect — Percentage above or below rolling average (configurable) – useful if the level moves around, but you need to detect sudden changes — Minimum alert period – how long is too long? — Window – only alert if anomaly detected in the last N minutes, hours, etc. – useful for continuous monitoring
  • 20. Queued MajC – Early Detection
  • 21. Roll-ups with Apache Flink — Consolidate data across different time windows — Roll up raw data to variable time resolution. — Aggregate using a number of functions. — Difficulties with Accumulo Aggregators — Ingest, query, aggregation resource contention — Evaluated a number of streaming frameworks — Storm, Kafka, etc. — Ultimately settled on Flink Streaming API. — (a rich understanding of event time vs. processing time, watermarks, etc.)
  • 22. Timely Analytics SummaryJob — SubscriptionSource — A Flink RichSourceFunction — Select Start, End, Specific Metrics & Window — Bounded Execution Timely (write) Timely (write) Timely (read) Timely (read) Tablet Server Tablet Server Tablet Server Master Accumulo Tablet Server Tablet Server Tablet Server HAProxy HAProxy Flink Worker Flink Worker Flink Worker Flink Worker
  • 23. Timely Analytics SummaryJob — SummaryJob — WebSocket Subsciption API Source — Summarizes multiple metrics simultaneously — Collect windows in Flink — Flush windows back to Timely — Metric Aggregations — Average, Count, Max, Min, Sum — Percentiles: 50, 75, 90, 99 — Metrics are in Double. — Probably not large enough.
  • 24. Lessons Learned : Architecture (don’t repeat our mistakes) — If Collectd metrics are not being reported often enough — Insufficient ReadThreads – too many source plugins — Insufficient WriteThreads – queue building up, metrics dropped at random to compensate — CollectdParentPlugin uses a single synchronized socket connection. Increasing write threads is inconsequential. — (See Issue #156) — HAProxy does not handle UDP connections — (nginx is a possible solution) — Make sure that your NSQPipe process reconnects to Timely on communication error — Monitor ingest / query performance using Timely’s internal metrics
  • 25. Lessons Learned : Compactions (don’t repeat our mistakes) — Watch your compactions / compaction ratio — HDFS ran out of space multiple times — Accumulo killed with too many compactions — Compaction Ratio was tricky to get right — Full table compactions worked well
  • 26. Lessons Learned : HDFS (don’t repeat our mistakes) — iostat is your friend. — Multiple spindles often necessary — Understand your configuration
  • 27. Lessons Learned: Metrics (don’t repeat our mistakes) — Review and Cull Metrics — You can track everything… — ...but it doesn’t mean you should — (do you really need stats for lo0)? — I heard you like metrics... — ...so I put some metrics on your metrics.