SlideShare a Scribd company logo
1 of 45
Download to read offline
7/29/2020
Platform Monitoring and Alert
Completeness and Consistency: Application and Data Services
Stability: IoT Hub, Kafka, Cassandra Monitoring
Resources: Kubernetes Resource Monitoring
Braja Das
BDAS@STARBUCKS.COM, BKD_108@YAHOO.COM
Platform Monitoring and Alert
Completeness and Consistency: Application and Data Services
Stability: IoT Hub, Kafka, Cassandra Monitoring
Resources: Kubernetes Resource Monitoring
Release Version: Summer 2019
1
Table of Contents
1.0 OBJECTIVE:.......................................................................................................................................................................................................................3
2.0 PRODUCT SCOPE...............................................................................................................................................................................................................4
3.0 MONITORING ARCHITECTURE AT SCALE..........................................................................................................................................................................5
4.0 COMPONENTS OF MONITORING ARCHITECTURE..............................................................................................................................................................6
4.1 Kubernetes Application and Data plane: ................................................................................................................................................................6
4.2 Kubernetes Control and Monitor plane...................................................................................................................................................................6
4.3 Azure Service Bus .....................................................................................................................................................................................................7
4.4 Cloud Data Store........................................................................................................................................................................................................7
4.5 Power BI Visual Presentation...................................................................................................................................................................................7
4.6 MS Flow.......................................................................................................................................................................................................................8
4.7 Notification Channel...................................................................................................................................................................................................8
5.0 MONITORING AGENT AND EXCEPTION HANDLING ...........................................................................................................................................................9
5.1 Exception Alerting....................................................................................................................................................................................................10
6.0 MONITORING AGENT MICROSERVICES...........................................................................................................................................................................11
7.0 KUBERNETES APPLICATION AND DATA SERVICES MONITORING..................................................................................................................................13
8.0 KAFKA OFFSET DELAY ALERT: .......................................................................................................................................................................................14
9.0 VISUAL PRESENTATION OF KAFKA OFFSET DELAY ......................................................................................................................................................15
9.1 Kafka offset delay card............................................................................................................................................................................................15
9.2 Real Time Kafka offset statistics............................................................................................................................................................................16
10.0 VISUAL PRESENTATION OF IOT HUB, EVENT HUB OFFSET DELAY .............................................................................................................................17
10.1 Event Hub offset delay card ...................................................................................................................................................................................17
10.2 Real Time Event Hub offset statistics ...................................................................................................................................................................18
11.0 CASSANDRA METRICS MONITORING .............................................................................................................................................................................19
12.0 AI-OPS ALERT: CASSANDRA CLIENT REQUEST OUTAGE ALERT ................................................................................................................................20
13.0 VISUAL PRESENTATION OF CASSANDRA METRICS MONITORING.................................................................................................................................22
13.1 Latest Cassandra Client Request Outage............................................................................................................................................................22
13.2 Latest Cassandra Table Latency...........................................................................................................................................................................23
14.1 Kubernetes cAdvisor ...............................................................................................................................................................................................25
14.2 Prometheus time series data model......................................................................................................................................................................25
14.3 Prometheus Exporters and Integrations...............................................................................................................................................................26
15.0 AI OPS ALERT: ABNORMAL RESOURCE (CPU OR MEMORY) USAGE IN LAST HOUR...................................................................................................27
15.1 Issues found: ............................................................................................................................................................................................................27
15.2 Root Cause Analysis:..............................................................................................................................................................................................27
2
15.3 Recommendation:....................................................................................................................................................................................................28
16.0 VISUAL PRESENTATION OF KUBERNETES RESOURCE USAGE AND MONITORING.......................................................................................................30
16.1 Latest CPU and Memory Usage Statistics...........................................................................................................................................................30
16.2 Kubernetes high CPU or Memory Usage Alerts from alert card.......................................................................................................................31
16.3 CPU and Memory Load Distribution Profile .........................................................................................................................................................32
16.4 Container CPU and Memory Request VS. Limit .................................................................................................................................................34
16.5 Persistent Volume Claim Usage Statistics (PVC) ...............................................................................................................................................35
17.0 MONITORING AGGREGATE AND ALERT SUMMARY REPORT.........................................................................................................................................36
17.1 Application and data streaming service stability report...........................................................................................................................................36
17.2 Kafka offset Aggregate Hist....................................................................................................................................................................................37
17.3 IoT Hub, Event Hub offset Aggregate Hist...........................................................................................................................................................38
17.4 Cassandra Client Request Outage Aggregate Hist ............................................................................................................................................39
17.5 Cassandra Table Latency Aggregate Hist............................................................................................................................................................40
17.6 Kubernetes nodes resource (CPU, Memory) usage contributions to Agent pool Hist...................................................................................41
17.7 Last week’s Kubernetes abnormal resource (CPU or Memory) usage summary ..........................................................................................42
17.8 Batch and Streaming Job Audit Log Statistics.....................................................................................................................................................43
APPENDIX ......................................................................................................................................................................................................................................44
Prometheus web site: ............................................................................................................................................................................................................44
3
1.0 Objective:
Applications and data services are critical for business success. In microservices oriented architecture
each microservice or application is accountable for distinct task. It is critical to bring control plane upfront
with 360 degree infrastructure, microservices, apps monitoring in all aspects. This also conforms
completeness, consistency and stability of application and data services and drives in business continuity
with supreme trust.
Key objective of this monitoring is continuous, completeness, consistency and stability of system,
application and data services. As application and data services access different control points, monitoring
these end points gives direct benefit not only for system’s stability but also consistency of application
services. Kafka offset and Event Hub offset delay monitoring helps identifying application services
consistency by tracking services (producer) data production stoppage in control points. Reconciliation
among different control points helps identifying data loss while spatio-temporal data aggregation and
statistical modeling in control point helps identifying data usage and variations over time. Abnormal pattern
recognition confirms data incompleteness and inconsistency.
This product handbook document will highlight system stability monitoring and alerting features of
Kubernetes infrastructure and resource usage, Kubernetes job monitoring using kafka offset and eventhub
offset delay, cassandra health monitoring by object latency and client request failure, timeouts and
unavailable in Kubernetes container orchestration framework.
4
2.0 Product scope
Monitoring solution release-1 product scope includes but not limited to following topics.
a. Define and design monitoring and alerting architecture.
b. Identify infrastructure monitoring control points
c. Define scopes of Kubernetes resources and app monitoring.
d. Define scopes of Kafka and Event Hub app monitoring
e. Define monitoring and alerts business rules.
f. Define alerts for L1, L2 and L3 support.
g. Develop configurable monitor and alert framework.
h. Develop configurable monitoring microservice for data services.
i. Store and aggregate monitoring metrics for future model generation.
j. Develop visual presentation for monitoring metrics trends, summary and aggregates.
k. Send alert push notifications via Slack, SendGrid (Email), PagerDuty, Remedy, SMS.
5
3.0 Monitoring Architecture at Scale
6
4.0 Components of Monitoring Architecture
4.1 Kubernetes Application and Data plane:
Application and data microservices involves distinct tasks for business processes and run inside
Kubernetes. While it is important these services run and deliver output as expected, in practical it is critical
in monitoring these microservices for business continuity and success.
Helm chart helps installing different apps cluster inside Kubernetes. Open Source Apache Cassandra as
data store and open source Apache Kafka as messaging bus are two important clusters that is used for
processing both streaming and batch jobs.
4.2 Kubernetes Control and Monitor plane
Control plane includes different agents that works as watchdog and ensure health of the overall system.
Different agents that works in Kubernetes control planes are :
a. CCS agent: CCS agents or microservices ensure Completeness, Consistency and Stability check
of application and data services. These services can run in batch or stream mode.
b. Prometheus agent: Collects Kubernetes metrics from Kubernetes metric server and store into
Prometheus server (Influx-DB time series DB). Prometheus node exporters, Kafka exporters and
Cassandra exporters streams prometheus metrics from prometheus server to Kafka.
c. Monitor agent: Monitor agent microservices monitor Kubernetes, Kafka, Cassandra, Azure
resources (IoT Hub, Event Hub, Service Bus, Blob Store). These services also compute monitoring
metrics and statistics for modeling and machine learning.
7
d. Notification agent: This agent microservices collects actionable metrics statistics and notifies to
different notification channel. This notification runs in Kubernetes. Notification can be time or event
triggered. Notification channels includes Slack, SendGrid (Email), PagerDuty, Remedy, SMS.
4.3 Azure Service Bus
Azure service bus eventhub is used for sending real time monitoring statistics to display in Power BI visual
presentations. Also eventhub is used for real time alerting.
4.4 Cloud Data Store
Monitoring statistics are stored into data store. Azure SQL DB and Azure Blob store are used as data
store. Microsoft provided Spark-SQLDB and apache haddop-azure connectors are used. Monitoring stats
are read from kafka and stored to Azure data store.
"com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2",
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
4.5 Power BI Visual Presentation
Power BI are used for visual presentation of monitoring statistics and alerts. This monitoring statistics
includes but not limited to trend graphs, frequency distributions, probability of failures, set analysis among
different metrics.
8
4.6 MS Flow
MS Flow has different connectors including Power BI. When dataset refreshes or new dataset shows up
in Power BI, MS flow connectors picks up and sends alerts based on workflow connectors. SendGrid
email, PagerDuty, Slacks are different connectors currently used as part of alerting and notification
framework.
4.7 Notification Channel
Slack, Pager Duty, Email, SMS, Remedy are used as notification channel. Warning, Low and High
severity alert are notified in Slack channel.
Integration is also done among email à pager duty à remedy
à SMS
9
5.0 Monitoring Agent and Exception Handling
Monitoring microservices monitors application and data services using CCS agents, Monitor agents.
Exception is caught at real time and alerted to application owner by email. Assessing severity, alerts are
notified to slack and PagerDuty also. Exception is captured in Power BI in real time. Azure real time stream
analytics is used to stream data from event hub. CCS and monitoring agents transfers exception string to
event hub downstream process for picking up at real time.
7/25/19 SIOT Platform Monitoring 5
10
5.1 Exception Alerting
Application, data and monitoring services have exception alerting in place. This helps in understanding
severity of exceptions and time when exception happens. With same alerts coming frequently in quick
time indicates issues in application, data and monitoring services and needs immediate attention. Example
of exception alerting is shown below.
11
6.0 Monitoring Agent Microservices
Monitoring agent microservices are mostly written in scala with spark-streaming and akka-streaming in
place. Time triggered alerts and monitoring aggregates run from kubernetes cron jobs. Monitoring agents
mostly used following libraries.
"org.apache.kafka" %% "kafka" % kafkaVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"com.microsoft.azure" % "azure-eventhubs" % "0.7.5",
"com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2",
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
"org.apache.commons" % "commons-email" % "1.4"
"com.typesafe" % "config" % "1.2.1",
Microsoft provided spark sqldb connector helps storing data in sqldb asynchronously. Transformation
process also have been written inside sqldb. Monitoring agent calls this transformation routines and
transformed data in required format.
org.apache.commons.commons-email library is mostly used as email alerts. SendGrid is used as email
carrier. SendGrid API Key is used for secure communication.
Slack apps token is used for slack communication. Slack chat messages rest calls are used to publish
messages in slack channel. "https://slack.com/api/chat.postMessage"
12
Alert EPML Configuration: Monitoring framework is fully configurable. Most of the monitoring jobs
configuration is based on typesafe’s config file. Other than typesafe’s configuration files, epmlconfig (Event
Processing Markup Language) json file is used. Microservices are written for reading this epmlconfig file
for different alerting condition and alerting rules. Example of epmlconfig records looks below.
{"type":"alert", "subject":"IoT Kubernetes CPU Alert: Last Hour CPU Usage exceeds 75%",
"fields":["eventdate", "eventhour", "node", "core_bucket", "avg_coreused", "max_coreused",
"confidence_level"], "fieldsheader":"Eventdate, Eventhour, Node, Core Bucket, Avg. core usage,
Max. core usage, Confidence level", "filter":"max_coreused <> '0'"}
13
7.0 Kubernetes Application and Data Services Monitoring
There are different apps or cluster installed in Kubernetes. Kubernetes microservices includes following
monitoring principles.
a. Exception alerting from application, data and monitoring microservices.
b. Microservices monitoring using control points delay.
I. Kafka (control point) topic data arriving (offset changes) delay.
II. Event Hub data arriving (offset changes) delay.
c. Cassandra client request outage (timeouts, failure, unavailable) and latency monitoring.
Control point (CP) monitoring in control plane has direct benefit while accessing control points. Control
points (CP) like Kafka, EventHub offset monitoring delay helps detect application and data services
stoppage. Appropriate delay alerts are triggered.
14
8.0 Kafka offset delay alert:
Kafka alert delay notifications are configurable. Currently supported notification methods are email
(SendGrid), slack, pager Duty, SMS, remedy (ITSM). Notification are generated from Kubecron jobs and
Power BI alert cards. Power BI alert card triggered notification to MS Flow for connectors to pick up. When
kafka offset delay exceeds certain threshold (min), alert is triggered to appropriate notification channels.
15
9.0 Visual Presentation of Kafka Offset Delay
9.1 Kafka offset delay card
Kafka topic data arriving delay card is configured for each kafka topic. When data is refreshed alert is
triggered only with card values exceeds certain threshold. Latest offset processing time is also displayed.
This latest offset processing time updates in real time. Operation support can act on kafka delay alerts by
looking this latest offset processing timestamp too.
16
9.2 Real Time Kafka offset statistics
Kafka offset statistics is captured in real time. This statistics not only give historical snapshot but also give
operational support visualization tool to follow in real time. When offset processing stopped or application
and data services stop producing traffic in kafka topic, operation support can look at this real time statistics
with latest offset processing timestamp and confirm problems.
17
10.0 Visual Presentation of IoT Hub, Event Hub Offset Delay
10.1 Event Hub offset delay card
IoT Hub’s built in endpoints Event Hub data arriving delay card is configured for event hub’s topic. When
data is refreshed alert is triggered when event hub data arriving delay exceeds certain threshold. Latest
offset processing time is also displayed. This latest offset processing time updates in real time. Operation
support can act on event hub data arriving delay alerts by looking this latest offset processing timestamp
too.
18
10.2 Real Time Event Hub offset statistics
Event Hub offset statistics are captured in real time. This statistics not only give historical snapshot but
also give operational support visualization tool to follow. When event hub has outage or offset processing
stopped or application and data services stop producing traffic in event hub, operation support can look at
this real time statistics with latest processing timestamp and confirm problems.
19
11.0 Cassandra Metrics Monitoring
When application and data services request cassandra client connections it is important in tracking
connection status. Client request failure is triggered when app or query connection request is unable to
access cassandra. There is also a scenario when cassandra is unavailable to accept excess client
connection request. After successful cassandra connection, application code sometimes during read or
write have very large latency and can cause possible timeouts. Cassandra client request failure, timeouts,
unavailable metrics are important indicator of application and data services stability while accessing to
cassandra instances. Following metrics are important for cassandra client request outage and client
request latency. It is also important to capture few other cassandra performance metric like keyspace,
table latency metric as well as keycachehitrate metrics. LiveDiskSpaceUsed and LiveDiskSpaceAvailable
are two important metrics for cassandra PVC disk space monitoring.
Kubernetes cassandra exporters exports cassandra metric from Prometheus server to Kafka topic.
https://github.com/criteo/cassandra_exporter
a. disk space related metrics label values contains totaldiskspaceused:value,
livediskspaceused:value
b. client request related metrics labels contains clientrequest: oneminuterate
c. table and keyspace latency metrics label values contain rangelatency:max, writelatency:max,
coordinatorscanlatency:max, coordinatorreadlatency:max, readlatency:max
d. keycachehitrate metrics label values contain keycachehitrate:value
20
12.0 AI-Ops Alert: Cassandra Client Request Outage alert
Cassandra client request outage metrics indicates apps or queries are not able to establish or failed in
connection request or lost connection while accessing cassandra objects or metadata and needs
immediate connection.
Cassandra client request outage alerts includes following.
12.1 Issues Found: frequency of client request failure, timeout, unavailable in last hour. Attributes
includes
i. Eventtime: timestamp when cassandra client request outage triggered in last hour.
ii. Metricgroup: timeouts, failure or unavailable.
iii. Pods: cassandra instance.
12.2 Root Cause Analysis: Top 10 Table Latency in last hour:
i. Latency type: coordinator read latency, read latency, coordinator scan latency, write latency,
coordinator write latency.
ii. Keyspace : keyspace name.
iii. Table name: table name
iv. Pods: cassandra instances.
v. Last hour’s total latency (second) : total latency ( seconds) in last hour
vi. Last hour’s latency frequency: frequency of latency (> 400 ms) triggered in last hour.
vii. Rank of latency duration: ranking by latency duration.
Below figure is snapshot of cassandra client request outage AI-Ops alert.
21
Figure: Cassandra Client Request Outage Alert: Last hour Failure, Timeout, Unavailable
22
13.0 Visual Presentation of Cassandra Metrics Monitoring
13.1 Latest Cassandra Client Request Outage
Cassandra client request outage includes client request timeouts, failure and unavailable. Operation
support has options of looking temporal aggregate frequency (hours, minutes) views of cassandra outage
when alert is raised. This trend graph will conform severity of problems in recent hours.
23
13.2 Latest Cassandra Table Latency
High latency (read, write, coordinator read, coordinator write, range scan, coordinators can) is primary
cause of cassandra client request outage. cassandra objects. Latest latency trend by keyspace is
displayed in first graph. 2nd
graph shows by latency type and by table. Third and fourth graph hourly and
minute’s drill down of table latency. Operation support have this view for confirming problem severities.
24
14.0 Kubernetes Resource Monitoring and Alert
CPU, Memory, Volume usage are important in infrastructure monitoring. Proactive failure alerts helps
taking action on time. Kubernetes resource monitoring and notification includes following steps.
Capture Kubernetes resource utilization metrics from Kubernetes cAdvisor and transfers to Prometheus
time series influx-db server. Prometheus agent does this job.
Prometheus agent exports node, kafka and cassandra by exporter job from influx db timeseries database
to kafka at near real time.
Apply business transformation rules for finding resource usage. Following cAdvisor’s KPIs are used in this
cases.
i. Volume Metrics: kubelet_volume_stats_available_bytes, kubelet_volume_stats_used_bytes
ii. Container Resource Request and Limits: kube_pod_container_resource_limits,
kube_pod_container_resource_requests, kube_node_status_allocatable
iii. Container Resource Usage Metrics: container_memory_working_set_bytes,
container_cpu_usage_seconds_total, container_fs_writes_bytes_total,
container_fs_reads_bytes_total, container_fs_io_time_seconds_total
Build aggregate for trend graphs, modeling and machine learning.
Apply set analysis for AIOps alerts.
Visualize resource utilization KPI in Power BI.
25
Develop AIOps alert of CPU, Memory, Volume usage.
This AIOps alert includes following topics.
High resource utilization: High resource (CPU, Memory, Volume) utilization in nodes, agent pool and
cluster.
Root Cause Analysis: Top apps resource usage in nodes, agent pool and across clusters.
Recommendation: App load balancing recommendation from models or set analysis. Least used nodes
are candidate for apps load balancing.
14.1 Kubernetes cAdvisor
cAdvisor is an open source container resource usage and performance analysis agent. It is purpose-
built for containers and supports Docker containers natively. In Kubernetes, cAdvisor is integrated into
the Kubelet binary. cAdvisor auto-discovers all containers in the machine and collects CPU, memory,
filesystem, and network usage statistics. cAdvisor also provides the overall machine usage by analyzing
the ‘root’ container on the machine.
14.2 Prometheus time series data model
Prometheus is an open-source systems monitoring and alerting toolkit. It is now a standalone open source
project and maintained independently of any company. To emphasize this, and to clarify the project's
governance structure, Prometheus joined the Cloud Native Computing Foundation in 2016 as the second
hosted project, after Kubernetes.
26
Prometheus's main features are:
• a multi-dimensional data model with time series data identified by metric name and key/value pairs
• time series collection happens via a pull model over HTTP
• pushing time series is supported via an intermediary gateway
Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway
for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate
and record new time series from existing data or generate alerts.
14.3 Prometheus Exporters and Integrations
Number of libraries and servers help exporting existing metrics from third-party systems as Prometheus
metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus
metrics directly. Kafka Exporters exports Prometheus metrics from Prometheus server to kafka. Node
exporters, Cassandra exporters are used for exporting node and Cassandra metrics to Kafka for further
transformation and analysis.
Here is GitHub link for these exporters.
https://github.com/prometheus/node_exporter
https://github.com/danielqsj/kafka_exporter
https://github.com/criteo/cassandra_exporter
27
15.0 AI Ops Alert: Abnormal resource (CPU or Memory) usage in last hour
Application and data services need resources (CPU and Memory) from Kubernetes cluster. High resource
usage alert helps in avoiding node restart and app running without issues. With node restart, apps also
restart and have to undergo service discovery process and finding resources to run from other nodes or
new nodes.
There are three parts of this AI-Ops alert.
15.1 Issues found: Nodes with CPU or memory usage exceeds threshold (75%). Attributes used are
I. Node: nodes with CPU usage > 75%,
II. Core Bucket: attribute values are 75% - 90% or 90%-100%, 100%
III. Avg core or memory usage : last hour’s avg. core or memory usage.
IV. Max. core or memory usage: last hour’s max. core or memory usage.
V. Confidence level: confidence level is P(n|N) where
n = event count of memory or cpu usage in given time scale from a resource bucket,
N= total events triggered in same time frame.
15.2 Root Cause Analysis: Top 3 apps with high cpu or memory utilization in nodes. Attributes used are
i. Node: Alert nodes name
ii. Container: apps container name
iii. Pod: apps pod name.
iv. Container’s rank by high core or memory usage
v. % resource used (core or memory)
28
15.3 Recommendation: Apps load balancing recommendation.
Lowest core or memory used nodes are strong candidates of app load rebalancing. This AI-Ops alert finds
top 5 least used core or memory nodes with less than 20% of core or memory usage in last hour for app
load rebalancing. Operation support need to restart job and relabel to run app services in these least used
nodes. Attributes used here are –
i. Node: recommender nodes.
ii. Avg core or memory used: last hour’s avg. core or memory used nodes.
iii. Max. core or memory used: last hour’s max. core or memory used nodes.
iv. Least core or memory used rank: nodes ranking based on least core or memory usage.
Snapshot of AI Ops alert are given below.
29
Figure: AI-Ops Alert - Kubernetes abnormal resource usage in last hour.
30
16.0 Visual Presentation of Kubernetes Resource Usage and Monitoring
Kubernetes resource usage presentation includes but not limited to following visuals.
16.1 Latest CPU and Memory Usage Statistics
Cluster’s CPU and Memory utilization statistics in last n hours. This gives Kubernetes nodes CPU and
memory usage trends. Latest memory and CPU usages statistics helps visualizing problem severities.
Operation support have this tool for taking action properly. container_memory_working_set_bytes,
container_cpu_usage_seconds_total, kube_node_status_allocatable metrics are computed for this.
31
16.2 Kubernetes high CPU or Memory Usage Alerts from alert card.
Configure CPU and memory utilization card for each nodes. When card values exceeds certain configured
threshold ( > 75%) alert is raised and sent to notification channels as configured. Slack, Email, PagerDuty,
Remedy, SMS can be used as notification methods or channels.
32
16.3 CPU and Memory Load Distribution Profile
Apps undergo services discovery and failover process when kubernetes node restarts. Core (CPU) and
memory usage utilization tracking helps in avoiding unwanted nodes and cluster restart.
Kubernetes CPU or Memory load distribution indicates % of nodes core (cpu) or memory used in agent
pool. Uneven or highly skewed cpu or memory contribution indicates opportunity of apps cpu or memory
load balancing. Cluster’s optimized utilization helps reducing costs and allows better apps performance.
Apps cpu or memory distribution in nodes indicates % of app core or memory used in nodes. When node
has high CPU or memory utilization (i.e > 75%), it is important to apps load redistribution in least used
nodes. Apps load distribution profile in nodes helps identify top used apps ranked by core or memory
usage.
33
Figure: Node1 distribution in agentpool1 and apps distribution in node 1.
34
16.4 Container CPU and Memory Request VS. Limit
Apps container resource request and resource limit setting metrics control apps resource usage in cluster.
Below figure indicates kuberesources-reader-controller containers is requested 49.9% of total memory
available for container to run in Kubernetes cluster. Also kubelet requests 10% of total cpu available to
run kafkaoffset-monitor-controller container. When container resource requests exceeds container
resource limit POD restart and clears resources from cluster.
kube_pod_container_resource_limits, kube_pod_container_resource_requests are two important
metric for this transformation and computation. For safety, all pods need this resource request and limit
set.
35
16.5 Persistent Volume Claim Usage Statistics (PVC)
Persistent volume can be configured in Kubernetes. PVC usage statistics shows persistent store usage in
Kubernetes. Cassandra, Kafka and other apps use persistent store for its operations. Monitoring PVC
usage helps in alerting high utilization of PVC. Below figure indicates different agentpools PVC
contribution. Also daily avg of % PVC used by each node. Alert is configured when PVC usage exceeds
certain threshold.
kubelet_volume_stats_available_bytes, kubelet_volume_stats_used_bytes are two important
metrics for transformation and further computation.
36
17.0 Monitoring Aggregate and Alert Summary Report
17.1 Application and data streaming service stability report
Weekly streaming microservices stability trends are captured. Data services missing kafka offset hours
are aggregated for different kafka topic. This weekly report indicates application services stability trends
for last 7 days.
37
17.2 Kafka offset Aggregate Hist
Kafka topic offset aggregate history informs microservices (that produce kafka traffic) stability, data quality,
partition utilization over time. With equal distribution in different time scale indicates producer produces
data in kafka topic evenly over time. Pattern might be skewed when data doesn’t process on time and
need replay. New resources (device) roll out is possible cause as well. Uniform data pattern in different
time scale informs stable producer microservices.
38
17.3 IoT Hub, Event Hub offset Aggregate Hist
IoT Hub, Event hub offset aggregate hist informs IoT Hub and Event Hub data processing quality, partition
utilization and job stability over time. Missing days, hours or minutes indicates IoT Hub or Event Hub
outages or inability of stream producer (device) producing traffic to IoT Hub or Event Hub.
39
17.4 Cassandra Client Request Outage Aggregate Hist
Cassandra client request outage aggregate hist informs microservices stability while accessing cassandra
db. Outage metric distribution shows cassandra outage type (read, write, scan, coordinator) contribution.
Temporal drilldown (day, hour) have pattern detail and helps app owner in troubleshooting query and
microservices.
40
17.5 Cassandra Table Latency Aggregate Hist
Cassandra table latency aggregate history helps in apps performance tuning. Cassandra objects might
need maintenance or data retention and cleanup policy might be enforced for high latency objects.
Partition and clustering key usage helps in reducing latency. High rangelatency, coordinatorscanlatency,
coordinatorreadlatency can be avoided using appropriate primary keys in query. Temporal drilldown of
different latency type, keyspace and table helps developer pinpoint root cause of problem and helps in
performance tuning and decision making.
41
17.6 Kubernetes nodes resource (CPU, Memory) usage contributions to Agent pool Hist
Kubernetes capacity requirement planning is easier when nodes utilization in cluster is visible over period
of time. Even CPU and Memory distribution of nodes in agent pool is desired for optimum resource
utilization. This temporal aggregate view provides resource contribution of nodes to agent pool. In below
figure contribution values close to 1 shows optimum CPU or memory distribution whereas high variation
from this number in either sides shows scope of resource optimization. Resource with value 2 indicates
abnormal (high) usage of resources and value close to zero shows least resource utilization of nodes to
agent pool. This figure also gives recommendation of nodes load balancing in kubernetes cluster.
42
17.7 Last week’s Kubernetes abnormal resource (CPU or Memory) usage summary
Last week’s abnormal CPU or memory usage alerts summary is captured here. Alert is raised with core
or memory bucket exceeds 75%. This summary report shows resource usage risk in future weeks and
helps in planning process.
43
17.8 Batch and Streaming Job Audit Log Statistics
Jobs stability is captured in this statistics. Batch job audit log history shows batch jobs success history
and streaming job audit log shows streaming jobs restart history.
44
Appendix
Prometheus web site: https://prometheus.io/
Prometheus Exporters and Integrations: https://prometheus.io/docs/instrumenting/exporters/
Nodes Exporter: https://github.com/prometheus/node_exporter
Cassandra Exporter: https://github.com/criteo/cassandra_exporter

More Related Content

What's hot

California enterprise architecture_framework_2_0
California enterprise architecture_framework_2_0California enterprise architecture_framework_2_0
California enterprise architecture_framework_2_0ppalacz
 
SafeDNS Content Filtering Service Guide
SafeDNS Content Filtering Service GuideSafeDNS Content Filtering Service Guide
SafeDNS Content Filtering Service GuideSafeDNS
 
Mobile Marketing Association Best Practices
Mobile Marketing Association Best PracticesMobile Marketing Association Best Practices
Mobile Marketing Association Best PracticesSellPhone Marketing
 
Ar smartphones
Ar smartphonesAr smartphones
Ar smartphonesaxiuluo
 
Gaia-X, le projet de cloud européen
Gaia-X, le projet de cloud européenGaia-X, le projet de cloud européen
Gaia-X, le projet de cloud européenPaperjam_redaction
 
Handbook all eng
Handbook all engHandbook all eng
Handbook all enganiqa7
 
The Endpoint Security Paradox
The Endpoint Security ParadoxThe Endpoint Security Paradox
The Endpoint Security ParadoxSymantec
 
Intrusion Detection on Public IaaS - Kevin L. Jackson
Intrusion Detection on Public IaaS  - Kevin L. JacksonIntrusion Detection on Public IaaS  - Kevin L. Jackson
Intrusion Detection on Public IaaS - Kevin L. JacksonGovCloud Network
 
THESEUS Usability Guidelines for Usecase Applications
THESEUS Usability Guidelines for Usecase ApplicationsTHESEUS Usability Guidelines for Usecase Applications
THESEUS Usability Guidelines for Usecase ApplicationsDaniel Sonntag
 
Open payments user guide [august-2014]
Open payments user guide [august-2014]Open payments user guide [august-2014]
Open payments user guide [august-2014]Market iT
 
MS SSAS 2008 & MDX Reports
MS SSAS 2008 &  MDX Reports MS SSAS 2008 &  MDX Reports
MS SSAS 2008 & MDX Reports Sunny U Okoro
 
Sample global forest wildfire detection system market research report 2020
Sample global forest wildfire detection system market research report 2020Sample global forest wildfire detection system market research report 2020
Sample global forest wildfire detection system market research report 2020Cognitive Market Research
 
CMS | Open payments user guide
CMS | Open payments user guideCMS | Open payments user guide
CMS | Open payments user guideMarket iT
 
Sample global digital x ray system market research report 2020
Sample global digital x ray system market research report 2020 Sample global digital x ray system market research report 2020
Sample global digital x ray system market research report 2020 Cognitive Market Research
 
Wireshark user's guide
Wireshark user's guideWireshark user's guide
Wireshark user's guideGió Lào
 

What's hot (20)

California enterprise architecture_framework_2_0
California enterprise architecture_framework_2_0California enterprise architecture_framework_2_0
California enterprise architecture_framework_2_0
 
SafeDNS Content Filtering Service Guide
SafeDNS Content Filtering Service GuideSafeDNS Content Filtering Service Guide
SafeDNS Content Filtering Service Guide
 
Amdin iws7 817-2179-10
Amdin iws7 817-2179-10Amdin iws7 817-2179-10
Amdin iws7 817-2179-10
 
Mobile Marketing Association Best Practices
Mobile Marketing Association Best PracticesMobile Marketing Association Best Practices
Mobile Marketing Association Best Practices
 
Ar smartphones
Ar smartphonesAr smartphones
Ar smartphones
 
Gaia-X, le projet de cloud européen
Gaia-X, le projet de cloud européenGaia-X, le projet de cloud européen
Gaia-X, le projet de cloud européen
 
Handbook all eng
Handbook all engHandbook all eng
Handbook all eng
 
Why You Aren't Eligible for Social Security, Form #06.001
Why You Aren't Eligible for Social Security, Form #06.001Why You Aren't Eligible for Social Security, Form #06.001
Why You Aren't Eligible for Social Security, Form #06.001
 
The Endpoint Security Paradox
The Endpoint Security ParadoxThe Endpoint Security Paradox
The Endpoint Security Paradox
 
Intrusion Detection on Public IaaS - Kevin L. Jackson
Intrusion Detection on Public IaaS  - Kevin L. JacksonIntrusion Detection on Public IaaS  - Kevin L. Jackson
Intrusion Detection on Public IaaS - Kevin L. Jackson
 
THESEUS Usability Guidelines for Usecase Applications
THESEUS Usability Guidelines for Usecase ApplicationsTHESEUS Usability Guidelines for Usecase Applications
THESEUS Usability Guidelines for Usecase Applications
 
Open payments user guide [august-2014]
Open payments user guide [august-2014]Open payments user guide [august-2014]
Open payments user guide [august-2014]
 
Jobseeker (1)(1)(1)(1)
Jobseeker (1)(1)(1)(1)Jobseeker (1)(1)(1)(1)
Jobseeker (1)(1)(1)(1)
 
MS SSAS 2008 & MDX Reports
MS SSAS 2008 &  MDX Reports MS SSAS 2008 &  MDX Reports
MS SSAS 2008 & MDX Reports
 
Sample global forest wildfire detection system market research report 2020
Sample global forest wildfire detection system market research report 2020Sample global forest wildfire detection system market research report 2020
Sample global forest wildfire detection system market research report 2020
 
CMS | Open payments user guide
CMS | Open payments user guideCMS | Open payments user guide
CMS | Open payments user guide
 
Android
AndroidAndroid
Android
 
Sample global digital x ray system market research report 2020
Sample global digital x ray system market research report 2020 Sample global digital x ray system market research report 2020
Sample global digital x ray system market research report 2020
 
Wireshark user's guide
Wireshark user's guideWireshark user's guide
Wireshark user's guide
 
Stopping Malware
Stopping MalwareStopping Malware
Stopping Malware
 

Similar to Platform Monitoring and Alert

SMA - SUNNY DESIGN 3 and SUNNY DESIGN WEB
SMA - SUNNY DESIGN 3 and SUNNY DESIGN WEBSMA - SUNNY DESIGN 3 and SUNNY DESIGN WEB
SMA - SUNNY DESIGN 3 and SUNNY DESIGN WEBHossam Zein
 
Codendi 4.0 User Guide
Codendi 4.0 User GuideCodendi 4.0 User Guide
Codendi 4.0 User GuideCodendi
 
iPlanet to HP Apache Migration Plan
iPlanet to HP Apache Migration PlaniPlanet to HP Apache Migration Plan
iPlanet to HP Apache Migration Planwebhostingguy
 
software-eng.pdf
software-eng.pdfsoftware-eng.pdf
software-eng.pdffellahi1
 
VeraCode State of software security report volume5 2013
VeraCode State of software security report volume5 2013VeraCode State of software security report volume5 2013
VeraCode State of software security report volume5 2013Cristiano Caetano
 
Benefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - WhitepaperBenefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - WhitepaperVasu S
 
(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA Governance(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA GovernanceGanesh Prasad
 
Kindsight security labs malware report - Q4 2013
Kindsight security labs malware report - Q4 2013Kindsight security labs malware report - Q4 2013
Kindsight security labs malware report - Q4 2013Bee_Ware
 
Final 2016 cyber captive survey
Final 2016 cyber captive surveyFinal 2016 cyber captive survey
Final 2016 cyber captive surveyGraeme Cross
 
Palo alto-3.1 administrators-guide
Palo alto-3.1 administrators-guidePalo alto-3.1 administrators-guide
Palo alto-3.1 administrators-guideSornchai Saen
 
The ARJEL-compliant Trusted Solution For Online Gambling And Betting Operators
The ARJEL-compliant Trusted Solution For Online Gambling And Betting OperatorsThe ARJEL-compliant Trusted Solution For Online Gambling And Betting Operators
The ARJEL-compliant Trusted Solution For Online Gambling And Betting OperatorsMarket Engel SAS
 
Sap system-measurement-guide
Sap system-measurement-guideSap system-measurement-guide
Sap system-measurement-guideotchmarz
 
White Paper Indoor Positioning in Healthcare
White Paper Indoor Positioning in HealthcareWhite Paper Indoor Positioning in Healthcare
White Paper Indoor Positioning in Healthcareinfsoft GmbH
 
Industry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToC
Industry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToCIndustry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToC
Industry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToCHomeland Security Research Corp.
 
PANOS 4.1 Administrators Guide
PANOS 4.1 Administrators GuidePANOS 4.1 Administrators Guide
PANOS 4.1 Administrators GuideAltaware, Inc.
 

Similar to Platform Monitoring and Alert (20)

SMA - SUNNY DESIGN 3 and SUNNY DESIGN WEB
SMA - SUNNY DESIGN 3 and SUNNY DESIGN WEBSMA - SUNNY DESIGN 3 and SUNNY DESIGN WEB
SMA - SUNNY DESIGN 3 and SUNNY DESIGN WEB
 
Wisr2011 en
Wisr2011 enWisr2011 en
Wisr2011 en
 
Codendi 4.0 User Guide
Codendi 4.0 User GuideCodendi 4.0 User Guide
Codendi 4.0 User Guide
 
iPlanet to HP Apache Migration Plan
iPlanet to HP Apache Migration PlaniPlanet to HP Apache Migration Plan
iPlanet to HP Apache Migration Plan
 
software-eng.pdf
software-eng.pdfsoftware-eng.pdf
software-eng.pdf
 
VeraCode State of software security report volume5 2013
VeraCode State of software security report volume5 2013VeraCode State of software security report volume5 2013
VeraCode State of software security report volume5 2013
 
Benefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - WhitepaperBenefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
 
Health Care Cyberthreat Report
Health Care Cyberthreat ReportHealth Care Cyberthreat Report
Health Care Cyberthreat Report
 
(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA Governance(Deprecated) Slicing the Gordian Knot of SOA Governance
(Deprecated) Slicing the Gordian Knot of SOA Governance
 
Hfm install
Hfm installHfm install
Hfm install
 
Kindsight security labs malware report - Q4 2013
Kindsight security labs malware report - Q4 2013Kindsight security labs malware report - Q4 2013
Kindsight security labs malware report - Q4 2013
 
Final 2016 cyber captive survey
Final 2016 cyber captive surveyFinal 2016 cyber captive survey
Final 2016 cyber captive survey
 
Palo alto-3.1 administrators-guide
Palo alto-3.1 administrators-guidePalo alto-3.1 administrators-guide
Palo alto-3.1 administrators-guide
 
The ARJEL-compliant Trusted Solution For Online Gambling And Betting Operators
The ARJEL-compliant Trusted Solution For Online Gambling And Betting OperatorsThe ARJEL-compliant Trusted Solution For Online Gambling And Betting Operators
The ARJEL-compliant Trusted Solution For Online Gambling And Betting Operators
 
This is
This is This is
This is
 
Sap system-measurement-guide
Sap system-measurement-guideSap system-measurement-guide
Sap system-measurement-guide
 
Begining j2 me
Begining j2 meBegining j2 me
Begining j2 me
 
White Paper Indoor Positioning in Healthcare
White Paper Indoor Positioning in HealthcareWhite Paper Indoor Positioning in Healthcare
White Paper Indoor Positioning in Healthcare
 
Industry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToC
Industry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToCIndustry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToC
Industry 4.0 Market & Technologies. Focus on the U.S. - 2018-2023 - Report ToC
 
PANOS 4.1 Administrators Guide
PANOS 4.1 Administrators GuidePANOS 4.1 Administrators Guide
PANOS 4.1 Administrators Guide
 

More from Braja Krishna Das

Netezza TwinFin12 Architecture Administration
Netezza TwinFin12 Architecture AdministrationNetezza TwinFin12 Architecture Administration
Netezza TwinFin12 Architecture AdministrationBraja Krishna Das
 
IoT Device Intelligence & Real Time Anomaly Detection
IoT Device Intelligence & Real Time Anomaly DetectionIoT Device Intelligence & Real Time Anomaly Detection
IoT Device Intelligence & Real Time Anomaly DetectionBraja Krishna Das
 
Real Time IoT Device Intelligence & Anomaly detection
Real Time IoT Device Intelligence & Anomaly detectionReal Time IoT Device Intelligence & Anomaly detection
Real Time IoT Device Intelligence & Anomaly detectionBraja Krishna Das
 
Cassandra Security Configuration
Cassandra Security ConfigurationCassandra Security Configuration
Cassandra Security ConfigurationBraja Krishna Das
 
Scala API - Azure Event Hub Integration
Scala API - Azure Event Hub IntegrationScala API - Azure Event Hub Integration
Scala API - Azure Event Hub IntegrationBraja Krishna Das
 
Azure Service Bus Queue Scala API
Azure Service Bus Queue Scala APIAzure Service Bus Queue Scala API
Azure Service Bus Queue Scala APIBraja Krishna Das
 
Azure Service Bus Queue API for Scala
Azure Service Bus Queue API for ScalaAzure Service Bus Queue API for Scala
Azure Service Bus Queue API for ScalaBraja Krishna Das
 
Azure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and SparkAzure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and SparkBraja Krishna Das
 
Azure Key Vault Integration in Scala
Azure Key Vault Integration in ScalaAzure Key Vault Integration in Scala
Azure Key Vault Integration in ScalaBraja Krishna Das
 
Netezza Architecture and Administration
Netezza Architecture and AdministrationNetezza Architecture and Administration
Netezza Architecture and AdministrationBraja Krishna Das
 

More from Braja Krishna Das (10)

Netezza TwinFin12 Architecture Administration
Netezza TwinFin12 Architecture AdministrationNetezza TwinFin12 Architecture Administration
Netezza TwinFin12 Architecture Administration
 
IoT Device Intelligence & Real Time Anomaly Detection
IoT Device Intelligence & Real Time Anomaly DetectionIoT Device Intelligence & Real Time Anomaly Detection
IoT Device Intelligence & Real Time Anomaly Detection
 
Real Time IoT Device Intelligence & Anomaly detection
Real Time IoT Device Intelligence & Anomaly detectionReal Time IoT Device Intelligence & Anomaly detection
Real Time IoT Device Intelligence & Anomaly detection
 
Cassandra Security Configuration
Cassandra Security ConfigurationCassandra Security Configuration
Cassandra Security Configuration
 
Scala API - Azure Event Hub Integration
Scala API - Azure Event Hub IntegrationScala API - Azure Event Hub Integration
Scala API - Azure Event Hub Integration
 
Azure Service Bus Queue Scala API
Azure Service Bus Queue Scala APIAzure Service Bus Queue Scala API
Azure Service Bus Queue Scala API
 
Azure Service Bus Queue API for Scala
Azure Service Bus Queue API for ScalaAzure Service Bus Queue API for Scala
Azure Service Bus Queue API for Scala
 
Azure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and SparkAzure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and Spark
 
Azure Key Vault Integration in Scala
Azure Key Vault Integration in ScalaAzure Key Vault Integration in Scala
Azure Key Vault Integration in Scala
 
Netezza Architecture and Administration
Netezza Architecture and AdministrationNetezza Architecture and Administration
Netezza Architecture and Administration
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Platform Monitoring and Alert

  • 1. 7/29/2020 Platform Monitoring and Alert Completeness and Consistency: Application and Data Services Stability: IoT Hub, Kafka, Cassandra Monitoring Resources: Kubernetes Resource Monitoring Braja Das BDAS@STARBUCKS.COM, BKD_108@YAHOO.COM Platform Monitoring and Alert Completeness and Consistency: Application and Data Services Stability: IoT Hub, Kafka, Cassandra Monitoring Resources: Kubernetes Resource Monitoring Release Version: Summer 2019
  • 2. 1 Table of Contents 1.0 OBJECTIVE:.......................................................................................................................................................................................................................3 2.0 PRODUCT SCOPE...............................................................................................................................................................................................................4 3.0 MONITORING ARCHITECTURE AT SCALE..........................................................................................................................................................................5 4.0 COMPONENTS OF MONITORING ARCHITECTURE..............................................................................................................................................................6 4.1 Kubernetes Application and Data plane: ................................................................................................................................................................6 4.2 Kubernetes Control and Monitor plane...................................................................................................................................................................6 4.3 Azure Service Bus .....................................................................................................................................................................................................7 4.4 Cloud Data Store........................................................................................................................................................................................................7 4.5 Power BI Visual Presentation...................................................................................................................................................................................7 4.6 MS Flow.......................................................................................................................................................................................................................8 4.7 Notification Channel...................................................................................................................................................................................................8 5.0 MONITORING AGENT AND EXCEPTION HANDLING ...........................................................................................................................................................9 5.1 Exception Alerting....................................................................................................................................................................................................10 6.0 MONITORING AGENT MICROSERVICES...........................................................................................................................................................................11 7.0 KUBERNETES APPLICATION AND DATA SERVICES MONITORING..................................................................................................................................13 8.0 KAFKA OFFSET DELAY ALERT: .......................................................................................................................................................................................14 9.0 VISUAL PRESENTATION OF KAFKA OFFSET DELAY ......................................................................................................................................................15 9.1 Kafka offset delay card............................................................................................................................................................................................15 9.2 Real Time Kafka offset statistics............................................................................................................................................................................16 10.0 VISUAL PRESENTATION OF IOT HUB, EVENT HUB OFFSET DELAY .............................................................................................................................17 10.1 Event Hub offset delay card ...................................................................................................................................................................................17 10.2 Real Time Event Hub offset statistics ...................................................................................................................................................................18 11.0 CASSANDRA METRICS MONITORING .............................................................................................................................................................................19 12.0 AI-OPS ALERT: CASSANDRA CLIENT REQUEST OUTAGE ALERT ................................................................................................................................20 13.0 VISUAL PRESENTATION OF CASSANDRA METRICS MONITORING.................................................................................................................................22 13.1 Latest Cassandra Client Request Outage............................................................................................................................................................22 13.2 Latest Cassandra Table Latency...........................................................................................................................................................................23 14.1 Kubernetes cAdvisor ...............................................................................................................................................................................................25 14.2 Prometheus time series data model......................................................................................................................................................................25 14.3 Prometheus Exporters and Integrations...............................................................................................................................................................26 15.0 AI OPS ALERT: ABNORMAL RESOURCE (CPU OR MEMORY) USAGE IN LAST HOUR...................................................................................................27 15.1 Issues found: ............................................................................................................................................................................................................27 15.2 Root Cause Analysis:..............................................................................................................................................................................................27
  • 3. 2 15.3 Recommendation:....................................................................................................................................................................................................28 16.0 VISUAL PRESENTATION OF KUBERNETES RESOURCE USAGE AND MONITORING.......................................................................................................30 16.1 Latest CPU and Memory Usage Statistics...........................................................................................................................................................30 16.2 Kubernetes high CPU or Memory Usage Alerts from alert card.......................................................................................................................31 16.3 CPU and Memory Load Distribution Profile .........................................................................................................................................................32 16.4 Container CPU and Memory Request VS. Limit .................................................................................................................................................34 16.5 Persistent Volume Claim Usage Statistics (PVC) ...............................................................................................................................................35 17.0 MONITORING AGGREGATE AND ALERT SUMMARY REPORT.........................................................................................................................................36 17.1 Application and data streaming service stability report...........................................................................................................................................36 17.2 Kafka offset Aggregate Hist....................................................................................................................................................................................37 17.3 IoT Hub, Event Hub offset Aggregate Hist...........................................................................................................................................................38 17.4 Cassandra Client Request Outage Aggregate Hist ............................................................................................................................................39 17.5 Cassandra Table Latency Aggregate Hist............................................................................................................................................................40 17.6 Kubernetes nodes resource (CPU, Memory) usage contributions to Agent pool Hist...................................................................................41 17.7 Last week’s Kubernetes abnormal resource (CPU or Memory) usage summary ..........................................................................................42 17.8 Batch and Streaming Job Audit Log Statistics.....................................................................................................................................................43 APPENDIX ......................................................................................................................................................................................................................................44 Prometheus web site: ............................................................................................................................................................................................................44
  • 4. 3 1.0 Objective: Applications and data services are critical for business success. In microservices oriented architecture each microservice or application is accountable for distinct task. It is critical to bring control plane upfront with 360 degree infrastructure, microservices, apps monitoring in all aspects. This also conforms completeness, consistency and stability of application and data services and drives in business continuity with supreme trust. Key objective of this monitoring is continuous, completeness, consistency and stability of system, application and data services. As application and data services access different control points, monitoring these end points gives direct benefit not only for system’s stability but also consistency of application services. Kafka offset and Event Hub offset delay monitoring helps identifying application services consistency by tracking services (producer) data production stoppage in control points. Reconciliation among different control points helps identifying data loss while spatio-temporal data aggregation and statistical modeling in control point helps identifying data usage and variations over time. Abnormal pattern recognition confirms data incompleteness and inconsistency. This product handbook document will highlight system stability monitoring and alerting features of Kubernetes infrastructure and resource usage, Kubernetes job monitoring using kafka offset and eventhub offset delay, cassandra health monitoring by object latency and client request failure, timeouts and unavailable in Kubernetes container orchestration framework.
  • 5. 4 2.0 Product scope Monitoring solution release-1 product scope includes but not limited to following topics. a. Define and design monitoring and alerting architecture. b. Identify infrastructure monitoring control points c. Define scopes of Kubernetes resources and app monitoring. d. Define scopes of Kafka and Event Hub app monitoring e. Define monitoring and alerts business rules. f. Define alerts for L1, L2 and L3 support. g. Develop configurable monitor and alert framework. h. Develop configurable monitoring microservice for data services. i. Store and aggregate monitoring metrics for future model generation. j. Develop visual presentation for monitoring metrics trends, summary and aggregates. k. Send alert push notifications via Slack, SendGrid (Email), PagerDuty, Remedy, SMS.
  • 7. 6 4.0 Components of Monitoring Architecture 4.1 Kubernetes Application and Data plane: Application and data microservices involves distinct tasks for business processes and run inside Kubernetes. While it is important these services run and deliver output as expected, in practical it is critical in monitoring these microservices for business continuity and success. Helm chart helps installing different apps cluster inside Kubernetes. Open Source Apache Cassandra as data store and open source Apache Kafka as messaging bus are two important clusters that is used for processing both streaming and batch jobs. 4.2 Kubernetes Control and Monitor plane Control plane includes different agents that works as watchdog and ensure health of the overall system. Different agents that works in Kubernetes control planes are : a. CCS agent: CCS agents or microservices ensure Completeness, Consistency and Stability check of application and data services. These services can run in batch or stream mode. b. Prometheus agent: Collects Kubernetes metrics from Kubernetes metric server and store into Prometheus server (Influx-DB time series DB). Prometheus node exporters, Kafka exporters and Cassandra exporters streams prometheus metrics from prometheus server to Kafka. c. Monitor agent: Monitor agent microservices monitor Kubernetes, Kafka, Cassandra, Azure resources (IoT Hub, Event Hub, Service Bus, Blob Store). These services also compute monitoring metrics and statistics for modeling and machine learning.
  • 8. 7 d. Notification agent: This agent microservices collects actionable metrics statistics and notifies to different notification channel. This notification runs in Kubernetes. Notification can be time or event triggered. Notification channels includes Slack, SendGrid (Email), PagerDuty, Remedy, SMS. 4.3 Azure Service Bus Azure service bus eventhub is used for sending real time monitoring statistics to display in Power BI visual presentations. Also eventhub is used for real time alerting. 4.4 Cloud Data Store Monitoring statistics are stored into data store. Azure SQL DB and Azure Blob store are used as data store. Microsoft provided Spark-SQLDB and apache haddop-azure connectors are used. Monitoring stats are read from kafka and stored to Azure data store. "com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2", "org.apache.hadoop" % "hadoop-azure" % "2.7.3" 4.5 Power BI Visual Presentation Power BI are used for visual presentation of monitoring statistics and alerts. This monitoring statistics includes but not limited to trend graphs, frequency distributions, probability of failures, set analysis among different metrics.
  • 9. 8 4.6 MS Flow MS Flow has different connectors including Power BI. When dataset refreshes or new dataset shows up in Power BI, MS flow connectors picks up and sends alerts based on workflow connectors. SendGrid email, PagerDuty, Slacks are different connectors currently used as part of alerting and notification framework. 4.7 Notification Channel Slack, Pager Duty, Email, SMS, Remedy are used as notification channel. Warning, Low and High severity alert are notified in Slack channel. Integration is also done among email à pager duty à remedy à SMS
  • 10. 9 5.0 Monitoring Agent and Exception Handling Monitoring microservices monitors application and data services using CCS agents, Monitor agents. Exception is caught at real time and alerted to application owner by email. Assessing severity, alerts are notified to slack and PagerDuty also. Exception is captured in Power BI in real time. Azure real time stream analytics is used to stream data from event hub. CCS and monitoring agents transfers exception string to event hub downstream process for picking up at real time. 7/25/19 SIOT Platform Monitoring 5
  • 11. 10 5.1 Exception Alerting Application, data and monitoring services have exception alerting in place. This helps in understanding severity of exceptions and time when exception happens. With same alerts coming frequently in quick time indicates issues in application, data and monitoring services and needs immediate attention. Example of exception alerting is shown below.
  • 12. 11 6.0 Monitoring Agent Microservices Monitoring agent microservices are mostly written in scala with spark-streaming and akka-streaming in place. Time triggered alerts and monitoring aggregates run from kubernetes cron jobs. Monitoring agents mostly used following libraries. "org.apache.kafka" %% "kafka" % kafkaVersion, "org.apache.spark" %% "spark-streaming" % sparkVersion, "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion, "com.microsoft.azure" % "azure-eventhubs" % "0.7.5", "com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2", "org.apache.hadoop" % "hadoop-azure" % "2.7.3" "org.apache.commons" % "commons-email" % "1.4" "com.typesafe" % "config" % "1.2.1", Microsoft provided spark sqldb connector helps storing data in sqldb asynchronously. Transformation process also have been written inside sqldb. Monitoring agent calls this transformation routines and transformed data in required format. org.apache.commons.commons-email library is mostly used as email alerts. SendGrid is used as email carrier. SendGrid API Key is used for secure communication. Slack apps token is used for slack communication. Slack chat messages rest calls are used to publish messages in slack channel. "https://slack.com/api/chat.postMessage"
  • 13. 12 Alert EPML Configuration: Monitoring framework is fully configurable. Most of the monitoring jobs configuration is based on typesafe’s config file. Other than typesafe’s configuration files, epmlconfig (Event Processing Markup Language) json file is used. Microservices are written for reading this epmlconfig file for different alerting condition and alerting rules. Example of epmlconfig records looks below. {"type":"alert", "subject":"IoT Kubernetes CPU Alert: Last Hour CPU Usage exceeds 75%", "fields":["eventdate", "eventhour", "node", "core_bucket", "avg_coreused", "max_coreused", "confidence_level"], "fieldsheader":"Eventdate, Eventhour, Node, Core Bucket, Avg. core usage, Max. core usage, Confidence level", "filter":"max_coreused <> '0'"}
  • 14. 13 7.0 Kubernetes Application and Data Services Monitoring There are different apps or cluster installed in Kubernetes. Kubernetes microservices includes following monitoring principles. a. Exception alerting from application, data and monitoring microservices. b. Microservices monitoring using control points delay. I. Kafka (control point) topic data arriving (offset changes) delay. II. Event Hub data arriving (offset changes) delay. c. Cassandra client request outage (timeouts, failure, unavailable) and latency monitoring. Control point (CP) monitoring in control plane has direct benefit while accessing control points. Control points (CP) like Kafka, EventHub offset monitoring delay helps detect application and data services stoppage. Appropriate delay alerts are triggered.
  • 15. 14 8.0 Kafka offset delay alert: Kafka alert delay notifications are configurable. Currently supported notification methods are email (SendGrid), slack, pager Duty, SMS, remedy (ITSM). Notification are generated from Kubecron jobs and Power BI alert cards. Power BI alert card triggered notification to MS Flow for connectors to pick up. When kafka offset delay exceeds certain threshold (min), alert is triggered to appropriate notification channels.
  • 16. 15 9.0 Visual Presentation of Kafka Offset Delay 9.1 Kafka offset delay card Kafka topic data arriving delay card is configured for each kafka topic. When data is refreshed alert is triggered only with card values exceeds certain threshold. Latest offset processing time is also displayed. This latest offset processing time updates in real time. Operation support can act on kafka delay alerts by looking this latest offset processing timestamp too.
  • 17. 16 9.2 Real Time Kafka offset statistics Kafka offset statistics is captured in real time. This statistics not only give historical snapshot but also give operational support visualization tool to follow in real time. When offset processing stopped or application and data services stop producing traffic in kafka topic, operation support can look at this real time statistics with latest offset processing timestamp and confirm problems.
  • 18. 17 10.0 Visual Presentation of IoT Hub, Event Hub Offset Delay 10.1 Event Hub offset delay card IoT Hub’s built in endpoints Event Hub data arriving delay card is configured for event hub’s topic. When data is refreshed alert is triggered when event hub data arriving delay exceeds certain threshold. Latest offset processing time is also displayed. This latest offset processing time updates in real time. Operation support can act on event hub data arriving delay alerts by looking this latest offset processing timestamp too.
  • 19. 18 10.2 Real Time Event Hub offset statistics Event Hub offset statistics are captured in real time. This statistics not only give historical snapshot but also give operational support visualization tool to follow. When event hub has outage or offset processing stopped or application and data services stop producing traffic in event hub, operation support can look at this real time statistics with latest processing timestamp and confirm problems.
  • 20. 19 11.0 Cassandra Metrics Monitoring When application and data services request cassandra client connections it is important in tracking connection status. Client request failure is triggered when app or query connection request is unable to access cassandra. There is also a scenario when cassandra is unavailable to accept excess client connection request. After successful cassandra connection, application code sometimes during read or write have very large latency and can cause possible timeouts. Cassandra client request failure, timeouts, unavailable metrics are important indicator of application and data services stability while accessing to cassandra instances. Following metrics are important for cassandra client request outage and client request latency. It is also important to capture few other cassandra performance metric like keyspace, table latency metric as well as keycachehitrate metrics. LiveDiskSpaceUsed and LiveDiskSpaceAvailable are two important metrics for cassandra PVC disk space monitoring. Kubernetes cassandra exporters exports cassandra metric from Prometheus server to Kafka topic. https://github.com/criteo/cassandra_exporter a. disk space related metrics label values contains totaldiskspaceused:value, livediskspaceused:value b. client request related metrics labels contains clientrequest: oneminuterate c. table and keyspace latency metrics label values contain rangelatency:max, writelatency:max, coordinatorscanlatency:max, coordinatorreadlatency:max, readlatency:max d. keycachehitrate metrics label values contain keycachehitrate:value
  • 21. 20 12.0 AI-Ops Alert: Cassandra Client Request Outage alert Cassandra client request outage metrics indicates apps or queries are not able to establish or failed in connection request or lost connection while accessing cassandra objects or metadata and needs immediate connection. Cassandra client request outage alerts includes following. 12.1 Issues Found: frequency of client request failure, timeout, unavailable in last hour. Attributes includes i. Eventtime: timestamp when cassandra client request outage triggered in last hour. ii. Metricgroup: timeouts, failure or unavailable. iii. Pods: cassandra instance. 12.2 Root Cause Analysis: Top 10 Table Latency in last hour: i. Latency type: coordinator read latency, read latency, coordinator scan latency, write latency, coordinator write latency. ii. Keyspace : keyspace name. iii. Table name: table name iv. Pods: cassandra instances. v. Last hour’s total latency (second) : total latency ( seconds) in last hour vi. Last hour’s latency frequency: frequency of latency (> 400 ms) triggered in last hour. vii. Rank of latency duration: ranking by latency duration. Below figure is snapshot of cassandra client request outage AI-Ops alert.
  • 22. 21 Figure: Cassandra Client Request Outage Alert: Last hour Failure, Timeout, Unavailable
  • 23. 22 13.0 Visual Presentation of Cassandra Metrics Monitoring 13.1 Latest Cassandra Client Request Outage Cassandra client request outage includes client request timeouts, failure and unavailable. Operation support has options of looking temporal aggregate frequency (hours, minutes) views of cassandra outage when alert is raised. This trend graph will conform severity of problems in recent hours.
  • 24. 23 13.2 Latest Cassandra Table Latency High latency (read, write, coordinator read, coordinator write, range scan, coordinators can) is primary cause of cassandra client request outage. cassandra objects. Latest latency trend by keyspace is displayed in first graph. 2nd graph shows by latency type and by table. Third and fourth graph hourly and minute’s drill down of table latency. Operation support have this view for confirming problem severities.
  • 25. 24 14.0 Kubernetes Resource Monitoring and Alert CPU, Memory, Volume usage are important in infrastructure monitoring. Proactive failure alerts helps taking action on time. Kubernetes resource monitoring and notification includes following steps. Capture Kubernetes resource utilization metrics from Kubernetes cAdvisor and transfers to Prometheus time series influx-db server. Prometheus agent does this job. Prometheus agent exports node, kafka and cassandra by exporter job from influx db timeseries database to kafka at near real time. Apply business transformation rules for finding resource usage. Following cAdvisor’s KPIs are used in this cases. i. Volume Metrics: kubelet_volume_stats_available_bytes, kubelet_volume_stats_used_bytes ii. Container Resource Request and Limits: kube_pod_container_resource_limits, kube_pod_container_resource_requests, kube_node_status_allocatable iii. Container Resource Usage Metrics: container_memory_working_set_bytes, container_cpu_usage_seconds_total, container_fs_writes_bytes_total, container_fs_reads_bytes_total, container_fs_io_time_seconds_total Build aggregate for trend graphs, modeling and machine learning. Apply set analysis for AIOps alerts. Visualize resource utilization KPI in Power BI.
  • 26. 25 Develop AIOps alert of CPU, Memory, Volume usage. This AIOps alert includes following topics. High resource utilization: High resource (CPU, Memory, Volume) utilization in nodes, agent pool and cluster. Root Cause Analysis: Top apps resource usage in nodes, agent pool and across clusters. Recommendation: App load balancing recommendation from models or set analysis. Least used nodes are candidate for apps load balancing. 14.1 Kubernetes cAdvisor cAdvisor is an open source container resource usage and performance analysis agent. It is purpose- built for containers and supports Docker containers natively. In Kubernetes, cAdvisor is integrated into the Kubelet binary. cAdvisor auto-discovers all containers in the machine and collects CPU, memory, filesystem, and network usage statistics. cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. 14.2 Prometheus time series data model Prometheus is an open-source systems monitoring and alerting toolkit. It is now a standalone open source project and maintained independently of any company. To emphasize this, and to clarify the project's governance structure, Prometheus joined the Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes.
  • 27. 26 Prometheus's main features are: • a multi-dimensional data model with time series data identified by metric name and key/value pairs • time series collection happens via a pull model over HTTP • pushing time series is supported via an intermediary gateway Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. 14.3 Prometheus Exporters and Integrations Number of libraries and servers help exporting existing metrics from third-party systems as Prometheus metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly. Kafka Exporters exports Prometheus metrics from Prometheus server to kafka. Node exporters, Cassandra exporters are used for exporting node and Cassandra metrics to Kafka for further transformation and analysis. Here is GitHub link for these exporters. https://github.com/prometheus/node_exporter https://github.com/danielqsj/kafka_exporter https://github.com/criteo/cassandra_exporter
  • 28. 27 15.0 AI Ops Alert: Abnormal resource (CPU or Memory) usage in last hour Application and data services need resources (CPU and Memory) from Kubernetes cluster. High resource usage alert helps in avoiding node restart and app running without issues. With node restart, apps also restart and have to undergo service discovery process and finding resources to run from other nodes or new nodes. There are three parts of this AI-Ops alert. 15.1 Issues found: Nodes with CPU or memory usage exceeds threshold (75%). Attributes used are I. Node: nodes with CPU usage > 75%, II. Core Bucket: attribute values are 75% - 90% or 90%-100%, 100% III. Avg core or memory usage : last hour’s avg. core or memory usage. IV. Max. core or memory usage: last hour’s max. core or memory usage. V. Confidence level: confidence level is P(n|N) where n = event count of memory or cpu usage in given time scale from a resource bucket, N= total events triggered in same time frame. 15.2 Root Cause Analysis: Top 3 apps with high cpu or memory utilization in nodes. Attributes used are i. Node: Alert nodes name ii. Container: apps container name iii. Pod: apps pod name. iv. Container’s rank by high core or memory usage v. % resource used (core or memory)
  • 29. 28 15.3 Recommendation: Apps load balancing recommendation. Lowest core or memory used nodes are strong candidates of app load rebalancing. This AI-Ops alert finds top 5 least used core or memory nodes with less than 20% of core or memory usage in last hour for app load rebalancing. Operation support need to restart job and relabel to run app services in these least used nodes. Attributes used here are – i. Node: recommender nodes. ii. Avg core or memory used: last hour’s avg. core or memory used nodes. iii. Max. core or memory used: last hour’s max. core or memory used nodes. iv. Least core or memory used rank: nodes ranking based on least core or memory usage. Snapshot of AI Ops alert are given below.
  • 30. 29 Figure: AI-Ops Alert - Kubernetes abnormal resource usage in last hour.
  • 31. 30 16.0 Visual Presentation of Kubernetes Resource Usage and Monitoring Kubernetes resource usage presentation includes but not limited to following visuals. 16.1 Latest CPU and Memory Usage Statistics Cluster’s CPU and Memory utilization statistics in last n hours. This gives Kubernetes nodes CPU and memory usage trends. Latest memory and CPU usages statistics helps visualizing problem severities. Operation support have this tool for taking action properly. container_memory_working_set_bytes, container_cpu_usage_seconds_total, kube_node_status_allocatable metrics are computed for this.
  • 32. 31 16.2 Kubernetes high CPU or Memory Usage Alerts from alert card. Configure CPU and memory utilization card for each nodes. When card values exceeds certain configured threshold ( > 75%) alert is raised and sent to notification channels as configured. Slack, Email, PagerDuty, Remedy, SMS can be used as notification methods or channels.
  • 33. 32 16.3 CPU and Memory Load Distribution Profile Apps undergo services discovery and failover process when kubernetes node restarts. Core (CPU) and memory usage utilization tracking helps in avoiding unwanted nodes and cluster restart. Kubernetes CPU or Memory load distribution indicates % of nodes core (cpu) or memory used in agent pool. Uneven or highly skewed cpu or memory contribution indicates opportunity of apps cpu or memory load balancing. Cluster’s optimized utilization helps reducing costs and allows better apps performance. Apps cpu or memory distribution in nodes indicates % of app core or memory used in nodes. When node has high CPU or memory utilization (i.e > 75%), it is important to apps load redistribution in least used nodes. Apps load distribution profile in nodes helps identify top used apps ranked by core or memory usage.
  • 34. 33 Figure: Node1 distribution in agentpool1 and apps distribution in node 1.
  • 35. 34 16.4 Container CPU and Memory Request VS. Limit Apps container resource request and resource limit setting metrics control apps resource usage in cluster. Below figure indicates kuberesources-reader-controller containers is requested 49.9% of total memory available for container to run in Kubernetes cluster. Also kubelet requests 10% of total cpu available to run kafkaoffset-monitor-controller container. When container resource requests exceeds container resource limit POD restart and clears resources from cluster. kube_pod_container_resource_limits, kube_pod_container_resource_requests are two important metric for this transformation and computation. For safety, all pods need this resource request and limit set.
  • 36. 35 16.5 Persistent Volume Claim Usage Statistics (PVC) Persistent volume can be configured in Kubernetes. PVC usage statistics shows persistent store usage in Kubernetes. Cassandra, Kafka and other apps use persistent store for its operations. Monitoring PVC usage helps in alerting high utilization of PVC. Below figure indicates different agentpools PVC contribution. Also daily avg of % PVC used by each node. Alert is configured when PVC usage exceeds certain threshold. kubelet_volume_stats_available_bytes, kubelet_volume_stats_used_bytes are two important metrics for transformation and further computation.
  • 37. 36 17.0 Monitoring Aggregate and Alert Summary Report 17.1 Application and data streaming service stability report Weekly streaming microservices stability trends are captured. Data services missing kafka offset hours are aggregated for different kafka topic. This weekly report indicates application services stability trends for last 7 days.
  • 38. 37 17.2 Kafka offset Aggregate Hist Kafka topic offset aggregate history informs microservices (that produce kafka traffic) stability, data quality, partition utilization over time. With equal distribution in different time scale indicates producer produces data in kafka topic evenly over time. Pattern might be skewed when data doesn’t process on time and need replay. New resources (device) roll out is possible cause as well. Uniform data pattern in different time scale informs stable producer microservices.
  • 39. 38 17.3 IoT Hub, Event Hub offset Aggregate Hist IoT Hub, Event hub offset aggregate hist informs IoT Hub and Event Hub data processing quality, partition utilization and job stability over time. Missing days, hours or minutes indicates IoT Hub or Event Hub outages or inability of stream producer (device) producing traffic to IoT Hub or Event Hub.
  • 40. 39 17.4 Cassandra Client Request Outage Aggregate Hist Cassandra client request outage aggregate hist informs microservices stability while accessing cassandra db. Outage metric distribution shows cassandra outage type (read, write, scan, coordinator) contribution. Temporal drilldown (day, hour) have pattern detail and helps app owner in troubleshooting query and microservices.
  • 41. 40 17.5 Cassandra Table Latency Aggregate Hist Cassandra table latency aggregate history helps in apps performance tuning. Cassandra objects might need maintenance or data retention and cleanup policy might be enforced for high latency objects. Partition and clustering key usage helps in reducing latency. High rangelatency, coordinatorscanlatency, coordinatorreadlatency can be avoided using appropriate primary keys in query. Temporal drilldown of different latency type, keyspace and table helps developer pinpoint root cause of problem and helps in performance tuning and decision making.
  • 42. 41 17.6 Kubernetes nodes resource (CPU, Memory) usage contributions to Agent pool Hist Kubernetes capacity requirement planning is easier when nodes utilization in cluster is visible over period of time. Even CPU and Memory distribution of nodes in agent pool is desired for optimum resource utilization. This temporal aggregate view provides resource contribution of nodes to agent pool. In below figure contribution values close to 1 shows optimum CPU or memory distribution whereas high variation from this number in either sides shows scope of resource optimization. Resource with value 2 indicates abnormal (high) usage of resources and value close to zero shows least resource utilization of nodes to agent pool. This figure also gives recommendation of nodes load balancing in kubernetes cluster.
  • 43. 42 17.7 Last week’s Kubernetes abnormal resource (CPU or Memory) usage summary Last week’s abnormal CPU or memory usage alerts summary is captured here. Alert is raised with core or memory bucket exceeds 75%. This summary report shows resource usage risk in future weeks and helps in planning process.
  • 44. 43 17.8 Batch and Streaming Job Audit Log Statistics Jobs stability is captured in this statistics. Batch job audit log history shows batch jobs success history and streaming job audit log shows streaming jobs restart history.
  • 45. 44 Appendix Prometheus web site: https://prometheus.io/ Prometheus Exporters and Integrations: https://prometheus.io/docs/instrumenting/exporters/ Nodes Exporter: https://github.com/prometheus/node_exporter Cassandra Exporter: https://github.com/criteo/cassandra_exporter