Platform Monitoring and Alert

7/29/2020
Platform Monitoring and Alert
Completeness and Consistency: Application and Data Services
Stability: IoT Hub, Kafka, Cassandra Monitoring
Resources: Kubernetes Resource Monitoring
Braja Das
BDAS@STARBUCKS.COM, BKD_108@YAHOO.COM
Platform Monitoring and Alert
Completeness and Consistency: Application and Data Services
Stability: IoT Hub, Kafka, Cassandra Monitoring
Resources: Kubernetes Resource Monitoring
Release Version: Summer 2019

1
Table of Contents
1.0 OBJECTIVE:.......................................................................................................................................................................................................................3
2.0 PRODUCT SCOPE...............................................................................................................................................................................................................4
3.0 MONITORING ARCHITECTURE AT SCALE..........................................................................................................................................................................5
4.0 COMPONENTS OF MONITORING ARCHITECTURE..............................................................................................................................................................6
4.1 Kubernetes Application and Data plane: ................................................................................................................................................................6
4.2 Kubernetes Control and Monitor plane...................................................................................................................................................................6
4.3 Azure Service Bus .....................................................................................................................................................................................................7
4.4 Cloud Data Store........................................................................................................................................................................................................7
4.5 Power BI Visual Presentation...................................................................................................................................................................................7
4.6 MS Flow.......................................................................................................................................................................................................................8
4.7 Notification Channel...................................................................................................................................................................................................8
5.0 MONITORING AGENT AND EXCEPTION HANDLING ...........................................................................................................................................................9
5.1 Exception Alerting....................................................................................................................................................................................................10
6.0 MONITORING AGENT MICROSERVICES...........................................................................................................................................................................11
7.0 KUBERNETES APPLICATION AND DATA SERVICES MONITORING..................................................................................................................................13
8.0 KAFKA OFFSET DELAY ALERT: .......................................................................................................................................................................................14
9.0 VISUAL PRESENTATION OF KAFKA OFFSET DELAY ......................................................................................................................................................15
9.1 Kafka offset delay card............................................................................................................................................................................................15
9.2 Real Time Kafka offset statistics............................................................................................................................................................................16
10.0 VISUAL PRESENTATION OF IOT HUB, EVENT HUB OFFSET DELAY .............................................................................................................................17
10.1 Event Hub offset delay card ...................................................................................................................................................................................17
10.2 Real Time Event Hub offset statistics ...................................................................................................................................................................18
11.0 CASSANDRA METRICS MONITORING .............................................................................................................................................................................19
12.0 AI-OPS ALERT: CASSANDRA CLIENT REQUEST OUTAGE ALERT ................................................................................................................................20
13.0 VISUAL PRESENTATION OF CASSANDRA METRICS MONITORING.................................................................................................................................22
13.1 Latest Cassandra Client Request Outage............................................................................................................................................................22
13.2 Latest Cassandra Table Latency...........................................................................................................................................................................23
14.1 Kubernetes cAdvisor ...............................................................................................................................................................................................25
14.2 Prometheus time series data model......................................................................................................................................................................25
14.3 Prometheus Exporters and Integrations...............................................................................................................................................................26
15.0 AI OPS ALERT: ABNORMAL RESOURCE (CPU OR MEMORY) USAGE IN LAST HOUR...................................................................................................27
15.1 Issues found: ............................................................................................................................................................................................................27
15.2 Root Cause Analysis:..............................................................................................................................................................................................27

2
15.3 Recommendation:....................................................................................................................................................................................................28
16.0 VISUAL PRESENTATION OF KUBERNETES RESOURCE USAGE AND MONITORING.......................................................................................................30
16.1 Latest CPU and Memory Usage Statistics...........................................................................................................................................................30
16.2 Kubernetes high CPU or Memory Usage Alerts from alert card.......................................................................................................................31
16.3 CPU and Memory Load Distribution Profile .........................................................................................................................................................32
16.4 Container CPU and Memory Request VS. Limit .................................................................................................................................................34
16.5 Persistent Volume Claim Usage Statistics (PVC) ...............................................................................................................................................35
17.0 MONITORING AGGREGATE AND ALERT SUMMARY REPORT.........................................................................................................................................36
17.1 Application and data streaming service stability report...........................................................................................................................................36
17.2 Kafka offset Aggregate Hist....................................................................................................................................................................................37
17.3 IoT Hub, Event Hub offset Aggregate Hist...........................................................................................................................................................38
17.4 Cassandra Client Request Outage Aggregate Hist ............................................................................................................................................39
17.5 Cassandra Table Latency Aggregate Hist............................................................................................................................................................40
17.6 Kubernetes nodes resource (CPU, Memory) usage contributions to Agent pool Hist...................................................................................41
17.7 Last week’s Kubernetes abnormal resource (CPU or Memory) usage summary ..........................................................................................42
17.8 Batch and Streaming Job Audit Log Statistics.....................................................................................................................................................43
APPENDIX ......................................................................................................................................................................................................................................44
Prometheus web site: ............................................................................................................................................................................................................44

3
1.0 Objective:
Applications and data services are critical for business success. In microservices oriented architecture
each microservice or application is accountable for distinct task. It is critical to bring control plane upfront
with 360 degree infrastructure, microservices, apps monitoring in all aspects. This also conforms
completeness, consistency and stability of application and data services and drives in business continuity
with supreme trust.
Key objective of this monitoring is continuous, completeness, consistency and stability of system,
application and data services. As application and data services access different control points, monitoring
these end points gives direct benefit not only for system’s stability but also consistency of application
services. Kafka offset and Event Hub offset delay monitoring helps identifying application services
consistency by tracking services (producer) data production stoppage in control points. Reconciliation
among different control points helps identifying data loss while spatio-temporal data aggregation and
statistical modeling in control point helps identifying data usage and variations over time. Abnormal pattern
recognition confirms data incompleteness and inconsistency.
This product handbook document will highlight system stability monitoring and alerting features of
Kubernetes infrastructure and resource usage, Kubernetes job monitoring using kafka offset and eventhub
offset delay, cassandra health monitoring by object latency and client request failure, timeouts and
unavailable in Kubernetes container orchestration framework.

4
2.0 Product scope
Monitoring solution release-1 product scope includes but not limited to following topics.
a. Define and design monitoring and alerting architecture.
b. Identify infrastructure monitoring control points
c. Define scopes of Kubernetes resources and app monitoring.
d. Define scopes of Kafka and Event Hub app monitoring
e. Define monitoring and alerts business rules.
f. Define alerts for L1, L2 and L3 support.
g. Develop configurable monitor and alert framework.
h. Develop configurable monitoring microservice for data services.
i. Store and aggregate monitoring metrics for future model generation.
j. Develop visual presentation for monitoring metrics trends, summary and aggregates.
k. Send alert push notifications via Slack, SendGrid (Email), PagerDuty, Remedy, SMS.

5
3.0 Monitoring Architecture at Scale

6
4.0 Components of Monitoring Architecture
4.1 Kubernetes Application and Data plane:
Application and data microservices involves distinct tasks for business processes and run inside
Kubernetes. While it is important these services run and deliver output as expected, in practical it is critical
in monitoring these microservices for business continuity and success.
Helm chart helps installing different apps cluster inside Kubernetes. Open Source Apache Cassandra as
data store and open source Apache Kafka as messaging bus are two important clusters that is used for
processing both streaming and batch jobs.
4.2 Kubernetes Control and Monitor plane
Control plane includes different agents that works as watchdog and ensure health of the overall system.
Different agents that works in Kubernetes control planes are :
a. CCS agent: CCS agents or microservices ensure Completeness, Consistency and Stability check
of application and data services. These services can run in batch or stream mode.
b. Prometheus agent: Collects Kubernetes metrics from Kubernetes metric server and store into
Prometheus server (Influx-DB time series DB). Prometheus node exporters, Kafka exporters and
Cassandra exporters streams prometheus metrics from prometheus server to Kafka.
c. Monitor agent: Monitor agent microservices monitor Kubernetes, Kafka, Cassandra, Azure
resources (IoT Hub, Event Hub, Service Bus, Blob Store). These services also compute monitoring
metrics and statistics for modeling and machine learning.

7
d. Notification agent: This agent microservices collects actionable metrics statistics and notifies to
different notification channel. This notification runs in Kubernetes. Notification can be time or event
triggered. Notification channels includes Slack, SendGrid (Email), PagerDuty, Remedy, SMS.
4.3 Azure Service Bus
Azure service bus eventhub is used for sending real time monitoring statistics to display in Power BI visual
presentations. Also eventhub is used for real time alerting.
4.4 Cloud Data Store
Monitoring statistics are stored into data store. Azure SQL DB and Azure Blob store are used as data
store. Microsoft provided Spark-SQLDB and apache haddop-azure connectors are used. Monitoring stats
are read from kafka and stored to Azure data store.
"com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2",
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
4.5 Power BI Visual Presentation
Power BI are used for visual presentation of monitoring statistics and alerts. This monitoring statistics
includes but not limited to trend graphs, frequency distributions, probability of failures, set analysis among
different metrics.

8
4.6 MS Flow
MS Flow has different connectors including Power BI. When dataset refreshes or new dataset shows up
in Power BI, MS flow connectors picks up and sends alerts based on workflow connectors. SendGrid
email, PagerDuty, Slacks are different connectors currently used as part of alerting and notification
framework.
4.7 Notification Channel
Slack, Pager Duty, Email, SMS, Remedy are used as notification channel. Warning, Low and High
severity alert are notified in Slack channel.
Integration is also done among email à pager duty à remedy
à SMS

9
5.0 Monitoring Agent and Exception Handling
Monitoring microservices monitors application and data services using CCS agents, Monitor agents.
Exception is caught at real time and alerted to application owner by email. Assessing severity, alerts are
notified to slack and PagerDuty also. Exception is captured in Power BI in real time. Azure real time stream
analytics is used to stream data from event hub. CCS and monitoring agents transfers exception string to
event hub downstream process for picking up at real time.
7/25/19 SIOT Platform Monitoring 5

10
5.1 Exception Alerting
Application, data and monitoring services have exception alerting in place. This helps in understanding
severity of exceptions and time when exception happens. With same alerts coming frequently in quick
time indicates issues in application, data and monitoring services and needs immediate attention. Example
of exception alerting is shown below.

11
6.0 Monitoring Agent Microservices
Monitoring agent microservices are mostly written in scala with spark-streaming and akka-streaming in
place. Time triggered alerts and monitoring aggregates run from kubernetes cron jobs. Monitoring agents
mostly used following libraries.
"org.apache.kafka" %% "kafka" % kafkaVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"com.microsoft.azure" % "azure-eventhubs" % "0.7.5",
"com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2",
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
"org.apache.commons" % "commons-email" % "1.4"
"com.typesafe" % "config" % "1.2.1",
Microsoft provided spark sqldb connector helps storing data in sqldb asynchronously. Transformation
process also have been written inside sqldb. Monitoring agent calls this transformation routines and
transformed data in required format.
org.apache.commons.commons-email library is mostly used as email alerts. SendGrid is used as email
carrier. SendGrid API Key is used for secure communication.
Slack apps token is used for slack communication. Slack chat messages rest calls are used to publish
messages in slack channel. "https://slack.com/api/chat.postMessage"

12
Alert EPML Configuration: Monitoring framework is fully configurable. Most of the monitoring jobs
configuration is based on typesafe’s config file. Other than typesafe’s configuration files, epmlconfig (Event
Processing Markup Language) json file is used. Microservices are written for reading this epmlconfig file
for different alerting condition and alerting rules. Example of epmlconfig records looks below.
{"type":"alert", "subject":"IoT Kubernetes CPU Alert: Last Hour CPU Usage exceeds 75%",
"fields":["eventdate", "eventhour", "node", "core_bucket", "avg_coreused", "max_coreused",
"confidence_level"], "fieldsheader":"Eventdate, Eventhour, Node, Core Bucket, Avg. core usage,
Max. core usage, Confidence level", "filter":"max_coreused <> '0'"}

13
7.0 Kubernetes Application and Data Services Monitoring
There are different apps or cluster installed in Kubernetes. Kubernetes microservices includes following
monitoring principles.
a. Exception alerting from application, data and monitoring microservices.
b. Microservices monitoring using control points delay.
I. Kafka (control point) topic data arriving (offset changes) delay.
II. Event Hub data arriving (offset changes) delay.
c. Cassandra client request outage (timeouts, failure, unavailable) and latency monitoring.
Control point (CP) monitoring in control plane has direct benefit while accessing control points. Control
points (CP) like Kafka, EventHub offset monitoring delay helps detect application and data services
stoppage. Appropriate delay alerts are triggered.

14
8.0 Kafka offset delay alert:
Kafka alert delay notifications are configurable. Currently supported notification methods are email
(SendGrid), slack, pager Duty, SMS, remedy (ITSM). Notification are generated from Kubecron jobs and
Power BI alert cards. Power BI alert card triggered notification to MS Flow for connectors to pick up. When
kafka offset delay exceeds certain threshold (min), alert is triggered to appropriate notification channels.

15
9.0 Visual Presentation of Kafka Offset Delay
9.1 Kafka offset delay card
Kafka topic data arriving delay card is configured for each kafka topic. When data is refreshed alert is
triggered only with card values exceeds certain threshold. Latest offset processing time is also displayed.
This latest offset processing time updates in real time. Operation support can act on kafka delay alerts by
looking this latest offset processing timestamp too.

16
9.2 Real Time Kafka offset statistics
Kafka offset statistics is captured in real time. This statistics not only give historical snapshot but also give
operational support visualization tool to follow in real time. When offset processing stopped or application
and data services stop producing traffic in kafka topic, operation support can look at this real time statistics
with latest offset processing timestamp and confirm problems.

17
10.0 Visual Presentation of IoT Hub, Event Hub Offset Delay
10.1 Event Hub offset delay card
IoT Hub’s built in endpoints Event Hub data arriving delay card is configured for event hub’s topic. When
data is refreshed alert is triggered when event hub data arriving delay exceeds certain threshold. Latest
offset processing time is also displayed. This latest offset processing time updates in real time. Operation
support can act on event hub data arriving delay alerts by looking this latest offset processing timestamp
too.

18
10.2 Real Time Event Hub offset statistics
Event Hub offset statistics are captured in real time. This statistics not only give historical snapshot but
also give operational support visualization tool to follow. When event hub has outage or offset processing
stopped or application and data services stop producing traffic in event hub, operation support can look at
this real time statistics with latest processing timestamp and confirm problems.

19
11.0 Cassandra Metrics Monitoring
When application and data services request cassandra client connections it is important in tracking
connection status. Client request failure is triggered when app or query connection request is unable to
access cassandra. There is also a scenario when cassandra is unavailable to accept excess client
connection request. After successful cassandra connection, application code sometimes during read or
write have very large latency and can cause possible timeouts. Cassandra client request failure, timeouts,
unavailable metrics are important indicator of application and data services stability while accessing to
cassandra instances. Following metrics are important for cassandra client request outage and client
request latency. It is also important to capture few other cassandra performance metric like keyspace,
table latency metric as well as keycachehitrate metrics. LiveDiskSpaceUsed and LiveDiskSpaceAvailable
are two important metrics for cassandra PVC disk space monitoring.
Kubernetes cassandra exporters exports cassandra metric from Prometheus server to Kafka topic.
https://github.com/criteo/cassandra_exporter
a. disk space related metrics label values contains totaldiskspaceused:value,
livediskspaceused:value
b. client request related metrics labels contains clientrequest: oneminuterate
c. table and keyspace latency metrics label values contain rangelatency:max, writelatency:max,
coordinatorscanlatency:max, coordinatorreadlatency:max, readlatency:max
d. keycachehitrate metrics label values contain keycachehitrate:value

20
12.0 AI-Ops Alert: Cassandra Client Request Outage alert
Cassandra client request outage metrics indicates apps or queries are not able to establish or failed in
connection request or lost connection while accessing cassandra objects or metadata and needs
immediate connection.
Cassandra client request outage alerts includes following.
12.1 Issues Found: frequency of client request failure, timeout, unavailable in last hour. Attributes
includes
i. Eventtime: timestamp when cassandra client request outage triggered in last hour.
ii. Metricgroup: timeouts, failure or unavailable.
iii. Pods: cassandra instance.
12.2 Root Cause Analysis: Top 10 Table Latency in last hour:
i. Latency type: coordinator read latency, read latency, coordinator scan latency, write latency,
coordinator write latency.
ii. Keyspace : keyspace name.
iii. Table name: table name
iv. Pods: cassandra instances.
v. Last hour’s total latency (second) : total latency ( seconds) in last hour
vi. Last hour’s latency frequency: frequency of latency (> 400 ms) triggered in last hour.
vii. Rank of latency duration: ranking by latency duration.
Below figure is snapshot of cassandra client request outage AI-Ops alert.

21
Figure: Cassandra Client Request Outage Alert: Last hour Failure, Timeout, Unavailable

22
13.0 Visual Presentation of Cassandra Metrics Monitoring
13.1 Latest Cassandra Client Request Outage
Cassandra client request outage includes client request timeouts, failure and unavailable. Operation
support has options of looking temporal aggregate frequency (hours, minutes) views of cassandra outage
when alert is raised. This trend graph will conform severity of problems in recent hours.

23
13.2 Latest Cassandra Table Latency
High latency (read, write, coordinator read, coordinator write, range scan, coordinators can) is primary
cause of cassandra client request outage. cassandra objects. Latest latency trend by keyspace is
displayed in first graph. 2nd
graph shows by latency type and by table. Third and fourth graph hourly and
minute’s drill down of table latency. Operation support have this view for confirming problem severities.

24
14.0 Kubernetes Resource Monitoring and Alert
CPU, Memory, Volume usage are important in infrastructure monitoring. Proactive failure alerts helps
taking action on time. Kubernetes resource monitoring and notification includes following steps.
Capture Kubernetes resource utilization metrics from Kubernetes cAdvisor and transfers to Prometheus
time series influx-db server. Prometheus agent does this job.
Prometheus agent exports node, kafka and cassandra by exporter job from influx db timeseries database
to kafka at near real time.
Apply business transformation rules for finding resource usage. Following cAdvisor’s KPIs are used in this
cases.
i. Volume Metrics: kubelet_volume_stats_available_bytes, kubelet_volume_stats_used_bytes
ii. Container Resource Request and Limits: kube_pod_container_resource_limits,
kube_pod_container_resource_requests, kube_node_status_allocatable
iii. Container Resource Usage Metrics: container_memory_working_set_bytes,
container_cpu_usage_seconds_total, container_fs_writes_bytes_total,
container_fs_reads_bytes_total, container_fs_io_time_seconds_total
Build aggregate for trend graphs, modeling and machine learning.
Apply set analysis for AIOps alerts.
Visualize resource utilization KPI in Power BI.

25
Develop AIOps alert of CPU, Memory, Volume usage.
This AIOps alert includes following topics.
High resource utilization: High resource (CPU, Memory, Volume) utilization in nodes, agent pool and
cluster.
Root Cause Analysis: Top apps resource usage in nodes, agent pool and across clusters.
Recommendation: App load balancing recommendation from models or set analysis. Least used nodes
are candidate for apps load balancing.
14.1 Kubernetes cAdvisor
cAdvisor is an open source container resource usage and performance analysis agent. It is purpose-
built for containers and supports Docker containers natively. In Kubernetes, cAdvisor is integrated into
the Kubelet binary. cAdvisor auto-discovers all containers in the machine and collects CPU, memory,
filesystem, and network usage statistics. cAdvisor also provides the overall machine usage by analyzing
the ‘root’ container on the machine.
14.2 Prometheus time series data model
Prometheus is an open-source systems monitoring and alerting toolkit. It is now a standalone open source
project and maintained independently of any company. To emphasize this, and to clarify the project's
governance structure, Prometheus joined the Cloud Native Computing Foundation in 2016 as the second
hosted project, after Kubernetes.

26
Prometheus's main features are:
• a multi-dimensional data model with time series data identified by metric name and key/value pairs
• time series collection happens via a pull model over HTTP
• pushing time series is supported via an intermediary gateway
Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway
for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate
and record new time series from existing data or generate alerts.
14.3 Prometheus Exporters and Integrations
Number of libraries and servers help exporting existing metrics from third-party systems as Prometheus
metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus
metrics directly. Kafka Exporters exports Prometheus metrics from Prometheus server to kafka. Node
exporters, Cassandra exporters are used for exporting node and Cassandra metrics to Kafka for further
transformation and analysis.
Here is GitHub link for these exporters.
https://github.com/prometheus/node_exporter
https://github.com/danielqsj/kafka_exporter
https://github.com/criteo/cassandra_exporter

27
15.0 AI Ops Alert: Abnormal resource (CPU or Memory) usage in last hour
Application and data services need resources (CPU and Memory) from Kubernetes cluster. High resource
usage alert helps in avoiding node restart and app running without issues. With node restart, apps also
restart and have to undergo service discovery process and finding resources to run from other nodes or
new nodes.
There are three parts of this AI-Ops alert.
15.1 Issues found: Nodes with CPU or memory usage exceeds threshold (75%). Attributes used are
I. Node: nodes with CPU usage > 75%,
II. Core Bucket: attribute values are 75% - 90% or 90%-100%, 100%
III. Avg core or memory usage : last hour’s avg. core or memory usage.
IV. Max. core or memory usage: last hour’s max. core or memory usage.
V. Confidence level: confidence level is P(n|N) where
n = event count of memory or cpu usage in given time scale from a resource bucket,
N= total events triggered in same time frame.
15.2 Root Cause Analysis: Top 3 apps with high cpu or memory utilization in nodes. Attributes used are
i. Node: Alert nodes name
ii. Container: apps container name
iii. Pod: apps pod name.
iv. Container’s rank by high core or memory usage
v. % resource used (core or memory)

28
15.3 Recommendation: Apps load balancing recommendation.
Lowest core or memory used nodes are strong candidates of app load rebalancing. This AI-Ops alert finds
top 5 least used core or memory nodes with less than 20% of core or memory usage in last hour for app
load rebalancing. Operation support need to restart job and relabel to run app services in these least used
nodes. Attributes used here are –
i. Node: recommender nodes.
ii. Avg core or memory used: last hour’s avg. core or memory used nodes.
iii. Max. core or memory used: last hour’s max. core or memory used nodes.
iv. Least core or memory used rank: nodes ranking based on least core or memory usage.
Snapshot of AI Ops alert are given below.

29
Figure: AI-Ops Alert - Kubernetes abnormal resource usage in last hour.

30
16.0 Visual Presentation of Kubernetes Resource Usage and Monitoring
Kubernetes resource usage presentation includes but not limited to following visuals.
16.1 Latest CPU and Memory Usage Statistics
Cluster’s CPU and Memory utilization statistics in last n hours. This gives Kubernetes nodes CPU and
memory usage trends. Latest memory and CPU usages statistics helps visualizing problem severities.
Operation support have this tool for taking action properly. container_memory_working_set_bytes,
container_cpu_usage_seconds_total, kube_node_status_allocatable metrics are computed for this.

31
16.2 Kubernetes high CPU or Memory Usage Alerts from alert card.
Configure CPU and memory utilization card for each nodes. When card values exceeds certain configured
threshold ( > 75%) alert is raised and sent to notification channels as configured. Slack, Email, PagerDuty,
Remedy, SMS can be used as notification methods or channels.

32
16.3 CPU and Memory Load Distribution Profile
Apps undergo services discovery and failover process when kubernetes node restarts. Core (CPU) and
memory usage utilization tracking helps in avoiding unwanted nodes and cluster restart.
Kubernetes CPU or Memory load distribution indicates % of nodes core (cpu) or memory used in agent
pool. Uneven or highly skewed cpu or memory contribution indicates opportunity of apps cpu or memory
load balancing. Cluster’s optimized utilization helps reducing costs and allows better apps performance.
Apps cpu or memory distribution in nodes indicates % of app core or memory used in nodes. When node
has high CPU or memory utilization (i.e > 75%), it is important to apps load redistribution in least used
nodes. Apps load distribution profile in nodes helps identify top used apps ranked by core or memory
usage.

33
Figure: Node1 distribution in agentpool1 and apps distribution in node 1.

34
16.4 Container CPU and Memory Request VS. Limit
Apps container resource request and resource limit setting metrics control apps resource usage in cluster.
Below figure indicates kuberesources-reader-controller containers is requested 49.9% of total memory
available for container to run in Kubernetes cluster. Also kubelet requests 10% of total cpu available to
run kafkaoffset-monitor-controller container. When container resource requests exceeds container
resource limit POD restart and clears resources from cluster.
kube_pod_container_resource_limits, kube_pod_container_resource_requests are two important
metric for this transformation and computation. For safety, all pods need this resource request and limit
set.

35
16.5 Persistent Volume Claim Usage Statistics (PVC)
Persistent volume can be configured in Kubernetes. PVC usage statistics shows persistent store usage in
Kubernetes. Cassandra, Kafka and other apps use persistent store for its operations. Monitoring PVC
usage helps in alerting high utilization of PVC. Below figure indicates different agentpools PVC
contribution. Also daily avg of % PVC used by each node. Alert is configured when PVC usage exceeds
certain threshold.
kubelet_volume_stats_available_bytes, kubelet_volume_stats_used_bytes are two important
metrics for transformation and further computation.

36
17.0 Monitoring Aggregate and Alert Summary Report
17.1 Application and data streaming service stability report
Weekly streaming microservices stability trends are captured. Data services missing kafka offset hours
are aggregated for different kafka topic. This weekly report indicates application services stability trends
for last 7 days.

37
17.2 Kafka offset Aggregate Hist
Kafka topic offset aggregate history informs microservices (that produce kafka traffic) stability, data quality,
partition utilization over time. With equal distribution in different time scale indicates producer produces
data in kafka topic evenly over time. Pattern might be skewed when data doesn’t process on time and
need replay. New resources (device) roll out is possible cause as well. Uniform data pattern in different
time scale informs stable producer microservices.

38
17.3 IoT Hub, Event Hub offset Aggregate Hist
IoT Hub, Event hub offset aggregate hist informs IoT Hub and Event Hub data processing quality, partition
utilization and job stability over time. Missing days, hours or minutes indicates IoT Hub or Event Hub
outages or inability of stream producer (device) producing traffic to IoT Hub or Event Hub.

39
17.4 Cassandra Client Request Outage Aggregate Hist
Cassandra client request outage aggregate hist informs microservices stability while accessing cassandra
db. Outage metric distribution shows cassandra outage type (read, write, scan, coordinator) contribution.
Temporal drilldown (day, hour) have pattern detail and helps app owner in troubleshooting query and
microservices.

40
17.5 Cassandra Table Latency Aggregate Hist
Cassandra table latency aggregate history helps in apps performance tuning. Cassandra objects might
need maintenance or data retention and cleanup policy might be enforced for high latency objects.
Partition and clustering key usage helps in reducing latency. High rangelatency, coordinatorscanlatency,
coordinatorreadlatency can be avoided using appropriate primary keys in query. Temporal drilldown of
different latency type, keyspace and table helps developer pinpoint root cause of problem and helps in
performance tuning and decision making.

41
17.6 Kubernetes nodes resource (CPU, Memory) usage contributions to Agent pool Hist
Kubernetes capacity requirement planning is easier when nodes utilization in cluster is visible over period
of time. Even CPU and Memory distribution of nodes in agent pool is desired for optimum resource
utilization. This temporal aggregate view provides resource contribution of nodes to agent pool. In below
figure contribution values close to 1 shows optimum CPU or memory distribution whereas high variation
from this number in either sides shows scope of resource optimization. Resource with value 2 indicates
abnormal (high) usage of resources and value close to zero shows least resource utilization of nodes to
agent pool. This figure also gives recommendation of nodes load balancing in kubernetes cluster.

42
17.7 Last week’s Kubernetes abnormal resource (CPU or Memory) usage summary
Last week’s abnormal CPU or memory usage alerts summary is captured here. Alert is raised with core
or memory bucket exceeds 75%. This summary report shows resource usage risk in future weeks and
helps in planning process.

43
17.8 Batch and Streaming Job Audit Log Statistics
Jobs stability is captured in this statistics. Batch job audit log history shows batch jobs success history
and streaming job audit log shows streaming jobs restart history.

44
Appendix
Prometheus web site: https://prometheus.io/
Prometheus Exporters and Integrations: https://prometheus.io/docs/instrumenting/exporters/
Nodes Exporter: https://github.com/prometheus/node_exporter
Cassandra Exporter: https://github.com/criteo/cassandra_exporter

Platform Monitoring and Alert

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Platform Monitoring and Alert

Similar to Platform Monitoring and Alert (20)

More from Braja Krishna Das

More from Braja Krishna Das (10)

Recently uploaded

Recently uploaded (20)

Platform Monitoring and Alert