Monitoring in Big Data Platform - Albert Lewandowski, GetInData

Monitoring in
Big Data Platform
Author: Albert Lewandowski

•Big Data DevOps Engineer in Getindata
•Focused on infrastructure, cloud, Internet of Things and
Big Data
Who am I?

Agenda
1. Monitoring overview.
2. Metrics.
a. Infrastructure
b. Big Data Components
c. Applications.
3. Logs analysis.
a. Overview.
b. Logging solutions.
c. Why and how to implement?
4. Q&A

Push vs. Pull model
Push Pull
Agents push metrics Collector takes metrics
Polling task fully distributed among agents,
resulting in linear scalability.
Workload on central poller increases with the
number of devices polled.
Push agents are inherently secure against remote
attacks since they do not listen for network
connections.
Polling protocol can potentially open up system to
remote access and denial of service attacks.
Relatively inflexible: pre-determined, fixed set of
measurements are periodically exported.
Flexible: poller can ask for any metric at any time.

Prometheus Architecture
Prometheus is an open-source systems monitoring and alerting toolkit
originally built at SoundCloud and joined the Cloud Native
Computing Foundation in 2016 as the second hosted project,
after Kubernetes.
Components:
• server
• client libraries for instrumenting application code
• a push gateway for supporting short-lived jobs
• special-purpose exporters for services
• an AlertManager to handle alerts

Prometheus’ stories
•Use service discovery, it’s great
• Discover where Flink JM and TM expose their
metrics.
•How to provide HA?
• Think of using long-term storage like Thanos
or M3 or Cortex
•Do you need archived data?
•Monitor Prometheus even if it’s a
monitoring tool.

Service Discovery
It is used for discovering scrape targets.
We can use Kubernetes, Consul or ﬁle-based
service discovery and many others.

Prometheus Security
Remember:
■ It should not by default be possible for a target to expose data
that impersonates a different target. The honor_labels option
removes this protection, as can certain relabelling setups.
■ --web.enable-admin-api flag controls access to the
administrative HTTP API which includes functionality such as
deleting time series. This is disabled by default. If enabled,
administrative and mutating functionality will be accessible
under the /api/*/admin/ paths.
■ Any secrets stored in template files could be exfiltrated by
anyone able to configure receivers in the Alertmanager
configuration file.

Prometheus Security
Remember:
■ Prometheus and its components do not provide any
server-side authentication, authorization or encryption. If
you require this, it is recommended to use a reverse proxy.
■ As administrative and mutating endpoints are intended to
be accessed via simple tools such as cURL, there is no built
in CSRF protection.
In the future, server-side TLS support will be rolled out to the different
Prometheus projects. Those projects include Prometheus,
Alertmanager, Pushgateway and the ofﬁcial exporters. Authentication
of clients by TLS client certs will also be supported.

HA for Prometheus
Simple
Two Prometheus
instances

HA for Prometheus
HA with cold storage
● Cortex (below)
● M3DB
● Thanos

Servers Monitoring
■ Using Node Exporter or any custom exporter
■ Predict disk usage
■ Send alerts in case of any issue or any dangerous
situation

Kubernetes Monitoring
■ Nodes monitoring
■ Pods monitoring
■ Namespace monitoring
Check:
- resources limits and resources requests
- resource utilization
- Node Exporter
- Kube State metrics -it is a simple service that listens to the Kubernetes
API server and generates metrics about the state of the objects.

Kafka exporter
We can scrape and expose
mBeans of a JMX target.
Add to KAFKA_OPTS
export
KAFKA_OPTS='-javaagent:/opt/
jmx-exporter/jmx-exporter.ja
r=9101:/etc/jmx-exporter/kaf
ka.yml'

Kafka exporter
Metrics example:
• Number or replicas
• Under replicated partitions
• State of each broker
• JVM metrics

What about Kafka?
• Critical component in many systems
• A lot of data so any issue can cause
business loss
• What about adding more brokers?

What about Kafka?
• Keep it simple with JMX exporter.
• Monitor partitions state, number of
replicas
• You must have alerts when the number
of under replicated partitions grows fast

When should I use PushGateway?
• When monitoring multiple instances through a single
Pushgateway, the Pushgateway becomes both a
single point of failure and a potential bottleneck.
• You lose Prometheus's automatic instance health
monitoring via the up metric (generated on every
scrape).
• The Pushgateway never forgets series pushed to it
and will expose them to Prometheus forever unless
those series are manually deleted via the
Pushgateway's API.

Blackbox Exporter
● Ofﬁcial Prometheus Exporter, written in Go
● Single binary installed on monitoring server
● Allows probing API endpoints
https://github.com/prometheus/blackbox_exporter

Airflow Exporter
● Provides metrics for Airflow:
○ dag_status
○ task_status
○ run_duration
● Simple install via pip
https://github.com/epoch8/airflow-exporter

Knox Exporter
● Provides metrics about access to WebHDFS and Hive
● Written in Java, manual installation as a service, needs
conﬁguration ﬁle
https://github.com/marcelmay/apache-knox-exporter

NiFi
We use PrometheusReportingTask to report metrics in
Prometheus format.
It created metrics HTTP endpoint with information about
the JVM and the NiFi instance.
For flows, we need to use custom reporters like own NAR
processor that can count the number of flow files or any
different metric.

Bash Exporter
● Written in Go, manual installation as a service, needs
additional bash scripts to run
● Can make metric from output of bash command
● Status for Hadoop component:
○ Zeppelin
○ Ranger
○ HiveServer
https://github.com/gree-gorey/bash-exporter

Hive Query
● Running Hive query via crontab and writing output to
ﬁle
● Running bash exporter reading ﬁle and displaying
metrics for Prometheus
● Reading metrics in Grafana

Flink monitoring
The most important processing tool.
Flink jobs must run all the time.

Flink monitoring
Processing metrics
• Number of processed events
• Lag of processing that may trigger
alerts
• JVM parameters
Flink stability
• How many restarts does Flink have?
• Can it ﬁnish making checkpoints
and savepoints?

What about Spark jobs?
• Let’s connect Prometheus with
statsd_exporter
• Use built in sink for statsd in Spark

Spark & Prometheus
The best way to achieve stable metrics system
with right timestamp is about using statsd_sink.
These metrics can be sent to statsd_exporter
(https://github.com/prometheus/statsd_exporter)
and then can be exposed to the Prometheus.

Spark & Prometheus
The different approach based on JMX Exporter
and using Spark’s JMXSink.
The next is about using Spark’s GraphiteSink and
using Prometheus Graphite Exporter.
The last is about building custom exporter of
metrics.

Spark & Prometheus
Spark 3.0 introduces the following resources:
- PrometheusServlet - which makes the
Master/Worker/Driver nodes expose metrics in
a Prometheus format.
- Prometheus Resource - which export metrics of
all executors at the driver.

AWS
- Cloudwatch
- Cloudwatch Logs

Google Cloud
- Operations metrics
- Operations Logging
Formerly Stackdriver

Microsoft Azure
- Azure Monitor:
- Logs
- Metrics

Logs analysis
Logging solutions

Elasticsearch
Elasticsearch is a distributed, open source
search and analytics engine for all types of
data, including textual, numerical,
geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene
and was ﬁrst released in 2010

Elasticsearch Use cases
• Logging and log analytics
• Infrastructure metrics and container
monitoring
• Application performance monitoring
• Geospatial data analysis and
visualization
• Security analytics
• Business analytics

Loki
Loki is a horizontally-scalable,
highly-available, multi-tenant log
aggregation system inspired by
Prometheus. It is designed to be very cost
effective and easy to operate. It does not
index the contents of the logs, but rather a
set of labels for each log stream.

Loki Overview
Loki receives logs in separate streams, where each
stream is uniquely identiﬁed by its tenant ID
and its set of labels.
As log entries from a stream arrive, they are
GZipped as "chunks" and saved in the chunks
store.
The index stores each stream's label set and links
them to the individual chunks.

LogQL
This example counts all the log lines within the last
ﬁve minutes for the MySQL job.
rate( ( {job="mysql"} |= "error" !=
"timeout)[10s] ) )
It counts the entries for each log stream
count_over_time({job="mysql"}[5m])

Promtail
Promtail is an agent which ships the
contents of local logs to a private Loki
instance. It is deployed to every machine
that has applications needed to be
monitored.
Currently, Promtail can tail logs from two
sources: local log ﬁles and the systemd
journal (on AMD64 machines only).

Pipelines in Promtail
A pipeline is used to transform a single
log line, its labels, and its timestamp.
A pipeline is comprised of a set of
stages.

ELK vs. Loki+Promtail
ELK Loki + Promtail
Data visualisation Kibana Grafana
Query performance Faster due to indexed
all the data
Slower due to indexing
only labels
Resource
consumption
Higher due to the need
of indexing
Lower due to index
only labels

ELK vs. Loki+Promtail
ELK Loki + Promtail
Indexing Keys and content of
each key
Only labels
Query language Query DSL or Lucene
QL
LogQL
Log pipeline More components
1. Fluentd/Fluentbit
-> Logstash ->
Elasticsearch
2. Filebeat ->
Logstash ->
Elasticsearch
Fewer components:
Promtail -> Loki

Logs analysis
Why and how to implement?

Article
https://getindata.com/blog/why-log-analyti
cs-important-monitoring-system

Whitepaper
Monitoring and Observability for Data Platform
https://getindata.com/blog/white-paper-big-
data-monitoring-observability-data-platfor
m

Monitoring in Big Data Platform - Albert Lewandowski, GetInData

More Related Content

Similar to Monitoring in Big Data Platform - Albert Lewandowski, GetInData

More from GetInData

Recently uploaded

Monitoring in Big Data Platform - Albert Lewandowski, GetInData