Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
The webinar was organized by GetinData on 2020. During the webinar we explaned the concept of monitoring and observability with focus on data analytics platforms.
Watch more here: https://www.youtube.com/watch?v=qSOlEN5XBQc
Whitepaper - Monitoring ang Observability for Data Platform: https://getindata.com/blog/white-paper-big-data-monitoring-observability-data-platform/
Speaker: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
2. •Big Data DevOps Engineer in Getindata
•Focused on infrastructure, cloud, Internet of Things and
Big Data
Who am I?
3. Agenda
1. Monitoring overview.
2. Metrics.
a. Infrastructure
b. Big Data Components
c. Applications.
3. Logs analysis.
a. Overview.
b. Logging solutions.
c. Why and how to implement?
4. Q&A
5. Push vs. Pull model
Push Pull
Agents push metrics Collector takes metrics
Polling task fully distributed among agents,
resulting in linear scalability.
Workload on central poller increases with the
number of devices polled.
Push agents are inherently secure against remote
attacks since they do not listen for network
connections.
Polling protocol can potentially open up system to
remote access and denial of service attacks.
Relatively inflexible: pre-determined, fixed set of
measurements are periodically exported.
Flexible: poller can ask for any metric at any time.
8. Prometheus Architecture
Prometheus is an open-source systems monitoring and alerting toolkit
originally built at SoundCloud and joined the Cloud Native
Computing Foundation in 2016 as the second hosted project,
after Kubernetes.
Components:
• server
• client libraries for instrumenting application code
• a push gateway for supporting short-lived jobs
• special-purpose exporters for services
• an AlertManager to handle alerts
9. Prometheus’ stories
•Use service discovery, it’s great
• Discover where Flink JM and TM expose their
metrics.
•How to provide HA?
• Think of using long-term storage like Thanos
or M3 or Cortex
•Do you need archived data?
•Monitor Prometheus even if it’s a
monitoring tool.
10. Service Discovery
It is used for discovering scrape targets.
We can use Kubernetes, Consul or file-based
service discovery and many others.
11. Prometheus Security
Remember:
■ It should not by default be possible for a target to expose data
that impersonates a different target. The honor_labels option
removes this protection, as can certain relabelling setups.
■ --web.enable-admin-api flag controls access to the
administrative HTTP API which includes functionality such as
deleting time series. This is disabled by default. If enabled,
administrative and mutating functionality will be accessible
under the /api/*/admin/ paths.
■ Any secrets stored in template files could be exfiltrated by
anyone able to configure receivers in the Alertmanager
configuration file.
12. Prometheus Security
Remember:
■ Prometheus and its components do not provide any
server-side authentication, authorization or encryption. If
you require this, it is recommended to use a reverse proxy.
■ As administrative and mutating endpoints are intended to
be accessed via simple tools such as cURL, there is no built
in CSRF protection.
In the future, server-side TLS support will be rolled out to the different
Prometheus projects. Those projects include Prometheus,
Alertmanager, Pushgateway and the official exporters. Authentication
of clients by TLS client certs will also be supported.
16. Servers Monitoring
■ Using Node Exporter or any custom exporter
■ Predict disk usage
■ Send alerts in case of any issue or any dangerous
situation
17. Kubernetes Monitoring
■ Nodes monitoring
■ Pods monitoring
■ Namespace monitoring
Check:
- resources limits and resources requests
- resource utilization
- Node Exporter
- Kube State metrics -it is a simple service that listens to the Kubernetes
API server and generates metrics about the state of the objects.
20. Kafka exporter
We can scrape and expose
mBeans of a JMX target.
Add to KAFKA_OPTS
export
KAFKA_OPTS='-javaagent:/opt/
jmx-exporter/jmx-exporter.ja
r=9101:/etc/jmx-exporter/kaf
ka.yml'
22. What about Kafka?
• Critical component in many systems
• A lot of data so any issue can cause
business loss
• What about adding more brokers?
23. What about Kafka?
• Keep it simple with JMX exporter.
• Monitor partitions state, number of
replicas
• You must have alerts when the number
of under replicated partitions grows fast
25. When should I use PushGateway?
• When monitoring multiple instances through a single
Pushgateway, the Pushgateway becomes both a
single point of failure and a potential bottleneck.
• You lose Prometheus's automatic instance health
monitoring via the up metric (generated on every
scrape).
• The Pushgateway never forgets series pushed to it
and will expose them to Prometheus forever unless
those series are manually deleted via the
Pushgateway's API.
26. Blackbox Exporter
● Official Prometheus Exporter, written in Go
● Single binary installed on monitoring server
● Allows probing API endpoints
https://github.com/prometheus/blackbox_exporter
27. Airflow Exporter
● Provides metrics for Airflow:
○ dag_status
○ task_status
○ run_duration
● Simple install via pip
https://github.com/epoch8/airflow-exporter
28. Knox Exporter
● Provides metrics about access to WebHDFS and Hive
● Written in Java, manual installation as a service, needs
configuration file
https://github.com/marcelmay/apache-knox-exporter
29. NiFi
We use PrometheusReportingTask to report metrics in
Prometheus format.
It created metrics HTTP endpoint with information about
the JVM and the NiFi instance.
For flows, we need to use custom reporters like own NAR
processor that can count the number of flow files or any
different metric.
30. Bash Exporter
● Written in Go, manual installation as a service, needs
additional bash scripts to run
● Can make metric from output of bash command
● Status for Hadoop component:
○ Zeppelin
○ Ranger
○ HiveServer
https://github.com/gree-gorey/bash-exporter
31. Hive Query
● Running Hive query via crontab and writing output to
file
● Running bash exporter reading file and displaying
metrics for Prometheus
● Reading metrics in Grafana
33. Flink monitoring
Processing metrics
• Number of processed events
• Lag of processing that may trigger
alerts
• JVM parameters
Flink stability
• How many restarts does Flink have?
• Can it finish making checkpoints
and savepoints?
36. What about Spark jobs?
• Let’s connect Prometheus with
statsd_exporter
• Use built in sink for statsd in Spark
37. Spark & Prometheus
The best way to achieve stable metrics system
with right timestamp is about using statsd_sink.
These metrics can be sent to statsd_exporter
(https://github.com/prometheus/statsd_exporter)
and then can be exposed to the Prometheus.
38. Spark & Prometheus
The different approach based on JMX Exporter
and using Spark’s JMXSink.
The next is about using Spark’s GraphiteSink and
using Prometheus Graphite Exporter.
The last is about building custom exporter of
metrics.
40. Spark & Prometheus
Spark 3.0 introduces the following resources:
- PrometheusServlet - which makes the
Master/Worker/Driver nodes expose metrics in
a Prometheus format.
- Prometheus Resource - which export metrics of
all executors at the driver.
49. Elasticsearch
Elasticsearch is a distributed, open source
search and analytics engine for all types of
data, including textual, numerical,
geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene
and was first released in 2010
50. Elasticsearch
Elasticsearch is a distributed, open source
search and analytics engine for all types of
data, including textual, numerical,
geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene
and was first released in 2010
51. Elasticsearch Use cases
• Logging and log analytics
• Infrastructure metrics and container
monitoring
• Application performance monitoring
• Geospatial data analysis and
visualization
• Security analytics
• Business analytics
55. Loki
Loki is a horizontally-scalable,
highly-available, multi-tenant log
aggregation system inspired by
Prometheus. It is designed to be very cost
effective and easy to operate. It does not
index the contents of the logs, but rather a
set of labels for each log stream.
56. Loki Overview
Loki receives logs in separate streams, where each
stream is uniquely identified by its tenant ID
and its set of labels.
As log entries from a stream arrive, they are
GZipped as "chunks" and saved in the chunks
store.
The index stores each stream's label set and links
them to the individual chunks.
58. LogQL
This example counts all the log lines within the last
five minutes for the MySQL job.
rate( ( {job="mysql"} |= "error" !=
"timeout)[10s] ) )
It counts the entries for each log stream
count_over_time({job="mysql"}[5m])
59. Promtail
Promtail is an agent which ships the
contents of local logs to a private Loki
instance. It is deployed to every machine
that has applications needed to be
monitored.
Currently, Promtail can tail logs from two
sources: local log files and the systemd
journal (on AMD64 machines only).
60. Pipelines in Promtail
A pipeline is used to transform a single
log line, its labels, and its timestamp.
A pipeline is comprised of a set of
stages.
62. ELK vs. Loki+Promtail
ELK Loki + Promtail
Data visualisation Kibana Grafana
Query performance Faster due to indexed
all the data
Slower due to indexing
only labels
Resource
consumption
Higher due to the need
of indexing
Lower due to index
only labels
63. ELK vs. Loki+Promtail
ELK Loki + Promtail
Indexing Keys and content of
each key
Only labels
Query language Query DSL or Lucene
QL
LogQL
Log pipeline More components
1. Fluentd/Fluentbit
-> Logstash ->
Elasticsearch
2. Filebeat ->
Logstash ->
Elasticsearch
Fewer components:
Promtail -> Loki