Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewandowski, GetInData

Functioning incessantly
of Data Science Platform
with Kubeﬂow
Author: Albert Lewandowski

•Big Data DevOps Engineer in Getindata
•DevOps and Cloud consultant
•Focused on infrastructure, cloud, Internet of Things and
Big Data
Who am I?

Agenda
1. Platform overview.
2. Observability.
a. Metrics.
b. Logs.
c. Actions.
3. Best practices.
a. CICD.
b. How to work with modern Data
Science platform?
4. Q&A

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Data Science
Kubeﬂow

Kubeﬂow Overview
Goals of the Kubeﬂow:
• End-to-end orchestration: enabling and
simplifying the orchestration of machine learning
pipelines.
• Easy experimentation: making it easy for you to
try numerous ideas and techniques and manage
your various trials/experiments.
• Easy re-use: enabling you to re-use components
and pipelines to quickly create end-to-end
solutions without having to rebuild each time

What is the pipeline in Kubeflow?
A pipeline is a description of an ML
workflow, including all of the components
in the workflow and how they combine in
the form of a graph.

What is the pipeline in Kubeﬂow?

Observability

Observability
Monitoring describes the process of gathering metrics about IT
environment, running applications and observing the system
performance
Observability is about measuring how well internal states of the
system can be inferred from knowledge of its external outputs
(according to the control theory).

Observability
Example:
- data processing job written in Spark or Flink, that rewrites
data from location A to B. Gathering its metrics and setting
up alerts or creating dashboard with simple runtime
visualization are a quite simple tasks. However to achieve
observability we should collect metrics about the amount of
processed data, JVM statistics and some metrics about
infrastructure under the hood.

Metrics

Kubeflow Pipeline Monitoring
■ Write a JSON file specifying metrics to render.
■ Export a file output artifact with an artifact name.
■ Done :)
Expose metrics and use Prometheus-based exporters to
detect any issue or problem with the released application.

Kubernetes Monitoring
■ Nodes monitoring
■ Pods monitoring
■ Namespace monitoring
Check:
- resources limits and resources requests
- resource utilization
- Node Exporter
- Kube State metrics -it is a simple service that listens to the Kubernetes
API server and generates metrics about the state of the objects.

Servers Monitoring
■ Using Node Exporter or any custom exporter
■ Predict disk usage
■ Send alerts in case of any issue or any dangerous
situation

Blackbox Exporter
● Ofﬁcial Prometheus Exporter, written in Go
● Single binary installed on monitoring server
● Allows probing API endpoints
https://github.com/prometheus/blackbox_exporter

Don’t forget about alerts
Alerts signify that a human needs to take
action immediately in response to
something that is either happening or
about to happen, in order to improve the
situation.

Flink monitoring
The most important processing tool.
Flink jobs must run all the time.

Flink monitoring
Processing metrics
• Number of processed events
• Lag of processing that may trigger
alerts
• JVM parameters
Flink stability
• How many restarts does Flink have?
• Can it ﬁnish making checkpoints
and savepoints?

What about Spark jobs?
• Let’s connect Prometheus with
statsd_exporter
• Use built in sink for statsd in Spark

Spark & Prometheus
The best way to achieve stable metrics system
with right timestamp is about using statsd_sink.
These metrics can be sent to statsd_exporter
(https://github.com/prometheus/statsd_exporter)
and then can be exposed to the Prometheus.

Spark & Prometheus
The different approach based on JMX Exporter
and using Spark’s JMXSink.
The next is about using Spark’s GraphiteSink and
using Prometheus Graphite Exporter.
The last is about building custom exporter of
metrics.

Spark & Prometheus
Spark 3.0 introduces the following resources:
- PrometheusServlet - which makes the
Master/Worker/Driver nodes expose metrics in
a Prometheus format.
- Prometheus Resource - which export metrics of
all executors at the driver.

Don’t forget about alerts
Do not overuse alerts, some
issues should be ﬁxed by
automation scripts.

Prometheus Security
Remember:
■ It should not by default be possible for a target to expose data
that impersonates a different target. The honor_labels option
removes this protection, as can certain relabelling setups.
■ --web.enable-admin-api flag controls access to the
administrative HTTP API which includes functionality such as
deleting time series. This is disabled by default. If enabled,
administrative and mutating functionality will be accessible
under the /api/*/admin/ paths.
■ Any secrets stored in template files could be exfiltrated by
anyone able to configure receivers in the Alertmanager
configuration file.

Prometheus Security
Remember:
■ Prometheus and its components do not provide any
server-side authentication, authorization or encryption. If
you require this, it is recommended to use a reverse proxy.
■ As administrative and mutating endpoints are intended to
be accessed via simple tools such as cURL, there is no built
in CSRF protection.
In the future, server-side TLS support will be rolled out to the different
Prometheus projects. Those projects include Prometheus,
Alertmanager, Pushgateway and the ofﬁcial exporters. Authentication
of clients by TLS client certs will also be supported.

Logs

ELK vs. Loki+Promtail
ELK Loki + Promtail
Data visualisation Kibana Grafana
Query performance Faster due to indexed
all the data
Slower due to indexing
only labels
Resource
consumption
Higher due to the need
of indexing
Lower due to index
only labels

ELK vs. Loki+Promtail
ELK Loki + Promtail
Indexing Keys and content of
each key
Only labels
Query language Query DSL or Lucene
QL
LogQL
Log pipeline More components
1. Fluentd/Fluentbit
-> Logstash ->
Elasticsearch
2. Filebeat ->
Logstash ->
Elasticsearch
Fewer components:
Promtail -> Loki

Maintaining and
updating

CICD

CICD pipelines
Besides black art, there is only automation and mechanization.
Federico García Lorca (1898–1936), Spanish poet and playwright
Source: AWS

Continuous Integration
Do not forget about tests on each step.
Automate deploy to development environment to
validate quickly your application.
Deﬁne strategy for branches name and how they
impact on deployment.

Continuous Deployment
Deploy to development after merge to branch x.
Evaluate metrics if, for example, the accuracy of ML
model is OK or not and then trigger next step in
CICD pipeline.
Example pipeline:
merge to branch dev -> build an image -> deploy to
dev -> test, validate -> deploy to qa -> test, validate
-> deploy to prod

Infrastructure as a Code
Infrastructure as a Code helps us in maintaining the
platform in Kubernetes world.
We can simply update, move, redeploy platform and
we can be sure that everything will be as we expect
it.
Code + GIT is better than manual actions.

Infrastructure as a Code
Spotify has open-sourced their Terraform module for
running Kubeﬂow on Google Kubernetes engine.
https://github.com/spotify/terraform-gke-kubeﬂow-c
luster
You can easily automate installation in any
Kubernetes cluster.

How to work with modern
Data Science platform?

SLI, SLA & SLO
• SLA - Service Level Agreements
• SLI - Service Level Indicators
• SLO - Service Level Objectives

SLI, SLA & SLO
SLA is a contract that the service provider promises
customers on service availability, performance.
SLO is a goal that service provider wants to reach.
SLI is a measurement the service provider uses for
the goal.

Measuring Service Risk
Time-based availability is traditionally calculated based on
the proportion of system uptime.
Aggregate availability shows how this yield-based metric is
calculated over a rolling window (i.e., proportion of successful
requests over a one-day window).

Postmortems
It should be written for all signiﬁcant incidents, regardless of
whether or not they paged; postmortems that did not
trigger a page are even more valuable.
Key features:
- real-time collaboration
- an open commenting/annotation system
- email notiﬁcation

Automate platform
- Trigger action based on metrics
- Deploy automatically
- Do not waste time on repetitive tasks

Automate platform
If a human operator needs to touch your system
during normal operations, you have a bug. The
definition of normal changes as your systems
grow.
- Carla Geisser, Google SRE

Q&A

Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewandowski, GetInData

More Related Content

Similar to Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewandowski, GetInData

More from GetInData

Recently uploaded

Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewandowski, GetInData