Kubernetes as data platform

Kubernetes as
Data Platform
Riga DevOpsDays 2018-09-28
Eric Skoglund, Bonnier News
Lars Albertsson, Mimeria
1

5
Brand Scope Data Scope
➔ Behavioral Data
➔ Technical Data
No Content Data
Scoping the platform

Cloud Selection
7
The Pragmatic Choice
➔ Known to people in the dev teams
➔ New base platform for all other applications within
Bonnier News

Use Case Driven Development
➔ Use cases drive the development of the platform
➔ Focus on value and quality not on slurping in all data in the company
➔ Start with simple use cases!
8

9
FIND USE CASE
THAT PROVIDE
VALUE
NEW DATA INTO
THE PLATFORM
EVOLVE THE
PLATFORM
BASED ON
REQUIREMENTS
Use Case Driven Development

● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ extraction?
○ data governance?
○ history?
Data-centric innovation
10

A collaboration paradigm
11
Stream storage
Data lake
Data
democratised

Onboard driven by use case
12
Data lake

Data platform == collaboration platform
13
Data lake

Data platform overview
14
Data lake
Cold
store
Service
Service
Online
services
Offline
data platform
Batch
processing

15
Data lake
Cold
store
Dataset
Job
Service
Service
Online
services
Offline
data platform
Batch
processing

16
Data lake
Cold
store
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Job
Batch
processing
Workflow
orchestration

17
Data lake
Batch
processing
Online
services
Cold
store
Service
Data feature
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Internal
services
Job

Life of a change, batch pipelines
18
● My pipeline, version 2!
○ Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Easy to test end-to-end
○ Upstream team can do the change
∆?

Egress target change
19
● Need output in different storage!
○ Adding egress target is easy
○ Egress target backfill is easy
● Facilitates cost limitation
○ Partially aggregate → BigQuery / Redshift
○ Limited retention in egress storage

Life of an error, batch pipelines
20
● My dataset, bad version!
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient

Deployment example, on-premise
21
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store

Deployment example, cloud native
22
source
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR

Deployment, one cluster less
23
source
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Workerspark-submit
--master=local
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
Docker image Docker registry
S3 / GCS

Continuous deployment
24
mono-
repo PR build,
affected
CI tests
mymodule/mypipe:revtag
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Workerspark-submit
--master=local
kind: CronJob
spec:
schedule: "10 * * * *"
Openshift registry
S3
master
branch
pipeline tests
doc build

Some pipelines are straightforward
25

GDPR
Article 17.
“The data subject shall have the right to obtain from the controller the erasure of personal data concerning
him or her without undue delay and the controller shall have the obligation to erase personal data without
undue delay where one of the following grounds applies:“
➔ the personal data are no longer necessary in relation to the purposes for which they were collected
or otherwise processed - Data Retention
➔ the data subject withdraws consent on which the processing is based - Data Deletion Requests
28

GDPR
29
{
id: ….
pii: [...]
}
CREATE
KEY FOR ID
ENCRYPT PERSONAL
DATA WITH KEY

GDPR - Retention
30
{
id: ….
pii: [...]
}
CREATE
KEY FOR
ID
ENCRYPT
PERSONAL DATA
WITH KEY
➔ Each dataset has a retention time from
the owners of the data
➔ Create new keys each 30 days
➔ Destroy keys older than the retention
time

GDPR - Right to be forgotten
31
List of users
that have
requested
deletion
Find keys
for those
users
Destroy
keys

Use Cases in Use
➔ Machine Learning
◆ Built a system that tries to predict if a visitor will watch an ad in a video or not
➔ Creating Reports
◆ Daily reporting data for ad team
◆ Weekly report of ad viewing data for site team
➔ GDPR Registry Extract
◆ Collect data from multiple different sources
◆ Merge the data
◆ Send data to be viewed by the user
32

Lessons Learned
Cloud selection is influenced by data location
Most data for the use cases we started with was on Google Cloud Storage / BigQuery
incurring extra development time and cost to exfiltrate that data.
Kubernetes?
Same platform as other teams + great support from infrastructure platform team.
No Spark cluster maintenance, tweaking, debugging.
Autoscaling works, but some challenges for batch jobs.
33

Summary
Use case driven development == Short Time to Production
First pipeline in 3 weeks
Small team 2-4 People
Keep it simple
10-15 Pipelines
34

Kubernetes as data platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kubernetes as data platform

Similar to Kubernetes as data platform (20)

More from Lars Albertsson

More from Lars Albertsson (14)

Recently uploaded

Recently uploaded (20)

Kubernetes as data platform