Top 10 data engineering mistakes

Lars Albertsson
Lars AlbertssonFounder & Data Engineer
www.mapflat.com
Top 10 data engineering
mistakes
Berlin Buzzwords, 2018-06-11
Lars Albertsson
www.mapflat.com, www.mimeria.com
1
www.mapflat.com
Why this talk?
● Sharing what I have learnt. Please do the same.
● Mistakes mentioned are directly or indirectly experienced.
● "Do this" / "Don't do that" == "Based on what I have seen, I have come to the conclusion that ..."
2
Learning
practices
Mastering
practices
Bending
practices
www.mapflat.com
Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc
● Founder @ Mimeria - data-value-as-a-service
3
www.mapflat.com
● What is the goal?
Mistake == obstacle towards end goal
4
www.mapflat.com
Blueprint
5
Data
lake
Cluster storage
Ingress Egress
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
DB
Cold store
www.mapflat.com
Cron schedule: 10 * * * *
A first pipeline
6
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
● Input: Page views from www (hourly)
● Output: key performance indexes (daily)
○ Number of orders
○ Web session length
www.mapflat.com
Cron schedule: 10 * * * *
Cron schedule: 15 1 * * *
● Works great day 1,2,3,4
A first pipeline
7
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
order="/prod/red/order/v1/$y/$m/$d"
campaign="/cold/green/campaign/v1/$y/$m/$d"
impact="/cold/green/impact/v1/$y/$m/$d"
session="/prod/red/session/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.CampaignImpact kpi.jar $order $campaign 
$session $impact
www.mapflat.com
Cron schedule: 10 * * * *
Cron schedule: 15 1 * * *
● Works great day 1,2,3,4
● Day 5: FormSession not done by 1:15
Job fails
A brittle pipeline
8
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
order="/prod/red/order/v1/$y/$m/$d"
campaign="/cold/green/campaign/v1/$y/$m/$d"
impact="/cold/green/impact/v1/$y/$m/$d"
session="/prod/red/session/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.CampaignImpact kpi.jar $order $campaign 
$session $impact
www.mapflat.com
Cron schedule: 10 * * * *
Cron schedule: 15 1 * * *
● Works great day 1,2,3,4
● Day 5: FormSession not done by 1:15 Silent data corruption
● Day 8: FilterOrders fails for hour 6
● Day 13: Page views for hour 19 only partially done at 20:10
A corrupt pipeline
9
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
order="/prod/red/order/v1/$y/$m/$d"
campaign="/cold/green/campaign/v1/$y/$m/$d"
impact="/cold/green/impact/v1/$y/$m/$d"
session="/prod/red/session/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.CampaignImpact kpi.jar $order $campaign 
$session $impact
www.mapflat.com
● A workflow orchestrator manages the dependencies
● Weak orchestrators use graphical / XML dependency language
○ Not expressive enough
● Hadoop vendors + cloud vendors ship weak orchestrators
○ Except Google
Mistake 1: No/weak workflow orchestration
10
www.mapflat.com
Remedy 1: Python-DSL orchestrators
Luigi
● Simple, stable
● Easy to run
● Slim DataOps
● Does one thing well
11
Airflow
● Good momentum
○ Google managed service
● More moving parts
● Complex DataOps
● Includes scheduler & monitoring UI
www.mapflat.com
Hadoop has a SQL interface, right?
● Let's pretend it is an RDBMS
12
Data
lake
Hadoop
Service
Service
Transactions
www.mapflat.com
Hadoop for everything
● Or a NoSQL database
13
Data
lake
Hadoop
Service
Service
Service
Queries
www.mapflat.com
Mistake 2: New technology, same old patterns
● Distributed technology is single-purpose
○ RDBMS == multi-purpose
● Hadoop + ecosystem = offline
Big data components + small data patterns → worst of both worlds
14
www.mapflat.com
● New technology is not the key to success
● Benefits of big data == collaboration and pipelines
Data democratisation
Remedy 2: Work patterns >> technology
15
Stream storage
Data lake
Offline, forgiving environment
→
iteration speed
www.mapflat.com
● Wait for all service nodes to send data
○ Then start processing
● For how long?
○ Dashboards?
○ Reports?
We need all the data
16
ServiceServiceServiceServiceServiceService
www.mapflat.com
● Start right away
● Hope we have most data
● If we get x% more, rerun
We need it now
17
ServiceServiceServiceServiceServiceService
www.mapflat.com
● We have an IP→ geo service
● Decorate incoming events!
● Works great until
○ service fails
○ has low quality geo data
○ has a bug
○ we overload it
Let's improve the events
18
Service
www.mapflat.com
● DB has data. We wants it.
● Just read it from production?
● Works until
○ You overload the database
○ You need to rerun the job later
Lazy data retrieval
19
DB
www.mapflat.com
Mistake 3: Deviate from functional principles
● Immutability
● Idempotency
○ Jobs must be able to rerun without harmful effects
● Reproducibility
○ Necessary for data science
○ Many exceptions - be conscious
20
www.mapflat.com
Knock on the back door
● Let's dump the DB to lake
● Spark/Sqoop can speak JDBC
● The more mappers, the merrier?
21
Data lake
DB
Service
Our preciousss
www.mapflat.com
Knock on the front door
● We can dump from the REST interface
○ Unavailability by elephant
22
Data lake
DB
Service
Our preciousss
www.mapflat.com
Here is a new ML model for you
● New ML algorithm is much better
● Let's broadcast new
recommendation indexes
● Why is the Atlantic link saturated?
2323
DC1 DC2
DC3
Credits: Tuplejump
www.mapflat.com
Mistake 4: Weak online / offline separation
24
10000s of customers, imprecise feedback
Need low probability =>
Proactive prevention =>
Low ROI
Risk = probability * impact
www.mapflat.com
Mistake 4: Weak online / offline separation
25
10000s of customers, imprecise feedback
Need low probability =>
Proactive prevention =>
Low ROI
10s of employees, precise feedback
Ok with medium probability =>
Reactive repair =>
High ROI
Risk = probability * impact
www.mapflat.com
Remedy 4: Careful handover
26
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
Careful handover Careful handover
www.mapflat.com
Bring on the clusters
● "Our clients have petabytes, and they want to do machine learning."
Ali Ghodsi, Databricks CEO
100s TBs/day
KBs/day
● Why not use scalable components?
○ In case we get 100M users?
27
www.mapflat.com
Distributed processing is difficult
28
Data skew
More
parameters
to tune
Logs are
distributed
Problem not
parallelisable
More
sysadmin
Errors not
reproducible
Serialisation
problems
Noisy
neighbours
Excessive
shuffling
Limited
speedup
Mysteriously
slow nodes
Immature
frameworks
Difficult to
reuse old code
www.mapflat.com
Mistake 5: Premature scaling
● Most companies have GBs, not PBs
○ Limited by human users
● Largest cloud instances ~4 TB memory.
○ Your data fits
○ Or: One time period of your data fits
Primary value of "Big Data" is not data size, but data sharing and data agility.
Avoid clusters until necessary - use single nodes. E.g. "spark-submit --master=local"
29
www.mapflat.com
Slower by the day
● What if we want top list of last 365 days, every day?
● Why is this team always using 20% of the cluster, every day?
30
www.mapflat.com
● Read it all:
● Incremental counting:
Strive to parallelise workflows, not jobs
Mistake 6: Read all the data
31
OrderCountryCount
CountryTopList
Order User
CountryTopList
Order User
www.mapflat.com
Context, streaming
● Data lake has a real-time cousin -
the unified log
● Stream storage bus
○ Kafka / cloud service
● Pipelines of stream jobs
32
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
www.mapflat.com
We need real-time everything
● Result now >> result in two hours?
● Today's numbers >> yesterday's?
● Fraud analysis must be quick?
● The demos look great!
"Because life doesn't happen in batch."
- Audi @ Kafka Summit
33
www.mapflat.com
Streaming operations
● Streaming ops is difficult
○ Schema change needed here
34
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
www.mapflat.com
Streaming operations
● Streaming ops is difficult
○ Three days of invalid output here
35
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
www.mapflat.com
Invalid output, batch operations
36
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Done!
Backfill is automatic (Luigi)
www.mapflat.com
Streaming operations
● Works for a single job, not pipeline. :-(
37
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams
www.mapflat.com
Mistake 7: Premature streaming
● Stream processing reasonable if one of:
○ Strong use case - worth the cost
○ Single job pipeline
○ Low data quality ok
● Stream tooling is getting better
○ Thanks Confluent, Google!
"Because humans operate on a batch time scale."
- Me @ Berlin Buzzwords.
38
www.mapflat.com
We see a discrepancy
● These numbers don't add up
○ Have you seen anything suspicious in your pipeline?
○ How many records are invalid?
○ Do these datasets agree?
○ What is your data quality, BTW?
● "Who put 'content-type: bullshit' in these messages?"
"We are the sewer of the company. All garbage dumped upstream ends up here."
39
www.mapflat.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
40
www.mapflat.com
Mistake 8: Crunching in the dark
● Data products are organic
● Measure odd and unexpected values
○ Spark, Flink, Beam: counters
○ Difficult with SQL processing
● Measure consistency
○ Within records
○ Between records
○ Between datasets
● Be proactive, display in office!
41
www.mapflat.com
Something broke our pipeline
● "Invalid UTF-8 encoding"
○ It worked yesterday!
● "Field not found"
○ Who changed the schema?
42
www.mapflat.com
Pipeline fossil
● "Invalid UTF-8 encoding"
○ It worked yesterday!
● "Field not found"
○ Who changed the schema?
43
● "Yes, we should change that field"
○ But it is too costly
● "We built a new pipeline in a few weeks"
○ And removed the old after 18 months
www.mapflat.com
Mistake 9: Weak end-to-end agility
Data agility measure: Time from client logging field → pipeline → use in feature / BI
● Siloed company: ~6 months
● Agile company: 1 month
● Coordinated company: days
Remedy: End-to-end testing. Rare, often in conflict with company culture.
44
www.mapflat.com
Users everywhere
Functional architecture:
● Event-oriented - append only
● Immutability
● At-least-once semantics
● Reproducibility
○ Through 1000s of copies
● Redundancy
45
GDPR came along
- Please delete me?
- What data have you got on me?
- Please correct this data
- Hold on a second...
www.mapflat.com
Negative value heterogeneity
● We use Avro everywhere, why is this one in Parquet?
○ Beam did not support Parquet. :-(
● Why do we have 25 time formats?
○ ISO 8601, UTC assumed
○ ISO 8601 + timezone
○ Millis since epoch, UTC
○ Nanos since epoch, UTC
○ Millis since epoch, user local time
○ ...
○ ...
○ Float of seconds since epoch, as string. WTF?!?
46
www.mapflat.com
Mistake 10: Late governance
Data governance for techies:
Proactive things you do in order to avoid later WTF.
47
GDPR compliance
● Erasure
● Anonymisation & pseudonymisation
● Enforced retention
● Access limitations
Security
● Blast radius
● Disaster
recovery
Development speed
● Technical processes
● Domain definitions
● Schema evolution
● End-to-end testing
Data democracy
● Data lake rules
● Data quality
communication
Financial compliance
● Reproducibility
● Data completeness
● Data consistency
www.mapflat.com
Technology over principles
Complexity over business value
Premature
scaling
Premature
streaming
Late
governance
No workflow
orchestration
End-to-end
agility
Crunching in
the dark
Read all
the dataNew tech,
old patterns
Functional
principles
Offline / online
Simplicity
Driven by
value
Scope down
Hype xor
profit
Common themes
48
www.mapflat.com
Related work
● https://tinyurl.com/mapflat-10-stumble
● Avoiding big data anti-patterns - https://youtu.be/t63SF2UkD0A
● https://www.slideshare.net/JesseAnderson/the-five-dysfunctions-of-a-data-engineering-team
● https://tinyurl.com/mapflat-reading-list
● http://www.mapflat.com/presentations/
49
www.mapflat.com
Bonus slides
50
www.mapflat.com
class FraudCandidate(SparkSubmitTask):
date = DateParameter()
def requires(self):
return [OrderDaily(date=self.date), self._credibility()]
def _credibility(self):
max_age_days = 3
for lookback in range(max_age_days):
candidate = CredibilityScore(date - timedelta(days=lookback))
if candidate.output().exists():
return candidate
return CredibilityScore(date - timedelta(days=max_age_days))
Dynamic DAGs
CredibilityScore
FraudCandidate
Order
51
www.mapflat.com
Luigi or Airflow?
Luigi & Kubernetes:
● 1 hour to production
52
Airflow & Kubernetes:
● AIRFLOW-1314, created 2017-06-16
○ 9/12 sub-tasks resolved
○ 6 PRs
○ 28 commits
○ ~30 files touched
○ ~2000 lines of code
● Airflow monitoring has value, however
myjob.yaml:
kind: CronJob
spec:
schedule: "17 * * * *"
jobTemplate:
spec:
containers:
- name: "MyJob"
image: myregistry/mypipeline:latest
command: ["luigi"]
args: ["--module", "mypipeline",
"RangeDaily", "--of",
"MyJob",
"--days-back", "14"]
www.mapflat.com
Counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
53
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope
1 of 53

Recommended

Crossing the data divide by
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
3 views31 slides
Schema management with Scalameta by
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
7 views50 slides
How to not kill people - Berlin Buzzwords 2023.pdf by
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
34 views51 slides
Data engineering in 10 years.pdf by
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
841 views52 slides
The 7 habits of data effective companies.pdf by
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
252 views44 slides
Holistic data application quality by
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
396 views30 slides

More Related Content

More from Lars Albertsson

Ai legal and ethics by
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethicsLars Albertsson
200 views6 slides
The right side of speed - learning to shift left by
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
202 views44 slides
Mortal analytics - Covid-19 and the problem of data quality by
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
416 views43 slides
Data ops in practice - Swedish style by
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
408 views59 slides
The lean principles of data ops by
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
410 views38 slides
Data democratised by
Data democratisedData democratised
Data democratisedLars Albertsson
307 views32 slides

More from Lars Albertsson(20)

The right side of speed - learning to shift left by Lars Albertsson
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson202 views
Mortal analytics - Covid-19 and the problem of data quality by Lars Albertsson
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson416 views
Data ops in practice - Swedish style by Lars Albertsson
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson408 views
Eventually, time will kill your data processing by Lars Albertsson
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson413 views
Taming the reproducibility crisis by Lars Albertsson
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson521 views
Eventually, time will kill your data pipeline by Lars Albertsson
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson936 views
Test strategies for data processing pipelines, v2.0 by Lars Albertsson
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson2.7K views
10 ways to stumble with big data by Lars Albertsson
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson1.4K views
Protecting privacy in practice by Lars Albertsson
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
Lars Albertsson9.8K views
Testing data streaming applications by Lars Albertsson
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson4K views
A primer on building real time data-driven products by Lars Albertsson
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lars Albertsson951 views

Recently uploaded

Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...ShapeBlue
101 views17 slides
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
112 views34 slides
DRBD Deep Dive - Philipp Reisner - LINBIT by
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
140 views21 slides
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPoolShapeBlue
84 views10 slides
The Power of Heat Decarbonisation Plans in the Built Environment by
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
69 views20 slides
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...ShapeBlue
98 views29 slides

Recently uploaded(20)

Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue101 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue112 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue140 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue84 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE69 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue98 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue176 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu365 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue103 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software385 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue144 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue158 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc160 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty62 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue163 views

Top 10 data engineering mistakes

  • 1. www.mapflat.com Top 10 data engineering mistakes Berlin Buzzwords, 2018-06-11 Lars Albertsson www.mapflat.com, www.mimeria.com 1
  • 2. www.mapflat.com Why this talk? ● Sharing what I have learnt. Please do the same. ● Mistakes mentioned are directly or indirectly experienced. ● "Do this" / "Don't do that" == "Based on what I have seen, I have come to the conclusion that ..." 2 Learning practices Mastering practices Bending practices
  • 3. www.mapflat.com Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat - independent data engineering consultant ○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc ● Founder @ Mimeria - data-value-as-a-service 3
  • 4. www.mapflat.com ● What is the goal? Mistake == obstacle towards end goal 4
  • 6. www.mapflat.com Cron schedule: 10 * * * * A first pipeline 6 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi ● Input: Page views from www (hourly) ● Output: key performance indexes (daily) ○ Number of orders ○ Web session length
  • 7. www.mapflat.com Cron schedule: 10 * * * * Cron schedule: 15 1 * * * ● Works great day 1,2,3,4 A first pipeline 7 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) order="/prod/red/order/v1/$y/$m/$d" campaign="/cold/green/campaign/v1/$y/$m/$d" impact="/cold/green/impact/v1/$y/$m/$d" session="/prod/red/session/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.CampaignImpact kpi.jar $order $campaign $session $impact
  • 8. www.mapflat.com Cron schedule: 10 * * * * Cron schedule: 15 1 * * * ● Works great day 1,2,3,4 ● Day 5: FormSession not done by 1:15 Job fails A brittle pipeline 8 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) order="/prod/red/order/v1/$y/$m/$d" campaign="/cold/green/campaign/v1/$y/$m/$d" impact="/cold/green/impact/v1/$y/$m/$d" session="/prod/red/session/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.CampaignImpact kpi.jar $order $campaign $session $impact
  • 9. www.mapflat.com Cron schedule: 10 * * * * Cron schedule: 15 1 * * * ● Works great day 1,2,3,4 ● Day 5: FormSession not done by 1:15 Silent data corruption ● Day 8: FilterOrders fails for hour 6 ● Day 13: Page views for hour 19 only partially done at 20:10 A corrupt pipeline 9 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) order="/prod/red/order/v1/$y/$m/$d" campaign="/cold/green/campaign/v1/$y/$m/$d" impact="/cold/green/impact/v1/$y/$m/$d" session="/prod/red/session/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.CampaignImpact kpi.jar $order $campaign $session $impact
  • 10. www.mapflat.com ● A workflow orchestrator manages the dependencies ● Weak orchestrators use graphical / XML dependency language ○ Not expressive enough ● Hadoop vendors + cloud vendors ship weak orchestrators ○ Except Google Mistake 1: No/weak workflow orchestration 10
  • 11. www.mapflat.com Remedy 1: Python-DSL orchestrators Luigi ● Simple, stable ● Easy to run ● Slim DataOps ● Does one thing well 11 Airflow ● Good momentum ○ Google managed service ● More moving parts ● Complex DataOps ● Includes scheduler & monitoring UI
  • 12. www.mapflat.com Hadoop has a SQL interface, right? ● Let's pretend it is an RDBMS 12 Data lake Hadoop Service Service Transactions
  • 13. www.mapflat.com Hadoop for everything ● Or a NoSQL database 13 Data lake Hadoop Service Service Service Queries
  • 14. www.mapflat.com Mistake 2: New technology, same old patterns ● Distributed technology is single-purpose ○ RDBMS == multi-purpose ● Hadoop + ecosystem = offline Big data components + small data patterns → worst of both worlds 14
  • 15. www.mapflat.com ● New technology is not the key to success ● Benefits of big data == collaboration and pipelines Data democratisation Remedy 2: Work patterns >> technology 15 Stream storage Data lake Offline, forgiving environment → iteration speed
  • 16. www.mapflat.com ● Wait for all service nodes to send data ○ Then start processing ● For how long? ○ Dashboards? ○ Reports? We need all the data 16 ServiceServiceServiceServiceServiceService
  • 17. www.mapflat.com ● Start right away ● Hope we have most data ● If we get x% more, rerun We need it now 17 ServiceServiceServiceServiceServiceService
  • 18. www.mapflat.com ● We have an IP→ geo service ● Decorate incoming events! ● Works great until ○ service fails ○ has low quality geo data ○ has a bug ○ we overload it Let's improve the events 18 Service
  • 19. www.mapflat.com ● DB has data. We wants it. ● Just read it from production? ● Works until ○ You overload the database ○ You need to rerun the job later Lazy data retrieval 19 DB
  • 20. www.mapflat.com Mistake 3: Deviate from functional principles ● Immutability ● Idempotency ○ Jobs must be able to rerun without harmful effects ● Reproducibility ○ Necessary for data science ○ Many exceptions - be conscious 20
  • 21. www.mapflat.com Knock on the back door ● Let's dump the DB to lake ● Spark/Sqoop can speak JDBC ● The more mappers, the merrier? 21 Data lake DB Service Our preciousss
  • 22. www.mapflat.com Knock on the front door ● We can dump from the REST interface ○ Unavailability by elephant 22 Data lake DB Service Our preciousss
  • 23. www.mapflat.com Here is a new ML model for you ● New ML algorithm is much better ● Let's broadcast new recommendation indexes ● Why is the Atlantic link saturated? 2323 DC1 DC2 DC3 Credits: Tuplejump
  • 24. www.mapflat.com Mistake 4: Weak online / offline separation 24 10000s of customers, imprecise feedback Need low probability => Proactive prevention => Low ROI Risk = probability * impact
  • 25. www.mapflat.com Mistake 4: Weak online / offline separation 25 10000s of customers, imprecise feedback Need low probability => Proactive prevention => Low ROI 10s of employees, precise feedback Ok with medium probability => Reactive repair => High ROI Risk = probability * impact
  • 26. www.mapflat.com Remedy 4: Careful handover 26 Fraud serviceFraud model Orders Orders Replication / Backup Standard procedures Standard proceduresLightweight procedures Careful handover Careful handover
  • 27. www.mapflat.com Bring on the clusters ● "Our clients have petabytes, and they want to do machine learning." Ali Ghodsi, Databricks CEO 100s TBs/day KBs/day ● Why not use scalable components? ○ In case we get 100M users? 27
  • 28. www.mapflat.com Distributed processing is difficult 28 Data skew More parameters to tune Logs are distributed Problem not parallelisable More sysadmin Errors not reproducible Serialisation problems Noisy neighbours Excessive shuffling Limited speedup Mysteriously slow nodes Immature frameworks Difficult to reuse old code
  • 29. www.mapflat.com Mistake 5: Premature scaling ● Most companies have GBs, not PBs ○ Limited by human users ● Largest cloud instances ~4 TB memory. ○ Your data fits ○ Or: One time period of your data fits Primary value of "Big Data" is not data size, but data sharing and data agility. Avoid clusters until necessary - use single nodes. E.g. "spark-submit --master=local" 29
  • 30. www.mapflat.com Slower by the day ● What if we want top list of last 365 days, every day? ● Why is this team always using 20% of the cluster, every day? 30
  • 31. www.mapflat.com ● Read it all: ● Incremental counting: Strive to parallelise workflows, not jobs Mistake 6: Read all the data 31 OrderCountryCount CountryTopList Order User CountryTopList Order User
  • 32. www.mapflat.com Context, streaming ● Data lake has a real-time cousin - the unified log ● Stream storage bus ○ Kafka / cloud service ● Pipelines of stream jobs 32 Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job
  • 33. www.mapflat.com We need real-time everything ● Result now >> result in two hours? ● Today's numbers >> yesterday's? ● Fraud analysis must be quick? ● The demos look great! "Because life doesn't happen in batch." - Audi @ Kafka Summit 33
  • 34. www.mapflat.com Streaming operations ● Streaming ops is difficult ○ Schema change needed here 34 Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job
  • 35. www.mapflat.com Streaming operations ● Streaming ops is difficult ○ Three days of invalid output here 35 Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job
  • 36. www.mapflat.com Invalid output, batch operations 36 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Done! Backfill is automatic (Luigi)
  • 37. www.mapflat.com Streaming operations ● Works for a single job, not pipeline. :-( 37 Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  • 38. www.mapflat.com Mistake 7: Premature streaming ● Stream processing reasonable if one of: ○ Strong use case - worth the cost ○ Single job pipeline ○ Low data quality ok ● Stream tooling is getting better ○ Thanks Confluent, Google! "Because humans operate on a batch time scale." - Me @ Berlin Buzzwords. 38
  • 39. www.mapflat.com We see a discrepancy ● These numbers don't add up ○ Have you seen anything suspicious in your pipeline? ○ How many records are invalid? ○ Do these datasets agree? ○ What is your data quality, BTW? ● "Who put 'content-type: bullshit' in these messages?" "We are the sewer of the company. All garbage dumped upstream ends up here." 39
  • 40. www.mapflat.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 40
  • 41. www.mapflat.com Mistake 8: Crunching in the dark ● Data products are organic ● Measure odd and unexpected values ○ Spark, Flink, Beam: counters ○ Difficult with SQL processing ● Measure consistency ○ Within records ○ Between records ○ Between datasets ● Be proactive, display in office! 41
  • 42. www.mapflat.com Something broke our pipeline ● "Invalid UTF-8 encoding" ○ It worked yesterday! ● "Field not found" ○ Who changed the schema? 42
  • 43. www.mapflat.com Pipeline fossil ● "Invalid UTF-8 encoding" ○ It worked yesterday! ● "Field not found" ○ Who changed the schema? 43 ● "Yes, we should change that field" ○ But it is too costly ● "We built a new pipeline in a few weeks" ○ And removed the old after 18 months
  • 44. www.mapflat.com Mistake 9: Weak end-to-end agility Data agility measure: Time from client logging field → pipeline → use in feature / BI ● Siloed company: ~6 months ● Agile company: 1 month ● Coordinated company: days Remedy: End-to-end testing. Rare, often in conflict with company culture. 44
  • 45. www.mapflat.com Users everywhere Functional architecture: ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy 45 GDPR came along - Please delete me? - What data have you got on me? - Please correct this data - Hold on a second...
  • 46. www.mapflat.com Negative value heterogeneity ● We use Avro everywhere, why is this one in Parquet? ○ Beam did not support Parquet. :-( ● Why do we have 25 time formats? ○ ISO 8601, UTC assumed ○ ISO 8601 + timezone ○ Millis since epoch, UTC ○ Nanos since epoch, UTC ○ Millis since epoch, user local time ○ ... ○ ... ○ Float of seconds since epoch, as string. WTF?!? 46
  • 47. www.mapflat.com Mistake 10: Late governance Data governance for techies: Proactive things you do in order to avoid later WTF. 47 GDPR compliance ● Erasure ● Anonymisation & pseudonymisation ● Enforced retention ● Access limitations Security ● Blast radius ● Disaster recovery Development speed ● Technical processes ● Domain definitions ● Schema evolution ● End-to-end testing Data democracy ● Data lake rules ● Data quality communication Financial compliance ● Reproducibility ● Data completeness ● Data consistency
  • 48. www.mapflat.com Technology over principles Complexity over business value Premature scaling Premature streaming Late governance No workflow orchestration End-to-end agility Crunching in the dark Read all the dataNew tech, old patterns Functional principles Offline / online Simplicity Driven by value Scope down Hype xor profit Common themes 48
  • 49. www.mapflat.com Related work ● https://tinyurl.com/mapflat-10-stumble ● Avoiding big data anti-patterns - https://youtu.be/t63SF2UkD0A ● https://www.slideshare.net/JesseAnderson/the-five-dysfunctions-of-a-data-engineering-team ● https://tinyurl.com/mapflat-reading-list ● http://www.mapflat.com/presentations/ 49
  • 51. www.mapflat.com class FraudCandidate(SparkSubmitTask): date = DateParameter() def requires(self): return [OrderDaily(date=self.date), self._credibility()] def _credibility(self): max_age_days = 3 for lookback in range(max_age_days): candidate = CredibilityScore(date - timedelta(days=lookback)) if candidate.output().exists(): return candidate return CredibilityScore(date - timedelta(days=max_age_days)) Dynamic DAGs CredibilityScore FraudCandidate Order 51
  • 52. www.mapflat.com Luigi or Airflow? Luigi & Kubernetes: ● 1 hour to production 52 Airflow & Kubernetes: ● AIRFLOW-1314, created 2017-06-16 ○ 9/12 sub-tasks resolved ○ 6 PRs ○ 28 commits ○ ~30 files touched ○ ~2000 lines of code ● Airflow monitoring has value, however myjob.yaml: kind: CronJob spec: schedule: "17 * * * *" jobTemplate: spec: containers: - name: "MyJob" image: myregistry/mypipeline:latest command: ["luigi"] args: ["--module", "mypipeline", "RangeDaily", "--of", "MyJob", "--days-back", "14"]
  • 53. www.mapflat.com Counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 53 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope