Crossing the data divide

www.scling.com
Crossing the data divide
Lars Albertsson, Founder, Scling
Data Innovation Summit, 2021-10-14
1

www.scling.com
The great capability divide
2
1000x span in
availability metrics
Started 2002 / 2006,
launched 2010,
killed 2012
1000 person years,
cost $125M
Started 2009-05-10,
launched 2009-05-16
$80M revenue in 15 months
https://www.flickr.com/photos/downloadsourcefr/15944373702, CC BY 2.0
Pirate Bay founders' picture used without permission

www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 10-1000
3
2014: 6500 datasets / day
2016: 20000 datasets / day
2017: 100B events collected / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value

www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
4
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return

www.scling.com
● Hand-built models
● Manual deployment
● Spreadsheets
Data artifacts: 100x 1000x
● Automated QA,
monitoring
● Continuous deployment
● Hadoop ecosystem
Manual, mechanised, industrialised
5
● Automated training
● Semi-automated
deployment
● Data warehouses,
notebooks

www.scling.com
Road towards industrialisation
6
Data warehouse age -
mechanised analytics
DW
LAMP stack age -
manual analytics
Hadoop age -
industrialised analytics,
data-fed features,
machine learning
Significant change in workflows
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations

www.scling.com
Road back again
7
DW
Enterprise big data failures
Post-Hadoop "data engineering" -
traditional workflows, new technology

www.scling.com
Gap is still there
8
DW
Enterprise big data failures
Post-Hadoop "data engineering" -
traditional workflows, new technology
~10 year capability gap
"data factory engineering"
Current data eng focus -
narrative, tools, vendors

www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
9

www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
10
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days

www.scling.com
Normalise data collection to compare
11
Graph by Adam Altmejd, @adamaltmejd

www.scling.com
Forecast for analytics with fresh data
12
Graph by Adam Altmejd, @adamaltmejd

www.scling.com
From craft to process
13

www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters

www.scling.com
Sustainable production ML
16
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test

www.scling.com
Data engineering vs data factory engineering
17
How to organise
How to work How to build

www.scling.com
Data factory engineering principles - technology
18
Centralised,
homogeneous
data platform
Functional
architecture
Simple technology,
simple rituals
● Minimal experiment friction
○ Centralise first to establish homogeneity
● Democratised functional data processing
○ Raw data + transforms
○ Immutable datasets!

www.scling.com
Data-centric innovation
● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ quality?
○ extraction?
○ data governance?
○ history?
19

www.scling.com
Data platform
Big data - a collaboration paradigm
20
Stream storage
Data lake
Data
democratised

www.scling.com
Data factory engineering principles - architecture
21
Failure-driven
design
What happens,
happens in production
Fast feedback cycle,
slow integration
● Batch processing is self healing
○ If you master workflow orchestration
● Low failure impact → high risk → fast cycle

www.scling.com 22
Cost of a software error
Nearline
● Data corruption
● Downstream impact
● Bounded recovery
Offline
● Temporary data
corruption
● Downstream impact
● Easy recovery
Online
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream

www.scling.com
Many nines uptime (99.99.. %) A couple of sevens
Data speed Innovation speed
23
Nearline
Data processing tradeoff
Job
Stream
Offline
Online
Stream
Job
Stream

www.scling.com
Eliminate infrastructure waste
24
● Production environment only
○ Dev, test, staging lack production data
● Dark pipelines
○ Run in parallel
○ Monitor diff vs production
○ Roll out slowly?
∆?

www.scling.com
Data factory engineering principles - engineering
25
It's a software
engineering problem
Continuous
process
improvement
● Quality, reproducibility, versioning,
deployment, monitoring, rapid change?
○ Solved software engineering
problems!
● Capable, unpolished components
○ Designed for strong processes,
CI/CD, testing, observability
○ Ugly interfaces
● Statistical process control, engineered

www.scling.com
SQL is a power tool, not an industrial robot
26
● No composition & abstractions
○ Hostile to testing
● Not expressive enough for mature data processing
● Hostile to data quality measurements and repair
○ Hadoop/Spark/Flink have quality primitives built in
https://threadreaderapp.com/thread/1353832649664692225.html

www.scling.com
Data factory engineering principles - value iteration
27
Pull-driven work,
initiated by business
value needs
Products, not
projects
Align along
value flows
● Only business value counts
○ Drives work
○ Few teams along path
● Data is organic
○ Never done, always iterate

www.scling.com
Data factory engineering principles
28
Centralised,
homogeneous
data platform
Functional
architecture
How to organise
It's a software
engineering problem
Pull-driven work,
initiated by business
value needs
Failure-driven
design
Simple technology,
simple rituals
What happens,
happens in production
Fast feedback cycle,
slow integration
Continuous
process
improvement
Products, not
projects
How to work How to build
Align along
value flows

www.scling.com
Software factory engineering principles
29
Immutable images
Agile
Statistical process control
Products
DevOps
Puppet, Ansible
Waterfall
In prod debugging
Projects
Dev + Ops
High code
Low code

www.scling.com
What should a company do?
30
● Everything in-house
○ Works only for big tech
● Vendors - build, not buy
○ Works for families of use cases
○ So far a 10 year gap to tech elite
● Get consultants
○ No competence flow from European big tech to consultants
○ Products, not projects
● Long-term partnerships?
○ Common outside IT
○ Unfamiliar model in IT - cf. cloud resistance
Autoliv general presentation 2017

www.scling.com
Scling - data-value-as-a-service
31
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration

Crossing the data divide

More Related Content

Similar to Crossing the data divide

More from Lars Albertsson

Recently uploaded

Crossing the data divide