www.scling.com
Crossing the data divide
Lars Albertsson, Founder, Scling
Data Innovation Summit, 2021-10-14
1
www.scling.com
The great capability divide
2
1000x span in
availability metrics
Started 2002 / 2006,
launched 2010,
killed 2012
1000 person years,
cost $125M
Started 2009-05-10,
launched 2009-05-16
$80M revenue in 15 months
https://www.flickr.com/photos/downloadsourcefr/15944373702, CC BY 2.0
Pirate Bay founders' picture used without permission
www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 10-1000
3
2014: 6500 datasets / day
2016: 20000 datasets / day
2017: 100B events collected / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value
www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
4
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
www.scling.com
● Hand-built models
● Manual deployment
● Spreadsheets
Data artifacts: 100x 1000x
● Automated QA,
monitoring
● Continuous deployment
● Hadoop ecosystem
Manual, mechanised, industrialised
5
● Automated training
● Semi-automated
deployment
● Data warehouses,
notebooks
www.scling.com
Road towards industrialisation
6
Data warehouse age -
mechanised analytics
DW
LAMP stack age -
manual analytics
Hadoop age -
industrialised analytics,
data-fed features,
machine learning
Significant change in workflows
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
Road back again
7
DW
Enterprise big data failures
Post-Hadoop "data engineering" -
traditional workflows, new technology
www.scling.com
Gap is still there
8
DW
Enterprise big data failures
Post-Hadoop "data engineering" -
traditional workflows, new technology
~10 year capability gap
"data factory engineering"
Current data eng focus -
narrative, tools, vendors
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
9
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
10
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days
www.scling.com
Normalise data collection to compare
11
Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
12
Graph by Adam Altmejd, @adamaltmejd
www.scling.com
From craft to process
13
www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Naive ML
15
www.scling.com
Sustainable production ML
16
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Data engineering vs data factory engineering
17
How to organise
How to work How to build
www.scling.com
Data factory engineering principles - technology
18
Centralised,
homogeneous
data platform
Functional
architecture
Simple technology,
simple rituals
● Minimal experiment friction
○ Centralise first to establish homogeneity
● Democratised functional data processing
○ Raw data + transforms
○ Immutable datasets!
www.scling.com
Data-centric innovation
● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ quality?
○ extraction?
○ data governance?
○ history?
19
www.scling.com
Data platform
Big data - a collaboration paradigm
20
Stream storage
Data lake
Data
democratised
www.scling.com
Data factory engineering principles - architecture
21
Failure-driven
design
What happens,
happens in production
Fast feedback cycle,
slow integration
● Batch processing is self healing
○ If you master workflow orchestration
● Low failure impact → high risk → fast cycle
www.scling.com 22
Cost of a software error
Nearline
● Data corruption
● Downstream impact
● Bounded recovery
Offline
● Temporary data
corruption
● Downstream impact
● Easy recovery
Online
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream
www.scling.com
Many nines uptime (99.99.. %) A couple of sevens
Data speed Innovation speed
23
Nearline
Data processing tradeoff
Job
Stream
Offline
Online
Stream
Job
Stream
www.scling.com
Eliminate infrastructure waste
24
● Production environment only
○ Dev, test, staging lack production data
● Dark pipelines
○ Run in parallel
○ Monitor diff vs production
○ Roll out slowly?
∆?
www.scling.com
Data factory engineering principles - engineering
25
It's a software
engineering problem
Continuous
process
improvement
● Quality, reproducibility, versioning,
deployment, monitoring, rapid change?
○ Solved software engineering
problems!
● Capable, unpolished components
○ Designed for strong processes,
CI/CD, testing, observability
○ Ugly interfaces
● Statistical process control, engineered
www.scling.com
SQL is a power tool, not an industrial robot
26
● No composition & abstractions
○ Hostile to testing
● Not expressive enough for mature data processing
● Hostile to data quality measurements and repair
○ Hadoop/Spark/Flink have quality primitives built in
https://threadreaderapp.com/thread/1353832649664692225.html
www.scling.com
Data factory engineering principles - value iteration
27
Pull-driven work,
initiated by business
value needs
Products, not
projects
Align along
value flows
● Only business value counts
○ Drives work
○ Few teams along path
● Data is organic
○ Never done, always iterate
www.scling.com
Data factory engineering principles
28
Centralised,
homogeneous
data platform
Functional
architecture
How to organise
It's a software
engineering problem
Pull-driven work,
initiated by business
value needs
Failure-driven
design
Simple technology,
simple rituals
What happens,
happens in production
Fast feedback cycle,
slow integration
Continuous
process
improvement
Products, not
projects
How to work How to build
Align along
value flows
www.scling.com
Software factory engineering principles
29
Immutable images
Agile
Statistical process control
Products
DevOps
Puppet, Ansible
Waterfall
In prod debugging
Projects
Dev + Ops
High code
Low code
www.scling.com
What should a company do?
30
● Everything in-house
○ Works only for big tech
● Vendors - build, not buy
○ Works for families of use cases
○ So far a 10 year gap to tech elite
● Get consultants
○ No competence flow from European big tech to consultants
○ Products, not projects
● Long-term partnerships?
○ Common outside IT
○ Unfamiliar model in IT - cf. cloud resistance
Autoliv general presentation 2017
www.scling.com
Scling - data-value-as-a-service
31
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration

Crossing the data divide

  • 1.
    www.scling.com Crossing the datadivide Lars Albertsson, Founder, Scling Data Innovation Summit, 2021-10-14 1
  • 2.
    www.scling.com The great capabilitydivide 2 1000x span in availability metrics Started 2002 / 2006, launched 2010, killed 2012 1000 person years, cost $125M Started 2009-05-10, launched 2009-05-16 $80M revenue in 15 months https://www.flickr.com/photos/downloadsourcefr/15944373702, CC BY 2.0 Pirate Bay founders' picture used without permission
  • 3.
    www.scling.com Efficiency gap, datacost & value ● Data processing produces datasets ○ Each dataset has business value ● Proxy value/cost metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 10-1000 3 2014: 6500 datasets / day 2016: 20000 datasets / day 2017: 100B events collected / day 2018: 100000+ datasets / day, 25% of staff use BigQuery 2016: 1600 000 000 datasets / day Disruptive value of data, machine learning Financial, reporting Insights, data-fed features effort value
  • 4.
    www.scling.com ● Scaled processes ●Machine tools ● Challenges: scale, logistics, legal, organisation, faults, ... Manual, mechanised, industrialised 4 ● Muscle-powered ● Few tools ● Human touch for every step ● Direct human control ● Machine tools ● Low investment, direct return
  • 5.
    www.scling.com ● Hand-built models ●Manual deployment ● Spreadsheets Data artifacts: 100x 1000x ● Automated QA, monitoring ● Continuous deployment ● Hadoop ecosystem Manual, mechanised, industrialised 5 ● Automated training ● Semi-automated deployment ● Data warehouses, notebooks
  • 6.
    www.scling.com Road towards industrialisation 6 Datawarehouse age - mechanised analytics DW LAMP stack age - manual analytics Hadoop age - industrialised analytics, data-fed features, machine learning Significant change in workflows Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 7.
    www.scling.com Road back again 7 DW Enterprisebig data failures Post-Hadoop "data engineering" - traditional workflows, new technology
  • 8.
    www.scling.com Gap is stillthere 8 DW Enterprise big data failures Post-Hadoop "data engineering" - traditional workflows, new technology ~10 year capability gap "data factory engineering" Current data eng focus - narrative, tools, vendors
  • 9.
    www.scling.com What conclusion fromthis graph? COVID-19 fatalities / day in Sweden 9
  • 10.
    www.scling.com What conclusion fromthis graph? COVID-19 fatalities / day in Sweden 10 Fatalities collected during 2 day Fatalities collected during 4 days Fatalities collected during 10 days
  • 11.
    www.scling.com Normalise data collectionto compare 11 Graph by Adam Altmejd, @adamaltmejd
  • 12.
    www.scling.com Forecast for analyticswith fresh data 12 Graph by Adam Altmejd, @adamaltmejd
  • 13.
  • 14.
    www.scling.com From craft toprocess 14 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 15.
  • 16.
    www.scling.com Sustainable production ML 16 Multiplemodels, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 17.
    www.scling.com Data engineering vsdata factory engineering 17 How to organise How to work How to build
  • 18.
    www.scling.com Data factory engineeringprinciples - technology 18 Centralised, homogeneous data platform Functional architecture Simple technology, simple rituals ● Minimal experiment friction ○ Centralise first to establish homogeneity ● Democratised functional data processing ○ Raw data + transforms ○ Immutable datasets!
  • 19.
    www.scling.com Data-centric innovation ● Needdata from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ quality? ○ extraction? ○ data governance? ○ history? 19
  • 20.
    www.scling.com Data platform Big data- a collaboration paradigm 20 Stream storage Data lake Data democratised
  • 21.
    www.scling.com Data factory engineeringprinciples - architecture 21 Failure-driven design What happens, happens in production Fast feedback cycle, slow integration ● Batch processing is self healing ○ If you master workflow orchestration ● Low failure impact → high risk → fast cycle
  • 22.
    www.scling.com 22 Cost ofa software error Nearline ● Data corruption ● Downstream impact ● Bounded recovery Offline ● Temporary data corruption ● Downstream impact ● Easy recovery Online ● User impact ● Data corruption ● Cascading corruption ● Unbounded recovery Job Stream Stream Job Stream
  • 23.
    www.scling.com Many nines uptime(99.99.. %) A couple of sevens Data speed Innovation speed 23 Nearline Data processing tradeoff Job Stream Offline Online Stream Job Stream
  • 24.
    www.scling.com Eliminate infrastructure waste 24 ●Production environment only ○ Dev, test, staging lack production data ● Dark pipelines ○ Run in parallel ○ Monitor diff vs production ○ Roll out slowly? ∆?
  • 25.
    www.scling.com Data factory engineeringprinciples - engineering 25 It's a software engineering problem Continuous process improvement ● Quality, reproducibility, versioning, deployment, monitoring, rapid change? ○ Solved software engineering problems! ● Capable, unpolished components ○ Designed for strong processes, CI/CD, testing, observability ○ Ugly interfaces ● Statistical process control, engineered
  • 26.
    www.scling.com SQL is apower tool, not an industrial robot 26 ● No composition & abstractions ○ Hostile to testing ● Not expressive enough for mature data processing ● Hostile to data quality measurements and repair ○ Hadoop/Spark/Flink have quality primitives built in https://threadreaderapp.com/thread/1353832649664692225.html
  • 27.
    www.scling.com Data factory engineeringprinciples - value iteration 27 Pull-driven work, initiated by business value needs Products, not projects Align along value flows ● Only business value counts ○ Drives work ○ Few teams along path ● Data is organic ○ Never done, always iterate
  • 28.
    www.scling.com Data factory engineeringprinciples 28 Centralised, homogeneous data platform Functional architecture How to organise It's a software engineering problem Pull-driven work, initiated by business value needs Failure-driven design Simple technology, simple rituals What happens, happens in production Fast feedback cycle, slow integration Continuous process improvement Products, not projects How to work How to build Align along value flows
  • 29.
    www.scling.com Software factory engineeringprinciples 29 Immutable images Agile Statistical process control Products DevOps Puppet, Ansible Waterfall In prod debugging Projects Dev + Ops High code Low code
  • 30.
    www.scling.com What should acompany do? 30 ● Everything in-house ○ Works only for big tech ● Vendors - build, not buy ○ Works for families of use cases ○ So far a 10 year gap to tech elite ● Get consultants ○ No competence flow from European big tech to consultants ○ Products, not projects ● Long-term partnerships? ○ Common outside IT ○ Unfamiliar model in IT - cf. cloud resistance Autoliv general presentation 2017
  • 31.
    www.scling.com Scling - data-value-as-a-service 31 Datavalue through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! Rapid data innovation Learning by doing, in collaboration