Did you know that the tech elite does not work at all like you do? Most people don't, and don't want to know. The State of DevOps report concluded a span of 1000x in delivery time and reliability between the elite and low performers. There is a similar gap for delivery time of data or ML pipelines to production. The gap in ability to compute datasets is higher, somewhere around a million times. We call this the data divide or the AI divide. It is widening over time, since most companies are not aware of its width.
We will share the principles we applied in the most successful Scandinavian crossing of the data divide. We never explicitly shared or described, nor fully understood the principles at the time, but it is long due to explicitly enumerate them.
The presentation will likely be uncomfortable and surprising, because it does not match what you do and what your vendors say. You will have no practical use of the information, since you cannot apply the principles, because they contradict many contemporary trends and popular technologies on the market, and you would be unable to overcome the forces of trends, popularity, and messages from vendors. They worked beautifully for us at the time.
2. www.scling.com
The great capability divide
2
1000x span in
availability metrics
Started 2002 / 2006,
launched 2010,
killed 2012
1000 person years,
cost $125M
Started 2009-05-10,
launched 2009-05-16
$80M revenue in 15 months
https://www.flickr.com/photos/downloadsourcefr/15944373702, CC BY 2.0
Pirate Bay founders' picture used without permission
3. www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 10-1000
3
2014: 6500 datasets / day
2016: 20000 datasets / day
2017: 100B events collected / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value
4. www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
4
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
8. www.scling.com
Gap is still there
8
DW
Enterprise big data failures
Post-Hadoop "data engineering" -
traditional workflows, new technology
~10 year capability gap
"data factory engineering"
Current data eng focus -
narrative, tools, vendors
10. www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
10
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days
14. www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
16. www.scling.com
Sustainable production ML
16
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
21. www.scling.com
Data factory engineering principles - architecture
21
Failure-driven
design
What happens,
happens in production
Fast feedback cycle,
slow integration
● Batch processing is self healing
○ If you master workflow orchestration
● Low failure impact → high risk → fast cycle
22. www.scling.com 22
Cost of a software error
Nearline
● Data corruption
● Downstream impact
● Bounded recovery
Offline
● Temporary data
corruption
● Downstream impact
● Easy recovery
Online
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream
23. www.scling.com
Many nines uptime (99.99.. %) A couple of sevens
Data speed Innovation speed
23
Nearline
Data processing tradeoff
Job
Stream
Offline
Online
Stream
Job
Stream
24. www.scling.com
Eliminate infrastructure waste
24
● Production environment only
○ Dev, test, staging lack production data
● Dark pipelines
○ Run in parallel
○ Monitor diff vs production
○ Roll out slowly?
∆?
25. www.scling.com
Data factory engineering principles - engineering
25
It's a software
engineering problem
Continuous
process
improvement
● Quality, reproducibility, versioning,
deployment, monitoring, rapid change?
○ Solved software engineering
problems!
● Capable, unpolished components
○ Designed for strong processes,
CI/CD, testing, observability
○ Ugly interfaces
● Statistical process control, engineered
26. www.scling.com
SQL is a power tool, not an industrial robot
26
● No composition & abstractions
○ Hostile to testing
● Not expressive enough for mature data processing
● Hostile to data quality measurements and repair
○ Hadoop/Spark/Flink have quality primitives built in
https://threadreaderapp.com/thread/1353832649664692225.html
27. www.scling.com
Data factory engineering principles - value iteration
27
Pull-driven work,
initiated by business
value needs
Products, not
projects
Align along
value flows
● Only business value counts
○ Drives work
○ Few teams along path
● Data is organic
○ Never done, always iterate
28. www.scling.com
Data factory engineering principles
28
Centralised,
homogeneous
data platform
Functional
architecture
How to organise
It's a software
engineering problem
Pull-driven work,
initiated by business
value needs
Failure-driven
design
Simple technology,
simple rituals
What happens,
happens in production
Fast feedback cycle,
slow integration
Continuous
process
improvement
Products, not
projects
How to work How to build
Align along
value flows
29. www.scling.com
Software factory engineering principles
29
Immutable images
Agile
Statistical process control
Products
DevOps
Puppet, Ansible
Waterfall
In prod debugging
Projects
Dev + Ops
High code
Low code
30. www.scling.com
What should a company do?
30
● Everything in-house
○ Works only for big tech
● Vendors - build, not buy
○ Works for families of use cases
○ So far a 10 year gap to tech elite
● Get consultants
○ No competence flow from European big tech to consultants
○ Products, not projects
● Long-term partnerships?
○ Common outside IT
○ Unfamiliar model in IT - cf. cloud resistance
Autoliv general presentation 2017
31. www.scling.com
Scling - data-value-as-a-service
31
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration