SlideShare a Scribd company logo
1 of 31
Download to read offline
www.scling.com
Crossing the data divide
Lars Albertsson, Founder, Scling
Data Innovation Summit, 2021-10-14
1
www.scling.com
The great capability divide
2
1000x span in
availability metrics
Started 2002 / 2006,
launched 2010,
killed 2012
1000 person years,
cost $125M
Started 2009-05-10,
launched 2009-05-16
$80M revenue in 15 months
https://www.flickr.com/photos/downloadsourcefr/15944373702, CC BY 2.0
Pirate Bay founders' picture used without permission
www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 10-1000
3
2014: 6500 datasets / day
2016: 20000 datasets / day
2017: 100B events collected / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value
www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
4
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
www.scling.com
● Hand-built models
● Manual deployment
● Spreadsheets
Data artifacts: 100x 1000x
● Automated QA,
monitoring
● Continuous deployment
● Hadoop ecosystem
Manual, mechanised, industrialised
5
● Automated training
● Semi-automated
deployment
● Data warehouses,
notebooks
www.scling.com
Road towards industrialisation
6
Data warehouse age -
mechanised analytics
DW
LAMP stack age -
manual analytics
Hadoop age -
industrialised analytics,
data-fed features,
machine learning
Significant change in workflows
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
Road back again
7
DW
Enterprise big data failures
Post-Hadoop "data engineering" -
traditional workflows, new technology
www.scling.com
Gap is still there
8
DW
Enterprise big data failures
Post-Hadoop "data engineering" -
traditional workflows, new technology
~10 year capability gap
"data factory engineering"
Current data eng focus -
narrative, tools, vendors
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
9
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
10
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days
www.scling.com
Normalise data collection to compare
11
Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
12
Graph by Adam Altmejd, @adamaltmejd
www.scling.com
From craft to process
13
www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Naive ML
15
www.scling.com
Sustainable production ML
16
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Data engineering vs data factory engineering
17
How to organise
How to work How to build
www.scling.com
Data factory engineering principles - technology
18
Centralised,
homogeneous
data platform
Functional
architecture
Simple technology,
simple rituals
● Minimal experiment friction
○ Centralise first to establish homogeneity
● Democratised functional data processing
○ Raw data + transforms
○ Immutable datasets!
www.scling.com
Data-centric innovation
● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ quality?
○ extraction?
○ data governance?
○ history?
19
www.scling.com
Data platform
Big data - a collaboration paradigm
20
Stream storage
Data lake
Data
democratised
www.scling.com
Data factory engineering principles - architecture
21
Failure-driven
design
What happens,
happens in production
Fast feedback cycle,
slow integration
● Batch processing is self healing
○ If you master workflow orchestration
● Low failure impact → high risk → fast cycle
www.scling.com 22
Cost of a software error
Nearline
● Data corruption
● Downstream impact
● Bounded recovery
Offline
● Temporary data
corruption
● Downstream impact
● Easy recovery
Online
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream
www.scling.com
Many nines uptime (99.99.. %) A couple of sevens
Data speed Innovation speed
23
Nearline
Data processing tradeoff
Job
Stream
Offline
Online
Stream
Job
Stream
www.scling.com
Eliminate infrastructure waste
24
● Production environment only
○ Dev, test, staging lack production data
● Dark pipelines
○ Run in parallel
○ Monitor diff vs production
○ Roll out slowly?
∆?
www.scling.com
Data factory engineering principles - engineering
25
It's a software
engineering problem
Continuous
process
improvement
● Quality, reproducibility, versioning,
deployment, monitoring, rapid change?
○ Solved software engineering
problems!
● Capable, unpolished components
○ Designed for strong processes,
CI/CD, testing, observability
○ Ugly interfaces
● Statistical process control, engineered
www.scling.com
SQL is a power tool, not an industrial robot
26
● No composition & abstractions
○ Hostile to testing
● Not expressive enough for mature data processing
● Hostile to data quality measurements and repair
○ Hadoop/Spark/Flink have quality primitives built in
https://threadreaderapp.com/thread/1353832649664692225.html
www.scling.com
Data factory engineering principles - value iteration
27
Pull-driven work,
initiated by business
value needs
Products, not
projects
Align along
value flows
● Only business value counts
○ Drives work
○ Few teams along path
● Data is organic
○ Never done, always iterate
www.scling.com
Data factory engineering principles
28
Centralised,
homogeneous
data platform
Functional
architecture
How to organise
It's a software
engineering problem
Pull-driven work,
initiated by business
value needs
Failure-driven
design
Simple technology,
simple rituals
What happens,
happens in production
Fast feedback cycle,
slow integration
Continuous
process
improvement
Products, not
projects
How to work How to build
Align along
value flows
www.scling.com
Software factory engineering principles
29
Immutable images
Agile
Statistical process control
Products
DevOps
Puppet, Ansible
Waterfall
In prod debugging
Projects
Dev + Ops
High code
Low code
www.scling.com
What should a company do?
30
● Everything in-house
○ Works only for big tech
● Vendors - build, not buy
○ Works for families of use cases
○ So far a 10 year gap to tech elite
● Get consultants
○ No competence flow from European big tech to consultants
○ Products, not projects
● Long-term partnerships?
○ Common outside IT
○ Unfamiliar model in IT - cf. cloud resistance
Autoliv general presentation 2017
www.scling.com
Scling - data-value-as-a-service
31
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration

More Related Content

Similar to Crossing the data divide

DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
OpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesOpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesDatavail
 
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022HostedbyConfluent
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...confluent
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyVenkata Pingali
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 
MLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven FactoryMLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven FactoryBigML, Inc
 
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AI
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AIDynniq & GoDataDriven - Shaping the future of traffic with IoT and AI
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AIBigDataExpo
 
DN 2017 | Hardware Failure Prediction at Dell-EMC | Ran Taig | Dell
DN 2017 |  Hardware Failure Prediction at Dell-EMC | Ran Taig | DellDN 2017 |  Hardware Failure Prediction at Dell-EMC | Ran Taig | Dell
DN 2017 | Hardware Failure Prediction at Dell-EMC | Ran Taig | DellDataconomy Media
 
Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?Talend
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Mikhail Rozhkov
 
Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?DiUS
 
SOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdfSOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdfDavid Barbieri Kennedy
 
About The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsAbout The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsKevin Haag
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature EngineeringAccelerating ML using Production Feature Engineering
Accelerating ML using Production Feature EngineeringVenkata Pingali
 

Similar to Crossing the data divide (20)

DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
OpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesOpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case Studies
 
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
MLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven FactoryMLSEV. Use Case: The Data-Driven Factory
MLSEV. Use Case: The Data-Driven Factory
 
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AI
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AIDynniq & GoDataDriven - Shaping the future of traffic with IoT and AI
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AI
 
DN 2017 | Hardware Failure Prediction at Dell-EMC | Ran Taig | Dell
DN 2017 |  Hardware Failure Prediction at Dell-EMC | Ran Taig | DellDN 2017 |  Hardware Failure Prediction at Dell-EMC | Ran Taig | Dell
DN 2017 | Hardware Failure Prediction at Dell-EMC | Ran Taig | Dell
 
Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
 
Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?
 
SOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdfSOP Planning and Optimization Solution-as-a-Service.pdf
SOP Planning and Optimization Solution-as-a-Service.pdf
 
About The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsAbout The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe Analytics
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Accelerating ML using Production Feature Engineering
Accelerating ML using Production Feature EngineeringAccelerating ML using Production Feature Engineering
Accelerating ML using Production Feature Engineering
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 

More from Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Crossing the data divide

  • 1. www.scling.com Crossing the data divide Lars Albertsson, Founder, Scling Data Innovation Summit, 2021-10-14 1
  • 2. www.scling.com The great capability divide 2 1000x span in availability metrics Started 2002 / 2006, launched 2010, killed 2012 1000 person years, cost $125M Started 2009-05-10, launched 2009-05-16 $80M revenue in 15 months https://www.flickr.com/photos/downloadsourcefr/15944373702, CC BY 2.0 Pirate Bay founders' picture used without permission
  • 3. www.scling.com Efficiency gap, data cost & value ● Data processing produces datasets ○ Each dataset has business value ● Proxy value/cost metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 10-1000 3 2014: 6500 datasets / day 2016: 20000 datasets / day 2017: 100B events collected / day 2018: 100000+ datasets / day, 25% of staff use BigQuery 2016: 1600 000 000 datasets / day Disruptive value of data, machine learning Financial, reporting Insights, data-fed features effort value
  • 4. www.scling.com ● Scaled processes ● Machine tools ● Challenges: scale, logistics, legal, organisation, faults, ... Manual, mechanised, industrialised 4 ● Muscle-powered ● Few tools ● Human touch for every step ● Direct human control ● Machine tools ● Low investment, direct return
  • 5. www.scling.com ● Hand-built models ● Manual deployment ● Spreadsheets Data artifacts: 100x 1000x ● Automated QA, monitoring ● Continuous deployment ● Hadoop ecosystem Manual, mechanised, industrialised 5 ● Automated training ● Semi-automated deployment ● Data warehouses, notebooks
  • 6. www.scling.com Road towards industrialisation 6 Data warehouse age - mechanised analytics DW LAMP stack age - manual analytics Hadoop age - industrialised analytics, data-fed features, machine learning Significant change in workflows Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 7. www.scling.com Road back again 7 DW Enterprise big data failures Post-Hadoop "data engineering" - traditional workflows, new technology
  • 8. www.scling.com Gap is still there 8 DW Enterprise big data failures Post-Hadoop "data engineering" - traditional workflows, new technology ~10 year capability gap "data factory engineering" Current data eng focus - narrative, tools, vendors
  • 9. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 9
  • 10. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 10 Fatalities collected during 2 day Fatalities collected during 4 days Fatalities collected during 10 days
  • 11. www.scling.com Normalise data collection to compare 11 Graph by Adam Altmejd, @adamaltmejd
  • 12. www.scling.com Forecast for analytics with fresh data 12 Graph by Adam Altmejd, @adamaltmejd
  • 14. www.scling.com From craft to process 14 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 16. www.scling.com Sustainable production ML 16 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 17. www.scling.com Data engineering vs data factory engineering 17 How to organise How to work How to build
  • 18. www.scling.com Data factory engineering principles - technology 18 Centralised, homogeneous data platform Functional architecture Simple technology, simple rituals ● Minimal experiment friction ○ Centralise first to establish homogeneity ● Democratised functional data processing ○ Raw data + transforms ○ Immutable datasets!
  • 19. www.scling.com Data-centric innovation ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ quality? ○ extraction? ○ data governance? ○ history? 19
  • 20. www.scling.com Data platform Big data - a collaboration paradigm 20 Stream storage Data lake Data democratised
  • 21. www.scling.com Data factory engineering principles - architecture 21 Failure-driven design What happens, happens in production Fast feedback cycle, slow integration ● Batch processing is self healing ○ If you master workflow orchestration ● Low failure impact → high risk → fast cycle
  • 22. www.scling.com 22 Cost of a software error Nearline ● Data corruption ● Downstream impact ● Bounded recovery Offline ● Temporary data corruption ● Downstream impact ● Easy recovery Online ● User impact ● Data corruption ● Cascading corruption ● Unbounded recovery Job Stream Stream Job Stream
  • 23. www.scling.com Many nines uptime (99.99.. %) A couple of sevens Data speed Innovation speed 23 Nearline Data processing tradeoff Job Stream Offline Online Stream Job Stream
  • 24. www.scling.com Eliminate infrastructure waste 24 ● Production environment only ○ Dev, test, staging lack production data ● Dark pipelines ○ Run in parallel ○ Monitor diff vs production ○ Roll out slowly? ∆?
  • 25. www.scling.com Data factory engineering principles - engineering 25 It's a software engineering problem Continuous process improvement ● Quality, reproducibility, versioning, deployment, monitoring, rapid change? ○ Solved software engineering problems! ● Capable, unpolished components ○ Designed for strong processes, CI/CD, testing, observability ○ Ugly interfaces ● Statistical process control, engineered
  • 26. www.scling.com SQL is a power tool, not an industrial robot 26 ● No composition & abstractions ○ Hostile to testing ● Not expressive enough for mature data processing ● Hostile to data quality measurements and repair ○ Hadoop/Spark/Flink have quality primitives built in https://threadreaderapp.com/thread/1353832649664692225.html
  • 27. www.scling.com Data factory engineering principles - value iteration 27 Pull-driven work, initiated by business value needs Products, not projects Align along value flows ● Only business value counts ○ Drives work ○ Few teams along path ● Data is organic ○ Never done, always iterate
  • 28. www.scling.com Data factory engineering principles 28 Centralised, homogeneous data platform Functional architecture How to organise It's a software engineering problem Pull-driven work, initiated by business value needs Failure-driven design Simple technology, simple rituals What happens, happens in production Fast feedback cycle, slow integration Continuous process improvement Products, not projects How to work How to build Align along value flows
  • 29. www.scling.com Software factory engineering principles 29 Immutable images Agile Statistical process control Products DevOps Puppet, Ansible Waterfall In prod debugging Projects Dev + Ops High code Low code
  • 30. www.scling.com What should a company do? 30 ● Everything in-house ○ Works only for big tech ● Vendors - build, not buy ○ Works for families of use cases ○ So far a 10 year gap to tech elite ● Get consultants ○ No competence flow from European big tech to consultants ○ Products, not projects ● Long-term partnerships? ○ Common outside IT ○ Unfamiliar model in IT - cf. cloud resistance Autoliv general presentation 2017
  • 31. www.scling.com Scling - data-value-as-a-service 31 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! Rapid data innovation Learning by doing, in collaboration