Data democratised

Lars Albertsson
Lars AlbertssonFounder & Data Engineer
www.scling.com
Data democratised
Next data analytics & protection, 2019-12-11
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Big data adoption
22
● 2003-2007: Only Google
● 2007-2014: Hadoop era (Europe). Highly technical
companies succeed and disrupt.
● 2015-2019: Enterprise adoption (Europe). Big data
gone from Gartner hype cycle. “New normal”
● 2019: Many enterprises in production, but big data and
machine learning ROI still confined to high-tech.
www.scling.com
Data value efficiency gap
aka
disrupted or disruptor
3
Early Spotify recommendations
Creator of Luigi, Annoy
www.scling.com
Efficiency gap, latency
4
We just took a machine
learning pipeline in
production after 8 months.
Great success!
Scandinavian retail
(pycon.se, 2019)Document similarity
pipeline finally in
production. Estimated 3
months, took 8 months.
Scandinavian telecom
(NDSML Summit 2019)
2016: Data platform approval
2018: Pipeline in production
Dutch bank
(Dataworks Summit 2018)
Bonnier News
(Riga DevOpsDays 2018)
Platform + 1st pipeline in production.
Seven weeks, 1 person.
Scandinavian retail
2018
New pipeline: < 1 day
Mend pipeline: < 1 hour
Spotify DataOps
transform, 2013
Platform + 1st pipeline in production.
Three weeks, 4 persons.
20 pipelines in 8 months.
www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
● Each dataset has business value
○ Financial, sales, forecasting reports
○ A/B test, auto completion, insights
○ Recommendations, fraud
● Proxy metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 10-1000
5
2016: 20000 datasets / day
2017: 100B events collected / day
Spotify
2016: 1600 000 000
datasets / day
Google
www.scling.com
Data efficiency key factors
6
Data democratisation
● Making data available,
usable, accessible DataOps
● Short path from idea to production
● Cross-functional teams
○ Data engineering, domain experts, product, (data science)
○ Aligned with value, not function
● Low cost of failure
○ Machine and human failure
○ Risks ok → move fast
● Engineered operations
www.scling.com
Service-oriented organisations
● Teams own services
● Teams own data
7
www.scling.com
Data-centric innovation
● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ quality?
○ extraction?
○ data governance?
○ history?
8
www.scling.com
Data-centric innovation
● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ quality?
○ extraction?
○ data governance?
○ history?
● Innovation friction
Value adding Waste
9
www.scling.com
Centralising data
10
Data lake
www.scling.com
More data - decreased friction
11
Data lake
Stream storage
www.scling.com
Hadoop is dead?
12
www.scling.com
Traditional systems
13
Mutation
www.scling.com
Data lake
Transformation
Cold
store
Data pipelines at a glance
14
Mutation
Immutable,
shareable
www.scling.com
Data lake
Transformation
Cold
store
Data pipelines at a glance
15
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
www.scling.com
Late Hadoop adoption
16
Mutation
Can you please
implement mutability,
transactions, SQL, etc?
We would like to keep
our workflows.
Anything, as long as
you are buying.
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
www.scling.com
Complex business logic - MDM @ Spotify ~2014
● 10 pipelines like this
● Pipeline dev environment
● Pipeline continuous deployment
infrastructure
One team of five engineers
17
www.scling.com
Data value = data + domain expertise + data practices
18
Disrupt?
https://xkcd.com/1831/
+ 1000s of failures...
www.scling.com
Data value = data + domain expertise + data practices
19
Disrupt?
https://xkcd.com/1831/
Adapt?
+ 1000s of failures...
www.scling.com
Data value = data + domain expertise + data practices
20
Data lake
Stream storage
Client data +
domain expertise
Practices from
data leaders
Disrupt?
https://xkcd.com/1831/
Collaborate?
Data-value-as-a-service
Adapt?
+ 1000s of failures...
www.scling.com
Factors of democratisation
21
Siloed Shared
Distributed
storage
Homogeneous
storage
CoordinatedOrganic
www.scling.com
Factors of democratisation
22
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
CoordinatedOrganic
www.scling.com
Factors of democratisation
23
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
Code read+write
access
Closed code
ownership
CoordinatedOrganic
www.scling.com
Factors of democratisation
24
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
Code read+write
access
Closed code
ownership
Coordinated data
governanceLocal rituals
CoordinatedOrganic
www.scling.com
Factors of democratisation
25
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
Code read+write
access
Closed code
ownership
Coordinated data
governanceLocal rituals
Common glossary,
semantics
Tribal
knowledge
CoordinatedOrganic
www.scling.com
Factors of democratisation
26
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
Code read+write
access
Closed code
ownership
Coordinated data
governanceLocal rituals
Common glossary,
semantics
Tribal
knowledge
Common data
provenance
Unclear data
origin
CoordinatedOrganic
www.scling.com
Factors of democratisation
27
Siloed Shared
Distributed
storage
Homogeneous
storage
Documentation
read+write accessNeed-to-know
basis
Code read+write
access
Closed code
ownership
Coordinated data
governanceLocal rituals
Common glossary,
semantics
Tribal
knowledge
Common DataOps
procedures
Lay-on-hands
deployment
Common data
provenance
Unclear data
origin
CoordinatedOrganic
www.scling.com
An e-shopping tale
28
1. Log in, search for product X
○ X + 100s of accessories, random order
2. Find X in product catalog
○ No link to web shop
3. Put in cart, delivery?
○ Ask for address, customer club number
4. …
Full story: “Avoid artificial stupidity” blog post
1. Log in, search for product X
○ Popular items first
2. Find X in product catalog
○ Take me to shop
3. Put in cart, delivery?
○ I am logged in
4. ...
www.scling.com
● Include minimal governance, security, privacy
Data lake
Transformation
Cold
store
Document a clean architecture
29
Mutation
Immutable,
shareable
● Align team with use case
○ Zero budget
● Ingest only necessary data
● Key technical component: Workflow orchestrator (Luigi / Airflow)
A lean start
30
www.scling.com
An MVP is minimal
31
Out of scope
Minimal privacy -
limiting access
One use
case
In scope
Minimal
privacy
Security
One DB
source
One use
caseData
scala-
bility
High
availa-
bility
Dura-
bility
Most
privacy
Self
service
Data
quality
Auto-
mation
Clusters
Audita-
bility
Scalable
BI
Fill lake
Real-
time
Lineage
● Remove complexity wherever possible
○ Unfamiliar tools may be less complex
● Pay attention to human and social factors
Journey towards data value
32
“Five dysfunctions of a data engineering team” -
Jesse Anderson
● Only database admins
● Set up for failure
● No one understands schema
● No veterans
● Too ambitious
“Avoiding big data antipatterns” -
Alex Holmes
● Big data tech for small data
● Point-to-point data integration
● Single tool for the job
● Excess volume or precision
● Lack of security
1 of 32

Recommended

Data ops in practice - Swedish style by
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
408 views59 slides
Don't build a data science team by
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
883 views35 slides
Mortal analytics - Covid-19 and the problem of data quality by
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
416 views43 slides
Taming the reproducibility crisis by
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
521 views26 slides
Privacy by design by
Privacy by designPrivacy by design
Privacy by designLars Albertsson
1.9K views44 slides
Eventually, time will kill your data processing by
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
413 views56 slides

More Related Content

What's hot

Engineering data quality by
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
1.3K views50 slides
Data ops in practice by
Data ops in practiceData ops in practice
Data ops in practiceLars Albertsson
3K views26 slides
The right side of speed - learning to shift left by
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
202 views44 slides
DataOps - Lean principles and lean practices by
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
787 views29 slides
Protecting privacy in practice by
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
9.8K views36 slides
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ... by
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...Big Data Spain
2.8K views102 slides

What's hot(20)

The right side of speed - learning to shift left by Lars Albertsson
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson202 views
DataOps - Lean principles and lean practices by Lars Albertsson
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Lars Albertsson787 views
Protecting privacy in practice by Lars Albertsson
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
Lars Albertsson9.8K views
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ... by Big Data Spain
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
Big Data Spain2.8K views
Data pipelines from zero to solid by Lars Albertsson
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson10.7K views
Data Science at Scale - The DevOps Approach by Mihai Criveti
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
Mihai Criveti126 views
Enabling the Bank of the Future by Ignacio Bernal by Big Data Spain
Enabling the Bank of the Future by Ignacio BernalEnabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio Bernal
Big Data Spain2K views
Building Reactive Real-time Data Pipeline by Trieu Nguyen
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
Trieu Nguyen6K views
Counting Unique Users in Real-Time: Here's a Challenge for You! by DataWorks Summit
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit2.3K views
Testing data streaming applications by Lars Albertsson
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson4K views
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013) by Mark Rittman
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)
OBIEE, Endeca, Hadoop and ORE Development (on Exalytics) (ODTUG 2013)
Mark Rittman3.8K views
Big Data with Apache Hadoop by InfoFarm
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
InfoFarm1.1K views
Neo4j-Databridge: Enterprise-scale ETL for Neo4j by Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j424 views
Building a Self-Service Big Data Pipeline by DataWorks Summit
Building a Self-Service Big Data PipelineBuilding a Self-Service Big Data Pipeline
Building a Self-Service Big Data Pipeline
DataWorks Summit2.6K views
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014 by Jaroslav Gergic
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Jaroslav Gergic3K views
Stored Procedure Superpowers: A Developer’s Guide by VoltDB
Stored Procedure Superpowers: A Developer’s GuideStored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s Guide
VoltDB454 views
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ... by Dataconomy Media
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Dataconomy Media491 views

Similar to Data democratised

Data Insights for Breakfast, Malmö by
Data Insights for Breakfast, MalmöData Insights for Breakfast, Malmö
Data Insights for Breakfast, MalmöSolita Oy
756 views86 slides
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ... by
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
3.5K views40 slides
Data insights for breakfast, stockholm by
Data insights for breakfast, stockholmData insights for breakfast, stockholm
Data insights for breakfast, stockholmSolita Oy
919 views86 slides
What is the future of data strategy? by
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
144 views30 slides
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ... by
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...Jochem van Grondelle
192 views55 slides
Big Data Pitfalls by
Big Data PitfallsBig Data Pitfalls
Big Data PitfallsAlex Meadows
754 views39 slides

Similar to Data democratised(20)

Data Insights for Breakfast, Malmö by Solita Oy
Data Insights for Breakfast, MalmöData Insights for Breakfast, Malmö
Data Insights for Breakfast, Malmö
Solita Oy756 views
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ... by Databricks
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks3.5K views
Data insights for breakfast, stockholm by Solita Oy
Data insights for breakfast, stockholmData insights for breakfast, stockholm
Data insights for breakfast, stockholm
Solita Oy919 views
What is the future of data strategy? by Denodo
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
Denodo 144 views
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ... by Jochem van Grondelle
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
Future of Data Strategy (ASEAN) by Denodo
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
Denodo 190 views
Cloudian 451-hortonworks - webinar by Hortonworks
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
Hortonworks1.1K views
Analyst Webinar: Discover how a logical data fabric helps organizations avoid... by Denodo
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Denodo 80 views
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes... by Dr. Arif Wider
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider1.4K views
Embedded-ml(ai)applications - Bjoern Staender by Dataconomy Media
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media468 views
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI... by Matt Stubbs
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Matt Stubbs315 views
Unlock Your Data for ML & AI using Data Virtualization by Denodo
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
Denodo 915 views
Accelerating Self-Service Analytics with Denodo and Tableau (Singapore) by Denodo
Accelerating Self-Service Analytics with Denodo and Tableau (Singapore)Accelerating Self-Service Analytics with Denodo and Tableau (Singapore)
Accelerating Self-Service Analytics with Denodo and Tableau (Singapore)
Denodo 161 views
Big data oracle_introduccion by Fran Navarro
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
Fran Navarro1.6K views
Introduction to Harnessing Big Data by Paul Barsch
Introduction to Harnessing Big DataIntroduction to Harnessing Big Data
Introduction to Harnessing Big Data
Paul Barsch1.2K views
Insights into Real-world Data Management Challenges by DataWorks Summit
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit2.2K views
Big Data LDN 2017: The New Dominant Companies Are Running on Data by Matt Stubbs
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs68 views

More from Lars Albertsson

Crossing the data divide by
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
3 views31 slides
Schema management with Scalameta by
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
7 views50 slides
How to not kill people - Berlin Buzzwords 2023.pdf by
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
34 views51 slides
Data engineering in 10 years.pdf by
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
842 views52 slides
The 7 habits of data effective companies.pdf by
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
252 views44 slides
Holistic data application quality by
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
396 views30 slides

More from Lars Albertsson(14)

How to not kill people - Berlin Buzzwords 2023.pdf by Lars Albertsson
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson34 views
Data engineering in 10 years.pdf by Lars Albertsson
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson842 views
The 7 habits of data effective companies.pdf by Lars Albertsson
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson252 views
Holistic data application quality by Lars Albertsson
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson396 views
Secure software supply chain on a shoestring budget by Lars Albertsson
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson268 views
Eventually, time will kill your data pipeline by Lars Albertsson
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson936 views
Test strategies for data processing pipelines, v2.0 by Lars Albertsson
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson2.7K views
A primer on building real time data-driven products by Lars Albertsson
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lars Albertsson951 views
Test strategies for data processing pipelines by Lars Albertsson
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson5.2K views
Building real time data-driven products by Lars Albertsson
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson2.8K views

Recently uploaded

shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
7 views14 slides
Report on OSINT by
Report on OSINTReport on OSINT
Report on OSINTAyonDebnathCertified
6 views15 slides
Inawsidom - Data Journey by
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data JourneyPhilipBasford
8 views38 slides
Amy slides.pdf by
Amy slides.pdfAmy slides.pdf
Amy slides.pdfStatsCommunications
5 views13 slides
Custom Tag Manager Templates by
Custom Tag Manager TemplatesCustom Tag Manager Templates
Custom Tag Manager TemplatesMarkus Baersch
30 views17 slides
apple.pptx by
apple.pptxapple.pptx
apple.pptxhoneybeeqwe
6 views15 slides

Recently uploaded(20)

Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402315 views
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus31 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views

Data democratised

  • 1. www.scling.com Data democratised Next data analytics & protection, 2019-12-11 Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Big data adoption 22 ● 2003-2007: Only Google ● 2007-2014: Hadoop era (Europe). Highly technical companies succeed and disrupt. ● 2015-2019: Enterprise adoption (Europe). Big data gone from Gartner hype cycle. “New normal” ● 2019: Many enterprises in production, but big data and machine learning ROI still confined to high-tech.
  • 3. www.scling.com Data value efficiency gap aka disrupted or disruptor 3 Early Spotify recommendations Creator of Luigi, Annoy
  • 4. www.scling.com Efficiency gap, latency 4 We just took a machine learning pipeline in production after 8 months. Great success! Scandinavian retail (pycon.se, 2019)Document similarity pipeline finally in production. Estimated 3 months, took 8 months. Scandinavian telecom (NDSML Summit 2019) 2016: Data platform approval 2018: Pipeline in production Dutch bank (Dataworks Summit 2018) Bonnier News (Riga DevOpsDays 2018) Platform + 1st pipeline in production. Seven weeks, 1 person. Scandinavian retail 2018 New pipeline: < 1 day Mend pipeline: < 1 hour Spotify DataOps transform, 2013 Platform + 1st pipeline in production. Three weeks, 4 persons. 20 pipelines in 8 months.
  • 5. www.scling.com Efficiency gap, data cost & value ● Data processing produces datasets ● Each dataset has business value ○ Financial, sales, forecasting reports ○ A/B test, auto completion, insights ○ Recommendations, fraud ● Proxy metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 10-1000 5 2016: 20000 datasets / day 2017: 100B events collected / day Spotify 2016: 1600 000 000 datasets / day Google
  • 6. www.scling.com Data efficiency key factors 6 Data democratisation ● Making data available, usable, accessible DataOps ● Short path from idea to production ● Cross-functional teams ○ Data engineering, domain experts, product, (data science) ○ Aligned with value, not function ● Low cost of failure ○ Machine and human failure ○ Risks ok → move fast ● Engineered operations
  • 8. www.scling.com Data-centric innovation ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ quality? ○ extraction? ○ data governance? ○ history? 8
  • 9. www.scling.com Data-centric innovation ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ quality? ○ extraction? ○ data governance? ○ history? ● Innovation friction Value adding Waste 9
  • 11. www.scling.com More data - decreased friction 11 Data lake Stream storage
  • 14. www.scling.com Data lake Transformation Cold store Data pipelines at a glance 14 Mutation Immutable, shareable
  • 15. www.scling.com Data lake Transformation Cold store Data pipelines at a glance 15 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments
  • 16. www.scling.com Late Hadoop adoption 16 Mutation Can you please implement mutability, transactions, SQL, etc? We would like to keep our workflows. Anything, as long as you are buying. DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments
  • 17. www.scling.com Complex business logic - MDM @ Spotify ~2014 ● 10 pipelines like this ● Pipeline dev environment ● Pipeline continuous deployment infrastructure One team of five engineers 17
  • 18. www.scling.com Data value = data + domain expertise + data practices 18 Disrupt? https://xkcd.com/1831/ + 1000s of failures...
  • 19. www.scling.com Data value = data + domain expertise + data practices 19 Disrupt? https://xkcd.com/1831/ Adapt? + 1000s of failures...
  • 20. www.scling.com Data value = data + domain expertise + data practices 20 Data lake Stream storage Client data + domain expertise Practices from data leaders Disrupt? https://xkcd.com/1831/ Collaborate? Data-value-as-a-service Adapt? + 1000s of failures...
  • 21. www.scling.com Factors of democratisation 21 Siloed Shared Distributed storage Homogeneous storage CoordinatedOrganic
  • 22. www.scling.com Factors of democratisation 22 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis CoordinatedOrganic
  • 23. www.scling.com Factors of democratisation 23 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership CoordinatedOrganic
  • 24. www.scling.com Factors of democratisation 24 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals CoordinatedOrganic
  • 25. www.scling.com Factors of democratisation 25 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals Common glossary, semantics Tribal knowledge CoordinatedOrganic
  • 26. www.scling.com Factors of democratisation 26 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals Common glossary, semantics Tribal knowledge Common data provenance Unclear data origin CoordinatedOrganic
  • 27. www.scling.com Factors of democratisation 27 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals Common glossary, semantics Tribal knowledge Common DataOps procedures Lay-on-hands deployment Common data provenance Unclear data origin CoordinatedOrganic
  • 28. www.scling.com An e-shopping tale 28 1. Log in, search for product X ○ X + 100s of accessories, random order 2. Find X in product catalog ○ No link to web shop 3. Put in cart, delivery? ○ Ask for address, customer club number 4. … Full story: “Avoid artificial stupidity” blog post 1. Log in, search for product X ○ Popular items first 2. Find X in product catalog ○ Take me to shop 3. Put in cart, delivery? ○ I am logged in 4. ...
  • 29. www.scling.com ● Include minimal governance, security, privacy Data lake Transformation Cold store Document a clean architecture 29 Mutation Immutable, shareable
  • 30. ● Align team with use case ○ Zero budget ● Ingest only necessary data ● Key technical component: Workflow orchestrator (Luigi / Airflow) A lean start 30
  • 31. www.scling.com An MVP is minimal 31 Out of scope Minimal privacy - limiting access One use case In scope Minimal privacy Security One DB source One use caseData scala- bility High availa- bility Dura- bility Most privacy Self service Data quality Auto- mation Clusters Audita- bility Scalable BI Fill lake Real- time Lineage
  • 32. ● Remove complexity wherever possible ○ Unfamiliar tools may be less complex ● Pay attention to human and social factors Journey towards data value 32 “Five dysfunctions of a data engineering team” - Jesse Anderson ● Only database admins ● Set up for failure ● No one understands schema ● No veterans ● Too ambitious “Avoiding big data antipatterns” - Alex Holmes ● Big data tech for small data ● Point-to-point data integration ● Single tool for the job ● Excess volume or precision ● Lack of security