SlideShare a Scribd company logo
www.scling.com
Holistic data application quality
Lars Albertsson, Founder, Scling
Data innovation summit 2022-05-06
1
www.scling.com
Data product quality assurance
● Product quality = f(code, data)
○ Cannot do full QA on code only
○ Only real data is production data
● Test code in CI/CD process
○ Quick QA cycle = quick production deployment
● Test data in production
○ Measure, monitor, validate
2
www.scling.com
Observations on data application quality
● Data application development is unpredictable
○ Need to assess improvement with precision
● Quality is a means to (iteration) speed
○ Only in some contexts
○ Iteration speed is a means to quality
● Local quality effort easy, holistic effort hard
● Companies that focus on end-to-end delivery value are more successful with data
○ Align along value delivery flows
○ Measure and act on holistic properties
3
www.scling.com
Themes of good presentations, IMHO
● We have seen lots of X. Here are the patterns.
● We have context Y. Here is how we work.
● We did a thing Z. Here is what we learnt.
4
www.scling.com
100x 100x
Data artifacts produced
Manual, mechanised, industrialised
5
Spotify's pipelines ~2013
www.scling.com
Craft vs industry
6
● Each step steered by human
○ Or primitive automation
● Improving artifacts
● Craft is primary competence
● Components made for humans
○ Look nice, "easy to use"
● Autonomous processes
● Improving process that creates artifacts
● Multitude of competences
● Some components unusable by humans
○ Hard, greasy
○ Made for integration
● Processes often include craft steps
www.scling.com
Road towards industrialisation
7
Data warehouse age -
mechanised analytics
DW
LAMP stack age -
manual analytics
Hadoop age -
industrialised analytics,
data-fed features,
machine learning
Significant change in workflows
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
A temporary regression
8
DW
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
www.scling.com
Gap is still there
9
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
www.scling.com
From craft to process
10
www.scling.com
From craft to process
11
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
● Toyota: Low defect rates AND high margins per vehicle
● State of DevOps report: High reliability AND high deployment rate
○ We have industrialised software engineering
Quality, speed - choose two
12
Quality
vs
Speed
Quality
AND
Speed
1000x span in
availability metrics
www.scling.com
How we work, briefly
● Industrialised context
○ Old school big data - minus Hadoop + MapReduce
○ No craft adaptations - mutability, lakehouse, dbt, warehouses, …
○ High code - Scala+Python over SQL+Yaml
● Strong data refinement process
○ Immutability
○ Functional architectural principles
● Strong CI/CD process
○ Local execution, no cloud resources
○ Strong enough for automated patch management
● Data quality evaluated in production
○ Dark pipelines hidden from consumption
13
www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
14
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope
www.scling.com 15
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service
www.scling.com 16
Measuring consistency: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service
www.scling.com 17
Production critical upgrade
● Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Testable end-to-end
No dev & staging environment needed!
∆?
www.scling.com 18
The unknown unknowns
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Revert to old egress dataset if necessary
Measure interactions
DB
Standard
alerting
service
Stream Job
www.scling.com 19
Machine learning operations, simplified
● Multiple trained models
○ Select at run time
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Ready to revert to
○ old models
○ simpler models
Measure interactions
Rendez-
vous
DB
Standard
alerting
service
Stream Job
"The required surrounding
infrastructure is vast and
complex."
- Google
www.scling.com 20
Conditional consumption
● Conditional consumption
○ Express in workflow orchestration
○ Read metrics DB, quality dataset
○ Producer can recommend, not decide
● Insufficient quality?
○ Wait for bug fix
○ Use older/newer input dataset
Recommendation
metrics
Report
www.scling.com
Data agility
21
● Siloed: 6+ months
Cultural work
● Autonomous: 1 month
Technical work
● Coordinated: days
Data lake
∆
∆
Latency?
www.scling.com
Cognitive waste
● Why do we have 25 time formats?
○ ISO 8601, UTC assumed
○ ISO 8601 + timezone
○ Millis since epoch, UTC
○ Nanos since epoch, UTC
○ Millis since epoch, user local time
○ …
○ Float of seconds since epoch, as string.
WTF?!?
● my-kafka-topic-name, your_topic_name
22
"I don't know what will break downstream"
paralyses many data organisations.
www.scling.com
Full pipeline testing
● Test downstream impact of change
● From change to egress point
○ Possibly include also presentation layer
● Test across team boundaries
● Difficult for organisations with weak coordination / strong cultural autonomy
23
www.scling.com
What about a layer of indirection?
24
file://test_data/
Testing?
Prod
Data access layer
- In conflict with autonomous culture
- Adds complexity to experimental cycle
- In conflict with data component ecosystems
+ Explicit
+ Debuggable
www.scling.com
Warped pipeline testing
25
file://test_input/ file://test_output/
Workflow manager DAG "warp" rewrite
1. Generate input 3. Verify output
f() p()
Stream
Stream
Production
www.scling.com
Warped pipeline testing
26
Prod cloud → test cloud rewrite
1. Generate input 3. Verify output
f() p()
Stream
Stream
Production
www.scling.com
Aspect-oriented testing
Aka hideous monkeypatching.
Rules for DAG transformation
27
www.scling.com
Monkeypatching exacts a price
Better than indirection? Depends on cultural cost of coordination.
28
- Implicit
- Framework difficult
to debug
+ Minimal code adaptation
+ Can test pipelines owned by
other teams
+ Goes through third-party
components
Great value from full pipeline testing!
www.scling.com
Why do we share this?
● Speed up transition out of data 4GL
● Data factory engineering is learnt inside factories.
○ Few in Europe
● Build factories for clients? Tried that.
● Sell factory machines? Skills & workflows matter more.
● Other ways of collaborating. Partnership innovation?
○ Common outside IT
○ Unfamiliar model in IT - cf. cloud resistance
29
Autoliv general presentation 2017
www.scling.com
Scling - data-factory-as-a-service
30
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration

More Related Content

What's hot

ETL Testing Training Presentation
ETL Testing Training PresentationETL Testing Training Presentation
ETL Testing Training Presentation
Apurba Biswas
 
SQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
SQL Server AlwaysOn for Dummies SQLSaturday #202 EditionSQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
SQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
Mark Broadbent
 
Building a Winning Roadmap for Analytics
Building a Winning Roadmap for AnalyticsBuilding a Winning Roadmap for Analytics
Building a Winning Roadmap for Analytics
Ironside
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
Calum Miller
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
Boris Otto
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
DATAVERSITY
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Databricks
 
Fog computing
Fog computingFog computing
Fog computing
Abdul Qadir
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
Amazon Web Services
 
Emerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big ThingEmerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big Thing
DATAVERSITY
 
Datacenter overview
Datacenter overviewDatacenter overview
Datacenter overview
Heather Brotherton
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
Snowflake Computing
 
Data Strategy - Executive MBA Class, IE Business School
Data Strategy - Executive MBA Class, IE Business SchoolData Strategy - Executive MBA Class, IE Business School
Data Strategy - Executive MBA Class, IE Business School
Gam Dias
 
Creating Your Data Governance Dashboard
Creating Your Data Governance DashboardCreating Your Data Governance Dashboard
Creating Your Data Governance Dashboard
Trillium Software
 
Enterprise Architecture Frameworks
Enterprise Architecture FrameworksEnterprise Architecture Frameworks
Enterprise Architecture Frameworks
Stephen Lahanas
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
DATAVERSITY
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
DATAVERSITY
 
Data Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step ApproachData Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step Approach
FindWhitePapers
 
Artifacts to Enable Data Goverance
Artifacts to Enable Data GoveranceArtifacts to Enable Data Goverance
Artifacts to Enable Data Goverance
DATAVERSITY
 
POWER POINT PRESENTATION ON DATA CENTER
POWER POINT PRESENTATION ON DATA CENTERPOWER POINT PRESENTATION ON DATA CENTER
POWER POINT PRESENTATION ON DATA CENTER
vivekprajapatiankur
 

What's hot (20)

ETL Testing Training Presentation
ETL Testing Training PresentationETL Testing Training Presentation
ETL Testing Training Presentation
 
SQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
SQL Server AlwaysOn for Dummies SQLSaturday #202 EditionSQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
SQL Server AlwaysOn for Dummies SQLSaturday #202 Edition
 
Building a Winning Roadmap for Analytics
Building a Winning Roadmap for AnalyticsBuilding a Winning Roadmap for Analytics
Building a Winning Roadmap for Analytics
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Fog computing
Fog computingFog computing
Fog computing
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Emerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big ThingEmerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big Thing
 
Datacenter overview
Datacenter overviewDatacenter overview
Datacenter overview
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Data Strategy - Executive MBA Class, IE Business School
Data Strategy - Executive MBA Class, IE Business SchoolData Strategy - Executive MBA Class, IE Business School
Data Strategy - Executive MBA Class, IE Business School
 
Creating Your Data Governance Dashboard
Creating Your Data Governance DashboardCreating Your Data Governance Dashboard
Creating Your Data Governance Dashboard
 
Enterprise Architecture Frameworks
Enterprise Architecture FrameworksEnterprise Architecture Frameworks
Enterprise Architecture Frameworks
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
 
Data Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step ApproachData Quality Strategy: A Step-by-Step Approach
Data Quality Strategy: A Step-by-Step Approach
 
Artifacts to Enable Data Goverance
Artifacts to Enable Data GoveranceArtifacts to Enable Data Goverance
Artifacts to Enable Data Goverance
 
POWER POINT PRESENTATION ON DATA CENTER
POWER POINT PRESENTATION ON DATA CENTERPOWER POINT PRESENTATION ON DATA CENTER
POWER POINT PRESENTATION ON DATA CENTER
 

Similar to Holistic data application quality

Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
Lars Albertsson
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
Lars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Lars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
Lars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
Lars Albertsson
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?
mabl
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 

Similar to Holistic data application quality (20)

Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
Lars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson
 
Data democratised
Data democratisedData democratised
Data democratised
Lars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
Lars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
Lars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
Lars Albertsson
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
Lars Albertsson
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
Lars Albertsson
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
Lars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lars Albertsson
 

More from Lars Albertsson (18)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 

Recently uploaded

Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
mohamed Ibrahim
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
huseindihon
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
dizzycaye
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
tanupasswan6
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
kinni singh$A17
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
kittycrispy617
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
45unexpected
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
huseindihon
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
doctorzlife786
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
huseindihon
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
tanupasswan6
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
girewiy968
 

Recently uploaded (20)

Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
 

Holistic data application quality

  • 1. www.scling.com Holistic data application quality Lars Albertsson, Founder, Scling Data innovation summit 2022-05-06 1
  • 2. www.scling.com Data product quality assurance ● Product quality = f(code, data) ○ Cannot do full QA on code only ○ Only real data is production data ● Test code in CI/CD process ○ Quick QA cycle = quick production deployment ● Test data in production ○ Measure, monitor, validate 2
  • 3. www.scling.com Observations on data application quality ● Data application development is unpredictable ○ Need to assess improvement with precision ● Quality is a means to (iteration) speed ○ Only in some contexts ○ Iteration speed is a means to quality ● Local quality effort easy, holistic effort hard ● Companies that focus on end-to-end delivery value are more successful with data ○ Align along value delivery flows ○ Measure and act on holistic properties 3
  • 4. www.scling.com Themes of good presentations, IMHO ● We have seen lots of X. Here are the patterns. ● We have context Y. Here is how we work. ● We did a thing Z. Here is what we learnt. 4
  • 5. www.scling.com 100x 100x Data artifacts produced Manual, mechanised, industrialised 5 Spotify's pipelines ~2013
  • 6. www.scling.com Craft vs industry 6 ● Each step steered by human ○ Or primitive automation ● Improving artifacts ● Craft is primary competence ● Components made for humans ○ Look nice, "easy to use" ● Autonomous processes ● Improving process that creates artifacts ● Multitude of competences ● Some components unusable by humans ○ Hard, greasy ○ Made for integration ● Processes often include craft steps
  • 7. www.scling.com Road towards industrialisation 7 Data warehouse age - mechanised analytics DW LAMP stack age - manual analytics Hadoop age - industrialised analytics, data-fed features, machine learning Significant change in workflows Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 8. www.scling.com A temporary regression 8 DW Enterprise big data failures "Modern data stack" - traditional workflows, new technology 4GL / UML phase of data engineering
  • 9. www.scling.com Gap is still there 9 DW ~10 year capability gap "data factory engineering" Enterprise big data failures "Modern data stack" - traditional workflows, new technology 4GL / UML phase of data engineering Data engineering education
  • 11. www.scling.com From craft to process 11 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 12. www.scling.com ● Toyota: Low defect rates AND high margins per vehicle ● State of DevOps report: High reliability AND high deployment rate ○ We have industrialised software engineering Quality, speed - choose two 12 Quality vs Speed Quality AND Speed 1000x span in availability metrics
  • 13. www.scling.com How we work, briefly ● Industrialised context ○ Old school big data - minus Hadoop + MapReduce ○ No craft adaptations - mutability, lakehouse, dbt, warehouses, … ○ High code - Scala+Python over SQL+Yaml ● Strong data refinement process ○ Immutability ○ Functional architectural principles ● Strong CI/CD process ○ Local execution, no cloud resources ○ Strong enough for automated patch management ● Data quality evaluated in production ○ Dark pipelines hidden from consumption 13
  • 14. www.scling.com Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 14 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  • 15. www.scling.com 15 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  • 16. www.scling.com 16 Measuring consistency: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  • 17. www.scling.com 17 Production critical upgrade ● Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Testable end-to-end No dev & staging environment needed! ∆?
  • 18. www.scling.com 18 The unknown unknowns ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Revert to old egress dataset if necessary Measure interactions DB Standard alerting service Stream Job
  • 19. www.scling.com 19 Machine learning operations, simplified ● Multiple trained models ○ Select at run time ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Ready to revert to ○ old models ○ simpler models Measure interactions Rendez- vous DB Standard alerting service Stream Job "The required surrounding infrastructure is vast and complex." - Google
  • 20. www.scling.com 20 Conditional consumption ● Conditional consumption ○ Express in workflow orchestration ○ Read metrics DB, quality dataset ○ Producer can recommend, not decide ● Insufficient quality? ○ Wait for bug fix ○ Use older/newer input dataset Recommendation metrics Report
  • 21. www.scling.com Data agility 21 ● Siloed: 6+ months Cultural work ● Autonomous: 1 month Technical work ● Coordinated: days Data lake ∆ ∆ Latency?
  • 22. www.scling.com Cognitive waste ● Why do we have 25 time formats? ○ ISO 8601, UTC assumed ○ ISO 8601 + timezone ○ Millis since epoch, UTC ○ Nanos since epoch, UTC ○ Millis since epoch, user local time ○ … ○ Float of seconds since epoch, as string. WTF?!? ● my-kafka-topic-name, your_topic_name 22 "I don't know what will break downstream" paralyses many data organisations.
  • 23. www.scling.com Full pipeline testing ● Test downstream impact of change ● From change to egress point ○ Possibly include also presentation layer ● Test across team boundaries ● Difficult for organisations with weak coordination / strong cultural autonomy 23
  • 24. www.scling.com What about a layer of indirection? 24 file://test_data/ Testing? Prod Data access layer - In conflict with autonomous culture - Adds complexity to experimental cycle - In conflict with data component ecosystems + Explicit + Debuggable
  • 25. www.scling.com Warped pipeline testing 25 file://test_input/ file://test_output/ Workflow manager DAG "warp" rewrite 1. Generate input 3. Verify output f() p() Stream Stream Production
  • 26. www.scling.com Warped pipeline testing 26 Prod cloud → test cloud rewrite 1. Generate input 3. Verify output f() p() Stream Stream Production
  • 27. www.scling.com Aspect-oriented testing Aka hideous monkeypatching. Rules for DAG transformation 27
  • 28. www.scling.com Monkeypatching exacts a price Better than indirection? Depends on cultural cost of coordination. 28 - Implicit - Framework difficult to debug + Minimal code adaptation + Can test pipelines owned by other teams + Goes through third-party components Great value from full pipeline testing!
  • 29. www.scling.com Why do we share this? ● Speed up transition out of data 4GL ● Data factory engineering is learnt inside factories. ○ Few in Europe ● Build factories for clients? Tried that. ● Sell factory machines? Skills & workflows matter more. ● Other ways of collaborating. Partnership innovation? ○ Common outside IT ○ Unfamiliar model in IT - cf. cloud resistance 29 Autoliv general presentation 2017
  • 30. www.scling.com Scling - data-factory-as-a-service 30 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! Rapid data innovation Learning by doing, in collaboration