Modern data processing environments resemble factory lines, transforming raw data to valuable data products. The lean principles that have successfully transformed manufacturing are equally applicable to data processing, and are well aligned with the new trend known as DataOps. In this presentation, we will explain how applying lean and DataOps principles can be implemented as technical data processing solutions and processes in order to eliminate waste and improve data innovation speed. We will go through how to eliminate the following types of waste in data processing systems:
* Cognitive waste - unclear source of truth, dependency sprawl, duplication, ambiguity.
* Operational waste - overhead for deployment, upgrades, and incident recovery.
* Delivery waste - friction and delay in development, testing, and deployment.
* Product waste - misalignment to business value, detach from use cases, push driven development, vanity quality assurance.
We will primarily focus on technical solutions, but some of the waste mentioned requires organisational refactoring to eliminate.
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
The lean principles of data ops
1. www.scling.com
The lean principles of
DataOps
Berlin Buzzwords, 2020-06-08
Lars Albertsson, Founder, Scling
Christopher Bergh, CEO & Head Chef, DataKitchen
1
2. www.scling.com
Scling - data-value-as-a-service
2
Data lake
Stream storage
● Extract value from your data
● Data platform + custom data pipelines
● Imitate data leaders:
○ Quick idea-to-production
○ Operational efficiency
Our marketing strategy:
● Promiscuously share knowledge
○ On slides devoid of glossy polish
4. www.scling.com
IT craft to factory
4
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
6. www.scling.com
The Toyota Way
Selected lean principles:
● Long-term over short-term
● The right process will produce the right results
● Eliminate waste (muda)
● Continuous improvement (kaizen)
● Use pull systems to avoid unnecessary production
● Quality takes precedence (jidoka)
○ Stop to fix problems
● Standardised tasks and processes
● Reliable technology that serves people and process
● Develop your people
● Decisions slowly by consensus
● Relentless reflection (hansei), organisational learning
6
8. www.scling.com
Cognitive waste
● Why do we have 25 time formats?
○ ISO 8601, UTC assumed
○ ISO 8601 + timezone
○ Millis since epoch, UTC
○ Nanos since epoch, UTC
○ Millis since epoch, user local time
○ …
○ Float of seconds since epoch, as string.
WTF?!?
● my-kafka-topic-name, your_topic_name
8
● Definition of an order:
○ Abandoned cart?
○ Payment refused?
○ Returned goods?
○ Free promotion?
● Data entity source of truth
○ MySQL, Kafka, data lake?
9. www.scling.com
What causes cognitive waste?
● We are autonomous!
○ Teams can choose technology, format, process, ...
● Cognitive debt
○ Short-term over long-term
○ Decisions without consensus
● Recognition and rewards
○ "You have made a similar independent pipeline, great work!"
9
10. www.scling.com
Avoiding cognitive waste
● Reusing semantic definitions
● Reusing code & technical definitions
○ Code transparency & sharing
○ Standardised technology
○ Document decisions & consensus process
● Read-only sharing not enough
○ Must be empowered to change for reuse and to improve quality
○ Standardised processes
10
11. www.scling.com
Eliminating cognitive waste
● Refactoring code, semantics, docs
● Low risk - what will I break downstream?
○ Standardised, automated, trusted QA process
○ End-to-end pipeline testing
● "Creating a pipeline - one day! Replace old pipeline - 18 months."
11
12. www.scling.com
Delivery waste
● Friction from code to production
○ Ideal: Idea, research, write code+tests, done. Everything else is friction.
● Code inventory
○ Code not yet fully utilised
● Data inventory
○ Data not yet fully processed
12
13. www.scling.com
Data product quality assurance
● Product quality = f(code, data)
○ Cannot do full QA on code only
○ Only real data is production data
● Test in production
○ Quick QA cycle = quick production deployment
○ Measure, monitor, validate
13
14. www.scling.com
Eliminating delivery friction
14
● In theory simple - scrutinise everything
○ Positive engineering: writing code, tests, docs, refactor, improve
○ All else is negative
● You are limited by your assumptions
○ State of practice far from state of art
But the test suite
takes 3 hours.
We have this
checklist.
Security must
approve.
X must be
released before Y.
That is another
team's job.
We don't have
access.
We must test in
staging first.
We haven't
performance
tested yet.
16. www.scling.com
● Code not yet fully utilised
● Code on its way to production
○ In a notebook
○ Waiting for approval
○ Waiting for release
○ Internally released, waiting
for dependants to upgrade
● Tests not fully used
○ Cover code (shared component),
but not yet executed
Code inventory
16
17. www.scling.com
Data inventory
● Data collected, but not yet fully processed
○ Traditional lazy joins & SQL processing at runtime
● Eliminate with eager processing = pipeline
○ Process, join, denormalise
● Fatal problems → offline crash
○ "Andon" cord - stop and fix before significant harm is done
17
18. www.scling.com
Operational waste
● Friction in operational manoeuvres
○ Fear of mistakes
● Cost of incidents
○ Time to recovery
○ Impact of incident
○ Frequency of incidents
18
19. www.scling.com
Separating offline and online
19
Raw
19
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
20. www.scling.com
20
Cost of a software error
Online
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
21. www.scling.com
21
Cost of a software error
Nearline
● Data corruption
● Downstream impact
● Bounded recovery
Online
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream
22. www.scling.com
22
Cost of a software error
Nearline
● Data corruption
● Downstream impact
● Bounded recovery
Offline
● Temporary data
corruption
● Downstream impact
● Easy recovery
Online
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream
24. www.scling.com
Product waste
● Work not driven by use case
● Unrealised data potential due to friction
○ Unawareness of data
○ Difficulty to use data
● Hidden quality problems
● Collaboration and communication overhead
24
Data democratisation -
making data accessible
and usable
25. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
Waste: Your Team’s Time Not Well Spent
25
Percentage
Time Team
Spends Per
Week
Current
Errors &
Operational Tasks
New Features &
Data For Customers
Improvements & Debt
Challenges:
• Complex roles
• Complex organizations
• Complex toolchains
• Complex data
• Complex collaboration
26. Copyright 2020 DataKitchen, Inc.
Waste: Data Analytics is like the US Auto
Industry in the 1970s
Current
High Errors
Production
Errors
Data Analytics
Team
Deployment
Latency
Weeks, Months
Dev Prod
Challenges:
• Slow to add new features,
rapidly address consumer
requests, changing data sets
• Lack of trust by data
consumers
• Slow model deployment, slow
to move to cloud
• Team morale
26
27. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
Waste: Conway’s Law and Data Pipelines
Data Analytics Follows Conway's Law
The structure of how teams are organized to do Data Science, Data
Engineering, Analytics, and Production is reflected in their data
pipelines.
28. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
Waste: A cornucopia of collaboration complexity
D D
P
D
D
D D
D
D
D
P
D
P
P
D Development - Data Analytic Team P Production - Data Analytic Team
Centralized Dev Centralized Dev & Prod Decentralized Dev Decentralized Dev & Prod
How do we create
together without conflicts?
(Data Engineer & Data
Scientist)
How do we deploy safely
and rapidly? (Data Team and
Production Team)
How to balance centralized
control vs self service freedom?
(Home Office Data Team and
Line of Business Analysts)
How to reuse/incorporate what
another team deployed?
(Multiple Data & Production
Teams in Many Orgs)
DE
DS
BI
29. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
Why? Data Teams Are Suffering
Data teams are caught between three competing forces:
• Unaware Data Providers – unaware that they send
crappy, late, and error prone data sets
• Demanding Data Consumers – demand trusted, original
insight at the speed of Amazon delivery
• Critical Supporting Teams – need flawless ongoing
production and collaboration with other teams/people
Make for:
• A beaten down, distraught, disempowered work
environment
• Teams that cannot create and innovate
• Lack of trust all around
29
Unaware Data
Providers
Demanding Data
Consumers
Critical Supporting
Teams
30. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
DataOps – Solution To That Suffering
DataOps – The technical practices,
cultural norms, and architecture
that enable:
• Rapid cycles of experimentation
and innovation to delivery of new
insights to our customers
• Low error rates
• Collaboration across complex sets
of people, technology, and
environments
• Clear measurement and monitoring
of results
30Source: Gartner
“Organizations that adopt a DevOps- and DataOps-based
approach are more successful in implementing end-to-end,
reliable, robust, scalable and repeatable solutions.”
Sumit Pal, Gartner, November 2018
People,
Process,
Organization
Technical
Environment
31. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
DataOps Benefit: Lower Cost, More Insight
31
After DataOps
Percentage
Time Team
Spends Per
Week
Before DataOps
New Features &
Data For Customers
Errors &
Operational Tasks
New Features &
Data For Customers
Improvements & Debt
Errors & Operational
Tasks
Process Improvements
& Tech Debt Reduction
32. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
DataOps Benefit: Faster, Better & Happier
32
After DataOpsBefore DataOps
High Errors
Production
Errors Low Errors
Data Analytics
Team
Deployment
Latency
Weeks, Months
Dev Prod
Hours & Mins
Dev Prod
33. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
DevOps vs DataOps (and all those *Opses)
Lean, Learning Origination, and W Edwards Deming Principles: Focus on Low Errors, Cycle Time,
Collaboration, and Measurement
Industrial Manufacturing
Teams
Business
Management
Concept
Data Science, Engineering
and Analytics Teams
IT and Software TeamsOrganization
Team Management Agile, Kanban, Scrum, DA, etc.
Team Management Six Sigma,
Total Quality Management
Organizational
Management
Method
Technical
Environment and
Process DevOps
AIOps
DevSecOps
DataOps
ModelOps
MLOps
…
GitOps
34. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
DevOps vs DataOps (and all those *Opses)
Lean, Learning Origination, and W Edwards Deming Principles: Focus on Low Errors, Cycle Time,
Collaboration, and Measurement
Industrial Manufacturing
Teams
Business
Management
Concept
Data Science, Engineering
and Analytics Teams
IT and Software TeamsOrganization
Team Management Agile, Kanban, Scrum, DA, etc.
Team Management Six Sigma,
Total Quality Management
Organizational
Management
Method
Technical
Environment and
Process DevOps
AIOps
DevSecOps
DataOps
ModelOps
MLOps
…
GitOps
35. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
DevOps vs DataOps (and all those *Opses)
Lean, Learning Origination, and W Edwards Deming Principles: Focus on Low Errors, Cycle Time,
Collaboration, and Measurement
Industrial Manufacturing
Teams
Business
Management
Concept
Data Science, Engineering
and Analytics Teams
IT and Software TeamsOrganization
Team Management Agile, Kanban, Scrum, DA, etc.
Team Management Six Sigma,
Total Quality Management
Organizational
Management
Method
Technical
Environment and
Process DevOps
AIOps
DevSecOps
DataOps
ModelOps
MLOps
…
GitOps
36. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
DevOps vs DataOps (and all those *Opses)
Lean, Learning Origination, and W Edwards Deming Principles: Focus on Low Errors, Cycle Time,
Collaboration, and Measurement
Industrial Manufacturing
Teams
Business
Management
Concept
Data Science, Engineering
and Analytics Teams
IT and Software TeamsOrganization
Team Management Agile, Kanban, Scrum, DA, etc.
Team Management Six Sigma,
Total Quality Management
Organizational
Management
Method
Technical
Environment and
Process DevOps
AIOps
DevSecOps
DataOps
ModelOps
MLOps
…
GitOps
37. Copyright 2020 by DataKitchen, Inc. All Rights Reserved.
What You Do Is Much Less Important Than
How You Do It
37
“We realized that the true problem, the true difficulty, and where
the greatest potential is – is building the machine that makes
the machine. It’s building the factory.” – Elon Musk
94% of causes were common cause. We often attribute problems
to a specific case, and look for a person to blame, rather than
focusing on the underlying process – Dr Deming