www.scling.com
DataOps - Lean principles
and practices
Data 2030 Summit, 2021-02-11
Lars Albertsson, Founder, Scling
1
www.scling.com
Ask not what, but how
2
Ideas << execution
DataOps is the "how" of data & ML
2013: Transform @ Spotify
2014: "DataOps" term first seen
2018: Conference talk rejected
2019: Most watched recording @ Data Innovation Summit
2021: DataOps day @ Data 2030 Summit
www.scling.com
Enabling innovation
3
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/
"Discover Weekly wasn't a great
strategic plan and 100 engineers.
It was 3 engineers that decided to
build something."
"I would have killed it. All of a sudden,
they shipped it. It’s one of the most
loved product features that we have."
- Daniel Ek, CEO
www.scling.com
IT craft to factory
4
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
www.scling.com
Security Waterfall
Data factories
5
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DB-oriented
architecture
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code
Data factories,
data pipelines,
DataOps
www.scling.com
From craft to process
6
www.scling.com
From craft to process
7
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Naive ML
8
www.scling.com
Towards sustainable production ML
9
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
The Toyota Way
Selected lean principles:
● Long-term over short-term
● The right process will produce the right results
● Eliminate waste (muda)
● Continuous improvement (kaizen)
● Use pull systems to avoid unnecessary production
● Quality takes precedence (jidoka)
○ Stop to fix problems
● Standardised tasks and processes
● Reliable technology that serves people and process
● Develop your people
● Decisions slowly by consensus
● Relentless reflection (hansei), organisational learning
10
www.scling.com
Common waste species
● Cognitive waste
● Technology waste
● Delivery waste
● Operational waste
● Product waste
11
Companies are generally good
at handling some waste forms,
and blind to others.
Your blindness is your potential.
www.scling.com
Cognitive waste
● Why do we have 25 time formats?
○ ISO 8601, UTC assumed
○ ISO 8601 + timezone
○ Millis since epoch, UTC
○ Nanos since epoch, UTC
○ Millis since epoch, user local time
○ …
○ Float of seconds since epoch, as string.
WTF?!?
● my-kafka-topic-name, your_topic_name
12
● Definition of an order:
○ Abandoned cart?
○ Payment refused?
○ Returned goods?
○ Free promotion?
● Data entity source of truth
○ MySQL, Kafka, data lake?
● Code and documentation sprawl
○ Repositories & branches
○ Wikis
www.scling.com
What causes cognitive waste?
● We are autonomous!
○ Teams can choose technology, format, process, ...
● Cognitive debt
○ Short-term over long-term
○ Decisions without consensus
● Recognition and rewards
○ "You have made a similar independent pipeline, great work!"
13
www.scling.com
Avoiding cognitive waste
● Reusing semantic definitions
● Reusing code & technical definitions
○ Code transparency & sharing
○ Standardised technology
○ Document decisions & consensus process
● Read-only sharing not enough
○ Must be empowered to
■ change for reuse
■ improve quality
■ delete unused
○ Low risk - what will I break downstream?
○ Standardised, end-to-end QA processes
14
www.scling.com
● Code not yet fully utilised
● Code on its way to production
○ In a notebook
○ Waiting for approval
○ Waiting for release
○ Internally released, waiting
for dependants to upgrade
● Tests not fully used
○ Tests that cover code (shared component),
but are not yet executed
Delivery waste - code inventory
15
www.scling.com
Eliminating delivery waste
16
● Friction from code to production
○ Positive engineering: research, writing code, tests, docs, refactor, improve
○ All else is negative
● You are limited by your assumptions
○ State of practice far is from state of art
But the test suite
takes 3 hours.
We have this
checklist.
Security must
approve.
X must be
released before Y.
That is another
team's job.
We don't have
access.
We must test in
staging first.
We haven't
performance
tested yet.
www.scling.com
So get rid of the waste. Resources:
No tradeoff between speed and quality!
17
www.scling.com
Data inventory
● Data collected, but not yet fully processed
○ Traditional lazy joins & SQL processing at runtime
○ Extract-load-transform (ELT)
● Eliminate with eager processing = pipeline
○ Process, join, denormalise
○ Extract-transform-load (ETL)
● Fatal problems → offline crash
○ "Andon" cord - stop and fix before significant harm is done
18
www.scling.com
Technology waste
19
NoSQL
Stream
processing
Spark/Flink
Hadoop
In-memory
databases
Schema
registry
Data
catalogue
Feature store
Change data
capture
Data
versioning
Governance
system Data
warehouse
Lakehouse
Scaled out
compute
Kubernetes
Essential
Compute
machines
Workflow
orchestration
RDBMS
File
storage
Code version
control
Visualisation Graph
processing
Deep learning
www.scling.com
Operational waste
● Friction in operational manoeuvres
○ Fear of mistakes
○ Application-specific tooling
● Cost of incidents
○ Time to recovery
○ Impact of incident
○ Frequency of incidents
20
www.scling.com
Separating offline and online
21
Raw
Fraud
service
Fraud
model
Orders Orders
Replication /
Backup
Prudent procedures Prudent procedures
Lightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
www.scling.com
Many nines uptime (99.99.. %) A couple of sevens
Data speed Innovation speed
22
Nearline
Data processing tradeoff
Job
Stream
Offline
Online
Stream
Job
Stream
www.scling.com
Product waste
● Work not driven by use case
● Unrealised data potential due to friction
○ Unawareness of data
○ Difficulty to use data
● Collaboration and communication
○ Connection
○ Overhead
23
Data democratisation -
making data accessible
and usable
Form teams aligned to
value flows.
www.scling.com
Continuous improvement & learning
● Products, not projects
○ Owned, never done, always improving
● To production early
○ Minimal fear
○ Measure and monitor to learn
● Fail & iterate
○ No blame, no penalties
● Communication across organisation essential
○ Data source team - data processing team - stakeholders
24
www.scling.com
Data product quality assurance
● Product quality = f(code, data)
○ Cannot do full QA on code only
○ Only real data is production data
● Test in production
○ Quick QA cycle = quick production deployment
○ Measure, monitor, validate
25
www.scling.com
Infrastructure waste
26
● Production environment only
○ Dev, test, staging lack production data
● Dark pipelines
○ Run in parallel
○ Monitor diff vs production
○ Roll out slowly?
∆?
www.scling.com
Slow cycle - slow learning
27
www.scling.com
Learning more about Lean & DataOps
28
www.scling.com
Scling - data-value-as-a-service
29
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration

DataOps - Lean principles and lean practices

  • 1.
    www.scling.com DataOps - Leanprinciples and practices Data 2030 Summit, 2021-02-11 Lars Albertsson, Founder, Scling 1
  • 2.
    www.scling.com Ask not what,but how 2 Ideas << execution DataOps is the "how" of data & ML 2013: Transform @ Spotify 2014: "DataOps" term first seen 2018: Conference talk rejected 2019: Most watched recording @ Data Innovation Summit 2021: DataOps day @ Data 2030 Summit
  • 3.
    www.scling.com Enabling innovation 3 "The actualwork that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8 https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/ "Discover Weekly wasn't a great strategic plan and 100 engineers. It was 3 engineers that decided to build something." "I would have killed it. All of a sudden, they shipped it. It’s one of the most loved product features that we have." - Daniel Ek, CEO
  • 4.
    www.scling.com IT craft tofactory 4 Security Waterfall Application delivery Traditional operations Traditional QA Infrastructure DevSecOps Agile Containers DevOps CI/CD Infrastructure as code
  • 5.
  • 6.
  • 7.
    www.scling.com From craft toprocess 7 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 8.
  • 9.
    www.scling.com Towards sustainable productionML 9 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 10.
    www.scling.com The Toyota Way Selectedlean principles: ● Long-term over short-term ● The right process will produce the right results ● Eliminate waste (muda) ● Continuous improvement (kaizen) ● Use pull systems to avoid unnecessary production ● Quality takes precedence (jidoka) ○ Stop to fix problems ● Standardised tasks and processes ● Reliable technology that serves people and process ● Develop your people ● Decisions slowly by consensus ● Relentless reflection (hansei), organisational learning 10
  • 11.
    www.scling.com Common waste species ●Cognitive waste ● Technology waste ● Delivery waste ● Operational waste ● Product waste 11 Companies are generally good at handling some waste forms, and blind to others. Your blindness is your potential.
  • 12.
    www.scling.com Cognitive waste ● Whydo we have 25 time formats? ○ ISO 8601, UTC assumed ○ ISO 8601 + timezone ○ Millis since epoch, UTC ○ Nanos since epoch, UTC ○ Millis since epoch, user local time ○ … ○ Float of seconds since epoch, as string. WTF?!? ● my-kafka-topic-name, your_topic_name 12 ● Definition of an order: ○ Abandoned cart? ○ Payment refused? ○ Returned goods? ○ Free promotion? ● Data entity source of truth ○ MySQL, Kafka, data lake? ● Code and documentation sprawl ○ Repositories & branches ○ Wikis
  • 13.
    www.scling.com What causes cognitivewaste? ● We are autonomous! ○ Teams can choose technology, format, process, ... ● Cognitive debt ○ Short-term over long-term ○ Decisions without consensus ● Recognition and rewards ○ "You have made a similar independent pipeline, great work!" 13
  • 14.
    www.scling.com Avoiding cognitive waste ●Reusing semantic definitions ● Reusing code & technical definitions ○ Code transparency & sharing ○ Standardised technology ○ Document decisions & consensus process ● Read-only sharing not enough ○ Must be empowered to ■ change for reuse ■ improve quality ■ delete unused ○ Low risk - what will I break downstream? ○ Standardised, end-to-end QA processes 14
  • 15.
    www.scling.com ● Code notyet fully utilised ● Code on its way to production ○ In a notebook ○ Waiting for approval ○ Waiting for release ○ Internally released, waiting for dependants to upgrade ● Tests not fully used ○ Tests that cover code (shared component), but are not yet executed Delivery waste - code inventory 15
  • 16.
    www.scling.com Eliminating delivery waste 16 ●Friction from code to production ○ Positive engineering: research, writing code, tests, docs, refactor, improve ○ All else is negative ● You are limited by your assumptions ○ State of practice far is from state of art But the test suite takes 3 hours. We have this checklist. Security must approve. X must be released before Y. That is another team's job. We don't have access. We must test in staging first. We haven't performance tested yet.
  • 17.
    www.scling.com So get ridof the waste. Resources: No tradeoff between speed and quality! 17
  • 18.
    www.scling.com Data inventory ● Datacollected, but not yet fully processed ○ Traditional lazy joins & SQL processing at runtime ○ Extract-load-transform (ELT) ● Eliminate with eager processing = pipeline ○ Process, join, denormalise ○ Extract-transform-load (ETL) ● Fatal problems → offline crash ○ "Andon" cord - stop and fix before significant harm is done 18
  • 19.
    www.scling.com Technology waste 19 NoSQL Stream processing Spark/Flink Hadoop In-memory databases Schema registry Data catalogue Feature store Changedata capture Data versioning Governance system Data warehouse Lakehouse Scaled out compute Kubernetes Essential Compute machines Workflow orchestration RDBMS File storage Code version control Visualisation Graph processing Deep learning
  • 20.
    www.scling.com Operational waste ● Frictionin operational manoeuvres ○ Fear of mistakes ○ Application-specific tooling ● Cost of incidents ○ Time to recovery ○ Impact of incident ○ Frequency of incidents 20
  • 21.
    www.scling.com Separating offline andonline 21 Raw Fraud service Fraud model Orders Orders Replication / Backup Prudent procedures Prudent procedures Lightweight procedures ● QA driven by internal efficiency ● Continuous deployment ● New pipeline < 1 day ● Upgrade < 1 hour ● Bug recovery < 1 hour Careful handover Careful handover
  • 22.
    www.scling.com Many nines uptime(99.99.. %) A couple of sevens Data speed Innovation speed 22 Nearline Data processing tradeoff Job Stream Offline Online Stream Job Stream
  • 23.
    www.scling.com Product waste ● Worknot driven by use case ● Unrealised data potential due to friction ○ Unawareness of data ○ Difficulty to use data ● Collaboration and communication ○ Connection ○ Overhead 23 Data democratisation - making data accessible and usable Form teams aligned to value flows.
  • 24.
    www.scling.com Continuous improvement &learning ● Products, not projects ○ Owned, never done, always improving ● To production early ○ Minimal fear ○ Measure and monitor to learn ● Fail & iterate ○ No blame, no penalties ● Communication across organisation essential ○ Data source team - data processing team - stakeholders 24
  • 25.
    www.scling.com Data product qualityassurance ● Product quality = f(code, data) ○ Cannot do full QA on code only ○ Only real data is production data ● Test in production ○ Quick QA cycle = quick production deployment ○ Measure, monitor, validate 25
  • 26.
    www.scling.com Infrastructure waste 26 ● Productionenvironment only ○ Dev, test, staging lack production data ● Dark pipelines ○ Run in parallel ○ Monitor diff vs production ○ Roll out slowly? ∆?
  • 27.
  • 28.
  • 29.
    www.scling.com Scling - data-value-as-a-service 29 Datavalue through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! Rapid data innovation Learning by doing, in collaboration