Data ops in practice - Swedish style

Lars Albertsson
Lars AlbertssonFounder & Data Engineer
www.scling.com
DataOps in practice -
Swedish style
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Who’s talking?
...
Google - video conference, engineering productivity
...
Spotify - data engineering
...
Independent data engineering consultant
Banks, media, startups, heavy industry, telco
Founder @ Scling - data-value-as-a-service
2
www.scling.com
Contents
Journey to DataOps
Experiences that shaped my data engineering
IMHO principles of successful DataOps
Toolbox
3
● Spotify information is old history
● Previously published
● Today is very different
www.scling.com
Spotify data 2007-2013
● Hadoop installed 2007
● Use cases: reporting, insights, recommendations
● Cultural aspects:
○ Autonomous teams
○ Eliminate waste
○ Learn and adapt
4
www.scling.com
Traditional systems
5
Mutation
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
Data lake
Transformation
Cold
store
6
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
Data factories
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
7
www.scling.com
Wrong conclusion, every day
● Downward trend every day!
8
www.scling.com
Normalise data collection to compare
9Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Normalise data collection to compare
10Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
11Graph by Adam Altmejd, @adamaltmejd
www.scling.com
From craft to process
12
www.scling.com
From craft to process
13
Multiple time windows
www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
www.scling.com
From craft to process
15
Multiple time windows
Assess ingress data quality
Assess outcome data quality
www.scling.com
From craft to process
16
Multiple time windows
Assess ingress data quality
Assess outcome data quality
Repair broken data
Intermediate datasets, reusable between pipelines
www.scling.com
From craft to process
17
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Assess outcome data quality
www.scling.com
From craft to process
18
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history
Assess outcome data quality
www.scling.com
From craft to process
19
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
www.scling.com
From craft to process
20
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Towards sustainable production ML
21
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Risky operations
22
How to I test the pipeline?
You temporarily change the
output path and run manually.
Don’t do that.
What if I forget to change path?
www.scling.com
2013
23
● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1)
● Folklore development cycle & operations
● Unsatisfied needs in other teams
www.scling.com
luigid
Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop.
On-prem Hadoop production
Worker
10 * * * * luigi --module mymodule MyDaily
23 * * * * luigi --module other OtherDaily
Master
Executor
Worker
HDFS metadata
Data
Control
(+data)
Submit job
10 * ...
23 * ...
www.scling.com
Ghost in the cluster
● Jobs were deployed with Debian packages + Puppet on pet machines.
○ Multiple pets for redundancy. Race to run job.
● "This monitor daemon is at 100%. Since 6 months. I'll kill it."
● "Data is wrong. But we fixed this bug 6 months ago?!?"
25
www.scling.com
Start of a DataOps journey
26
Stateful Stateless
Pets Cattle
Folklore
Golden pathTest in prod
Local test
CI/CD
Weeks to learn
New pipeline
< 1 day
Days to mend
Bug fix
< 1 hour
www.scling.com
On-prem pipeline deployment pipeline
27
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store
www.scling.com
Principle: Functional pipelines
28
● Raw source of truth + data refinement factory
● Immutable datasets & artifacts
● Deterministic, idempotent, reproducible deployment & processing
● Key success factor: workflow orchestration
○ Oozie, Rambo, Builder, Builder2, Luigi
○ Key properties:
1. Pure Python
2. Simplicity
3. All the features it lacks
www.scling.com
Big data - a collaboration paradigm
29
Stream storage
Data lake
Data
democratised
www.scling.com
● Technically
○ Data available
○ Reusable QA
● Operationally
○ Continuous deployment
○ Hands off operations
○ Monitoring, debugging
● Bottom-up innovation
Enabling teams
30
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
www.scling.com
Principle: Small scope components
31
● Do one thing well. Less is more.
● Complex systems from replaceable bricks
○ Cloud/OSS over enterprise vendors
○ Simplicity over features
Solvable
challenge
~2000 lines of code
Perpetual
complexity
www.scling.com
Cloud native deployment
32
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR
www.scling.com
Data platform gravitation
● Hadoop all the things.
● Data is there. Simple test, simple deploy, simple ops.
● Autonomous teams - no mandate. Natural gravity.
33
www.scling.com
3434
Nearline
● Stream storage
● Asynchronous event
processing
● 10 ms - 1 hour
Data integration timescales
34
Job
Stream
Offline
● File storage
● Asynchronous batch
processing
● 1 minute -
Online
● SOA / microservices
● Synchronous RPC
● 1-100 ms
Stream
Job
Stream
www.scling.com
3535
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
35
www.scling.com
3636
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
36
Service failure
● User impact
● Data loss
● Cascading outage
www.scling.com
3737
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
37
Service failure
● User impact
● Data loss
● Cascading outage
Bug
● User impact
● Data corruption
● Cascading corruption
www.scling.com
38
Operational manoeuvres - offline
38
Upgrade
● Instant rollout
● No user impact
● Reactive QA
Service failure
● Pipeline delay
● No data loss
● No downstream impact
Bug
● Temporary data
corruption
● Downstream impact
www.scling.com
Life of an error, batch pipelines
39
● Faulty job, emits bad data
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com
40
Production critical upgrade
● Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Testable end-to-end
No dev & staging environment needed!
∆?
www.scling.com
41
Operational manoeuvres - nearline
41
Upgrade
● Swift rollout
● Parallel pipelines
● User impact, QA?
Service failure
● Pipeline delay
● No data loss
● Downstream impact?
Bug
● Data corruption
● Downstream impact
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
www.scling.com
42
Life of an error, streaming
42
● Works for a single job, not pipeline. :-(
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams
www.scling.com
Data speed Innovation speed
43
Nearline
Data processing tradeoff
43
Job
Stream
OfflineOnline
Stream
Job
Stream
www.scling.com
44
Separating online & offline
● Daily user DB dump. Cassandra can handle the load.
○ Load spike became 25 h long…
● New recommendation model! Cassandra can replicate to all regions.
○ Who saturated the Atlantic link?
● Batch jobs saturate one resource.
○ Bad neighbours.
www.scling.com
Batch offline vs online
45
Raw
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
46
www.scling.com
Testing single batch job
47
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
www.scling.com
Testing batch pipelines - two options
48
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
p()f()
B:
www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
49
www.scling.com
50
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service
www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
51
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope
www.scling.com
Data quality - high code vs low code
● 2013: Python MapReduce outdated
● Hive/SQL?
○ Not expressive enough
○ Data quality challenging
● Technical platform + multi-skilled teams!
○ Strong development processes
52
Low code / no code platform? Technical platform?
www.scling.com
53
Measuring consistency: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service
www.scling.com
54
Machine learning operations, simplified
● Multiple trained models
○ Select at run time
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Ready to revert to
○ old models
○ simpler models
Measure interactionsRendez-
vous
DB
Standard
alerting
service
Stream Job
"The required surrounding
infrastructure is vast and
complex."
- Google
www.scling.com
55
Not all things went well
● Autonomy → excessive heterogeneity
○ 25 ways to store a timestamp?
● Pipeline end-to-end tests
○ Culturally challenging
○ → difficult to change & retire pipelines
● Trial and error to learn
www.scling.com
Data engineering in Scandinavia
● Stockholm region ranks 2nd in unicorns / capita
○ Media, games, fintech
● Critical mass of world class data engineering
○ Limited to a few companies
56
www.scling.com
Mission: Spread data & AI superpowers
● There are companies to help
● Data & AI capabilities require culture & process change
○ Slow, very slow
57
www.scling.com
Scandinavian minimalist design
● Lean, simple technology - focus on flow and business value
● Bonnier News data platform, 4-5 persons:
○ Zero to happy customer in 3 weeks.
○ Dozens of ROI pipelines in 8 months.
● Scling retail client, 1-3 persons, after 1 year:
○ 40 sources, 70 pipelines, 200 egress points
○ 3,400 datasets / day
● Typical enterprise numbers
○ Big data project: 6-24 months
○ Analytics department: 100-1000 datasets / day
○ Spotify: 100,000+ datasets / day
○ Google: 1.6B datasets / day (2016)
58
www.scling.com
Scling - data-value-as-a-service
59
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses
1 of 59

Recommended

Data democratised by
Data democratisedData democratised
Data democratisedLars Albertsson
307 views32 slides
Taming the reproducibility crisis by
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
521 views26 slides
Mortal analytics - Covid-19 and the problem of data quality by
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
416 views43 slides
Don't build a data science team by
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
883 views35 slides
Data ops in practice by
Data ops in practiceData ops in practice
Data ops in practiceLars Albertsson
3K views26 slides
DataOps - Lean principles and lean practices by
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
787 views29 slides

More Related Content

What's hot

Kubernetes as data platform by
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
884 views34 slides
Engineering data quality by
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
1.3K views50 slides
The lean principles of data ops by
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
410 views38 slides
10 ways to stumble with big data by
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
1.4K views18 slides
Protecting privacy in practice by
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
9.8K views36 slides
Data pipelines from zero to solid by
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
10.7K views58 slides

What's hot(20)

10 ways to stumble with big data by Lars Albertsson
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson1.4K views
Protecting privacy in practice by Lars Albertsson
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
Lars Albertsson9.8K views
Data pipelines from zero to solid by Lars Albertsson
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson10.7K views
Open Data Science Conference Agile Data by DataKitchen
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
DataKitchen1.5K views
Offload, Transform, and Present - The New World of Data Integration by gluent.
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integration
gluent.595 views
Building Reactive Real-time Data Pipeline by Trieu Nguyen
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
Trieu Nguyen6K views
Testing the Data Warehouse—Big Data, Big Problems by TechWell
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big Problems
TechWell1.8K views
Data Science and Enterprise Engineering with Michael Finger and Chris Robison by Databricks
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Databricks480 views
Testing data streaming applications by Lars Albertsson
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson4K views
How to design and implement a data ops architecture with sdc and gcp by Joseph Arriola
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
Joseph Arriola394 views
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ... by smallerror
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror1.4K views
Neo4j-Databridge: Enterprise-scale ETL for Neo4j by Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j424 views
H2O AutoML roadmap - Ray Peck by Sri Ambati
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
Sri Ambati2.1K views
Continuous delivery for machine learning by Rajesh Muppalla
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla2.9K views
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop... by DataKitchen
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen2.1K views

Similar to Data ops in practice - Swedish style

Holistic data application quality by
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
396 views30 slides
Secure software supply chain on a shoestring budget by
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
268 views49 slides
Data engineering in 10 years.pdf by
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
842 views52 slides
Crossing the data divide by
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
3 views31 slides
About VisualDNA Architecture @ Rubyslava 2014 by
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014Michal Harish
1.6K views13 slides
Schema management with Scalameta by
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
7 views50 slides

Similar to Data ops in practice - Swedish style(20)

Holistic data application quality by Lars Albertsson
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson396 views
Secure software supply chain on a shoestring budget by Lars Albertsson
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson268 views
Data engineering in 10 years.pdf by Lars Albertsson
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson842 views
About VisualDNA Architecture @ Rubyslava 2014 by Michal Harish
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
Michal Harish1.6K views
Test strategies for data processing pipelines by Lars Albertsson
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson5.2K views
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari by Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Demi Ben-Ari67 views
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022 by HostedbyConfluent
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
HostedbyConfluent458 views
Data Science in the Cloud @StitchFix by C4Media
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media952 views
The of Operational Analytics Data Store by Rommel Garcia
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
Rommel Garcia288 views
Introduction to Data Engineer and Data Pipeline at Credit OK by Kriangkrai Chaonithi
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Data platform architecture principles - ieee infrastructure 2020 by Julien Le Dem
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem824 views
Managing Apache Spark Workload and Automatic Optimizing by Databricks
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks1.1K views
H2O at Poznan R Meetup by Jo-fai Chow
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
Jo-fai Chow755 views

More from Lars Albertsson

How to not kill people - Berlin Buzzwords 2023.pdf by
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
34 views51 slides
The 7 habits of data effective companies.pdf by
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
252 views44 slides
Ai legal and ethics by
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethicsLars Albertsson
200 views6 slides
Eventually, time will kill your data pipeline by
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
936 views54 slides
Big data == lean data by
Big data == lean dataBig data == lean data
Big data == lean dataLars Albertsson
226 views17 slides
Privacy by design by
Privacy by designPrivacy by design
Privacy by designLars Albertsson
1.9K views44 slides

More from Lars Albertsson(9)

How to not kill people - Berlin Buzzwords 2023.pdf by Lars Albertsson
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson34 views
The 7 habits of data effective companies.pdf by Lars Albertsson
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson252 views
Eventually, time will kill your data pipeline by Lars Albertsson
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson936 views
Test strategies for data processing pipelines, v2.0 by Lars Albertsson
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson2.7K views
A primer on building real time data-driven products by Lars Albertsson
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lars Albertsson951 views
Building real time data-driven products by Lars Albertsson
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson2.8K views

Recently uploaded

PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PC Cluster Consortium
25 views12 slides
Innovation & Entrepreneurship strategies in Dairy Industry by
Innovation & Entrepreneurship strategies in Dairy IndustryInnovation & Entrepreneurship strategies in Dairy Industry
Innovation & Entrepreneurship strategies in Dairy IndustryPervaizDar1
35 views26 slides
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf by
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfThomasBronack
31 views31 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
146 views31 slides
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...ShapeBlue
199 views20 slides
AI + Memoori = AIM by
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIMMemoori
14 views9 slides

Recently uploaded(20)

PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
Innovation & Entrepreneurship strategies in Dairy Industry by PervaizDar1
Innovation & Entrepreneurship strategies in Dairy IndustryInnovation & Entrepreneurship strategies in Dairy Industry
Innovation & Entrepreneurship strategies in Dairy Industry
PervaizDar135 views
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf by ThomasBronack
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
ThomasBronack31 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10146 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue199 views
AI + Memoori = AIM by Memoori
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIM
Memoori14 views
What is Authentication Active Directory_.pptx by HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10145 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays33 views
Discover Aura Workshop (12.5.23).pdf by Neo4j
Discover Aura Workshop (12.5.23).pdfDiscover Aura Workshop (12.5.23).pdf
Discover Aura Workshop (12.5.23).pdf
Neo4j15 views
Measurecamp Brussels - Synthetic data.pdf by Human37
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdf
Human37 26 views
GDSC GLAU Info Session.pptx by gauriverrma4
GDSC GLAU Info Session.pptxGDSC GLAU Info Session.pptx
GDSC GLAU Info Session.pptx
gauriverrma415 views
This talk was not generated with ChatGPT: how AI is changing science by Elena Simperl
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing science
Elena Simperl32 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays36 views
AIM102-S_Cognizant_CognizantCognitive by PhilipBasford
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitive
PhilipBasford21 views
"Package management in monorepos", Zoltan Kochan by Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays34 views

Data ops in practice - Swedish style

  • 1. www.scling.com DataOps in practice - Swedish style Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Who’s talking? ... Google - video conference, engineering productivity ... Spotify - data engineering ... Independent data engineering consultant Banks, media, startups, heavy industry, telco Founder @ Scling - data-value-as-a-service 2
  • 3. www.scling.com Contents Journey to DataOps Experiences that shaped my data engineering IMHO principles of successful DataOps Toolbox 3 ● Spotify information is old history ● Previously published ● Today is very different
  • 4. www.scling.com Spotify data 2007-2013 ● Hadoop installed 2007 ● Use cases: reporting, insights, recommendations ● Cultural aspects: ○ Autonomous teams ○ Eliminate waste ○ Learn and adapt 4
  • 5. www.scling.com Traditional systems 5 Mutation Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 6. www.scling.com Data lake Transformation Cold store 6 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments Data factories
  • 7. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 7
  • 8. www.scling.com Wrong conclusion, every day ● Downward trend every day! 8
  • 9. www.scling.com Normalise data collection to compare 9Graph by Adam Altmejd, @adamaltmejd
  • 10. www.scling.com Normalise data collection to compare 10Graph by Adam Altmejd, @adamaltmejd
  • 11. www.scling.com Forecast for analytics with fresh data 11Graph by Adam Altmejd, @adamaltmejd
  • 13. www.scling.com From craft to process 13 Multiple time windows
  • 14. www.scling.com From craft to process 14 Multiple time windows Assess ingress data quality
  • 15. www.scling.com From craft to process 15 Multiple time windows Assess ingress data quality Assess outcome data quality
  • 16. www.scling.com From craft to process 16 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  • 17. www.scling.com From craft to process 17 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  • 18. www.scling.com From craft to process 18 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  • 19. www.scling.com From craft to process 19 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  • 20. www.scling.com From craft to process 20 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 21. www.scling.com Towards sustainable production ML 21 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 22. www.scling.com Risky operations 22 How to I test the pipeline? You temporarily change the output path and run manually. Don’t do that. What if I forget to change path?
  • 23. www.scling.com 2013 23 ● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1) ● Folklore development cycle & operations ● Unsatisfied needs in other teams
  • 24. www.scling.com luigid Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop. On-prem Hadoop production Worker 10 * * * * luigi --module mymodule MyDaily 23 * * * * luigi --module other OtherDaily Master Executor Worker HDFS metadata Data Control (+data) Submit job 10 * ... 23 * ...
  • 25. www.scling.com Ghost in the cluster ● Jobs were deployed with Debian packages + Puppet on pet machines. ○ Multiple pets for redundancy. Race to run job. ● "This monitor daemon is at 100%. Since 6 months. I'll kill it." ● "Data is wrong. But we fixed this bug 6 months ago?!?" 25
  • 26. www.scling.com Start of a DataOps journey 26 Stateful Stateless Pets Cattle Folklore Golden pathTest in prod Local test CI/CD Weeks to learn New pipeline < 1 day Days to mend Bug fix < 1 hour
  • 27. www.scling.com On-prem pipeline deployment pipeline 27 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store
  • 28. www.scling.com Principle: Functional pipelines 28 ● Raw source of truth + data refinement factory ● Immutable datasets & artifacts ● Deterministic, idempotent, reproducible deployment & processing ● Key success factor: workflow orchestration ○ Oozie, Rambo, Builder, Builder2, Luigi ○ Key properties: 1. Pure Python 2. Simplicity 3. All the features it lacks
  • 29. www.scling.com Big data - a collaboration paradigm 29 Stream storage Data lake Data democratised
  • 30. www.scling.com ● Technically ○ Data available ○ Reusable QA ● Operationally ○ Continuous deployment ○ Hands off operations ○ Monitoring, debugging ● Bottom-up innovation Enabling teams 30 "The actual work that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8
  • 31. www.scling.com Principle: Small scope components 31 ● Do one thing well. Less is more. ● Complex systems from replaceable bricks ○ Cloud/OSS over enterprise vendors ○ Simplicity over features Solvable challenge ~2000 lines of code Perpetual complexity
  • 32. www.scling.com Cloud native deployment 32 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR
  • 33. www.scling.com Data platform gravitation ● Hadoop all the things. ● Data is there. Simple test, simple deploy, simple ops. ● Autonomous teams - no mandate. Natural gravity. 33
  • 34. www.scling.com 3434 Nearline ● Stream storage ● Asynchronous event processing ● 10 ms - 1 hour Data integration timescales 34 Job Stream Offline ● File storage ● Asynchronous batch processing ● 1 minute - Online ● SOA / microservices ● Synchronous RPC ● 1-100 ms Stream Job Stream
  • 35. www.scling.com 3535 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 35
  • 36. www.scling.com 3636 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 36 Service failure ● User impact ● Data loss ● Cascading outage
  • 37. www.scling.com 3737 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 37 Service failure ● User impact ● Data loss ● Cascading outage Bug ● User impact ● Data corruption ● Cascading corruption
  • 38. www.scling.com 38 Operational manoeuvres - offline 38 Upgrade ● Instant rollout ● No user impact ● Reactive QA Service failure ● Pipeline delay ● No data loss ● No downstream impact Bug ● Temporary data corruption ● Downstream impact
  • 39. www.scling.com Life of an error, batch pipelines 39 ● Faulty job, emits bad data 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 40. www.scling.com 40 Production critical upgrade ● Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Testable end-to-end No dev & staging environment needed! ∆?
  • 41. www.scling.com 41 Operational manoeuvres - nearline 41 Upgrade ● Swift rollout ● Parallel pipelines ● User impact, QA? Service failure ● Pipeline delay ● No data loss ● Downstream impact? Bug ● Data corruption ● Downstream impact Job Stream Stream Job Stream Job Stream Stream Job Stream Job Stream Stream Job Stream
  • 42. www.scling.com 42 Life of an error, streaming 42 ● Works for a single job, not pipeline. :-( Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  • 43. www.scling.com Data speed Innovation speed 43 Nearline Data processing tradeoff 43 Job Stream OfflineOnline Stream Job Stream
  • 44. www.scling.com 44 Separating online & offline ● Daily user DB dump. Cassandra can handle the load. ○ Load spike became 25 h long… ● New recommendation model! Cassandra can replicate to all regions. ○ Who saturated the Atlantic link? ● Batch jobs saturate one resource. ○ Bad neighbours.
  • 45. www.scling.com Batch offline vs online 45 Raw Fraud serviceFraud model Orders Orders Replication / Backup Standard procedures Standard proceduresLightweight procedures ● QA driven by internal efficiency ● Continuous deployment ● New pipeline < 1 day ● Upgrade < 1 hour ● Bug recovery < 1 hour Careful handover Careful handover
  • 46. www.scling.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 46
  • 47. www.scling.com Testing single batch job 47 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  • 48. www.scling.com Testing batch pipelines - two options 48 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup p()f() B:
  • 49. www.scling.com Monitoring timeliness, examples ● Datamon - Spotify internal ● Twitter Ambrose (dead?) ● Airflow 49
  • 50. www.scling.com 50 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  • 51. www.scling.com Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 51 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  • 52. www.scling.com Data quality - high code vs low code ● 2013: Python MapReduce outdated ● Hive/SQL? ○ Not expressive enough ○ Data quality challenging ● Technical platform + multi-skilled teams! ○ Strong development processes 52 Low code / no code platform? Technical platform?
  • 53. www.scling.com 53 Measuring consistency: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  • 54. www.scling.com 54 Machine learning operations, simplified ● Multiple trained models ○ Select at run time ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Ready to revert to ○ old models ○ simpler models Measure interactionsRendez- vous DB Standard alerting service Stream Job "The required surrounding infrastructure is vast and complex." - Google
  • 55. www.scling.com 55 Not all things went well ● Autonomy → excessive heterogeneity ○ 25 ways to store a timestamp? ● Pipeline end-to-end tests ○ Culturally challenging ○ → difficult to change & retire pipelines ● Trial and error to learn
  • 56. www.scling.com Data engineering in Scandinavia ● Stockholm region ranks 2nd in unicorns / capita ○ Media, games, fintech ● Critical mass of world class data engineering ○ Limited to a few companies 56
  • 57. www.scling.com Mission: Spread data & AI superpowers ● There are companies to help ● Data & AI capabilities require culture & process change ○ Slow, very slow 57
  • 58. www.scling.com Scandinavian minimalist design ● Lean, simple technology - focus on flow and business value ● Bonnier News data platform, 4-5 persons: ○ Zero to happy customer in 3 weeks. ○ Dozens of ROI pipelines in 8 months. ● Scling retail client, 1-3 persons, after 1 year: ○ 40 sources, 70 pipelines, 200 egress points ○ 3,400 datasets / day ● Typical enterprise numbers ○ Big data project: 6-24 months ○ Analytics department: 100-1000 datasets / day ○ Spotify: 100,000+ datasets / day ○ Google: 1.6B datasets / day (2016) 58
  • 59. www.scling.com Scling - data-value-as-a-service 59 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses