SlideShare a Scribd company logo
www.scling.com
DataOps in practice -
Swedish style
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Who’s talking?
...
Google - video conference, engineering productivity
...
Spotify - data engineering
...
Independent data engineering consultant
Banks, media, startups, heavy industry, telco
Founder @ Scling - data-value-as-a-service
2
www.scling.com
Contents
Journey to DataOps
Experiences that shaped my data engineering
IMHO principles of successful DataOps
Toolbox
3
● Spotify information is old history
● Previously published
● Today is very different
www.scling.com
Spotify data 2007-2013
● Hadoop installed 2007
● Use cases: reporting, insights, recommendations
● Cultural aspects:
○ Autonomous teams
○ Eliminate waste
○ Learn and adapt
4
www.scling.com
Traditional systems
5
Mutation
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
Data lake
Transformation
Cold
store
6
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
Data factories
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
7
www.scling.com
Wrong conclusion, every day
● Downward trend every day!
8
www.scling.com
Normalise data collection to compare
9Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Normalise data collection to compare
10Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
11Graph by Adam Altmejd, @adamaltmejd
www.scling.com
From craft to process
12
www.scling.com
From craft to process
13
Multiple time windows
www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
www.scling.com
From craft to process
15
Multiple time windows
Assess ingress data quality
Assess outcome data quality
www.scling.com
From craft to process
16
Multiple time windows
Assess ingress data quality
Assess outcome data quality
Repair broken data
Intermediate datasets, reusable between pipelines
www.scling.com
From craft to process
17
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Assess outcome data quality
www.scling.com
From craft to process
18
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history
Assess outcome data quality
www.scling.com
From craft to process
19
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
www.scling.com
From craft to process
20
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Towards sustainable production ML
21
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Risky operations
22
How to I test the pipeline?
You temporarily change the
output path and run manually.
Don’t do that.
What if I forget to change path?
www.scling.com
2013
23
● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1)
● Folklore development cycle & operations
● Unsatisfied needs in other teams
www.scling.com
luigid
Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop.
On-prem Hadoop production
Worker
10 * * * * luigi --module mymodule MyDaily
23 * * * * luigi --module other OtherDaily
Master
Executor
Worker
HDFS metadata
Data
Control
(+data)
Submit job
10 * ...
23 * ...
www.scling.com
Ghost in the cluster
● Jobs were deployed with Debian packages + Puppet on pet machines.
○ Multiple pets for redundancy. Race to run job.
● "This monitor daemon is at 100%. Since 6 months. I'll kill it."
● "Data is wrong. But we fixed this bug 6 months ago?!?"
25
www.scling.com
Start of a DataOps journey
26
Stateful Stateless
Pets Cattle
Folklore
Golden pathTest in prod
Local test
CI/CD
Weeks to learn
New pipeline
< 1 day
Days to mend
Bug fix
< 1 hour
www.scling.com
On-prem pipeline deployment pipeline
27
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store
www.scling.com
Principle: Functional pipelines
28
● Raw source of truth + data refinement factory
● Immutable datasets & artifacts
● Deterministic, idempotent, reproducible deployment & processing
● Key success factor: workflow orchestration
○ Oozie, Rambo, Builder, Builder2, Luigi
○ Key properties:
1. Pure Python
2. Simplicity
3. All the features it lacks
www.scling.com
Big data - a collaboration paradigm
29
Stream storage
Data lake
Data
democratised
www.scling.com
● Technically
○ Data available
○ Reusable QA
● Operationally
○ Continuous deployment
○ Hands off operations
○ Monitoring, debugging
● Bottom-up innovation
Enabling teams
30
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
www.scling.com
Principle: Small scope components
31
● Do one thing well. Less is more.
● Complex systems from replaceable bricks
○ Cloud/OSS over enterprise vendors
○ Simplicity over features
Solvable
challenge
~2000 lines of code
Perpetual
complexity
www.scling.com
Cloud native deployment
32
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR
www.scling.com
Data platform gravitation
● Hadoop all the things.
● Data is there. Simple test, simple deploy, simple ops.
● Autonomous teams - no mandate. Natural gravity.
33
www.scling.com
3434
Nearline
● Stream storage
● Asynchronous event
processing
● 10 ms - 1 hour
Data integration timescales
34
Job
Stream
Offline
● File storage
● Asynchronous batch
processing
● 1 minute -
Online
● SOA / microservices
● Synchronous RPC
● 1-100 ms
Stream
Job
Stream
www.scling.com
3535
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
35
www.scling.com
3636
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
36
Service failure
● User impact
● Data loss
● Cascading outage
www.scling.com
3737
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
37
Service failure
● User impact
● Data loss
● Cascading outage
Bug
● User impact
● Data corruption
● Cascading corruption
www.scling.com
38
Operational manoeuvres - offline
38
Upgrade
● Instant rollout
● No user impact
● Reactive QA
Service failure
● Pipeline delay
● No data loss
● No downstream impact
Bug
● Temporary data
corruption
● Downstream impact
www.scling.com
Life of an error, batch pipelines
39
● Faulty job, emits bad data
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com
40
Production critical upgrade
● Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Testable end-to-end
No dev & staging environment needed!
∆?
www.scling.com
41
Operational manoeuvres - nearline
41
Upgrade
● Swift rollout
● Parallel pipelines
● User impact, QA?
Service failure
● Pipeline delay
● No data loss
● Downstream impact?
Bug
● Data corruption
● Downstream impact
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
www.scling.com
42
Life of an error, streaming
42
● Works for a single job, not pipeline. :-(
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams
www.scling.com
Data speed Innovation speed
43
Nearline
Data processing tradeoff
43
Job
Stream
OfflineOnline
Stream
Job
Stream
www.scling.com
44
Separating online & offline
● Daily user DB dump. Cassandra can handle the load.
○ Load spike became 25 h long…
● New recommendation model! Cassandra can replicate to all regions.
○ Who saturated the Atlantic link?
● Batch jobs saturate one resource.
○ Bad neighbours.
www.scling.com
Batch offline vs online
45
Raw
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
46
www.scling.com
Testing single batch job
47
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
www.scling.com
Testing batch pipelines - two options
48
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
p()f()
B:
www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
49
www.scling.com
50
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service
www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
51
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope
www.scling.com
Data quality - high code vs low code
● 2013: Python MapReduce outdated
● Hive/SQL?
○ Not expressive enough
○ Data quality challenging
● Technical platform + multi-skilled teams!
○ Strong development processes
52
Low code / no code platform? Technical platform?
www.scling.com
53
Measuring consistency: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service
www.scling.com
54
Machine learning operations, simplified
● Multiple trained models
○ Select at run time
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Ready to revert to
○ old models
○ simpler models
Measure interactionsRendez-
vous
DB
Standard
alerting
service
Stream Job
"The required surrounding
infrastructure is vast and
complex."
- Google
www.scling.com
55
Not all things went well
● Autonomy → excessive heterogeneity
○ 25 ways to store a timestamp?
● Pipeline end-to-end tests
○ Culturally challenging
○ → difficult to change & retire pipelines
● Trial and error to learn
www.scling.com
Data engineering in Scandinavia
● Stockholm region ranks 2nd in unicorns / capita
○ Media, games, fintech
● Critical mass of world class data engineering
○ Limited to a few companies
56
www.scling.com
Mission: Spread data & AI superpowers
● There are companies to help
● Data & AI capabilities require culture & process change
○ Slow, very slow
57
www.scling.com
Scandinavian minimalist design
● Lean, simple technology - focus on flow and business value
● Bonnier News data platform, 4-5 persons:
○ Zero to happy customer in 3 weeks.
○ Dozens of ROI pipelines in 8 months.
● Scling retail client, 1-3 persons, after 1 year:
○ 40 sources, 70 pipelines, 200 egress points
○ 3,400 datasets / day
● Typical enterprise numbers
○ Big data project: 6-24 months
○ Analytics department: 100-1000 datasets / day
○ Spotify: 100,000+ datasets / day
○ Google: 1.6B datasets / day (2016)
58
www.scling.com
Scling - data-value-as-a-service
59
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses

More Related Content

What's hot

Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
Lars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
Lars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
Lars Albertsson
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
Lars Albertsson
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
Big Data Monitoring Cockpit
Big Data Monitoring CockpitBig Data Monitoring Cockpit
Big Data Monitoring Cockpit
Stefan Bergstein
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
DataKitchen
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Institute e-Austria Timisoara
 
Offload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integration
gluent.
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
Trieu Nguyen
 
Testing the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big Problems
TechWell
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Databricks
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
Joseph Arriola
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
Sri Ambati
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 

What's hot (20)

Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Big Data Monitoring Cockpit
Big Data Monitoring CockpitBig Data Monitoring Cockpit
Big Data Monitoring Cockpit
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
 
Offload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integration
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Testing the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big Problems
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 

Similar to Data ops in practice - Swedish style

Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
Lars Albertsson
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
Michal Harish
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
Lars Albertsson
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
Vladislav Supalov
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
HostedbyConfluent
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
Rommel Garcia
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
Amihay Zer-Kavod
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
Senturus
 
OpenFlow @ Google
OpenFlow @ GoogleOpenFlow @ Google
OpenFlow @ Google
Open Networking Summits
 

Similar to Data ops in practice - Swedish style (20)

Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
OpenFlow @ Google
OpenFlow @ GoogleOpenFlow @ Google
OpenFlow @ Google
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
Lars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
Lars Albertsson
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
Lars Albertsson
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lars Albertsson
 

More from Lars Albertsson (10)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 

Recently uploaded

leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
alexjohnson7307
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 

Recently uploaded (20)

leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 

Data ops in practice - Swedish style

  • 1. www.scling.com DataOps in practice - Swedish style Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Who’s talking? ... Google - video conference, engineering productivity ... Spotify - data engineering ... Independent data engineering consultant Banks, media, startups, heavy industry, telco Founder @ Scling - data-value-as-a-service 2
  • 3. www.scling.com Contents Journey to DataOps Experiences that shaped my data engineering IMHO principles of successful DataOps Toolbox 3 ● Spotify information is old history ● Previously published ● Today is very different
  • 4. www.scling.com Spotify data 2007-2013 ● Hadoop installed 2007 ● Use cases: reporting, insights, recommendations ● Cultural aspects: ○ Autonomous teams ○ Eliminate waste ○ Learn and adapt 4
  • 5. www.scling.com Traditional systems 5 Mutation Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 6. www.scling.com Data lake Transformation Cold store 6 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments Data factories
  • 7. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 7
  • 8. www.scling.com Wrong conclusion, every day ● Downward trend every day! 8
  • 9. www.scling.com Normalise data collection to compare 9Graph by Adam Altmejd, @adamaltmejd
  • 10. www.scling.com Normalise data collection to compare 10Graph by Adam Altmejd, @adamaltmejd
  • 11. www.scling.com Forecast for analytics with fresh data 11Graph by Adam Altmejd, @adamaltmejd
  • 13. www.scling.com From craft to process 13 Multiple time windows
  • 14. www.scling.com From craft to process 14 Multiple time windows Assess ingress data quality
  • 15. www.scling.com From craft to process 15 Multiple time windows Assess ingress data quality Assess outcome data quality
  • 16. www.scling.com From craft to process 16 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  • 17. www.scling.com From craft to process 17 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  • 18. www.scling.com From craft to process 18 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  • 19. www.scling.com From craft to process 19 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  • 20. www.scling.com From craft to process 20 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 21. www.scling.com Towards sustainable production ML 21 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 22. www.scling.com Risky operations 22 How to I test the pipeline? You temporarily change the output path and run manually. Don’t do that. What if I forget to change path?
  • 23. www.scling.com 2013 23 ● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1) ● Folklore development cycle & operations ● Unsatisfied needs in other teams
  • 24. www.scling.com luigid Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop. On-prem Hadoop production Worker 10 * * * * luigi --module mymodule MyDaily 23 * * * * luigi --module other OtherDaily Master Executor Worker HDFS metadata Data Control (+data) Submit job 10 * ... 23 * ...
  • 25. www.scling.com Ghost in the cluster ● Jobs were deployed with Debian packages + Puppet on pet machines. ○ Multiple pets for redundancy. Race to run job. ● "This monitor daemon is at 100%. Since 6 months. I'll kill it." ● "Data is wrong. But we fixed this bug 6 months ago?!?" 25
  • 26. www.scling.com Start of a DataOps journey 26 Stateful Stateless Pets Cattle Folklore Golden pathTest in prod Local test CI/CD Weeks to learn New pipeline < 1 day Days to mend Bug fix < 1 hour
  • 27. www.scling.com On-prem pipeline deployment pipeline 27 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store
  • 28. www.scling.com Principle: Functional pipelines 28 ● Raw source of truth + data refinement factory ● Immutable datasets & artifacts ● Deterministic, idempotent, reproducible deployment & processing ● Key success factor: workflow orchestration ○ Oozie, Rambo, Builder, Builder2, Luigi ○ Key properties: 1. Pure Python 2. Simplicity 3. All the features it lacks
  • 29. www.scling.com Big data - a collaboration paradigm 29 Stream storage Data lake Data democratised
  • 30. www.scling.com ● Technically ○ Data available ○ Reusable QA ● Operationally ○ Continuous deployment ○ Hands off operations ○ Monitoring, debugging ● Bottom-up innovation Enabling teams 30 "The actual work that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8
  • 31. www.scling.com Principle: Small scope components 31 ● Do one thing well. Less is more. ● Complex systems from replaceable bricks ○ Cloud/OSS over enterprise vendors ○ Simplicity over features Solvable challenge ~2000 lines of code Perpetual complexity
  • 32. www.scling.com Cloud native deployment 32 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR
  • 33. www.scling.com Data platform gravitation ● Hadoop all the things. ● Data is there. Simple test, simple deploy, simple ops. ● Autonomous teams - no mandate. Natural gravity. 33
  • 34. www.scling.com 3434 Nearline ● Stream storage ● Asynchronous event processing ● 10 ms - 1 hour Data integration timescales 34 Job Stream Offline ● File storage ● Asynchronous batch processing ● 1 minute - Online ● SOA / microservices ● Synchronous RPC ● 1-100 ms Stream Job Stream
  • 35. www.scling.com 3535 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 35
  • 36. www.scling.com 3636 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 36 Service failure ● User impact ● Data loss ● Cascading outage
  • 37. www.scling.com 3737 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 37 Service failure ● User impact ● Data loss ● Cascading outage Bug ● User impact ● Data corruption ● Cascading corruption
  • 38. www.scling.com 38 Operational manoeuvres - offline 38 Upgrade ● Instant rollout ● No user impact ● Reactive QA Service failure ● Pipeline delay ● No data loss ● No downstream impact Bug ● Temporary data corruption ● Downstream impact
  • 39. www.scling.com Life of an error, batch pipelines 39 ● Faulty job, emits bad data 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 40. www.scling.com 40 Production critical upgrade ● Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Testable end-to-end No dev & staging environment needed! ∆?
  • 41. www.scling.com 41 Operational manoeuvres - nearline 41 Upgrade ● Swift rollout ● Parallel pipelines ● User impact, QA? Service failure ● Pipeline delay ● No data loss ● Downstream impact? Bug ● Data corruption ● Downstream impact Job Stream Stream Job Stream Job Stream Stream Job Stream Job Stream Stream Job Stream
  • 42. www.scling.com 42 Life of an error, streaming 42 ● Works for a single job, not pipeline. :-( Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  • 43. www.scling.com Data speed Innovation speed 43 Nearline Data processing tradeoff 43 Job Stream OfflineOnline Stream Job Stream
  • 44. www.scling.com 44 Separating online & offline ● Daily user DB dump. Cassandra can handle the load. ○ Load spike became 25 h long… ● New recommendation model! Cassandra can replicate to all regions. ○ Who saturated the Atlantic link? ● Batch jobs saturate one resource. ○ Bad neighbours.
  • 45. www.scling.com Batch offline vs online 45 Raw Fraud serviceFraud model Orders Orders Replication / Backup Standard procedures Standard proceduresLightweight procedures ● QA driven by internal efficiency ● Continuous deployment ● New pipeline < 1 day ● Upgrade < 1 hour ● Bug recovery < 1 hour Careful handover Careful handover
  • 46. www.scling.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 46
  • 47. www.scling.com Testing single batch job 47 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  • 48. www.scling.com Testing batch pipelines - two options 48 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup p()f() B:
  • 49. www.scling.com Monitoring timeliness, examples ● Datamon - Spotify internal ● Twitter Ambrose (dead?) ● Airflow 49
  • 50. www.scling.com 50 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  • 51. www.scling.com Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 51 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  • 52. www.scling.com Data quality - high code vs low code ● 2013: Python MapReduce outdated ● Hive/SQL? ○ Not expressive enough ○ Data quality challenging ● Technical platform + multi-skilled teams! ○ Strong development processes 52 Low code / no code platform? Technical platform?
  • 53. www.scling.com 53 Measuring consistency: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  • 54. www.scling.com 54 Machine learning operations, simplified ● Multiple trained models ○ Select at run time ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Ready to revert to ○ old models ○ simpler models Measure interactionsRendez- vous DB Standard alerting service Stream Job "The required surrounding infrastructure is vast and complex." - Google
  • 55. www.scling.com 55 Not all things went well ● Autonomy → excessive heterogeneity ○ 25 ways to store a timestamp? ● Pipeline end-to-end tests ○ Culturally challenging ○ → difficult to change & retire pipelines ● Trial and error to learn
  • 56. www.scling.com Data engineering in Scandinavia ● Stockholm region ranks 2nd in unicorns / capita ○ Media, games, fintech ● Critical mass of world class data engineering ○ Limited to a few companies 56
  • 57. www.scling.com Mission: Spread data & AI superpowers ● There are companies to help ● Data & AI capabilities require culture & process change ○ Slow, very slow 57
  • 58. www.scling.com Scandinavian minimalist design ● Lean, simple technology - focus on flow and business value ● Bonnier News data platform, 4-5 persons: ○ Zero to happy customer in 3 weeks. ○ Dozens of ROI pipelines in 8 months. ● Scling retail client, 1-3 persons, after 1 year: ○ 40 sources, 70 pipelines, 200 egress points ○ 3,400 datasets / day ● Typical enterprise numbers ○ Big data project: 6-24 months ○ Analytics department: 100-1000 datasets / day ○ Spotify: 100,000+ datasets / day ○ Google: 1.6B datasets / day (2016) 58
  • 59. www.scling.com Scling - data-value-as-a-service 59 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses