All the DataOps, all the paradigms .

www.scling.com
All the DataOps,
all the paradigms
Lars Albertsson,
independent data engineer
Berlin Buzzwords, 2025-06-17
1

www.scling.com
Berlin Buzzwords 2014
2
● "Cutting Hadoop developer cycle time"
○ → "Democratising data @ Spotify"

www.scling.com
Enabling innovation
3
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/
"Discover Weekly wasn't a great
strategic plan and 100 engineers.
It was 3 engineers that decided to
build something."
"I would have killed it. All of a sudden,
they shipped it. It’s one of the most
loved product features that we have."
- Daniel Ek, CEO

www.scling.com
Berlin Buzzwords 201x
4
~30 developers

www.scling.com
Myth:
● We are all doing quite ok
● 2-10x leader-to-rear span
The great capability divide
5
capability in X
# orgs
capability in X
# orgs
Reality:
● Few leaders in each area
● 100-10000x leader-to-rear span

www.scling.com
The Model company
● The Spotify Model
● Thanks to Henrik Kniberg, Joakim
Sundén, Viktor Cessan, …
● The Spotify Data Model
● Lacked great communicators
● This is not the Spotify Data Model,
this is just a tribute.
6
https://youtu.be/Yvfz4HGtoPc

www.scling.com 7
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
Functional data
engineering
While there is value in the items on the right,
we value the items on the left more
Scandinavian style
O
F
I
M
N
U
R = UNIFORM

www.scling.com
All the data paradigms
8
Data warehousing
Data dump
Modern data
warehouse
Data vault
Kimball
Modern
data stack
Spreadsheet
Data scientist in
the corner Lakehouse
Live database
Functional data engineering
Data lake
Frozen lake
Data products
Data mesh
Data hub
Data
contracts
Notebooks
Stored
procedures
Deployed
notebooks
Medallion
Stream processing
PubSub
Unified
log
Service-oriented
RPC
Enterprise
service bus
Data
fabric
Data
access layer
Paradigm != product
Lambda
Beam

www.scling.com 9
Live
database
Data warehousing
Data dump
Stream
processing
Data products
Service-oriented
Offline,
async integration
Online,
sync integration
Immutable Mutable
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based

www.scling.com
All the DataOps
● Create new job / service
○ Trivial in every blog post
● Roll out new version
○ Relate to state
● Recover from crash
○ Unavailable
○ No bad data produced
● Recover from faulty logic
○ Available
○ Bad data produced
10

www.scling.com 11
Microservices
● Careful rollout
● Risk of user impact
● Proactive QA
Bidirectional vs unidirectional upgrade

www.scling.com 12
Microservices
● Careful rollout
● Proactive QA
Streaming
● Swift rollout
● Parallel pipelines
● User impact, QA?
Job
Stream
Stream
Job
Stream

www.scling.com 13
Microservices
● Careful rollout
● Proactive QA
Data lake
● Instant rollout
● User impact later
● Reactive QA
Streaming
● Swift rollout
● Parallel pipelines
● User impact, QA?
Job
Stream
Stream
Job
Stream

www.scling.com 14
Bidirectional vs unidirectional error recovery
Microservices
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery

www.scling.com 15
Streaming
● Data corruption
● Downstream impact
● Bounded recovery
Microservices
● User impact
● Data corruption
Job
Stream
Stream
Job
Stream

www.scling.com 16
Streaming
● Data corruption
● Bounded recovery
Data lake
● Temporary data
corruption
● Easy recovery
Microservices
● User impact
● Data corruption
Job
Stream
Stream
Job
Stream

www.scling.com
● Asynchronous
operational
dependencies
● Precompute - discover
failures early
17
Service-oriented
architectures
Stream
processing
● Strong operational
dependencies
● Failure scenarios
discovered late
Data
warehousing
Functional data
engineering
Data dump
Live
database
Offline,
async integration
Online,
sync integration
Immutable Mutable
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based

www.scling.com
Separating offline and online
18
Raw
Fraud
service
Fraud
model
Orders Orders
Replication /
Backup
Prudent procedures Prudent procedures
Lightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover

www.scling.com
Mixing paradigms
● Tradeoff
○ No single perfect paradigm
○ Borders pose operational risks
● Organic growth → accidental heterogeneity
● Early Hadoop adoption → accidental homogeneity
19
Service Service Service
App App App
DB
Poll
Queue
Aggregate
logs
NFS
Hourly dump
Data
warehouse
ETL
Kafka
NFS
scp
DB
HTTP

www.scling.com
Life of an error, data lake
20
● My processing job, bad code!
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Deploy
5. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient

www.scling.com
Life of an error, frozen lake
21
● My processing job, bad code!
1. Revert serving datasets to old
2. Fix bug
3. Bump pipeline version
4. Deploy
5. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient

www.scling.com 22
Life of an error, streaming
● Works for a single job, not pipeline. :-(
Job
Stream
Stream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams

www.scling.com 23
Service-oriented
architectures
Stream
processing
Data
warehousing
Functional data
engineering
Data dump
Live
database
● Immutable entities
○ Partitioned by time
● One entity =
immutable facts
○ collected during a period
○ state snapshot
● Entities (tables)
updated by flows
● One entity =
unbounded container of
all similar records
Offline,
async integration
Online,
sync integration
Immutable Mutable
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based

www.scling.com
● Extract - transform - load (ETL) write path
○ Updates to mutable tables
■ Not easily shared
○ Normalised model
■ Expensive to change
■ Carefully crafted, future-proof
● Denormalising read path
○ Interactive exploration
○ Dashboards
○ BI tools
○ (End user applications)
○ (Machine learning)
Data warehousing
24
Data warehouse

www.scling.com
Mutable vs immutable ETL
● Mutable tables
● Share & reuse?
○ Semantically challenging
■ Updates, partitions?
○ Human sync needed
● Immutable partitions
● Share & reuse
○ Semantics manageable by consumer
○ No human sync
● Partition addressing needed
25
Data warehouse Data lake

www.scling.com
Mutable vs immutable ETL - bug recovery
● Mutable tables
● Entire tables become tainted
○ Recompute all history?
○ Case-specific partial recompute
● Immutable partitions
● Time partitions after bug become tainted
○ Traverse time-aware DAG and recompute
○ Toolable
26
Data warehouse Data lake

www.scling.com
● Dremel / BigQuery ~2010
● Extract - load - transform (ELT) read path
○ From (mutable?) raw tables
○ Use case decides model
Modern data warehousing
27
Modern data warehouse
● Low human ops cost
○ Fast iterations
○ No mutable intermediate tables
● High compute cost

www.scling.com 28
Service-oriented
architectures
Stream
processing
Data
warehousing
Functional data
engineering
Data dump
Live
database
● Metadata defined as
explicit, separate code
○ Dependencies
○ Schemas
○ Management
○ Governance
● Tooling feasible
● Metadata generated by
business logic
○ DB tables
○ JSON
● Or policy docs
Offline,
async integration
Online,
sync integration
Immutable Mutable
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based

www.scling.com
Separating fundamental & superficial challenges
29
Fundamental challenges = your business
● Click-through rate
● Sensor anomalies
● User registrations
● …
Superficial challenges = your system
● Data collection delay
● Stream join sync mismatch
● Technical failures
● …

www.scling.com
Workflow orchestration - addressing data time space
30
Data warehousing:
dependencies between tasks
Functional data engineering;
Dependencies between time partitions

www.scling.com
class Session(SparkSubmitTask):
"""Sessions ending or active during a particular hour."""
hour = DateHourParameter()
window_size = IntParameter(default=4)
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.SessionJob'
def requires(self):
return [Click(hour=self.hour - offset))
for offset in range(self.window_size)]
def output(self):
return GCSTarget("gs://mybucket/prod/red/order_user/v1/" +
f"{self.hour:year=%Y/month=%m/day=%d/hour=%H}")
def app_options(self):
return ["--clicks", ",".join(
[req.output().path for req in self.requires]),
"--output", self.output().path]
DAG example, window (simplified)
Click
Session
31
● Immutable, reproducible
● Free to consume by downstream
○ Without ops risk
○ Without human sync

www.scling.com
Flowing data time partition management
32
Functional data engineering:
● Partitions defined in workflow
● Reproducible
● Addressable
● Predictable resources
Data warehousing:
● All data?
● Arrival time field?
● Watermark table?
● Joins? Stream processing:
● Joins?
○ Other streams?
○ Tables?
● Resources determined by jitter
Single data dump:
● Flow? Nah.
Beam:
● What does Google do?

www.scling.com
History of workflow orchestration
33
First orchestrator scalable in:
● Logic complexity
● Parameter management
● DAG size
● Ops cost
● Domain-specific abstractions
https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025

www.scling.com
Schema definitions
34
{
"type" : "record",
"namespace" : "com.mapflat.example",
"name" : "User",
"fields" : [
{ "name" : "id" , "type" : "int" },
{ "name" : "name" , "type" : "string" },
{ "name" : "age" , "type" : "int" },
{ "name" : "phone" , "type" : ["null", "string"],
"default": null }
]
}
● RDBMS: Table metadata
● Avro format: JSON/DSL definition
○ Definition is bundled with avro data files
○ Reused by Parquet format
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
{ "id": 1, "name": "Alice", "age": "34" }
{ "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" }
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))

www.scling.com
● Expressive
● Custom types
● Scalameta
● IDE support
● Avro for data lake storage
Schema definition choice
35
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
○ Reused by Parquet format
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))

www.scling.com
Schema offspring Test record
difference render
type classes
36
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types

www.scling.com 37
Service-oriented
architectures
Stream
processing
Data
warehousing
Functional data
engineering
Live
database
● High-code - 3GL
○ Python, Scala, Java
● Embedded DSLs
○ Spark, Flink, ..
● Built for production
○ QA
○ DevEx
○ Quality mgmt
● "What can I do with data?"
● Special tools - data 4GL
○ Low code (SQL)
○ No code
● "What can I do with tool X?"
Offline,
async integration
Online,
sync integration
Immutable Mutable
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based

www.scling.com
SQL for data processing
● SQL used in 3 distinct contexts
○ Backend data record retrieval
■ 25 years of injections
○ ETL data processing?
38
Important data language features:
● Can express (complex) business logic
● Composability
● Reusability
● Testability
● Seamless integration with external logic
● Tools to guide towards good path
○ Type system
○ Inspection tools
● IDE experience
● Debuggability
● Data quality measurement support
● Data quality improvement support
● Learning curve

www.scling.com
SQL for data processing
● SQL used in 3 distinct contexts
○ Backend data record retrieval
■ 25 years of injections
○ ETL data processing?
39
Important data language features:
● Can express (complex) business logic
● Composability
● Reusability
● Testability
● Seamless integration with external logic
● Tools to guide towards good path
○ Type system
○ Inspection tools
● IDE experience
● Debuggability
● Data quality measurement support
● Data quality improvement support
● Learning curve
https://threadreaderapp.com/thread/1353832649664692225.html

www.scling.com
Reporting master data management → SQL
2013:
2025:
● "MasterUser" - MDM of users
● "ReportingUser" - MasterUser + fiscal
● Convert ReportingUser to Hive?
○ Business logic too complex
○ No code reuse
○ Normalisation forced on consumers
○ No counters - sacrifice data quality
○ 3-5x performance loss
40
"We seem to be the largest
company using Python for big
data. That's a risky position.
Let's find alternatives."
That did not
age well.

www.scling.com
● Wide scope
components / assets
● Good interoperability
● Less control → ops risk
41
Data vendor
products
Cloud
(IaaS, PaaS)
Data products
Data lake
Frozen lake
Modern data
warehouse
Data vault Kimball
Data
fabric
Data
access layer
Data mesh Data hub
Data
contracts
Offline,
async integration
Online,
sync integration
Immutable Mutable
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
● Do one thing well -
small scope
● Enables evolution
● Some features not
available in OSS
○ Data monitoring

www.scling.com
Unix philosophy example
42
● Small programs that do one thing well.
● Architecture for two-way decisions
● Data pipeline deployment evolution
○ Spotify 2014 - 2018
1. Self-contained jar file. Ad-hoc continuous deploy flow.
2. Docker container on VM pool
3. Docker container on Kubernetes
logs
CI
dev
env
o11y

www.scling.com
Separation of computation and integration
Computation
● Fails on
○ New data + code combination
○ Static resources
● Deterministic, reproducible
○ No side effects
Integration
● Fails on
○ Configuration
○ Dynamic resources
○ Bad cloud weather
● Non-deterministic
○ Side effects
43

www.scling.com
● One size fits all
● Data producer in
control
44
Data products
Modern data
warehouse
Data vault Kimball
Data
fabric
Data
access layer
Data mesh Data hub
Data
contracts
Offline,
async integration
Online,
sync integration
Immutable Mutable
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
● Driven by use cases
● Data consumer in
control
Stream
processing
Functional data
engineering
Broonze & gold
layers
Silver layer

www.scling.com
Downstream consumer choice
45
Delay: 0
Delay: 4
Delay: 12

www.scling.com
Artisanal vs industrialised knowledge graphs
Artisanal:
● Create single shared graph
● Used for many use cases
● Innovate fast graph → use case
Industrial:
● Create graph for each use case
● Reuse code that produces graph
● Each graph may be unique
● Innovate fast raw → graph → use case
46

www.scling.com
Artisanal vs industrialised machine learning models
Google MLOps maturity model:
● MLOps level 0: Manual process
● MLOps level 1: ML pipeline automation
● MLOps level 2: CI/CD pipeline automation
https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
47

www.scling.com
Premature modelling is waste
● Power: Recompute model quickly
● Lifted limitation: Expensive to compute model
● Old rule: Careful manual modelling work
● New rules: Guard rails preventing model iteration from breaking downstream
○ Code QA = testing
○ Code + data QA = monitoring
Yes, on purpose!
48

www.scling.com
All the data paradigms
49
Data warehousing
Data dump
Modern data
warehouse
Data vault
Kimball
Modern
data stack
Spreadsheet
Data scientist in
the corner Lakehouse
Live database
Data lake
Frozen lake
Data products
Data mesh
Data hub
Data
contracts
Notebooks
Stored
procedures
Deployed
notebooks
Medallion
Stream processing
PubSub
Unified
log
Service-oriented
RPC
Enterprise
service bus
Data
fabric
Data
access layer
Paradigm != product
Lambda
Beam

www.scling.com
Functional data engineering @ enterprise context
50
1.5 persons, 3 years
● 162 pipelines
● 700 datasets / day*
● 4 new pipelines / month
● 80 commits / month*
● 35 deployments / month*
● 40 KLOC pipeline code
● 20 KLOC tests
● 1.5 KLOC Terraform
● 8 Kubernetes clusters
● 10K pods / day
● 4 regions AWS / Azure
● Cloud: 17 KEUR / month*
● Cloud + dev + ops (TCO):
300 EUR / pipeline / month
*As-a-service (2 devs):
● 3700 datasets / day
● 275 commits / month
● 173 deployments / month
● Cloud 2.5 KEUR / month
Consumer products
O(10M) units
O(10G) / day
50K employees
10 BEUR revenue
All user-related
operational data
flows

www.scling.com
The next 100x?
51
capability in X
# orgs
2016: 1600 000 000
datasets / day
There are companies 100x
ahead on these KPIs. Don't
you want that?
I don't believe you. We
are great by definition.
But we follow the
vendor's advice.
How hard can it be?
No, we need a data
mesh and a silver layer.
Oh.
We prefer
detailed control.

www.scling.com
The first high-level, scalable orchestrator…
52
…has yet to be created
● Higher abstraction layers
○ Beyond datasets, jobs, pipelines
● It's a software engineering problem
○ Convenience blocks abstraction stacking
← Similar capabilities →
More convenience →
https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025

www.scling.com
I hope that I have contributed to
53
● Insights into paradigms' practical aspects
○ Latency / ops+productivity tradeoff
■ Microservices
■ Streaming
■ Functional data engineering
○ No software engineers: Data warehousing
● Awareness of data engineering subfields
○ Functional (Hadoop ecosys, software eng)
○ Data warehousing (ex BI development)

www.scling.com
I hope that I have contributed to
54
● Insights into paradigms' practical aspects
○ Latency / ops+productivity tradeoff
■ Microservices
■ Streaming
■ Functional data engineering
○ No software engineers: Data warehousing
● Awareness of data engineering subfields
○ Functional (Hadoop ecosys, software eng)
○ Data warehousing (ex BI development)
Want to
● Adopt functional data engineering?
● Aim for the next 100x?
Ping me.
● Courage to
○ follow your own path
■ We need innovation
■ Vendors seek revenue
○ use your skills wisely
■ Tech has major impact
■ European sovereignty?
■ Democracy?

All the DataOps, all the paradigms .

More Related Content

Similar to All the DataOps, all the paradigms .

More from Lars Albertsson

Recently uploaded

All the DataOps, all the paradigms .