www.scling.com
All the DataOps,
all the paradigms
Lars Albertsson,
independent data engineer
Berlin Buzzwords, 2025-06-17
1
www.scling.com
Berlin Buzzwords 2014
2
● "Cutting Hadoop developer cycle time"
○ → "Democratising data @ Spotify"
www.scling.com
Enabling innovation
3
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/
"Discover Weekly wasn't a great
strategic plan and 100 engineers.
It was 3 engineers that decided to
build something."
"I would have killed it. All of a sudden,
they shipped it. It’s one of the most
loved product features that we have."
- Daniel Ek, CEO
www.scling.com
Berlin Buzzwords 201x
4
~30 developers
www.scling.com
Myth:
● We are all doing quite ok
● 2-10x leader-to-rear span
The great capability divide
5
capability in X
# orgs
capability in X
# orgs
Reality:
● Few leaders in each area
● 100-10000x leader-to-rear span
www.scling.com
The Model company
● The Spotify Model
● Thanks to Henrik Kniberg, Joakim
Sundén, Viktor Cessan, …
● The Spotify Data Model
● Lacked great communicators
● This is not the Spotify Data Model,
this is just a tribute.
6
https://youtu.be/Yvfz4HGtoPc
www.scling.com 7
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
Functional data
engineering
While there is value in the items on the right,
we value the items on the left more
Scandinavian style
O
F
I
M
N
U
R = UNIFORM
www.scling.com
All the data paradigms
8
Data warehousing
Data dump
Modern data
warehouse
Data vault
Kimball
Modern
data stack
Spreadsheet
Data scientist in
the corner Lakehouse
Live database
Functional data engineering
Data lake
Frozen lake
Data products
Data mesh
Data hub
Data
contracts
Notebooks
Stored
procedures
Deployed
notebooks
Medallion
Stream processing
PubSub
Unified
log
Service-oriented
RPC
Enterprise
service bus
Data
fabric
Data
access layer
Paradigm != product
Lambda
Beam
www.scling.com 9
Live
database
Data warehousing
Functional data engineering
Data dump
Stream
processing
Data products
Service-oriented
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
www.scling.com
All the DataOps
● Create new job / service
○ Trivial in every blog post
● Roll out new version
○ Relate to state
● Recover from crash
○ Unavailable
○ No bad data produced
● Recover from faulty logic
○ Available
○ Bad data produced
10
www.scling.com 11
Microservices
● Careful rollout
● Risk of user impact
● Proactive QA
Bidirectional vs unidirectional upgrade
www.scling.com 12
Microservices
● Careful rollout
● Risk of user impact
● Proactive QA
Bidirectional vs unidirectional upgrade
Streaming
● Swift rollout
● Parallel pipelines
● User impact, QA?
Job
Stream
Stream
Job
Stream
www.scling.com 13
Microservices
● Careful rollout
● Risk of user impact
● Proactive QA
Bidirectional vs unidirectional upgrade
Data lake
● Instant rollout
● User impact later
● Reactive QA
Streaming
● Swift rollout
● Parallel pipelines
● User impact, QA?
Job
Stream
Stream
Job
Stream
www.scling.com 14
Bidirectional vs unidirectional error recovery
Microservices
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
www.scling.com 15
Bidirectional vs unidirectional error recovery
Streaming
● Data corruption
● Downstream impact
● Bounded recovery
Microservices
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream
www.scling.com 16
Bidirectional vs unidirectional error recovery
Streaming
● Data corruption
● Downstream impact
● Bounded recovery
Data lake
● Temporary data
corruption
● Downstream impact
● Easy recovery
Microservices
● User impact
● Data corruption
● Cascading corruption
● Unbounded recovery
Job
Stream
Stream
Job
Stream
www.scling.com
● Asynchronous
operational
dependencies
● Precompute - discover
failures early
17
Service-oriented
architectures
Stream
processing
● Strong operational
dependencies
● Failure scenarios
discovered late
Data
warehousing
Functional data
engineering
Data dump
Live
database
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
www.scling.com
Separating offline and online
18
Raw
Fraud
service
Fraud
model
Orders Orders
Replication /
Backup
Prudent procedures Prudent procedures
Lightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
www.scling.com
Mixing paradigms
● Tradeoff
○ No single perfect paradigm
○ Borders pose operational risks
● Organic growth → accidental heterogeneity
● Early Hadoop adoption → accidental homogeneity
19
Service Service Service
App App App
DB
Poll
Queue
Aggregate
logs
NFS
Hourly dump
Data
warehouse
ETL
Kafka
NFS
scp
DB
HTTP
www.scling.com
Life of an error, data lake
20
● My processing job, bad code!
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Deploy
5. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com
Life of an error, frozen lake
21
● My processing job, bad code!
1. Revert serving datasets to old
2. Fix bug
3. Bump pipeline version
4. Deploy
5. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com 22
Life of an error, streaming
● Works for a single job, not pipeline. :-(
Job
Stream
Stream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams
www.scling.com 23
Service-oriented
architectures
Stream
processing
Data
warehousing
Functional data
engineering
Data dump
Live
database
● Immutable entities
○ Partitioned by time
● One entity =
immutable facts
○ collected during a period
○ state snapshot
● Entities (tables)
updated by flows
● One entity =
unbounded container of
all similar records
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
www.scling.com
● Extract - transform - load (ETL) write path
○ Updates to mutable tables
■ Not easily shared
○ Normalised model
■ Expensive to change
■ Carefully crafted, future-proof
● Denormalising read path
○ Interactive exploration
○ Dashboards
○ BI tools
○ (End user applications)
○ (Machine learning)
Data warehousing
24
Data warehouse
www.scling.com
Mutable vs immutable ETL
● Mutable tables
● Share & reuse?
○ Semantically challenging
■ Updates, partitions?
○ Human sync needed
● Immutable partitions
● Share & reuse
○ Semantics manageable by consumer
○ No human sync
● Partition addressing needed
25
Data warehouse Data lake
www.scling.com
Mutable vs immutable ETL - bug recovery
● Mutable tables
● Entire tables become tainted
○ Recompute all history?
○ Case-specific partial recompute
● Immutable partitions
● Time partitions after bug become tainted
○ Traverse time-aware DAG and recompute
○ Toolable
26
Data warehouse Data lake
www.scling.com
● Dremel / BigQuery ~2010
● Extract - load - transform (ELT) read path
○ From (mutable?) raw tables
○ Use case decides model
Modern data warehousing
27
Modern data warehouse
● Low human ops cost
○ Fast iterations
○ No mutable intermediate tables
● High compute cost
www.scling.com 28
Service-oriented
architectures
Stream
processing
Data
warehousing
Functional data
engineering
Data dump
Live
database
● Metadata defined as
explicit, separate code
○ Dependencies
○ Schemas
○ Management
○ Governance
● Tooling feasible
● Metadata generated by
business logic
○ DB tables
○ JSON
● Or policy docs
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
www.scling.com
Separating fundamental & superficial challenges
29
Fundamental challenges = your business
● Click-through rate
● Sensor anomalies
● User registrations
● …
Superficial challenges = your system
● Data collection delay
● Stream join sync mismatch
● Technical failures
● …
www.scling.com
Workflow orchestration - addressing data time space
30
Data warehousing:
dependencies between tasks
Functional data engineering;
Dependencies between time partitions
www.scling.com
class Session(SparkSubmitTask):
"""Sessions ending or active during a particular hour."""
hour = DateHourParameter()
window_size = IntParameter(default=4)
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.SessionJob'
def requires(self):
return [Click(hour=self.hour - offset))
for offset in range(self.window_size)]
def output(self):
return GCSTarget("gs://mybucket/prod/red/order_user/v1/" +
f"{self.hour:year=%Y/month=%m/day=%d/hour=%H}")
def app_options(self):
return ["--clicks", ",".join(
[req.output().path for req in self.requires]),
"--output", self.output().path]
DAG example, window (simplified)
Click
Session
31
● Immutable, reproducible
● Free to consume by downstream
○ Without ops risk
○ Without human sync
www.scling.com
Flowing data time partition management
32
Functional data engineering:
● Partitions defined in workflow
● Reproducible
● Addressable
● Predictable resources
Data warehousing:
● All data?
● Arrival time field?
● Watermark table?
● Joins? Stream processing:
● Joins?
○ Other streams?
○ Tables?
● Resources determined by jitter
Single data dump:
● Flow? Nah.
Beam:
● What does Google do?
www.scling.com
History of workflow orchestration
33
First orchestrator scalable in:
● Logic complexity
● Parameter management
● DAG size
● Ops cost
● Domain-specific abstractions
https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025
www.scling.com
Schema definitions
34
{
"type" : "record",
"namespace" : "com.mapflat.example",
"name" : "User",
"fields" : [
{ "name" : "id" , "type" : "int" },
{ "name" : "name" , "type" : "string" },
{ "name" : "age" , "type" : "int" },
{ "name" : "phone" , "type" : ["null", "string"],
"default": null }
]
}
● RDBMS: Table metadata
● Avro format: JSON/DSL definition
○ Definition is bundled with avro data files
○ Reused by Parquet format
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
{ "id": 1, "name": "Alice", "age": "34" }
{ "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" }
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
● Expressive
● Custom types
● Scalameta
● IDE support
● Avro for data lake storage
Schema definition choice
35
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
○ Reused by Parquet format
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Schema offspring Test record
difference render
type classes
36
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types
www.scling.com 37
Service-oriented
architectures
Stream
processing
Data
warehousing
Functional data
engineering
Live
database
● High-code - 3GL
○ Python, Scala, Java
● Embedded DSLs
○ Spark, Flink, ..
● Built for production
○ QA
○ DevEx
○ Quality mgmt
● "What can I do with data?"
● Special tools - data 4GL
○ Low code (SQL)
○ No code
● "What can I do with tool X?"
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
www.scling.com
SQL for data processing
● SQL used in 3 distinct contexts
○ Interactive exploration
○ Backend data record retrieval
■ 25 years of injections
○ ETL data processing?
38
Important data language features:
● Can express (complex) business logic
● Composability
● Reusability
● Testability
● Seamless integration with external logic
● Tools to guide towards good path
○ Type system
○ Inspection tools
● IDE experience
● Debuggability
● Data quality measurement support
● Data quality improvement support
● Learning curve
www.scling.com
SQL for data processing
● SQL used in 3 distinct contexts
○ Interactive exploration
○ Backend data record retrieval
■ 25 years of injections
○ ETL data processing?
39
Important data language features:
● Can express (complex) business logic
● Composability
● Reusability
● Testability
● Seamless integration with external logic
● Tools to guide towards good path
○ Type system
○ Inspection tools
● IDE experience
● Debuggability
● Data quality measurement support
● Data quality improvement support
● Learning curve
https://threadreaderapp.com/thread/1353832649664692225.html
www.scling.com
Reporting master data management → SQL
2013:
2025:
● "MasterUser" - MDM of users
● "ReportingUser" - MasterUser + fiscal
● Convert ReportingUser to Hive?
○ Business logic too complex
○ No code reuse
○ Normalisation forced on consumers
○ No counters - sacrifice data quality
○ 3-5x performance loss
40
"We seem to be the largest
company using Python for big
data. That's a risky position.
Let's find alternatives."
That did not
age well.
www.scling.com
● Wide scope
components / assets
● Good interoperability
● Less control → ops risk
41
Data vendor
products
Cloud
(IaaS, PaaS)
Data products
Data lake
Frozen lake
Modern data
warehouse
Data vault Kimball
Data
fabric
Data
access layer
Data mesh Data hub
Data
contracts
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
● Do one thing well -
small scope
● Enables evolution
● Some features not
available in OSS
○ Data monitoring
www.scling.com
Unix philosophy example
42
● Small programs that do one thing well.
● Architecture for two-way decisions
● Data pipeline deployment evolution
○ Spotify 2014 - 2018
1. Self-contained jar file. Ad-hoc continuous deploy flow.
2. Docker container on VM pool
3. Docker container on Kubernetes
logs
CI
dev
env
o11y
www.scling.com
Separation of computation and integration
Computation
● Fails on
○ New data + code combination
○ Static resources
● Deterministic, reproducible
○ No side effects
Integration
● Fails on
○ Configuration
○ Dynamic resources
○ Bad cloud weather
● Non-deterministic
○ Side effects
43
www.scling.com
● One size fits all
● Data producer in
control
44
Data products
Modern data
warehouse
Data vault Kimball
Data
fabric
Data
access layer
Data mesh Data hub
Data
contracts
One-directional Bidirectional
Offline,
async integration
Online,
sync integration
Immutable Mutable
Explicit metadata Implicit metadata
Native,
expressive
components
Specialised,
limited
components
Unix
philosophy
Comprehensive
components
Reactive,
pull-based
Proactive,
push-based
● Driven by use cases
● Data consumer in
control
Stream
processing
Functional data
engineering
Broonze & gold
layers
Silver layer
www.scling.com
Downstream consumer choice
45
Delay: 0
Delay: 4
Delay: 12
www.scling.com
Artisanal vs industrialised knowledge graphs
Artisanal:
● Create single shared graph
● Used for many use cases
● Innovate fast graph → use case
Industrial:
● Create graph for each use case
● Reuse code that produces graph
● Each graph may be unique
● Innovate fast raw → graph → use case
46
www.scling.com
Artisanal vs industrialised machine learning models
Google MLOps maturity model:
● MLOps level 0: Manual process
● MLOps level 1: ML pipeline automation
● MLOps level 2: CI/CD pipeline automation
https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
47
www.scling.com
Premature modelling is waste
● Power: Recompute model quickly
● Lifted limitation: Expensive to compute model
● Old rule: Careful manual modelling work
● New rules: Guard rails preventing model iteration from breaking downstream
○ Code QA = testing
○ Code + data QA = monitoring
Yes, on purpose!
48
www.scling.com
All the data paradigms
49
Data warehousing
Data dump
Modern data
warehouse
Data vault
Kimball
Modern
data stack
Spreadsheet
Data scientist in
the corner Lakehouse
Live database
Functional data engineering
Data lake
Frozen lake
Data products
Data mesh
Data hub
Data
contracts
Notebooks
Stored
procedures
Deployed
notebooks
Medallion
Stream processing
PubSub
Unified
log
Service-oriented
RPC
Enterprise
service bus
Data
fabric
Data
access layer
Paradigm != product
Lambda
Beam
www.scling.com
Functional data engineering @ enterprise context
50
1.5 persons, 3 years
● 162 pipelines
● 700 datasets / day*
● 4 new pipelines / month
● 80 commits / month*
● 35 deployments / month*
● 40 KLOC pipeline code
● 20 KLOC tests
● 1.5 KLOC Terraform
● 8 Kubernetes clusters
● 10K pods / day
● 4 regions AWS / Azure
● Cloud: 17 KEUR / month*
● Cloud + dev + ops (TCO):
300 EUR / pipeline / month
*As-a-service (2 devs):
● 3700 datasets / day
● 275 commits / month
● 173 deployments / month
● Cloud 2.5 KEUR / month
Consumer products
O(10M) units
O(10G) / day
50K employees
10 BEUR revenue
All user-related
operational data
flows
www.scling.com
The next 100x?
51
capability in X
# orgs
2016: 1600 000 000
datasets / day
There are companies 100x
ahead on these KPIs. Don't
you want that?
I don't believe you. We
are great by definition.
But we follow the
vendor's advice.
How hard can it be?
No, we need a data
mesh and a silver layer.
Oh.
We prefer
detailed control.
www.scling.com
The first high-level, scalable orchestrator…
52
…has yet to be created
● Higher abstraction layers
○ Beyond datasets, jobs, pipelines
● It's a software engineering problem
○ Convenience blocks abstraction stacking
← Similar capabilities →
More convenience →
https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025
www.scling.com
I hope that I have contributed to
53
● Insights into paradigms' practical aspects
○ Latency / ops+productivity tradeoff
■ Microservices
■ Streaming
■ Functional data engineering
○ No software engineers: Data warehousing
● Awareness of data engineering subfields
○ Functional (Hadoop ecosys, software eng)
○ Data warehousing (ex BI development)
www.scling.com
I hope that I have contributed to
54
● Insights into paradigms' practical aspects
○ Latency / ops+productivity tradeoff
■ Microservices
■ Streaming
■ Functional data engineering
○ No software engineers: Data warehousing
● Awareness of data engineering subfields
○ Functional (Hadoop ecosys, software eng)
○ Data warehousing (ex BI development)
Want to
● Adopt functional data engineering?
● Aim for the next 100x?
Ping me.
● Courage to
○ follow your own path
■ We need innovation
■ Vendors seek revenue
○ use your skills wisely
■ Tech has major impact
■ European sovereignty?
■ Democracy?

All the DataOps, all the paradigms .

  • 1.
    www.scling.com All the DataOps, allthe paradigms Lars Albertsson, independent data engineer Berlin Buzzwords, 2025-06-17 1
  • 2.
    www.scling.com Berlin Buzzwords 2014 2 ●"Cutting Hadoop developer cycle time" ○ → "Democratising data @ Spotify"
  • 3.
    www.scling.com Enabling innovation 3 "The actualwork that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8 https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/ "Discover Weekly wasn't a great strategic plan and 100 engineers. It was 3 engineers that decided to build something." "I would have killed it. All of a sudden, they shipped it. It’s one of the most loved product features that we have." - Daniel Ek, CEO
  • 4.
  • 5.
    www.scling.com Myth: ● We areall doing quite ok ● 2-10x leader-to-rear span The great capability divide 5 capability in X # orgs capability in X # orgs Reality: ● Few leaders in each area ● 100-10000x leader-to-rear span
  • 6.
    www.scling.com The Model company ●The Spotify Model ● Thanks to Henrik Kniberg, Joakim Sundén, Viktor Cessan, … ● The Spotify Data Model ● Lacked great communicators ● This is not the Spotify Data Model, this is just a tribute. 6 https://youtu.be/Yvfz4HGtoPc
  • 7.
    www.scling.com 7 One-directional Bidirectional Offline, asyncintegration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based Functional data engineering While there is value in the items on the right, we value the items on the left more Scandinavian style O F I M N U R = UNIFORM
  • 8.
    www.scling.com All the dataparadigms 8 Data warehousing Data dump Modern data warehouse Data vault Kimball Modern data stack Spreadsheet Data scientist in the corner Lakehouse Live database Functional data engineering Data lake Frozen lake Data products Data mesh Data hub Data contracts Notebooks Stored procedures Deployed notebooks Medallion Stream processing PubSub Unified log Service-oriented RPC Enterprise service bus Data fabric Data access layer Paradigm != product Lambda Beam
  • 9.
    www.scling.com 9 Live database Data warehousing Functionaldata engineering Data dump Stream processing Data products Service-oriented One-directional Bidirectional Offline, async integration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based
  • 10.
    www.scling.com All the DataOps ●Create new job / service ○ Trivial in every blog post ● Roll out new version ○ Relate to state ● Recover from crash ○ Unavailable ○ No bad data produced ● Recover from faulty logic ○ Available ○ Bad data produced 10
  • 11.
    www.scling.com 11 Microservices ● Carefulrollout ● Risk of user impact ● Proactive QA Bidirectional vs unidirectional upgrade
  • 12.
    www.scling.com 12 Microservices ● Carefulrollout ● Risk of user impact ● Proactive QA Bidirectional vs unidirectional upgrade Streaming ● Swift rollout ● Parallel pipelines ● User impact, QA? Job Stream Stream Job Stream
  • 13.
    www.scling.com 13 Microservices ● Carefulrollout ● Risk of user impact ● Proactive QA Bidirectional vs unidirectional upgrade Data lake ● Instant rollout ● User impact later ● Reactive QA Streaming ● Swift rollout ● Parallel pipelines ● User impact, QA? Job Stream Stream Job Stream
  • 14.
    www.scling.com 14 Bidirectional vsunidirectional error recovery Microservices ● User impact ● Data corruption ● Cascading corruption ● Unbounded recovery
  • 15.
    www.scling.com 15 Bidirectional vsunidirectional error recovery Streaming ● Data corruption ● Downstream impact ● Bounded recovery Microservices ● User impact ● Data corruption ● Cascading corruption ● Unbounded recovery Job Stream Stream Job Stream
  • 16.
    www.scling.com 16 Bidirectional vsunidirectional error recovery Streaming ● Data corruption ● Downstream impact ● Bounded recovery Data lake ● Temporary data corruption ● Downstream impact ● Easy recovery Microservices ● User impact ● Data corruption ● Cascading corruption ● Unbounded recovery Job Stream Stream Job Stream
  • 17.
    www.scling.com ● Asynchronous operational dependencies ● Precompute- discover failures early 17 Service-oriented architectures Stream processing ● Strong operational dependencies ● Failure scenarios discovered late Data warehousing Functional data engineering Data dump Live database One-directional Bidirectional Offline, async integration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based
  • 18.
    www.scling.com Separating offline andonline 18 Raw Fraud service Fraud model Orders Orders Replication / Backup Prudent procedures Prudent procedures Lightweight procedures ● QA driven by internal efficiency ● Continuous deployment ● New pipeline < 1 day ● Upgrade < 1 hour ● Bug recovery < 1 hour Careful handover Careful handover
  • 19.
    www.scling.com Mixing paradigms ● Tradeoff ○No single perfect paradigm ○ Borders pose operational risks ● Organic growth → accidental heterogeneity ● Early Hadoop adoption → accidental homogeneity 19 Service Service Service App App App DB Poll Queue Aggregate logs NFS Hourly dump Data warehouse ETL Kafka NFS scp DB HTTP
  • 20.
    www.scling.com Life of anerror, data lake 20 ● My processing job, bad code! 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Deploy 5. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 21.
    www.scling.com Life of anerror, frozen lake 21 ● My processing job, bad code! 1. Revert serving datasets to old 2. Fix bug 3. Bump pipeline version 4. Deploy 5. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 22.
    www.scling.com 22 Life ofan error, streaming ● Works for a single job, not pipeline. :-( Job Stream Stream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  • 23.
    www.scling.com 23 Service-oriented architectures Stream processing Data warehousing Functional data engineering Datadump Live database ● Immutable entities ○ Partitioned by time ● One entity = immutable facts ○ collected during a period ○ state snapshot ● Entities (tables) updated by flows ● One entity = unbounded container of all similar records One-directional Bidirectional Offline, async integration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based
  • 24.
    www.scling.com ● Extract -transform - load (ETL) write path ○ Updates to mutable tables ■ Not easily shared ○ Normalised model ■ Expensive to change ■ Carefully crafted, future-proof ● Denormalising read path ○ Interactive exploration ○ Dashboards ○ BI tools ○ (End user applications) ○ (Machine learning) Data warehousing 24 Data warehouse
  • 25.
    www.scling.com Mutable vs immutableETL ● Mutable tables ● Share & reuse? ○ Semantically challenging ■ Updates, partitions? ○ Human sync needed ● Immutable partitions ● Share & reuse ○ Semantics manageable by consumer ○ No human sync ● Partition addressing needed 25 Data warehouse Data lake
  • 26.
    www.scling.com Mutable vs immutableETL - bug recovery ● Mutable tables ● Entire tables become tainted ○ Recompute all history? ○ Case-specific partial recompute ● Immutable partitions ● Time partitions after bug become tainted ○ Traverse time-aware DAG and recompute ○ Toolable 26 Data warehouse Data lake
  • 27.
    www.scling.com ● Dremel /BigQuery ~2010 ● Extract - load - transform (ELT) read path ○ From (mutable?) raw tables ○ Use case decides model Modern data warehousing 27 Modern data warehouse ● Low human ops cost ○ Fast iterations ○ No mutable intermediate tables ● High compute cost
  • 28.
    www.scling.com 28 Service-oriented architectures Stream processing Data warehousing Functional data engineering Datadump Live database ● Metadata defined as explicit, separate code ○ Dependencies ○ Schemas ○ Management ○ Governance ● Tooling feasible ● Metadata generated by business logic ○ DB tables ○ JSON ● Or policy docs One-directional Bidirectional Offline, async integration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based
  • 29.
    www.scling.com Separating fundamental &superficial challenges 29 Fundamental challenges = your business ● Click-through rate ● Sensor anomalies ● User registrations ● … Superficial challenges = your system ● Data collection delay ● Stream join sync mismatch ● Technical failures ● …
  • 30.
    www.scling.com Workflow orchestration -addressing data time space 30 Data warehousing: dependencies between tasks Functional data engineering; Dependencies between time partitions
  • 31.
    www.scling.com class Session(SparkSubmitTask): """Sessions endingor active during a particular hour.""" hour = DateHourParameter() window_size = IntParameter(default=4) jar = 'orderpipeline.jar' entry_class = 'com.example.shop.SessionJob' def requires(self): return [Click(hour=self.hour - offset)) for offset in range(self.window_size)] def output(self): return GCSTarget("gs://mybucket/prod/red/order_user/v1/" + f"{self.hour:year=%Y/month=%m/day=%d/hour=%H}") def app_options(self): return ["--clicks", ",".join( [req.output().path for req in self.requires]), "--output", self.output().path] DAG example, window (simplified) Click Session 31 ● Immutable, reproducible ● Free to consume by downstream ○ Without ops risk ○ Without human sync
  • 32.
    www.scling.com Flowing data timepartition management 32 Functional data engineering: ● Partitions defined in workflow ● Reproducible ● Addressable ● Predictable resources Data warehousing: ● All data? ● Arrival time field? ● Watermark table? ● Joins? Stream processing: ● Joins? ○ Other streams? ○ Tables? ● Resources determined by jitter Single data dump: ● Flow? Nah. Beam: ● What does Google do?
  • 33.
    www.scling.com History of workfloworchestration 33 First orchestrator scalable in: ● Logic complexity ● Parameter management ● DAG size ● Ops cost ● Domain-specific abstractions https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025
  • 34.
    www.scling.com Schema definitions 34 { "type" :"record", "namespace" : "com.mapflat.example", "name" : "User", "fields" : [ { "name" : "id" , "type" : "int" }, { "name" : "name" , "type" : "string" }, { "name" : "age" , "type" : "int" }, { "name" : "phone" , "type" : ["null", "string"], "default": null } ] } ● RDBMS: Table metadata ● Avro format: JSON/DSL definition ○ Definition is bundled with avro data files ○ Reused by Parquet format ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema { "id": 1, "name": "Alice", "age": "34" } { "id": 1, "name": "Bob", "age": "42", "phone": "08-123456" } case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 35.
    www.scling.com ● Expressive ● Customtypes ● Scalameta ● IDE support ● Avro for data lake storage Schema definition choice 35 ● RDBMS: Table metadata ● Avro: JSON/DSL definition ○ Definition is bundled with avro data files ○ Reused by Parquet format ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 36.
    www.scling.com Schema offspring Testrecord difference render type classes 36 case classes test equality type classes Avro definitions Java Avro codec classes Java <-> Scala converters Avro type annotations MySQL schemas CSV codecs Privacy by design machinery Python Logical types
  • 37.
    www.scling.com 37 Service-oriented architectures Stream processing Data warehousing Functional data engineering Live database ●High-code - 3GL ○ Python, Scala, Java ● Embedded DSLs ○ Spark, Flink, .. ● Built for production ○ QA ○ DevEx ○ Quality mgmt ● "What can I do with data?" ● Special tools - data 4GL ○ Low code (SQL) ○ No code ● "What can I do with tool X?" One-directional Bidirectional Offline, async integration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based
  • 38.
    www.scling.com SQL for dataprocessing ● SQL used in 3 distinct contexts ○ Interactive exploration ○ Backend data record retrieval ■ 25 years of injections ○ ETL data processing? 38 Important data language features: ● Can express (complex) business logic ● Composability ● Reusability ● Testability ● Seamless integration with external logic ● Tools to guide towards good path ○ Type system ○ Inspection tools ● IDE experience ● Debuggability ● Data quality measurement support ● Data quality improvement support ● Learning curve
  • 39.
    www.scling.com SQL for dataprocessing ● SQL used in 3 distinct contexts ○ Interactive exploration ○ Backend data record retrieval ■ 25 years of injections ○ ETL data processing? 39 Important data language features: ● Can express (complex) business logic ● Composability ● Reusability ● Testability ● Seamless integration with external logic ● Tools to guide towards good path ○ Type system ○ Inspection tools ● IDE experience ● Debuggability ● Data quality measurement support ● Data quality improvement support ● Learning curve https://threadreaderapp.com/thread/1353832649664692225.html
  • 40.
    www.scling.com Reporting master datamanagement → SQL 2013: 2025: ● "MasterUser" - MDM of users ● "ReportingUser" - MasterUser + fiscal ● Convert ReportingUser to Hive? ○ Business logic too complex ○ No code reuse ○ Normalisation forced on consumers ○ No counters - sacrifice data quality ○ 3-5x performance loss 40 "We seem to be the largest company using Python for big data. That's a risky position. Let's find alternatives." That did not age well.
  • 41.
    www.scling.com ● Wide scope components/ assets ● Good interoperability ● Less control → ops risk 41 Data vendor products Cloud (IaaS, PaaS) Data products Data lake Frozen lake Modern data warehouse Data vault Kimball Data fabric Data access layer Data mesh Data hub Data contracts One-directional Bidirectional Offline, async integration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based ● Do one thing well - small scope ● Enables evolution ● Some features not available in OSS ○ Data monitoring
  • 42.
    www.scling.com Unix philosophy example 42 ●Small programs that do one thing well. ● Architecture for two-way decisions ● Data pipeline deployment evolution ○ Spotify 2014 - 2018 1. Self-contained jar file. Ad-hoc continuous deploy flow. 2. Docker container on VM pool 3. Docker container on Kubernetes logs CI dev env o11y
  • 43.
    www.scling.com Separation of computationand integration Computation ● Fails on ○ New data + code combination ○ Static resources ● Deterministic, reproducible ○ No side effects Integration ● Fails on ○ Configuration ○ Dynamic resources ○ Bad cloud weather ● Non-deterministic ○ Side effects 43
  • 44.
    www.scling.com ● One sizefits all ● Data producer in control 44 Data products Modern data warehouse Data vault Kimball Data fabric Data access layer Data mesh Data hub Data contracts One-directional Bidirectional Offline, async integration Online, sync integration Immutable Mutable Explicit metadata Implicit metadata Native, expressive components Specialised, limited components Unix philosophy Comprehensive components Reactive, pull-based Proactive, push-based ● Driven by use cases ● Data consumer in control Stream processing Functional data engineering Broonze & gold layers Silver layer
  • 45.
  • 46.
    www.scling.com Artisanal vs industrialisedknowledge graphs Artisanal: ● Create single shared graph ● Used for many use cases ● Innovate fast graph → use case Industrial: ● Create graph for each use case ● Reuse code that produces graph ● Each graph may be unique ● Innovate fast raw → graph → use case 46
  • 47.
    www.scling.com Artisanal vs industrialisedmachine learning models Google MLOps maturity model: ● MLOps level 0: Manual process ● MLOps level 1: ML pipeline automation ● MLOps level 2: CI/CD pipeline automation https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning 47
  • 48.
    www.scling.com Premature modelling iswaste ● Power: Recompute model quickly ● Lifted limitation: Expensive to compute model ● Old rule: Careful manual modelling work ● New rules: Guard rails preventing model iteration from breaking downstream ○ Code QA = testing ○ Code + data QA = monitoring Yes, on purpose! 48
  • 49.
    www.scling.com All the dataparadigms 49 Data warehousing Data dump Modern data warehouse Data vault Kimball Modern data stack Spreadsheet Data scientist in the corner Lakehouse Live database Functional data engineering Data lake Frozen lake Data products Data mesh Data hub Data contracts Notebooks Stored procedures Deployed notebooks Medallion Stream processing PubSub Unified log Service-oriented RPC Enterprise service bus Data fabric Data access layer Paradigm != product Lambda Beam
  • 50.
    www.scling.com Functional data engineering@ enterprise context 50 1.5 persons, 3 years ● 162 pipelines ● 700 datasets / day* ● 4 new pipelines / month ● 80 commits / month* ● 35 deployments / month* ● 40 KLOC pipeline code ● 20 KLOC tests ● 1.5 KLOC Terraform ● 8 Kubernetes clusters ● 10K pods / day ● 4 regions AWS / Azure ● Cloud: 17 KEUR / month* ● Cloud + dev + ops (TCO): 300 EUR / pipeline / month *As-a-service (2 devs): ● 3700 datasets / day ● 275 commits / month ● 173 deployments / month ● Cloud 2.5 KEUR / month Consumer products O(10M) units O(10G) / day 50K employees 10 BEUR revenue All user-related operational data flows
  • 51.
    www.scling.com The next 100x? 51 capabilityin X # orgs 2016: 1600 000 000 datasets / day There are companies 100x ahead on these KPIs. Don't you want that? I don't believe you. We are great by definition. But we follow the vendor's advice. How hard can it be? No, we need a data mesh and a silver layer. Oh. We prefer detailed control.
  • 52.
    www.scling.com The first high-level,scalable orchestrator… 52 …has yet to be created ● Higher abstraction layers ○ Beyond datasets, jobs, pipelines ● It's a software engineering problem ○ Convenience blocks abstraction stacking ← Similar capabilities → More convenience → https://www.pracdata.io/p/state-of-workflow-orchestration-ecosystem-2025
  • 53.
    www.scling.com I hope thatI have contributed to 53 ● Insights into paradigms' practical aspects ○ Latency / ops+productivity tradeoff ■ Microservices ■ Streaming ■ Functional data engineering ○ No software engineers: Data warehousing ● Awareness of data engineering subfields ○ Functional (Hadoop ecosys, software eng) ○ Data warehousing (ex BI development)
  • 54.
    www.scling.com I hope thatI have contributed to 54 ● Insights into paradigms' practical aspects ○ Latency / ops+productivity tradeoff ■ Microservices ■ Streaming ■ Functional data engineering ○ No software engineers: Data warehousing ● Awareness of data engineering subfields ○ Functional (Hadoop ecosys, software eng) ○ Data warehousing (ex BI development) Want to ● Adopt functional data engineering? ● Aim for the next 100x? Ping me. ● Courage to ○ follow your own path ■ We need innovation ■ Vendors seek revenue ○ use your skills wisely ■ Tech has major impact ■ European sovereignty? ■ Democracy?