SlideShare a Scribd company logo
www.scling.com
End-to-end pipeline agility
Berlin Buzzwords, 2024-06-10
Lars Albertsson
Scling - data factory as a service
1
www.scling.com
Myth:
● We are all doing quite ok
● 2-10x leader-to-rear span
The great capability divide
2
capability in X
# orgs
www.scling.com
Myth:
● We are all doing quite ok
● 2-10x leader-to-rear span
The great capability divide
3
capability in X
# orgs
capability in X
# orgs
Reality:
● Few leaders in each area
● 100-10000x leader-to-rear span
www.scling.com
Capability KPIs
DORA research / State of DevOps report:
● Deployment frequency
● Lead time for changes
● Change failure rate
● Time to restore service
Small elite
~1000x span
4
Observed differences in data organisations:
● Lead time from idea to production
● Time to mend / change pipeline
● Number of pipelines / developer
● Number of datasets / day / developer
Small elite
100 - 10000x span (or more)
www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 100-1000
5
2014: 6500 datasets / day
2016: 20000 datasets / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2021: 500B events collected / day
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value
www.scling.com
Enabling innovation
6
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/
"Discover Weekly wasn't a great
strategic plan and 100 engineers.
It was 3 engineers that decided to
build something."
"I would have killed it. All of a sudden,
they shipped it. It’s one of the most
loved product features that we have."
- Daniel Ek, CEO
www.scling.com
Swedish BigCorp 1:
Enterprise innovation
7
Swedish BigCorp 2:
Only one committee? Hold my beer.
Before we could build this internal data
tool, we had to submit an application
and get it approved by a committee at
the headquarters.
Swedish municipiality:
We started the AI project with a design
phase for a few months, where we did
not write any code!
www.scling.com
Data factory track record
8
Time to first
flow
Staff size 1st flow effort,
weeks
1st flow cost SEK Time to
innovation
# Datasets /
day after 1y
# Flows
after 1y
Spotify (new gen) weeks ~30-50 60? 2M - 10000s 100s
Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ~100 ~10
Finance 2 years 10-50 2000? 100M? Years < 100 < 10
Media 3 weeks 4.5 - 8 15 750K 3 months ~2000 30
Retail 7 weeks 1-3 7 500K 6 months ~3500 70
Telecom 12 weeks 2-5 30 1500K 6 months ~500 50
Consumer
products
20+ weeks 1.5 30+ 1200+K 6+ months ~200 20
Construction 8 weeks 0.5 4 150K 7 months 10-100 * 10
Manufacturing 8 weeks 0.5 4 200K 6 months ? ?
* External bottlenecks
www.scling.com
Data agility
9
● Siloed: 6+ months
Cultural work
● Autonomous: 1 month
Technical work
● Coordinated: days
Data lake
∆
∆
Latency?
www.scling.com
● Oozie (~2007)
● Luigi (2010 / 2012)
○ Asset-based
● Airflow, Pinball (2015)
○ Task-based
● Dagster (asset-based), Prefetch, Argo, …
Workflow orchestration - key to DataOps success
10
● Data assets (Target)
● Jobs that build assets (Task)
● Data sensors (ExternalTask)
● Asset & job parameters
● Job dependencies (requires())
● (State)
● Data management as code
○ No ClickOps
● Simple, debuggable tools
● Industrial, not craft. Work with process, not with data.
● Build higher abstraction layers
www.scling.com
On-prem pipeline deployment pipeline
11
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store
/
www.scling.com
Cloud native deployment
12
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR
/
www.scling.com
Simpler cloud native deployment
13
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
spark-submit
--master=local
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
/
www.scling.com 14
Potential test scopes
● Unit/component
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Job
Service
App
Storage
Storage
Job
Storage
Job
www.scling.com 15
Recommended test scopes
● Single job
● Multiple jobs
● Pipeline, including service
Job
Service
App
Storage
Storage
Job
Storage
Job
www.scling.com
Testing single batch job
16
Job
Standard Scalatest/JUnit harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
www.scling.com
Cognitive waste
● Why do we have 25 time formats?
○ ISO 8601, UTC assumed
○ ISO 8601 + timezone
○ Millis since epoch, UTC
○ Nanos since epoch, UTC
○ Millis since epoch, user local time
○ …
○ Float of seconds since epoch, as string.
WTF?!?
● my-kafka-topic-name, your_topic_name
17
"I don't know what will break downstream"
paralyses many data organisations.
www.scling.com
● Both can be extended with ingress (Kafka), egress DBs
Testing batch pipelines - two options
18
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
www.scling.com
Warped pipeline testing
19
file://test_input/ file://test_output/
Workflow manager DAG "warp" rewrite
1. Generate input 3. Verify output
f() p()
Stream
Stream
Production
www.scling.com
What about a layer of indirection?
20
file://test_data/
Testing?
Prod
Data access layer
- In conflict with autonomous culture
- Adds complexity to experimental cycle
- In conflict with data component ecosystems
+ Explicit
+ Debuggable
www.scling.com
Aspect-oriented testing
Aka hideous monkey patching.
Packaged in prod image.
Triggered by environment variable.
Rules for DAG transformation
21
www.scling.com
Monkey patching extracts a price
Better than indirection? Depends on cultural cost of coordination.
Our plans ahead: Keep warp zone, replace monkey patching with layer of indirection
22
- Implicit
- Difficult to debug
framework
- Disharmony with
higher workflow
abstractions
+ Minimal code adaptation
+ Can test pipelines owned by
other teams
+ Seams in third-party
components
Great value from full pipeline testing!
www.scling.com
Performance is essential. No, not that performance.
● Bitten by bug in org.apache.spark.sql.KeyValueGroupedDataset.flatMapGroups().
○ Method is covered by one test case. WTF?
● Spark startup is slow. 10-30 seconds.
● What about plain Scala collections + minimal data DSL wrapper?
○ Up to 10x faster on join benchmark for < 10 GB data.
○ 10x faster in production on small datasets.
○ 10x faster tests.
○ 90% saved CPU resources.
○ Easier to program + debug.
● Performance dimensions that actually matter for agility & value creation.
23
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
24
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days
www.scling.com
Normalise data collection to compare
25
Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Build current state from immutable
events + dumps
Address cumulative state
at arbitrary time
Data flow / ops paradigms
26
Immutable
Functional
Democratised
Mutable
Object-oriented
Exclusive
Microservices
Shared DBs
Data
warehousing
Modern data
warehousing
Data lake
Frozen lake
Mutate current
state
Stream
processing
Lakehouse
www.scling.com
Incompleteness recovery
27
www.scling.com
Fast data or complete data
28
Delay: 0
Delay: 4
Delay: 12
www.scling.com
Life of an error, batch pipelines
29
● My processing job, bad code!
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Deploy
5. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com
Life of an error, frozen lake
30
● My processing job, bad code!
1. Revert serving datasets to old
2. Fix bug
3. Bump pipeline version
4. Deploy
5. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com
Life of a change, batch pipelines
31
● Forgiving environment
○ Machine errors
○ Human errors
● Friendly to experiments
○ "Dark pipelines" run in parallel
● Operationally efficient
○ Separate dev, test, staging environments
not necessary
○ Self-healing
∆?
www.scling.com
Dynamic version rollout
CredibilityScore
FraudCandidate
Order
32
www.scling.com
Dynamic DAGs
CredibilityScore
FraudCandidate
Order
33
www.scling.com
Schema on write
34
● Schema defined by writer
● Destination (table / dataset / stream topic) has defined schema
○ Technical definition with metadata (e.g. RDMBS, Kafka + registry)
○ By convention
● Writes not in compliance are not accepted
○ Technically aborted (e.g. RDBMS)
○ In violation of intent (e.g. HDFS datasets)
● Can be technically enforced by producer driver
○ Through ORM / code generation
○ Schema registry lookup
Strict checking philosophy
www.scling.com
Schema on read
35
● Anything (technically) accepted when writing
● Schema defined by reader, at consumption
○ Reader may impose requirements on type & value
● In dynamic languages, fields propagate implicitly
○ E-shopping example:
i. Join order + customer.
ii. Add device_type to order schema
iii. device_type becomes available in downstream datasets
● Violations of constraints are detected at read
○ Perhaps long after production?
○ By team not owning producer?
Loose checking philosophy
www.scling.com
Schema on read or write?
36
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
● Expressive
● Custom types
● IDE support
● Avro for data lake storage
Schema definition choice
37
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
● Parquet
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
www.scling.com
Schema offspring Test record
difference render
type classes
38
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types
www.scling.com
Scalameta: schema → syntax tree
39
Defn.Class(
List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case),
Type.Name("SaleTransaction"),
List(),
Ctor.Primary(
List(),
,
List(
List(
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))),
Term.Name("customerClubId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))),
Term.Name("storeId"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(
List(),
Term.Name("item"),
Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))),
None
),
Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None)
)
)
),
Template(List(), List(), Self(, None), List()))
@PrivacyShielded
case class SaleTransaction(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
www.scling.com
Scalameta use cases
● Scalafmt
● Scalafix
○ Static analysis
○ Code transformation
● Online code generation - macros
● Offline code generation
40
// Example from scio 0.7 -> 0.8 upgrade rules
final class FixTensorflow extends SemanticRule("FixTensorflow") {
override def fix(implicit doc: SemanticDocument): Patch =
doc.tree.collect {
case t @ Term.Select(s, Term.Name(
"saveAsTfExampleFile")) =>
Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax)
}.asPatch
}
www.scling.com
Test equality Test record
difference render
type classes
41
case classes
test equality
type classes
www.scling.com
case class User(
age: Int,
@AvroProp ("sqlType", "varchar(1012)")
phone: Option[String] = None)
Python + RDBMS
42
case classes
Avro
definitions
Avro type
annotations
MySQL
schemas
Python
{
"name": "User",
{ "name": "age", "type": "int" }
{ "name": "phone",
"type": [ "null", "string" ],
"sqlType": "varchar(1012)",
}
}
class UserEgressJob(CopyToTable):
columns = [
( "age", "int"),
( "name", "varchar(1012)"),
]
...
www.scling.com
Logical types
43
case classes
Logical types
case t"""Instant""" =>
JObject(List(JField("type", JString("long")), JField("logicalType",
JString("timestamp-micros"))))
case t"""LocalDate""" => JObject(List(JField("type", JString("int")),
JField("logicalType", JString("date"))))
case t"""YearMonth""" => JObject(List(JField("type", JString("int"))))
case t"""JObject""" => JString("string")
● Avro logical types
○ E.g. date → int, timestamp → long
○ Default is timestamp-millis
■ Great for year > 294441 (!)
● Custom logical types
○ Time
○ Collections
○ Physical
www.scling.com
Schema on read or write?
44
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com
Hydration boilerplate
45
www.scling.com
Chimney - case class transformer
46
● Commonality in schema classes
○ Copy + a few more fields
○ Drop fields
● Statically typed
○ Forgot a field - error
○ Wrong type - error
www.scling.com
Chimney in real code
47
www.scling.com
Data
lake
Private
pond
Cold
store
Ingest prepared for deletion
48
Mutation
Landing
pond
Append +
delete
Immutable,
limited
retention
www.scling.com
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles transformed PII fields
Lost key pattern
49
www.scling.com
Shieldformation
50
@PrivacyShielded
case class Sale(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
case class SaleShielded(
shieldId: Option[String],
customerClubIdEncrypted: Option[String],
storeIdEncrypted: Option[String],
item: Option[String],
timestamp: String
)
case class SaleAnonymous(
item: Option[String],
timestamp: String
)
object SaleAnonymize extends SparkJob {
...
}
ShieldForm
object SaleExpose extends SparkJob {
...
}
object SaleShield extends SparkJob {
...
}
case class Shield(
shieldId: String,
personId: Option[String],
keyStr: Option[String],
encounterDate: String
)
www.scling.com
Shield
Shieldformation & lost key
51
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats
www.scling.com
Schema on write!
52
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
www.scling.com 53
Agility is essential
● Innovation, cost of operations
● Cultural & technical challenges
● Leave data warehousing behind
Josh Baer: Powering Spotify's audio
personalization platform
www.scling.com 54
How long from idea to
pipeline in production?
6-12 weeks
A manager heard that
some companies do it
in hours. What would
you do to get there?
I push back on
unrealistic
management
requirements.
Agility is essential
● Innovation, cost of operations
● Cultural & technical challenges
● Leave data warehousing behind
Josh Baer: Powering Spotify's audio
personalization platform
www.scling.com 55
How long from idea to
pipeline in production?
6-12 weeks
A manager heard that
some companies do it
in hours. What would
you do to get there?
I push back on
unrealistic
management
requirements.
Agility is essential
● Innovation, cost of operations
● Cultural & technical challenges
● Always improve - never be content
○ Weeks → hours → seconds
Josh Baer: Powering Spotify's audio
personalization platform
Code & continuous improvement

More Related Content

Similar to End-to-end pipeline agility - Berlin Buzzwords 2024

Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
Lars Albertsson
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
Kyle Hailey
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
Heinrich Hartmann
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
Alex Van Boxel
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
Calvin French-Owen
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Deploying spark ml models
Deploying spark ml models Deploying spark ml models
Deploying spark ml models
subhojit banerjee
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
James Chittenden
 
Accelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataAccelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual Data
Kyle Hailey
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
DSDT_MTL
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
JDA Labs MTL
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Prepare to defend thyself with Blue/Green
Prepare to defend thyself with Blue/GreenPrepare to defend thyself with Blue/Green
Prepare to defend thyself with Blue/Green
Sonatype
 
All Day DevOps 2016 Fabian - Defending Thyself with Blue Green
All Day DevOps 2016 Fabian - Defending Thyself with Blue GreenAll Day DevOps 2016 Fabian - Defending Thyself with Blue Green
All Day DevOps 2016 Fabian - Defending Thyself with Blue Green
Fab L
 
How to run a bank on Apache CloudStack
How to run a bank on Apache CloudStackHow to run a bank on Apache CloudStack
How to run a bank on Apache CloudStack
gjdevos
 

Similar to End-to-end pipeline agility - Berlin Buzzwords 2024 (20)

Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Deploying spark ml models
Deploying spark ml models Deploying spark ml models
Deploying spark ml models
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
Accelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataAccelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual Data
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Prepare to defend thyself with Blue/Green
Prepare to defend thyself with Blue/GreenPrepare to defend thyself with Blue/Green
Prepare to defend thyself with Blue/Green
 
All Day DevOps 2016 Fabian - Defending Thyself with Blue Green
All Day DevOps 2016 Fabian - Defending Thyself with Blue GreenAll Day DevOps 2016 Fabian - Defending Thyself with Blue Green
All Day DevOps 2016 Fabian - Defending Thyself with Blue Green
 
How to run a bank on Apache CloudStack
How to run a bank on Apache CloudStackHow to run a bank on Apache CloudStack
How to run a bank on Apache CloudStack
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
Lars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Lars Albertsson
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
Lars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
Lars Albertsson
 
Data democratised
Data democratisedData democratised
Data democratised
Lars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
Lars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
Lars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
Lars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
Lars Albertsson
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
Lars Albertsson
 

More from Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 

Recently uploaded

Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
RahulS66654
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
Alireza Kamrani
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
44annissa
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
palaniappancse
 
Ahrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptxAhrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptx
tylermmo95
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
zyqedad
 
PHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysisPHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysis
CharmoliApumKhrime
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
The University of New England degree offer diploma Transcript
The University of New England  degree offer diploma TranscriptThe University of New England  degree offer diploma Transcript
The University of New England degree offer diploma Transcript
taqyea
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
Universidad de Barcelona degree offer diploma Transcript
Universidad de Barcelona  degree offer diploma TranscriptUniversidad de Barcelona  degree offer diploma Transcript
Universidad de Barcelona degree offer diploma Transcript
taqyea
 
University of Wollongong degree offer diploma Transcript
University of Wollongong  degree offer diploma TranscriptUniversity of Wollongong  degree offer diploma Transcript
University of Wollongong degree offer diploma Transcript
taqyea
 
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptxArtificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
vaishnavisharma877623
 
Contemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdfContemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdf
DngQuct12A1
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 

Recently uploaded (20)

Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
 
all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...all about the data science process, covering the steps present in almost ever...
all about the data science process, covering the steps present in almost ever...
 
Ahrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptxAhrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptx
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
 
PHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysisPHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysis
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
The University of New England degree offer diploma Transcript
The University of New England  degree offer diploma TranscriptThe University of New England  degree offer diploma Transcript
The University of New England degree offer diploma Transcript
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Universidad de Barcelona degree offer diploma Transcript
Universidad de Barcelona  degree offer diploma TranscriptUniversidad de Barcelona  degree offer diploma Transcript
Universidad de Barcelona degree offer diploma Transcript
 
University of Wollongong degree offer diploma Transcript
University of Wollongong  degree offer diploma TranscriptUniversity of Wollongong  degree offer diploma Transcript
University of Wollongong degree offer diploma Transcript
 
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptxArtificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
Artificial Intelligence (AI) Technology Project Proposal _ by Slidesgo.pptx
 
Contemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdfContemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdf
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 

End-to-end pipeline agility - Berlin Buzzwords 2024

  • 1. www.scling.com End-to-end pipeline agility Berlin Buzzwords, 2024-06-10 Lars Albertsson Scling - data factory as a service 1
  • 2. www.scling.com Myth: ● We are all doing quite ok ● 2-10x leader-to-rear span The great capability divide 2 capability in X # orgs
  • 3. www.scling.com Myth: ● We are all doing quite ok ● 2-10x leader-to-rear span The great capability divide 3 capability in X # orgs capability in X # orgs Reality: ● Few leaders in each area ● 100-10000x leader-to-rear span
  • 4. www.scling.com Capability KPIs DORA research / State of DevOps report: ● Deployment frequency ● Lead time for changes ● Change failure rate ● Time to restore service Small elite ~1000x span 4 Observed differences in data organisations: ● Lead time from idea to production ● Time to mend / change pipeline ● Number of pipelines / developer ● Number of datasets / day / developer Small elite 100 - 10000x span (or more)
  • 5. www.scling.com Efficiency gap, data cost & value ● Data processing produces datasets ○ Each dataset has business value ● Proxy value/cost metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 100-1000 5 2014: 6500 datasets / day 2016: 20000 datasets / day 2018: 100000+ datasets / day, 25% of staff use BigQuery 2021: 500B events collected / day 2016: 1600 000 000 datasets / day Disruptive value of data, machine learning Financial, reporting Insights, data-fed features effort value
  • 6. www.scling.com Enabling innovation 6 "The actual work that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8 https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/ "Discover Weekly wasn't a great strategic plan and 100 engineers. It was 3 engineers that decided to build something." "I would have killed it. All of a sudden, they shipped it. It’s one of the most loved product features that we have." - Daniel Ek, CEO
  • 7. www.scling.com Swedish BigCorp 1: Enterprise innovation 7 Swedish BigCorp 2: Only one committee? Hold my beer. Before we could build this internal data tool, we had to submit an application and get it approved by a committee at the headquarters. Swedish municipiality: We started the AI project with a design phase for a few months, where we did not write any code!
  • 8. www.scling.com Data factory track record 8 Time to first flow Staff size 1st flow effort, weeks 1st flow cost SEK Time to innovation # Datasets / day after 1y # Flows after 1y Spotify (new gen) weeks ~30-50 60? 2M - 10000s 100s Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ~100 ~10 Finance 2 years 10-50 2000? 100M? Years < 100 < 10 Media 3 weeks 4.5 - 8 15 750K 3 months ~2000 30 Retail 7 weeks 1-3 7 500K 6 months ~3500 70 Telecom 12 weeks 2-5 30 1500K 6 months ~500 50 Consumer products 20+ weeks 1.5 30+ 1200+K 6+ months ~200 20 Construction 8 weeks 0.5 4 150K 7 months 10-100 * 10 Manufacturing 8 weeks 0.5 4 200K 6 months ? ? * External bottlenecks
  • 9. www.scling.com Data agility 9 ● Siloed: 6+ months Cultural work ● Autonomous: 1 month Technical work ● Coordinated: days Data lake ∆ ∆ Latency?
  • 10. www.scling.com ● Oozie (~2007) ● Luigi (2010 / 2012) ○ Asset-based ● Airflow, Pinball (2015) ○ Task-based ● Dagster (asset-based), Prefetch, Argo, … Workflow orchestration - key to DataOps success 10 ● Data assets (Target) ● Jobs that build assets (Task) ● Data sensors (ExternalTask) ● Asset & job parameters ● Job dependencies (requires()) ● (State) ● Data management as code ○ No ClickOps ● Simple, debuggable tools ● Industrial, not craft. Work with process, not with data. ● Build higher abstraction layers
  • 11. www.scling.com On-prem pipeline deployment pipeline 11 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store /
  • 12. www.scling.com Cloud native deployment 12 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR /
  • 13. www.scling.com Simpler cloud native deployment 13 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker spark-submit --master=local Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS /
  • 14. www.scling.com 14 Potential test scopes ● Unit/component ● Single job ● Multiple jobs ● Pipeline, including service ● Full system, including client Choose stable interfaces Each scope has a cost Job Service App Storage Storage Job Storage Job
  • 15. www.scling.com 15 Recommended test scopes ● Single job ● Multiple jobs ● Pipeline, including service Job Service App Storage Storage Job Storage Job
  • 16. www.scling.com Testing single batch job 16 Job Standard Scalatest/JUnit harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  • 17. www.scling.com Cognitive waste ● Why do we have 25 time formats? ○ ISO 8601, UTC assumed ○ ISO 8601 + timezone ○ Millis since epoch, UTC ○ Nanos since epoch, UTC ○ Millis since epoch, user local time ○ … ○ Float of seconds since epoch, as string. WTF?!? ● my-kafka-topic-name, your_topic_name 17 "I don't know what will break downstream" paralyses many data organisations.
  • 18. www.scling.com ● Both can be extended with ingress (Kafka), egress DBs Testing batch pipelines - two options 18 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B:
  • 19. www.scling.com Warped pipeline testing 19 file://test_input/ file://test_output/ Workflow manager DAG "warp" rewrite 1. Generate input 3. Verify output f() p() Stream Stream Production
  • 20. www.scling.com What about a layer of indirection? 20 file://test_data/ Testing? Prod Data access layer - In conflict with autonomous culture - Adds complexity to experimental cycle - In conflict with data component ecosystems + Explicit + Debuggable
  • 21. www.scling.com Aspect-oriented testing Aka hideous monkey patching. Packaged in prod image. Triggered by environment variable. Rules for DAG transformation 21
  • 22. www.scling.com Monkey patching extracts a price Better than indirection? Depends on cultural cost of coordination. Our plans ahead: Keep warp zone, replace monkey patching with layer of indirection 22 - Implicit - Difficult to debug framework - Disharmony with higher workflow abstractions + Minimal code adaptation + Can test pipelines owned by other teams + Seams in third-party components Great value from full pipeline testing!
  • 23. www.scling.com Performance is essential. No, not that performance. ● Bitten by bug in org.apache.spark.sql.KeyValueGroupedDataset.flatMapGroups(). ○ Method is covered by one test case. WTF? ● Spark startup is slow. 10-30 seconds. ● What about plain Scala collections + minimal data DSL wrapper? ○ Up to 10x faster on join benchmark for < 10 GB data. ○ 10x faster in production on small datasets. ○ 10x faster tests. ○ 90% saved CPU resources. ○ Easier to program + debug. ● Performance dimensions that actually matter for agility & value creation. 23
  • 24. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 24 Fatalities collected during 2 day Fatalities collected during 4 days Fatalities collected during 10 days
  • 25. www.scling.com Normalise data collection to compare 25 Graph by Adam Altmejd, @adamaltmejd
  • 26. www.scling.com Build current state from immutable events + dumps Address cumulative state at arbitrary time Data flow / ops paradigms 26 Immutable Functional Democratised Mutable Object-oriented Exclusive Microservices Shared DBs Data warehousing Modern data warehousing Data lake Frozen lake Mutate current state Stream processing Lakehouse
  • 28. www.scling.com Fast data or complete data 28 Delay: 0 Delay: 4 Delay: 12
  • 29. www.scling.com Life of an error, batch pipelines 29 ● My processing job, bad code! 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Deploy 5. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 30. www.scling.com Life of an error, frozen lake 30 ● My processing job, bad code! 1. Revert serving datasets to old 2. Fix bug 3. Bump pipeline version 4. Deploy 5. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 31. www.scling.com Life of a change, batch pipelines 31 ● Forgiving environment ○ Machine errors ○ Human errors ● Friendly to experiments ○ "Dark pipelines" run in parallel ● Operationally efficient ○ Separate dev, test, staging environments not necessary ○ Self-healing ∆?
  • 34. www.scling.com Schema on write 34 ● Schema defined by writer ● Destination (table / dataset / stream topic) has defined schema ○ Technical definition with metadata (e.g. RDMBS, Kafka + registry) ○ By convention ● Writes not in compliance are not accepted ○ Technically aborted (e.g. RDBMS) ○ In violation of intent (e.g. HDFS datasets) ● Can be technically enforced by producer driver ○ Through ORM / code generation ○ Schema registry lookup Strict checking philosophy
  • 35. www.scling.com Schema on read 35 ● Anything (technically) accepted when writing ● Schema defined by reader, at consumption ○ Reader may impose requirements on type & value ● In dynamic languages, fields propagate implicitly ○ E-shopping example: i. Join order + customer. ii. Add device_type to order schema iii. device_type becomes available in downstream datasets ● Violations of constraints are detected at read ○ Perhaps long after production? ○ By team not owning producer? Loose checking philosophy
  • 36. www.scling.com Schema on read or write? 36 DB DB DB Service Service Export Business intelligence Change agility important here Production stability important here
  • 37. www.scling.com ● Expressive ● Custom types ● IDE support ● Avro for data lake storage Schema definition choice 37 ● RDBMS: Table metadata ● Avro: JSON/DSL definition ○ Definition is bundled with avro data files ● Parquet ● pyschema / dataclass ● Scala case classes ● JSON-schema ● JSON: Each record ○ One record insufficient to deduce schema case class User(id: String, name: String, age: Int, phone: Option[String] = None) val users = Seq( User("1", "Alice", 32), User("2", "Bob", 43, Some("08-123456")))
  • 38. www.scling.com Schema offspring Test record difference render type classes 38 case classes test equality type classes Avro definitions Java Avro codec classes Java <-> Scala converters Avro type annotations MySQL schemas CSV codecs Privacy by design machinery Python Logical types
  • 39. www.scling.com Scalameta: schema → syntax tree 39 Defn.Class( List(Mod.Annot(Init(Type.Name("PrivacyShielded"), , List())), case), Type.Name("SaleTransaction"), List(), Ctor.Primary( List(), , List( List( Term.Param( List(Mod.Annot(Init(Type.Name("PersonalId"), , List()))), Term.Name("customerClubId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(Mod.Annot(Init(Type.Name("PersonalData"), , List()))), Term.Name("storeId"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param( List(), Term.Name("item"), Some(Type.Apply(Type.Name("Option"), List(Type.Name("String")))), None ), Term.Param(List(), Term.Name("timestamp"), Some(Type.Name("String")), None) ) ) ), Template(List(), List(), Self(, None), List())) @PrivacyShielded case class SaleTransaction( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String )
  • 40. www.scling.com Scalameta use cases ● Scalafmt ● Scalafix ○ Static analysis ○ Code transformation ● Online code generation - macros ● Offline code generation 40 // Example from scio 0.7 -> 0.8 upgrade rules final class FixTensorflow extends SemanticRule("FixTensorflow") { override def fix(implicit doc: SemanticDocument): Patch = doc.tree.collect { case t @ Term.Select(s, Term.Name( "saveAsTfExampleFile")) => Patch.replaceTree(t, q"$s.saveAsTfRecordFile".syntax) }.asPatch }
  • 41. www.scling.com Test equality Test record difference render type classes 41 case classes test equality type classes
  • 42. www.scling.com case class User( age: Int, @AvroProp ("sqlType", "varchar(1012)") phone: Option[String] = None) Python + RDBMS 42 case classes Avro definitions Avro type annotations MySQL schemas Python { "name": "User", { "name": "age", "type": "int" } { "name": "phone", "type": [ "null", "string" ], "sqlType": "varchar(1012)", } } class UserEgressJob(CopyToTable): columns = [ ( "age", "int"), ( "name", "varchar(1012)"), ] ...
  • 43. www.scling.com Logical types 43 case classes Logical types case t"""Instant""" => JObject(List(JField("type", JString("long")), JField("logicalType", JString("timestamp-micros")))) case t"""LocalDate""" => JObject(List(JField("type", JString("int")), JField("logicalType", JString("date")))) case t"""YearMonth""" => JObject(List(JField("type", JString("int")))) case t"""JObject""" => JString("string") ● Avro logical types ○ E.g. date → int, timestamp → long ○ Default is timestamp-millis ■ Great for year > 294441 (!) ● Custom logical types ○ Time ○ Collections ○ Physical
  • 44. www.scling.com Schema on read or write? 44 DB DB DB Service Service Export Business intelligence Change agility important here Production stability important here
  • 46. www.scling.com Chimney - case class transformer 46 ● Commonality in schema classes ○ Copy + a few more fields ○ Drop fields ● Statically typed ○ Forgot a field - error ○ Wrong type - error
  • 48. www.scling.com Data lake Private pond Cold store Ingest prepared for deletion 48 Mutation Landing pond Append + delete Immutable, limited retention
  • 49. www.scling.com ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Decryption (user) id needed + Multi-field oblivion + Single dataset leak → no PII leak + Handles transformed PII fields Lost key pattern 49
  • 50. www.scling.com Shieldformation 50 @PrivacyShielded case class Sale( @PersonalId customerClubId: Option[String], @PersonalData storeId: Option[String], item: Option[String], timestamp: String ) case class SaleShielded( shieldId: Option[String], customerClubIdEncrypted: Option[String], storeIdEncrypted: Option[String], item: Option[String], timestamp: String ) case class SaleAnonymous( item: Option[String], timestamp: String ) object SaleAnonymize extends SparkJob { ... } ShieldForm object SaleExpose extends SparkJob { ... } object SaleShield extends SparkJob { ... } case class Shield( shieldId: String, personId: Option[String], keyStr: Option[String], encounterDate: String )
  • 51. www.scling.com Shield Shieldformation & lost key 51 SaleShield Sale Sale Shielded Shield Deletion requests Customer History Exposed egress SaleExpose Limited retention SaleAnonymize Sale Anonymous Sale Stats
  • 53. www.scling.com 53 Agility is essential ● Innovation, cost of operations ● Cultural & technical challenges ● Leave data warehousing behind Josh Baer: Powering Spotify's audio personalization platform
  • 54. www.scling.com 54 How long from idea to pipeline in production? 6-12 weeks A manager heard that some companies do it in hours. What would you do to get there? I push back on unrealistic management requirements. Agility is essential ● Innovation, cost of operations ● Cultural & technical challenges ● Leave data warehousing behind Josh Baer: Powering Spotify's audio personalization platform
  • 55. www.scling.com 55 How long from idea to pipeline in production? 6-12 weeks A manager heard that some companies do it in hours. What would you do to get there? I push back on unrealistic management requirements. Agility is essential ● Innovation, cost of operations ● Cultural & technical challenges ● Always improve - never be content ○ Weeks → hours → seconds Josh Baer: Powering Spotify's audio personalization platform Code & continuous improvement