Testing data streaming applications

Lars Albertsson
Lars AlbertssonFounder & Data Engineer
www.mapflat.com
Testing data streaming
applications
Lars Albertsson, independent consultant
Øyvind Løkling, Schibsted Media Group
www.mapflat.com
Who’s talking?
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (very large machines)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
www.mapflat.com
Why stream processing?
● Increasing number of
data-driven features
● 90+% fed by batch processing
○ Simpler, better tooling
○ 1+ hour data reaction time
● Stream processing for
○ 100ms - 1 hour reaction
○ Decoupled, asynchronous
microservices
User content
Professional
content
Ads / partners
User
behaviour
Systems
Ads
System
diagnostics
Recommendations
Data-based
features
Curated
content
Pushing
Business
intelligence
Experiments
Exploration
www.mapflat.com
The organic past
● Many paths
● Synchronous
● Link failure -> chain failure
● Heterogeneous
● Difficult to recover from
transformation bugs
Service Service Service
App App App
DB
Poll
Queue
Aggregate
logs
NFS
Hourly dump
Data
warehouse
ETL
Queue
NFS
scp
DB
HTTP
www.mapflat.com
● Publish data in streams
● Replicated, sharded
append-only log
● Pub / sub with history
○ Kafka, Google Pub/Sub,
AWS Kinesis
● Tap to data lake for batch
processing
Unified log
The unified log
Ads Search Feed
App App App
StreamStream Stream
Data lake
www.mapflat.com
● Decoupled
producers/consumers
○ In source/deployment
○ In space
○ In time
● Publish results to log
● Recovers from link failures
● Replay on job bug fix
Stream processing
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
www.mapflat.com
Stream processing building blocks
● Aggregate
○ Calculate time windows
○ Aggregate state (in memory / local database / shared database)
● Filter
○ Slim down stream
○ Privacy, security concerns
● Join
○ Enrich by joining with datasets, e.g. geo IP lookup, demographics
○ Join streams within time windows, e.g. click-through rate
● Transform
○ Bring data into same “shape”, schema
www.mapflat.com
Stream processing technologies
● Spark Streaming
○ Ideal if you are already using Spark, same model
○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams
○ Library - new, positions itself as a lightweight alternative
○ Tightly coupled to Kafka
● Others
○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
www.mapflat.com
● Update database table, e.g. for
polling dashboard
● Create service index table n+1.
Notify service to switch.
● Post to external web service
● Push stream to client
Egress
Service
Stream Stream
Job Job
App
www.mapflat.com
Test concepts
Test harness
Test
fixture
System under test
(SUT)
3rd party
component
(e.g. DB)
3rd party
component
3rd party
component
Test
input
Test
oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build
tools
www.mapflat.com
● Unit
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Potential test scopes
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Stream application properties
● Output = function(input, code)
○ Perfect for testing!
○ Avoid: indeterministic processing, reading wall clock
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
www.mapflat.com
● Single job
● Multiple jobs
● Pipeline, including service
Recommended scopes
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
● Unit
○ Few stable interfaces
○ Not necessary
○ Avoid mocks, DI rituals
● Full system, including client
○ Client automation fragile
“Focus on functional system
tests, complement with smaller
where you cannot get
coverage.” - Henrik Kniberg
Scopes to avoid
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Stream application, example harness
Scalatest Spark Streaming jobs
IDE, CI, debug integration
15
DB
Topic
Kafka
Test
input
Test
oracle
Docker
IDE / Gradle
Polling
www.mapflat.com
Test lifecycle
1. Start fixture containers
2. Await fixture ready
3. Allocate test case resources
4. Start jobs
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture
For absence test, send dummy sync messages at end.
2, 7. Scalatest 4. Spark
5 6
1. Docker
IDE / Gradle
www.mapflat.com
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible => new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
Input generation
www.mapflat.com
Test oracles
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
○ Reuse for production data quality monitoring
www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers
● Document “golden path”
www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake implementations
○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3)
● Integrate PaaS service as fixture component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework - vendor lock-in
7. Using wall clock time
8. Embedded fixture components
www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
● Content inspiration
Confluent, LinkedIn, Google, Netflix, Apache Samza
● Images
Tracey Saxby, Integration and Application Network, University of Maryland
Center for Environmental Science (ian.umces.edu/imagelibrary/).
www.mapflat.com
Bonus slides
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
24
www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production
Obtaining quality metrics
25
DB
Quality assessment job
www.mapflat.com
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for publishing dataset
● Push aggregates to DB
○ Standard ops: monitor, alert
26
DB
∆?
Code ∆!
1 of 26

Recommended

Apache Kafka at LinkedIn by
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInDiscover Pinterest
4.6K views37 slides
Kafka Tutorial - Introduction to Apache Kafka (Part 1) by
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
8.9K views250 slides
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019 by
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
5.3K views42 slides
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... by
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
4.3K views38 slides
Kappa vs Lambda Architectures and Technology Comparison by
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
1.2K views26 slides
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring by
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
3.5K views42 slides

More Related Content

What's hot

Flexible and Real-Time Stream Processing with Apache Flink by
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
2.2K views38 slides
From Zero to Hero with Kafka Connect by
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectDatabricks
430 views72 slides
Stream processing using Kafka by
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
1.6K views44 slides
Handle Large Messages In Apache Kafka by
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
46.7K views59 slides
Grafana introduction by
Grafana introductionGrafana introduction
Grafana introductionRico Chen
8.8K views11 slides
Introduction to Apache Flink - Fast and reliable big data processing by
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
7.2K views23 slides

What's hot(20)

Flexible and Real-Time Stream Processing with Apache Flink by DataWorks Summit
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit2.2K views
From Zero to Hero with Kafka Connect by Databricks
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
Databricks430 views
Stream processing using Kafka by Knoldus Inc.
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.1.6K views
Handle Large Messages In Apache Kafka by Jiangjie Qin
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin46.7K views
Grafana introduction by Rico Chen
Grafana introductionGrafana introduction
Grafana introduction
Rico Chen8.8K views
Introduction to Apache Flink - Fast and reliable big data processing by Till Rohrmann
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann7.2K views
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ... by HostedbyConfluent
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
HostedbyConfluent1.2K views
Scylla Summit 2022: Scylla 5.0 New Features, Part 1 by ScyllaDB
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB852 views
Hudi architecture, fundamentals and capabilities by Nishith Agarwal
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal2.8K views
From Zero to Hero with Kafka Connect by confluent
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
confluent9.1K views
Scalability, Availability & Stability Patterns by Jonas Bonér
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér516.6K views
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드 by confluent
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
confluent640 views
Stephan Ewen - Experiences running Flink at Very Large Scale by Ververica
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica 3.5K views
ksqlDB: A Stream-Relational Database System by confluent
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent1.4K views
Kafka Security 101 and Real-World Tips by confluent
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
confluent6.6K views
Kafka Streams: What it is, and how to use it? by confluent
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
confluent2K views
Apache Iceberg - A Table Format for Hige Analytic Datasets by Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K views
Beautiful Monitoring With Grafana and InfluxDB by leesjensen
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
leesjensen17.8K views
An Introduction to Apache Kafka by Amir Sedighi
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi2.7K views
Large Scale Lakehouse Implementation Using Structured Streaming by Databricks
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks490 views

Viewers also liked

Test strategies for data processing pipelines, v2.0 by
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
2.7K views36 slides
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu by
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
11.7K views111 slides
Testing distributed, complex web applications by
Testing distributed, complex web applicationsTesting distributed, complex web applications
Testing distributed, complex web applicationsJens-Christian Fischer
3K views82 slides
Step-by-Step Introduction to Apache Flink by
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
22.7K views64 slides
10 ways to stumble with big data by
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
1.4K views18 slides
A Beginner's Guide to Building Data Pipelines with Luigi by
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
57K views26 slides

Viewers also liked(12)

Test strategies for data processing pipelines, v2.0 by Lars Albertsson
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson2.7K views
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu by Slim Baltagi
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi11.7K views
Step-by-Step Introduction to Apache Flink by Slim Baltagi
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
Slim Baltagi22.7K views
10 ways to stumble with big data by Lars Albertsson
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson1.4K views
A Beginner's Guide to Building Data Pipelines with Luigi by Growth Intelligence
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
The Mechanics of Testing Large Data Pipelines (QCon London 2016) by Mathieu Bastian
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian7.7K views
Data pipelines from zero to solid by Lars Albertsson
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson10.7K views
Building Scalable Data Pipelines - 2016 DataPalooza Seattle by Evan Chan
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan5.7K views
Overview of Apache Flink: Next-Gen Big Data Analytics Framework by Slim Baltagi
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi32.5K views
Building a unified data pipeline in Apache Spark by DataWorks Summit
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit26.2K views
Building a Data Pipeline from Scratch - Joe Crobak by Hakka Labs
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs38.6K views

Similar to Testing data streaming applications

Test strategies for data processing pipelines by
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
5.2K views43 slides
A primer on building real time data-driven products by
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
951 views17 slides
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W... by
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
553 views43 slides
Scaling up uber's real time data analytics by
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
758 views49 slides
Machine learning and big data @ uber a tale of two systems by
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
2.2K views31 slides
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma by
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
1.1K views17 slides

Similar to Testing data streaming applications(20)

Test strategies for data processing pipelines by Lars Albertsson
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson5.2K views
A primer on building real time data-driven products by Lars Albertsson
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lars Albertsson951 views
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W... by StampedeCon
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon553 views
Scaling up uber's real time data analytics by Xiang Fu
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu758 views
Machine learning and big data @ uber a tale of two systems by Zhenxiao Luo
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo2.2K views
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma by Spark Summit
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit1.1K views
Holistic data application quality by Lars Albertsson
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson396 views
Structured Streaming in Spark by Digital Vidya
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya935 views
Near real-time anomaly detection at Lyft by markgrover
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
markgrover1.9K views
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin... by Anna Ossowski
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski115 views
Distributed real time stream processing- why and how by Petr Zapletal
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal20.2K views
Intro to Apache Apex - Next Gen Platform for Ingest and Transform by Apache Apex
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex1.2K views
Druid Optimizations for Scaling Customer Facing Analytics by Amir Youssefi
Druid Optimizations for Scaling Customer Facing AnalyticsDruid Optimizations for Scaling Customer Facing Analytics
Druid Optimizations for Scaling Customer Facing Analytics
Amir Youssefi20 views
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by... by NETWAYS
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS20 views
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013) by Gabriele Bartolini
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Gabriele Bartolini5.1K views
Data ops in practice - Swedish style by Lars Albertsson
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson408 views
Webinar september 2013 by Marc Gille
Webinar september 2013Webinar september 2013
Webinar september 2013
Marc Gille1.7K views
The Lyft data platform: Now and in the future by markgrover
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover3.8K views

More from Lars Albertsson

Crossing the data divide by
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
3 views31 slides
Schema management with Scalameta by
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
7 views50 slides
How to not kill people - Berlin Buzzwords 2023.pdf by
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
34 views51 slides
Data engineering in 10 years.pdf by
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
842 views52 slides
The 7 habits of data effective companies.pdf by
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
252 views44 slides
Secure software supply chain on a shoestring budget by
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
268 views49 slides

More from Lars Albertsson(20)

How to not kill people - Berlin Buzzwords 2023.pdf by Lars Albertsson
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson34 views
Data engineering in 10 years.pdf by Lars Albertsson
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson842 views
The 7 habits of data effective companies.pdf by Lars Albertsson
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson252 views
Secure software supply chain on a shoestring budget by Lars Albertsson
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson268 views
DataOps - Lean principles and lean practices by Lars Albertsson
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Lars Albertsson787 views
The right side of speed - learning to shift left by Lars Albertsson
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson202 views
Mortal analytics - Covid-19 and the problem of data quality by Lars Albertsson
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson416 views
Eventually, time will kill your data processing by Lars Albertsson
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson413 views
Taming the reproducibility crisis by Lars Albertsson
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson521 views
Eventually, time will kill your data pipeline by Lars Albertsson
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson936 views

Recently uploaded

DGST Methodology Presentation.pdf by
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdfmaddierlegum
7 views9 slides
4_4_WP_4_06_ND_Model.pptx by
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptxd6fmc6kwd4
7 views13 slides
Construction Accidents & Injuries by
Construction Accidents & InjuriesConstruction Accidents & Injuries
Construction Accidents & InjuriesBisnar Chase Personal Injury Attorneys
6 views5 slides
AvizoImageSegmentation.pptx by
AvizoImageSegmentation.pptxAvizoImageSegmentation.pptx
AvizoImageSegmentation.pptxnathanielbutterworth1
7 views14 slides
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx by
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptxDataScienceConferenc1
5 views15 slides
AIMS-EREA.pdf by
AIMS-EREA.pdfAIMS-EREA.pdf
AIMS-EREA.pdfSudarson Roy Pratihar
6 views18 slides

Recently uploaded(20)

DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx by DataScienceConferenc1
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
DGIQ East 2023 AI Ethics SIG by Karen Lopez
DGIQ East 2023 AI Ethics SIGDGIQ East 2023 AI Ethics SIG
DGIQ East 2023 AI Ethics SIG
Karen Lopez5 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus34 views
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between... by DataScienceConferenc1
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson35 views
Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
AZConf 2023 - Considerations for LLMOps: Running LLMs in production by SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
Underfunded.pptx by vgarcia19
Underfunded.pptxUnderfunded.pptx
Underfunded.pptx
vgarcia1915 views
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402316 views

Testing data streaming applications

  • 1. www.mapflat.com Testing data streaming applications Lars Albertsson, independent consultant Øyvind Løkling, Schibsted Media Group
  • 2. www.mapflat.com Who’s talking? ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (very large machines) ● Google (Hangouts, productivity) ● Recorded Future (NLP startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat - independent data engineering consultant
  • 3. www.mapflat.com Why stream processing? ● Increasing number of data-driven features ● 90+% fed by batch processing ○ Simpler, better tooling ○ 1+ hour data reaction time ● Stream processing for ○ 100ms - 1 hour reaction ○ Decoupled, asynchronous microservices User content Professional content Ads / partners User behaviour Systems Ads System diagnostics Recommendations Data-based features Curated content Pushing Business intelligence Experiments Exploration
  • 4. www.mapflat.com The organic past ● Many paths ● Synchronous ● Link failure -> chain failure ● Heterogeneous ● Difficult to recover from transformation bugs Service Service Service App App App DB Poll Queue Aggregate logs NFS Hourly dump Data warehouse ETL Queue NFS scp DB HTTP
  • 5. www.mapflat.com ● Publish data in streams ● Replicated, sharded append-only log ● Pub / sub with history ○ Kafka, Google Pub/Sub, AWS Kinesis ● Tap to data lake for batch processing Unified log The unified log Ads Search Feed App App App StreamStream Stream Data lake
  • 6. www.mapflat.com ● Decoupled producers/consumers ○ In source/deployment ○ In space ○ In time ● Publish results to log ● Recovers from link failures ● Replay on job bug fix Stream processing Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job
  • 7. www.mapflat.com Stream processing building blocks ● Aggregate ○ Calculate time windows ○ Aggregate state (in memory / local database / shared database) ● Filter ○ Slim down stream ○ Privacy, security concerns ● Join ○ Enrich by joining with datasets, e.g. geo IP lookup, demographics ○ Join streams within time windows, e.g. click-through rate ● Transform ○ Bring data into same “shape”, schema
  • 8. www.mapflat.com Stream processing technologies ● Spark Streaming ○ Ideal if you are already using Spark, same model ○ Bridges gap between data science / data engineers, batch and stream ● Kafka Streams ○ Library - new, positions itself as a lightweight alternative ○ Tightly coupled to Kafka ● Others ○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
  • 9. www.mapflat.com ● Update database table, e.g. for polling dashboard ● Create service index table n+1. Notify service to switch. ● Post to external web service ● Push stream to client Egress Service Stream Stream Job Job App
  • 10. www.mapflat.com Test concepts Test harness Test fixture System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  • 11. www.mapflat.com ● Unit ● Single job ● Multiple jobs ● Pipeline, including service ● Full system, including client Choose stable interfaces Each scope has a cost Potential test scopes Job Service App Stream Stream Job Stream Job
  • 12. www.mapflat.com Stream application properties ● Output = function(input, code) ○ Perfect for testing! ○ Avoid: indeterministic processing, reading wall clock ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common
  • 13. www.mapflat.com ● Single job ● Multiple jobs ● Pipeline, including service Recommended scopes Job Service App Stream Stream Job Stream Job
  • 14. www.mapflat.com ● Unit ○ Few stable interfaces ○ Not necessary ○ Avoid mocks, DI rituals ● Full system, including client ○ Client automation fragile “Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg Scopes to avoid Job Service App Stream Stream Job Stream Job
  • 15. www.mapflat.com Stream application, example harness Scalatest Spark Streaming jobs IDE, CI, debug integration 15 DB Topic Kafka Test input Test oracle Docker IDE / Gradle Polling
  • 16. www.mapflat.com Test lifecycle 1. Start fixture containers 2. Await fixture ready 3. Allocate test case resources 4. Start jobs 5. Push input data to Kafka 6. While (!done && !timeout) { pollDatabase(); sleep(1ms) } 7. While (moreTests) { Goto 3 } 8. Tear down fixture For absence test, send dummy sync messages at end. 2, 7. Scalatest 4. Spark 5 6 1. Docker IDE / Gradle
  • 17. www.mapflat.com ● Input & output is denormalised & wide ● Fields are frequently changed ○ Additions are compatible ○ Modifications are incompatible => new, similar data type ● Static test input, e.g. JSON files ○ Unmaintainable ● Input generation routines ○ Robust to changes, reusable Input generation
  • 18. www.mapflat.com Test oracles ● Compare with expected output ● Check fields relevant for test ○ Robust to field changes ○ Reusable for new, similar types ● Tip: Use lenses ○ JSON: JsonPath (Java), Play JSON (Scala) ○ Case classes: Monocle ● Express invariants for each data type ○ Reuse for production data quality monitoring
  • 19. www.mapflat.com Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers ● Document “golden path”
  • 20. www.mapflat.com Testing with cloud services ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3) ● Integrate PaaS service as fixture component is challenging ○ Distribute access tokens, etc ○ Pay $ or $$$
  • 21. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production Data processing applications are suited for test! 2. Static test input in version control 3. Exact expected output test oracle 4. Unit testing volatile interfaces 5. Using mocks & dependency injection 6. Tool-specific test framework - vendor lock-in 7. Using wall clock time 8. Embedded fixture components
  • 22. www.mapflat.com Thank you. Questions? Credits: Øyvind Løkling, Schibsted Media Group ● Content inspiration Confluent, LinkedIn, Google, Netflix, Apache Samza ● Images Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/).
  • 24. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 24
  • 25. www.mapflat.com Hadoop / Spark counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ○ Reuse test oracle invariants in production Obtaining quality metrics 25 DB Quality assessment job
  • 26. www.mapflat.com Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for publishing dataset ● Push aggregates to DB ○ Standard ops: monitor, alert 26 DB ∆? Code ∆!