SlideShare a Scribd company logo
www.mapflat.com
Test strategies for data
processing pipelines
1
Lars Albertsson, independent consultant (Mapflat)
www.mapflat.com
Who's talking
● Swedish Institute of Computer. Science. (test & debug tools)
● Sun Microsystems (large machine verification)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup) (data integrations)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling, productivity)
● Schibsted Products & Tech (data processing & modelling)
● Mapflat (independent data engineering consultant)
2
www.mapflat.com
Agenda
● Data applications from a test perspective
● Testing batch processing products
● Testing stream processing product
● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience, reading Scala
3
www.mapflat.com
Data
lake
Batch pipeline anatomy
4
Cluster storage
Unified log
Ingress
ETL
Egress
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
www.mapflat.com
Workflow orchestrator
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
● DSL describes DAG
○ Includes ingress & egress
Luigi / Airflow
5
DB
Orchestrator
www.mapflat.com
6
Stream pipeline anatomy
● Unified log - bus of all business events
● Pub/sub with history
○ Kafka
○ AWS Kinesis, Google Pub/Sub, Azure
Event Hub
● Decoupled producers/consumers
○ In source/deployment
○ In space
○ In time
● Publish results to log
● Recovery from link failures
● Replay on job bug fix
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
www.mapflat.com
Online vs offline
7
Online Offline
ETLCold
store
AI feature
DatasetJob
Pipeline
Service
Service
Online
www.mapflat.com
Online failure vs Offline failure
8
10000s of customers, imprecise feedback
Need low probability =>
Proactive prevention =>
Low ROI
10s of employees, precise feedback
Ok with medium probability =>
Reactive repair =>
High ROI
Risk = probability * impact
www.mapflat.com
Value of testing
9
For data-centric (batch) applications, in this order:
● Productivity
○ Move fast without breaking things
● Fast experimentation
○ 10% good ideas, 90% neutral or bad
● Data quality
○ Challenging, more important than
● Technical quality
○ Technical failure =>
■ Operations hassle.
■ Stale data. Often acceptable.
■ No customers were harmed. Usually.
Significant harm to external customer is rare enough to be reactive - data-driven by bug frequency.
www.mapflat.com
Test concepts
10
Test harness
Test
fixture
System under test
(SUT)
3rd party
component
(e.g. DB)
3rd party
component
3rd party
component
Test
input
Test
oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build
tools
www.mapflat.com
11
Data pipeline properties
● Output = function(input, code)
○ No external factors => deterministic
○ Easy to craft input, even in large tests
○ Perfect for test!
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is
common
q
www.mapflat.com
12
Potential test scopes
● Unit/component
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
13
Recommended scopes
● Single job
● Multiple jobs
● Pipeline, including service
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
14
Scopes to avoid
● Unit/Component
○ Few stable interfaces
○ Not necessary
○ Avoid mocks, DI rituals
● Full system, including client
○ Client automation fragile
“Focus on functional system tests, complement
with smaller where you cannot get coverage.”
- Henrik Kniberg
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Testing single batch job
15
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
www.mapflat.com
Testing batch pipelines - two options
16
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
● Both can be extended with ingress (Kafka), egress DBs
www.mapflat.com
17
Test
data
Test input data is code
● Tied to production code
● Should live in version control
● Duplication to eliminate
● Generation gives more power
○ Randomness can be desirable
○ Larger scale
○ Statistical distribution
● Maintainable
> git add src/resources/test-data.json
userInputFile.overwrite(Json.toJson(
userTemplate.copy(
age = 27,
product = "premium",
country = "NO"))
.toString)
Prod
data Privacy!
www.mapflat.com
18
Putting batch tests together
class CleanUserTest extends FlatSpec {
val inFile = (tmpDir / "test_user.json")
val outFile = (tmpDir / "test_output.json")
def writeInput(users: Seq[User]) = inFile.overwrite(users.map(u => Json.toJson(u).toString).mkString("n"))
def readOutput() = outFile.contentAsString.split("n").map(line => Json.fromJson[User](Json.parse(line))).toSeq
def runJob(input: Seq[User]): Seq[User] = {
writeInput(input)
val args = Seq(
"--user", inFile.path.toString,
"--output", outFile.path.toString)
CleanUserJob.main(args) // Works for some processing frameworks, e.g. Spark
readOutput()
}
"Clean users" should "remain untouched" {
val output = runJob(Seq(TestInput.userTemplate))
assert(output === Seq(TestInput.userTemplate))
}
}
www.mapflat.com
19
Test oracle
class CleanUserTest extends FlatSpec {
// ...
"Lower case country" should "translate to upper case" {
val input = Seq(userTemplate.copy(country = "se"))
val output = runJob(input)
assert(output.size === 1)
assert(output.head.country === "SE")
val lens = GenLens[User](_.country)
assert(lens.get(output.head) === "SE")
}
}
● Avoid full record comparisons
○ Except for a few tests
● Examine only fields relevant to test
○ Or tests break for unrelated changes
● Lenses are your friend
○ JSON: JsonPath (Java), Play (Scala), ...
○ Case classes: Monocle, Shapeless,
Scalaz, Sauron, Quicklens
www.mapflat.com
class CleanUserTest extends FlatSpec {
def runJob(input: Seq[User]): Seq[User] = {
writeInput(input)
val args = Seq("--user", inFile.path.toString,
"--output", outFile.path.toString)
CleanUserJob.main(args)
// Ugly way to expose counters
val counters = Map(
"invalid" -> CleanUserJob.invalidCount,
"upper-cased" -> CleanUserJob.upperCased)
(readOutput(), counters)
}
"Lower case country" should "translate to upper case" {
val input = Seq(userTemplate.copy(country = "se"))
val (output, counters) = runJob(input)
assert(output.size === 1)
assert(output.head.country === "SE")
assert(counters.get("upper-cased") === 1)
assert((counters - "upper-cased").filter(_._2 != 0)
shouldBe empty)
}
}
20
Inspecting counters
● Counters (accumulators in Spark) are
critical for monitoring and quality
● Test that expected counters are bumped
○ But no other counters
www.mapflat.com
class CleanUserTest extends FlatSpec {
def validateInvariants(
input: Seq[User],
output: Seq[User],
counters: Map[String, Int]) = {
output.foreach(recordInvariant)
// Dataset invariants
assert(input.size === output.size)
assert(input.size should be >= counters["upper-cased"])
}
def recordInvariant(u: User) =
assert(u.country.size === 2)
def runJob(input: Seq[User]): Seq[User] = {
// Same as before
...
validateInvariants(input, output, counters)
(output, counters)
}
// Test case is the same
}
21
Invariants
● Some things are true
○ For every record
○ For every job invocation
● Not necessarily in production
○ Reuse invariant predicates as quality
probes
www.mapflat.com
22
Streaming SUT, example harness
Scalatest Spark Streaming jobs
IDE, CI, debug integration
DB
Topic
Kafka
Test
input
Test
oracle
Docker
IDE / Gradle
Polling
Service
JVM monolith
for test
www.mapflat.com
23
Test lifecycle
1. Initiate from IDE / build system
2. Start fixture containers
3. Await fixture ready
4. Allocate test case resources
5. Start jobs
6. Push input data
7. While (!done && !timeout) {
pollDatabase()
sleep(1ms)
}
8. While (moreTests) { Goto 4 }
9. Tear down fixture
8 3
6
1
2
4,9
5
7
www.mapflat.com
24
Testing for absence
1. Send test input
2. Send dummy input
3. Await effect of dummy input
4. Verify test output absence
Assumes in-order semantics
1,2 3, 4
www.mapflat.com
Testing in the process
1. Integration tests for happy paths.
2. Likely odd cases, e.g.
○ Empty inputs
○ Missing matching records, e.g. in joins
○ Odd encodings
3. Dark corners
○ Motivated for externally generated input
25
● Minimal test cases triggering desired paths
● On production failure:
○ Debug in production
○ Create minimal test case
○ Add to regression suite
● Cannot trigger bug in test?
○ Consider new scope? Significantly
different from old scopes.
○ Staging pipelines?
○ Automatic code inspection?
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
26
www.mapflat.com
27
Quality metrics
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
● Workflow quality predicate for
consumption
○ Depends on consumer use case
Hadoop / Spark counters DB
Quality assessment job
Tiny quality metadataset
www.mapflat.com
28
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for consuming dataset?
● Push aggregates to DB
○ Standard ops: monitor, alert
DB
∆?
Code ∆!
www.mapflat.com
29
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers
● Document “golden path”
www.mapflat.com
Test code = not yet another program
● Shared between data engineers & QA engineers
○ Best result with mutual respect and collaboration
● Readability >> abstractions
○ Create (Scala, Python) DSLs
● Some software bad practices are benign
○ Duplication
○ Inconsistencies
○ Lazy error handling
30
www.mapflat.com
Honey traps
The cloud
● PaaS components do not work locally
○ Cloud providers should provide fake
implementations
○ Exceptions: Kubernetes, Cloud SQL,
Relational Database Service, (S3)
● Integrate PaaS service as fixture
component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$. Much better with per-second
billing.
31
Vendor batch test frameworks
● Spark, Scalding, Crunch variants
● Seam == internal data structure
○ Omits I/O - common bug source
● Vendor lock-in -
when switching batch framework:
○ Need tests for protection
○ Test rewrite is unnecessary burden
Test input
Test
output
Job
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production. Data processing applications are suited for test!
2. Developer testing requires cluster
3. Static test input in version control
4. Exact expected output test oracle
5. Unit testing volatile interfaces
6. Using mocks & dependency injection
7. Tool-specific test framework - vendor lock-in
8. Using wall clock time
9. Embedded fixture components, e.g. in-memory Kafka/Cassandra/RDBMS
10. Performance testing (low ROI for offline)
11. Data quality not measured, not monitored
32
www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
Images:
Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental
Science (ian.umces.edu/imagelibrary/).
33
www.mapflat.com
Bonus slides
34
www.mapflat.com
Deployment, example
35
Hg/git
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
HDFS
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule, higher
frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily 
--backfill 14
All that a pipeline needs, installed atomically
www.mapflat.com
Continuous deployment, example
36
● Poll and pull latest on worker nodes
○ virtualenv package/version
■ No need to sync environment & versions
○ Cron package/latest/bin/*
■ Old versions run pipelines to completion, then exit
Hg/git
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
HDFS
my_cd.py hdfs://pipelines/
Worker
> virtualenv my_pipe/7
> pip install my-pipe-7.tar.gz
* 10 * * * my_pipe/7/bin/*

More Related Content

What's hot

Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
"Deployment for free": removing the need to write model deployment code at St...
"Deployment for free": removing the need to write model deployment code at St..."Deployment for free": removing the need to write model deployment code at St...
"Deployment for free": removing the need to write model deployment code at St...
Stefan Krawczyk
 
empirical analysis modeling of power dissipation control in internet data ce...
 empirical analysis modeling of power dissipation control in internet data ce... empirical analysis modeling of power dissipation control in internet data ce...
empirical analysis modeling of power dissipation control in internet data ce...
saadjamil31
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 

What's hot (20)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
 
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
"Deployment for free": removing the need to write model deployment code at St...
"Deployment for free": removing the need to write model deployment code at St..."Deployment for free": removing the need to write model deployment code at St...
"Deployment for free": removing the need to write model deployment code at St...
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
empirical analysis modeling of power dissipation control in internet data ce...
 empirical analysis modeling of power dissipation control in internet data ce... empirical analysis modeling of power dissipation control in internet data ce...
empirical analysis modeling of power dissipation control in internet data ce...
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Druid
DruidDruid
Druid
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 

Viewers also liked

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 

Viewers also liked (9)

Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Testing distributed, complex web applications
Testing distributed, complex web applicationsTesting distributed, complex web applications
Testing distributed, complex web applications
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Test strategies for data processing pipelines, v2.0

Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
inovex GmbH
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Holden Karau
 

Similar to Test strategies for data processing pipelines, v2.0 (20)

Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Andriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsAndriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tips
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Golang dot-testing-lite
Golang dot-testing-liteGolang dot-testing-lite
Golang dot-testing-lite
 
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
Amin Milani Fard: Directed Model Inference for Testing and Analysis of Web Ap...
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Performance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle CoherencePerformance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle Coherence
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
 
Stress test data pipeline
Stress test data pipelineStress test data pipeline
Stress test data pipeline
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Functional Programming 101 for Java 7 Developers
Functional Programming 101 for Java 7 DevelopersFunctional Programming 101 for Java 7 Developers
Functional Programming 101 for Java 7 Developers
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 

More from Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 

Recently uploaded

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
benishzehra469
 

Recently uploaded (20)

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 

Test strategies for data processing pipelines, v2.0

  • 1. www.mapflat.com Test strategies for data processing pipelines 1 Lars Albertsson, independent consultant (Mapflat)
  • 2. www.mapflat.com Who's talking ● Swedish Institute of Computer. Science. (test & debug tools) ● Sun Microsystems (large machine verification) ● Google (Hangouts, productivity) ● Recorded Future (NLP startup) (data integrations) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling, productivity) ● Schibsted Products & Tech (data processing & modelling) ● Mapflat (independent data engineering consultant) 2
  • 3. www.mapflat.com Agenda ● Data applications from a test perspective ● Testing batch processing products ● Testing stream processing product ● Data quality testing Main focus is functional, regression testing Prerequisites: Backend dev testing, basic data experience, reading Scala 3
  • 4. www.mapflat.com Data lake Batch pipeline anatomy 4 Cluster storage Unified log Ingress ETL Egress DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  • 5. www.mapflat.com Workflow orchestrator ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ● DSL describes DAG ○ Includes ingress & egress Luigi / Airflow 5 DB Orchestrator
  • 6. www.mapflat.com 6 Stream pipeline anatomy ● Unified log - bus of all business events ● Pub/sub with history ○ Kafka ○ AWS Kinesis, Google Pub/Sub, Azure Event Hub ● Decoupled producers/consumers ○ In source/deployment ○ In space ○ In time ● Publish results to log ● Recovery from link failures ● Replay on job bug fix Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job
  • 7. www.mapflat.com Online vs offline 7 Online Offline ETLCold store AI feature DatasetJob Pipeline Service Service Online
  • 8. www.mapflat.com Online failure vs Offline failure 8 10000s of customers, imprecise feedback Need low probability => Proactive prevention => Low ROI 10s of employees, precise feedback Ok with medium probability => Reactive repair => High ROI Risk = probability * impact
  • 9. www.mapflat.com Value of testing 9 For data-centric (batch) applications, in this order: ● Productivity ○ Move fast without breaking things ● Fast experimentation ○ 10% good ideas, 90% neutral or bad ● Data quality ○ Challenging, more important than ● Technical quality ○ Technical failure => ■ Operations hassle. ■ Stale data. Often acceptable. ■ No customers were harmed. Usually. Significant harm to external customer is rare enough to be reactive - data-driven by bug frequency.
  • 10. www.mapflat.com Test concepts 10 Test harness Test fixture System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  • 11. www.mapflat.com 11 Data pipeline properties ● Output = function(input, code) ○ No external factors => deterministic ○ Easy to craft input, even in large tests ○ Perfect for test! ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common q
  • 12. www.mapflat.com 12 Potential test scopes ● Unit/component ● Single job ● Multiple jobs ● Pipeline, including service ● Full system, including client Choose stable interfaces Each scope has a cost Job Service App Stream Stream Job Stream Job
  • 13. www.mapflat.com 13 Recommended scopes ● Single job ● Multiple jobs ● Pipeline, including service Job Service App Stream Stream Job Stream Job
  • 14. www.mapflat.com 14 Scopes to avoid ● Unit/Component ○ Few stable interfaces ○ Not necessary ○ Avoid mocks, DI rituals ● Full system, including client ○ Client automation fragile “Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg Job Service App Stream Stream Job Stream Job
  • 15. www.mapflat.com Testing single batch job 15 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  • 16. www.mapflat.com Testing batch pipelines - two options 16 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B: ● Both can be extended with ingress (Kafka), egress DBs
  • 17. www.mapflat.com 17 Test data Test input data is code ● Tied to production code ● Should live in version control ● Duplication to eliminate ● Generation gives more power ○ Randomness can be desirable ○ Larger scale ○ Statistical distribution ● Maintainable > git add src/resources/test-data.json userInputFile.overwrite(Json.toJson( userTemplate.copy( age = 27, product = "premium", country = "NO")) .toString) Prod data Privacy!
  • 18. www.mapflat.com 18 Putting batch tests together class CleanUserTest extends FlatSpec { val inFile = (tmpDir / "test_user.json") val outFile = (tmpDir / "test_output.json") def writeInput(users: Seq[User]) = inFile.overwrite(users.map(u => Json.toJson(u).toString).mkString("n")) def readOutput() = outFile.contentAsString.split("n").map(line => Json.fromJson[User](Json.parse(line))).toSeq def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq( "--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Works for some processing frameworks, e.g. Spark readOutput() } "Clean users" should "remain untouched" { val output = runJob(Seq(TestInput.userTemplate)) assert(output === Seq(TestInput.userTemplate)) } }
  • 19. www.mapflat.com 19 Test oracle class CleanUserTest extends FlatSpec { // ... "Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val output = runJob(input) assert(output.size === 1) assert(output.head.country === "SE") val lens = GenLens[User](_.country) assert(lens.get(output.head) === "SE") } } ● Avoid full record comparisons ○ Except for a few tests ● Examine only fields relevant to test ○ Or tests break for unrelated changes ● Lenses are your friend ○ JSON: JsonPath (Java), Play (Scala), ... ○ Case classes: Monocle, Shapeless, Scalaz, Sauron, Quicklens
  • 20. www.mapflat.com class CleanUserTest extends FlatSpec { def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq("--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Ugly way to expose counters val counters = Map( "invalid" -> CleanUserJob.invalidCount, "upper-cased" -> CleanUserJob.upperCased) (readOutput(), counters) } "Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val (output, counters) = runJob(input) assert(output.size === 1) assert(output.head.country === "SE") assert(counters.get("upper-cased") === 1) assert((counters - "upper-cased").filter(_._2 != 0) shouldBe empty) } } 20 Inspecting counters ● Counters (accumulators in Spark) are critical for monitoring and quality ● Test that expected counters are bumped ○ But no other counters
  • 21. www.mapflat.com class CleanUserTest extends FlatSpec { def validateInvariants( input: Seq[User], output: Seq[User], counters: Map[String, Int]) = { output.foreach(recordInvariant) // Dataset invariants assert(input.size === output.size) assert(input.size should be >= counters["upper-cased"]) } def recordInvariant(u: User) = assert(u.country.size === 2) def runJob(input: Seq[User]): Seq[User] = { // Same as before ... validateInvariants(input, output, counters) (output, counters) } // Test case is the same } 21 Invariants ● Some things are true ○ For every record ○ For every job invocation ● Not necessarily in production ○ Reuse invariant predicates as quality probes
  • 22. www.mapflat.com 22 Streaming SUT, example harness Scalatest Spark Streaming jobs IDE, CI, debug integration DB Topic Kafka Test input Test oracle Docker IDE / Gradle Polling Service JVM monolith for test
  • 23. www.mapflat.com 23 Test lifecycle 1. Initiate from IDE / build system 2. Start fixture containers 3. Await fixture ready 4. Allocate test case resources 5. Start jobs 6. Push input data 7. While (!done && !timeout) { pollDatabase() sleep(1ms) } 8. While (moreTests) { Goto 4 } 9. Tear down fixture 8 3 6 1 2 4,9 5 7
  • 24. www.mapflat.com 24 Testing for absence 1. Send test input 2. Send dummy input 3. Await effect of dummy input 4. Verify test output absence Assumes in-order semantics 1,2 3, 4
  • 25. www.mapflat.com Testing in the process 1. Integration tests for happy paths. 2. Likely odd cases, e.g. ○ Empty inputs ○ Missing matching records, e.g. in joins ○ Odd encodings 3. Dark corners ○ Motivated for externally generated input 25 ● Minimal test cases triggering desired paths ● On production failure: ○ Debug in production ○ Create minimal test case ○ Add to regression suite ● Cannot trigger bug in test? ○ Consider new scope? Significantly different from old scopes. ○ Staging pipelines? ○ Automatic code inspection?
  • 26. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 26
  • 27. www.mapflat.com 27 Quality metrics ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ● Workflow quality predicate for consumption ○ Depends on consumer use case Hadoop / Spark counters DB Quality assessment job Tiny quality metadataset
  • 28. www.mapflat.com 28 Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for consuming dataset? ● Push aggregates to DB ○ Standard ops: monitor, alert DB ∆? Code ∆!
  • 29. www.mapflat.com 29 Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers ● Document “golden path”
  • 30. www.mapflat.com Test code = not yet another program ● Shared between data engineers & QA engineers ○ Best result with mutual respect and collaboration ● Readability >> abstractions ○ Create (Scala, Python) DSLs ● Some software bad practices are benign ○ Duplication ○ Inconsistencies ○ Lazy error handling 30
  • 31. www.mapflat.com Honey traps The cloud ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3) ● Integrate PaaS service as fixture component is challenging ○ Distribute access tokens, etc ○ Pay $ or $$$. Much better with per-second billing. 31 Vendor batch test frameworks ● Spark, Scalding, Crunch variants ● Seam == internal data structure ○ Omits I/O - common bug source ● Vendor lock-in - when switching batch framework: ○ Need tests for protection ○ Test rewrite is unnecessary burden Test input Test output Job
  • 32. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production. Data processing applications are suited for test! 2. Developer testing requires cluster 3. Static test input in version control 4. Exact expected output test oracle 5. Unit testing volatile interfaces 6. Using mocks & dependency injection 7. Tool-specific test framework - vendor lock-in 8. Using wall clock time 9. Embedded fixture components, e.g. in-memory Kafka/Cassandra/RDBMS 10. Performance testing (low ROI for offline) 11. Data quality not measured, not monitored 32
  • 33. www.mapflat.com Thank you. Questions? Credits: Øyvind Løkling, Schibsted Media Group Images: Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/). 33
  • 35. www.mapflat.com Deployment, example 35 Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency + backfill (Luigi range tools) * 10 * * * bin/my_pipe_daily --backfill 14 All that a pipeline needs, installed atomically
  • 36. www.mapflat.com Continuous deployment, example 36 ● Poll and pull latest on worker nodes ○ virtualenv package/version ■ No need to sync environment & versions ○ Cron package/latest/bin/* ■ Old versions run pipelines to completion, then exit Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS my_cd.py hdfs://pipelines/ Worker > virtualenv my_pipe/7 > pip install my-pipe-7.tar.gz * 10 * * * my_pipe/7/bin/*