Test strategies for data processing pipelines, v2.0

www.mapflat.com
Test strategies for data
processing pipelines
1
Lars Albertsson, independent consultant (Mapflat)

www.mapflat.com
Who's talking
● Swedish Institute of Computer. Science. (test & debug tools)
● Sun Microsystems (large machine verification)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup) (data integrations)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling, productivity)
● Schibsted Products & Tech (data processing & modelling)
● Mapflat (independent data engineering consultant)
2

www.mapflat.com
Agenda
● Data applications from a test perspective
● Testing batch processing products
● Testing stream processing product
● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience, reading Scala
3

www.mapflat.com
Data
lake
Batch pipeline anatomy
4
Cluster storage
Unified log
Ingress
ETL
Egress
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import

www.mapflat.com
Workflow orchestrator
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
● DSL describes DAG
○ Includes ingress & egress
Luigi / Airflow
5
DB
Orchestrator

www.mapflat.com
6
Stream pipeline anatomy
● Unified log - bus of all business events
● Pub/sub with history
○ Kafka
○ AWS Kinesis, Google Pub/Sub, Azure
Event Hub
● Decoupled producers/consumers
○ In source/deployment
○ In space
○ In time
● Publish results to log
● Recovery from link failures
● Replay on job bug fix
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job

www.mapflat.com
Online vs offline
7
Online Offline
ETLCold
store
AI feature
DatasetJob
Pipeline
Service
Service
Online

www.mapflat.com
Online failure vs Offline failure
8
10000s of customers, imprecise feedback
Need low probability =>
Proactive prevention =>
Low ROI
10s of employees, precise feedback
Ok with medium probability =>
Reactive repair =>
High ROI
Risk = probability * impact

www.mapflat.com
Value of testing
9
For data-centric (batch) applications, in this order:
● Productivity
○ Move fast without breaking things
● Fast experimentation
○ 10% good ideas, 90% neutral or bad
● Data quality
○ Challenging, more important than
● Technical quality
○ Technical failure =>
■ Operations hassle.
■ Stale data. Often acceptable.
■ No customers were harmed. Usually.
Significant harm to external customer is rare enough to be reactive - data-driven by bug frequency.

www.mapflat.com
Test concepts
10
Test harness
Test
fixture
System under test
(SUT)
3rd party
component
(e.g. DB)
3rd party
component
3rd party
component
Test
input
Test
oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build
tools

www.mapflat.com
11
Data pipeline properties
● Output = function(input, code)
○ No external factors => deterministic
○ Easy to craft input, even in large tests
○ Perfect for test!
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is
common
q

www.mapflat.com
12
Potential test scopes
● Unit/component
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Job
Service
App
Stream
Stream
Job
Stream
Job

www.mapflat.com
13
Recommended scopes
● Single job
● Multiple jobs
● Pipeline, including service
Job
Service
App
Stream
Stream
Job
Stream
Job

www.mapflat.com
14
Scopes to avoid
● Unit/Component
○ Few stable interfaces
○ Not necessary
○ Avoid mocks, DI rituals
● Full system, including client
○ Client automation fragile
“Focus on functional system tests, complement
with smaller where you cannot get coverage.”
- Henrik Kniberg
Job
Service
App
Stream
Stream
Job
Stream
Job

www.mapflat.com
Testing single batch job
15
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE

www.mapflat.com
Testing batch pipelines - two options
16
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
● Both can be extended with ingress (Kafka), egress DBs

www.mapflat.com
17
Test
data
Test input data is code
● Tied to production code
● Should live in version control
● Duplication to eliminate
● Generation gives more power
○ Randomness can be desirable
○ Larger scale
○ Statistical distribution
● Maintainable
> git add src/resources/test-data.json
userInputFile.overwrite(Json.toJson(
userTemplate.copy(
age = 27,
product = "premium",
country = "NO"))
.toString)
Prod
data Privacy!

www.mapflat.com
18
Putting batch tests together
class CleanUserTest extends FlatSpec {
val inFile = (tmpDir / "test_user.json")
val outFile = (tmpDir / "test_output.json")
def writeInput(users: Seq[User]) = inFile.overwrite(users.map(u => Json.toJson(u).toString).mkString("n"))
def readOutput() = outFile.contentAsString.split("n").map(line => Json.fromJson[User](Json.parse(line))).toSeq
def runJob(input: Seq[User]): Seq[User] = {
writeInput(input)
val args = Seq(
"--user", inFile.path.toString,
"--output", outFile.path.toString)
CleanUserJob.main(args) // Works for some processing frameworks, e.g. Spark
readOutput()
}
"Clean users" should "remain untouched" {
val output = runJob(Seq(TestInput.userTemplate))
assert(output === Seq(TestInput.userTemplate))
}
}

www.mapflat.com
19
Test oracle
// ...
"Lower case country" should "translate to upper case" {
val input = Seq(userTemplate.copy(country = "se"))
val output = runJob(input)
assert(output.size === 1)
assert(output.head.country === "SE")
val lens = GenLens[User](_.country)
assert(lens.get(output.head) === "SE")
}
}
● Avoid full record comparisons
○ Except for a few tests
● Examine only fields relevant to test
○ Or tests break for unrelated changes
● Lenses are your friend
○ JSON: JsonPath (Java), Play (Scala), ...
○ Case classes: Monocle, Shapeless,
Scalaz, Sauron, Quicklens

www.mapflat.com
writeInput(input)
val args = Seq("--user", inFile.path.toString,
"--output", outFile.path.toString)
CleanUserJob.main(args)
// Ugly way to expose counters
val counters = Map(
"invalid" -> CleanUserJob.invalidCount,
"upper-cased" -> CleanUserJob.upperCased)
(readOutput(), counters)
}
"Lower case country" should "translate to upper case" {
val input = Seq(userTemplate.copy(country = "se"))
val (output, counters) = runJob(input)
assert(output.size === 1)
assert(output.head.country === "SE")
assert(counters.get("upper-cased") === 1)
assert((counters - "upper-cased").filter(_._2 != 0)
shouldBe empty)
}
}
20
Inspecting counters
● Counters (accumulators in Spark) are
critical for monitoring and quality
● Test that expected counters are bumped
○ But no other counters

www.mapflat.com
def validateInvariants(
input: Seq[User],
output: Seq[User],
counters: Map[String, Int]) = {
output.foreach(recordInvariant)
// Dataset invariants
assert(input.size === output.size)
assert(input.size should be >= counters["upper-cased"])
}
def recordInvariant(u: User) =
assert(u.country.size === 2)
// Same as before
...
validateInvariants(input, output, counters)
(output, counters)
}
// Test case is the same
}
21
Invariants
● Some things are true
○ For every record
○ For every job invocation
● Not necessarily in production
○ Reuse invariant predicates as quality
probes

www.mapflat.com
22
Streaming SUT, example harness
Scalatest Spark Streaming jobs
IDE, CI, debug integration
DB
Topic
Kafka
Test
input
Test
oracle
Docker
IDE / Gradle
Polling
Service
JVM monolith
for test

www.mapflat.com
23
Test lifecycle
1. Initiate from IDE / build system
2. Start fixture containers
3. Await fixture ready
4. Allocate test case resources
5. Start jobs
6. Push input data
7. While (!done && !timeout) {
pollDatabase()
sleep(1ms)
}
8. While (moreTests) { Goto 4 }
9. Tear down fixture
8 3
6
1
2
4,9
5
7

www.mapflat.com
24
Testing for absence
1. Send test input
2. Send dummy input
3. Await effect of dummy input
4. Verify test output absence
Assumes in-order semantics
1,2 3, 4

www.mapflat.com
Testing in the process
1. Integration tests for happy paths.
2. Likely odd cases, e.g.
○ Empty inputs
○ Missing matching records, e.g. in joins
○ Odd encodings
3. Dark corners
○ Motivated for externally generated input
25
● Minimal test cases triggering desired paths
● On production failure:
○ Debug in production
○ Create minimal test case
○ Add to regression suite
● Cannot trigger bug in test?
○ Consider new scope? Significantly
different from old scopes.
○ Staging pipelines?
○ Automatic code inspection?

www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
26

www.mapflat.com
27
Quality metrics
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
● Workflow quality predicate for
consumption
○ Depends on consumer use case
Hadoop / Spark counters DB
Quality assessment job
Tiny quality metadataset

www.mapflat.com
28
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for consuming dataset?
● Push aggregates to DB
○ Standard ops: monitor, alert
DB
∆?
Code ∆!

www.mapflat.com
29
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers
● Document “golden path”

www.mapflat.com
Test code = not yet another program
● Shared between data engineers & QA engineers
○ Best result with mutual respect and collaboration
● Readability >> abstractions
○ Create (Scala, Python) DSLs
● Some software bad practices are benign
○ Duplication
○ Inconsistencies
○ Lazy error handling
30

www.mapflat.com
Honey traps
The cloud
● PaaS components do not work locally
○ Cloud providers should provide fake
implementations
○ Exceptions: Kubernetes, Cloud SQL,
Relational Database Service, (S3)
● Integrate PaaS service as fixture
component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$. Much better with per-second
billing.
31
Vendor batch test frameworks
● Spark, Scalding, Crunch variants
● Seam == internal data structure
○ Omits I/O - common bug source
● Vendor lock-in -
when switching batch framework:
○ Need tests for protection
○ Test rewrite is unnecessary burden
Test input
Test
output
Job

www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production. Data processing applications are suited for test!
2. Developer testing requires cluster
3. Static test input in version control
4. Exact expected output test oracle
5. Unit testing volatile interfaces
6. Using mocks & dependency injection
7. Tool-specific test framework - vendor lock-in
8. Using wall clock time
9. Embedded fixture components, e.g. in-memory Kafka/Cassandra/RDBMS
10. Performance testing (low ROI for offline)
11. Data quality not measured, not monitored
32

www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
Images:
Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental
Science (ian.umces.edu/imagelibrary/).
33

www.mapflat.com
Bonus slides
34

www.mapflat.com
Deployment, example
35
Hg/git
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
HDFS
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule, higher
frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily
--backfill 14
All that a pipeline needs, installed atomically

www.mapflat.com
Continuous deployment, example
36
● Poll and pull latest on worker nodes
○ virtualenv package/version
■ No need to sync environment & versions
○ Cron package/latest/bin/*
■ Old versions run pipelines to completion, then exit
Hg/git
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
HDFS
my_cd.py hdfs://pipelines/
Worker
> virtualenv my_pipe/7
> pip install my-pipe-7.tar.gz
* 10 * * * my_pipe/7/bin/*

Test strategies for data processing pipelines, v2.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Test strategies for data processing pipelines, v2.0

Similar to Test strategies for data processing pipelines, v2.0 (20)

More from Lars Albertsson

More from Lars Albertsson (20)

Recently uploaded

Recently uploaded (20)

Test strategies for data processing pipelines, v2.0