Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Test strategies for data processing pipelines, v2.0

745 views

Published on

This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.

Published in: Data & Analytics
  • Presentation held at Stockholm Hadoop User Group 2017-10-23. No video is available, but here are video and slides for an earlier version of the presentation: https://vimeo.com/192429554 https://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Test strategies for data processing pipelines, v2.0

  1. 1. www.mapflat.com Test strategies for data processing pipelines 1 Lars Albertsson, independent consultant (Mapflat)
  2. 2. www.mapflat.com Who's talking ● Swedish Institute of Computer. Science. (test & debug tools) ● Sun Microsystems (large machine verification) ● Google (Hangouts, productivity) ● Recorded Future (NLP startup) (data integrations) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling, productivity) ● Schibsted Products & Tech (data processing & modelling) ● Mapflat (independent data engineering consultant) 2
  3. 3. www.mapflat.com Agenda ● Data applications from a test perspective ● Testing batch processing products ● Testing stream processing product ● Data quality testing Main focus is functional, regression testing Prerequisites: Backend dev testing, basic data experience, reading Scala 3
  4. 4. www.mapflat.com Data lake Batch pipeline anatomy 4 Cluster storage Unified log Ingress ETL Egress DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  5. 5. www.mapflat.com Workflow orchestrator ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ● DSL describes DAG ○ Includes ingress & egress Luigi / Airflow 5 DB Orchestrator
  6. 6. www.mapflat.com 6 Stream pipeline anatomy ● Unified log - bus of all business events ● Pub/sub with history ○ Kafka ○ AWS Kinesis, Google Pub/Sub, Azure Event Hub ● Decoupled producers/consumers ○ In source/deployment ○ In space ○ In time ● Publish results to log ● Recovery from link failures ● Replay on job bug fix Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job
  7. 7. www.mapflat.com Online vs offline 7 Online Offline ETLCold store AI feature DatasetJob Pipeline Service Service Online
  8. 8. www.mapflat.com Online failure vs Offline failure 8 10000s of customers, imprecise feedback Need low probability => Proactive prevention => Low ROI 10s of employees, precise feedback Ok with medium probability => Reactive repair => High ROI Risk = probability * impact
  9. 9. www.mapflat.com Value of testing 9 For data-centric (batch) applications, in this order: ● Productivity ○ Move fast without breaking things ● Fast experimentation ○ 10% good ideas, 90% neutral or bad ● Data quality ○ Challenging, more important than ● Technical quality ○ Technical failure => ■ Operations hassle. ■ Stale data. Often acceptable. ■ No customers were harmed. Usually. Significant harm to external customer is rare enough to be reactive - data-driven by bug frequency.
  10. 10. www.mapflat.com Test concepts 10 Test harness Test fixture System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  11. 11. www.mapflat.com 11 Data pipeline properties ● Output = function(input, code) ○ No external factors => deterministic ○ Easy to craft input, even in large tests ○ Perfect for test! ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common q
  12. 12. www.mapflat.com 12 Potential test scopes ● Unit/component ● Single job ● Multiple jobs ● Pipeline, including service ● Full system, including client Choose stable interfaces Each scope has a cost Job Service App Stream Stream Job Stream Job
  13. 13. www.mapflat.com 13 Recommended scopes ● Single job ● Multiple jobs ● Pipeline, including service Job Service App Stream Stream Job Stream Job
  14. 14. www.mapflat.com 14 Scopes to avoid ● Unit/Component ○ Few stable interfaces ○ Not necessary ○ Avoid mocks, DI rituals ● Full system, including client ○ Client automation fragile “Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg Job Service App Stream Stream Job Stream Job
  15. 15. www.mapflat.com Testing single batch job 15 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  16. 16. www.mapflat.com Testing batch pipelines - two options 16 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B: ● Both can be extended with ingress (Kafka), egress DBs
  17. 17. www.mapflat.com 17 Test data Test input data is code ● Tied to production code ● Should live in version control ● Duplication to eliminate ● Generation gives more power ○ Randomness can be desirable ○ Larger scale ○ Statistical distribution ● Maintainable > git add src/resources/test-data.json userInputFile.overwrite(Json.toJson( userTemplate.copy( age = 27, product = "premium", country = "NO")) .toString) Prod data Privacy!
  18. 18. www.mapflat.com 18 Putting batch tests together class CleanUserTest extends FlatSpec { val inFile = (tmpDir / "test_user.json") val outFile = (tmpDir / "test_output.json") def writeInput(users: Seq[User]) = inFile.overwrite(users.map(u => Json.toJson(u).toString).mkString("n")) def readOutput() = outFile.contentAsString.split("n").map(line => Json.fromJson[User](Json.parse(line))).toSeq def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq( "--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Works for some processing frameworks, e.g. Spark readOutput() } "Clean users" should "remain untouched" { val output = runJob(Seq(TestInput.userTemplate)) assert(output === Seq(TestInput.userTemplate)) } }
  19. 19. www.mapflat.com 19 Test oracle class CleanUserTest extends FlatSpec { // ... "Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val output = runJob(input) assert(output.size === 1) assert(output.head.country === "SE") val lens = GenLens[User](_.country) assert(lens.get(output.head) === "SE") } } ● Avoid full record comparisons ○ Except for a few tests ● Examine only fields relevant to test ○ Or tests break for unrelated changes ● Lenses are your friend ○ JSON: JsonPath (Java), Play (Scala), ... ○ Case classes: Monocle, Shapeless, Scalaz, Sauron, Quicklens
  20. 20. www.mapflat.com class CleanUserTest extends FlatSpec { def runJob(input: Seq[User]): Seq[User] = { writeInput(input) val args = Seq("--user", inFile.path.toString, "--output", outFile.path.toString) CleanUserJob.main(args) // Ugly way to expose counters val counters = Map( "invalid" -> CleanUserJob.invalidCount, "upper-cased" -> CleanUserJob.upperCased) (readOutput(), counters) } "Lower case country" should "translate to upper case" { val input = Seq(userTemplate.copy(country = "se")) val (output, counters) = runJob(input) assert(output.size === 1) assert(output.head.country === "SE") assert(counters.get("upper-cased") === 1) assert((counters - "upper-cased").filter(_._2 != 0) shouldBe empty) } } 20 Inspecting counters ● Counters (accumulators in Spark) are critical for monitoring and quality ● Test that expected counters are bumped ○ But no other counters
  21. 21. www.mapflat.com class CleanUserTest extends FlatSpec { def validateInvariants( input: Seq[User], output: Seq[User], counters: Map[String, Int]) = { output.foreach(recordInvariant) // Dataset invariants assert(input.size === output.size) assert(input.size should be >= counters["upper-cased"]) } def recordInvariant(u: User) = assert(u.country.size === 2) def runJob(input: Seq[User]): Seq[User] = { // Same as before ... validateInvariants(input, output, counters) (output, counters) } // Test case is the same } 21 Invariants ● Some things are true ○ For every record ○ For every job invocation ● Not necessarily in production ○ Reuse invariant predicates as quality probes
  22. 22. www.mapflat.com 22 Streaming SUT, example harness Scalatest Spark Streaming jobs IDE, CI, debug integration DB Topic Kafka Test input Test oracle Docker IDE / Gradle Polling Service JVM monolith for test
  23. 23. www.mapflat.com 23 Test lifecycle 1. Initiate from IDE / build system 2. Start fixture containers 3. Await fixture ready 4. Allocate test case resources 5. Start jobs 6. Push input data 7. While (!done && !timeout) { pollDatabase() sleep(1ms) } 8. While (moreTests) { Goto 4 } 9. Tear down fixture 8 3 6 1 2 4,9 5 7
  24. 24. www.mapflat.com 24 Testing for absence 1. Send test input 2. Send dummy input 3. Await effect of dummy input 4. Verify test output absence Assumes in-order semantics 1,2 3, 4
  25. 25. www.mapflat.com Testing in the process 1. Integration tests for happy paths. 2. Likely odd cases, e.g. ○ Empty inputs ○ Missing matching records, e.g. in joins ○ Odd encodings 3. Dark corners ○ Motivated for externally generated input 25 ● Minimal test cases triggering desired paths ● On production failure: ○ Debug in production ○ Create minimal test case ○ Add to regression suite ● Cannot trigger bug in test? ○ Consider new scope? Significantly different from old scopes. ○ Staging pipelines? ○ Automatic code inspection?
  26. 26. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 26
  27. 27. www.mapflat.com 27 Quality metrics ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ● Workflow quality predicate for consumption ○ Depends on consumer use case Hadoop / Spark counters DB Quality assessment job Tiny quality metadataset
  28. 28. www.mapflat.com 28 Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for consuming dataset? ● Push aggregates to DB ○ Standard ops: monitor, alert DB ∆? Code ∆!
  29. 29. www.mapflat.com 29 Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers ● Document “golden path”
  30. 30. www.mapflat.com Test code = not yet another program ● Shared between data engineers & QA engineers ○ Best result with mutual respect and collaboration ● Readability >> abstractions ○ Create (Scala, Python) DSLs ● Some software bad practices are benign ○ Duplication ○ Inconsistencies ○ Lazy error handling 30
  31. 31. www.mapflat.com Honey traps The cloud ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3) ● Integrate PaaS service as fixture component is challenging ○ Distribute access tokens, etc ○ Pay $ or $$$. Much better with per-second billing. 31 Vendor batch test frameworks ● Spark, Scalding, Crunch variants ● Seam == internal data structure ○ Omits I/O - common bug source ● Vendor lock-in - when switching batch framework: ○ Need tests for protection ○ Test rewrite is unnecessary burden Test input Test output Job
  32. 32. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production. Data processing applications are suited for test! 2. Developer testing requires cluster 3. Static test input in version control 4. Exact expected output test oracle 5. Unit testing volatile interfaces 6. Using mocks & dependency injection 7. Tool-specific test framework - vendor lock-in 8. Using wall clock time 9. Embedded fixture components, e.g. in-memory Kafka/Cassandra/RDBMS 10. Performance testing (low ROI for offline) 11. Data quality not measured, not monitored 32
  33. 33. www.mapflat.com Thank you. Questions? Credits: Øyvind Løkling, Schibsted Media Group Images: Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/). 33
  34. 34. www.mapflat.com Bonus slides 34
  35. 35. www.mapflat.com Deployment, example 35 Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency + backfill (Luigi range tools) * 10 * * * bin/my_pipe_daily --backfill 14 All that a pipeline needs, installed atomically
  36. 36. www.mapflat.com Continuous deployment, example 36 ● Poll and pull latest on worker nodes ○ virtualenv package/version ■ No need to sync environment & versions ○ Cron package/latest/bin/* ■ Old versions run pipelines to completion, then exit Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS my_cd.py hdfs://pipelines/ Worker > virtualenv my_pipe/7 > pip install my-pipe-7.tar.gz * 10 * * * my_pipe/7/bin/*

×