Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Test strategies for data processing pipelines

2,322 views

Published on

This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.

Presented at highloadstrategy.com 2016 by Lars Albertsson (independent, www.mapflat.com), joint work with Øyvind Løkling (Schibsted Products & Technology).

Published in: Data & Analytics

Test strategies for data processing pipelines

  1. 1. www.mapflat.com Test strategies for data processing pipelines Lars Albertsson, independent consultant (Mapflat) Øyvind Løkling, Schibsted Products & Technology 1
  2. 2. www.mapflat.com Who’s talking? Swedish Institute of Computer. Science. (test & debug tools) Sun Microsystems (large machine verification) Google (Hangouts, productivity) Recorded Future (NLP startup) (data integrations) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling, productivity) Schibsted Products & Tech (data processing & modelling) Mapflat (independent data engineering consultant) 2
  3. 3. www.mapflat.com Agenda ● Data applications from a test perspective ● Testing stream processing product ● Testing batch processing products ● Data quality testing Main focus is functional, regression testing Prerequisites: Backend dev testing, basic data experience 3
  4. 4. www.mapflat.com Test value For data-centric applications, in this order: ● Productivity ○ Move fast without breaking things ● Fast experimentation ○ 10% good ideas, 90% bad ● Data quality ○ Challenging, more important than ● Technical quality ○ Technical failure => ops hassle, stale data 4
  5. 5. www.mapflat.com Streaming data product anatomy 5 Pub / sub Unified log Ingress Stream processing Egress DB Service TopicJob Pipeline Service Export Business intelligence DB
  6. 6. www.mapflat.com Test harness Test fixture Test concepts 6 System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  7. 7. www.mapflat.com Test scopes 7 Unit Class Component Integration System / acceptance ● Pick stable seam ● Small scope ○ Fast? ○ Easy, simple? ● Large scope ○ Real app value? ○ Slow, unstable? ● Maintenance, cost ○ Pick few SUTs
  8. 8. www.mapflat.com ● Output = function(input, code) ○ No external factors => deterministic ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common Data-centric application properties 8 q
  9. 9. www.mapflat.com ● Output = function(input, code) ○ Perfect for test! ○ Avoid: external service calls, wall clock ● Pipeline/job edges are suitable seams ○ Focus on large tests ● Internal seams => high maintenance, low value ○ Omit unit tests, mocks, dependency injection! ● Long pipelines crosses teams ○ Need for end-to-end tests, but culture challenge Data-centric app test properties 9 q
  10. 10. www.mapflat.com Suitable seams, streaming 10 Pub / sub DB Service TopicJob Pipeline Service Export Business intelligence DB
  11. 11. www.mapflat.com 2, 7. Scalatest. 4. Spark Streaming jobs IDE, CI, debug integration Streaming SUT, example harness 11 DB Topic Kafka 5. Test input 6. Test oracle 3. Docker 1. IDE / Gradle Polling
  12. 12. www.mapflat.com Test lifecycle 12 1. Start fixture containers 2. Await fixture ready 3. Allocate test case resources 4. Start jobs 5. Push input data to Kafka 6. While (!done && !timeout) { pollDatabase(); sleep(1ms) } 7. While (moreTests) { Goto 3 } 8. Tear down fixture For absence test, send dummy sync messages at end.
  13. 13. www.mapflat.com Input generation 13 ● Input & output is denormalised & wide ● Fields are frequently changed ○ Additions are compatible ○ Modifications are incompatible => new, similar data type ● Static test input, e.g. JSON files ○ Unmaintainable ● Input generation routines ○ Robust to changes, reusable
  14. 14. www.mapflat.com Test oracles 14 ● Compare with expected output ● Check fields relevant for test ○ Robust to field changes ○ Reusable for new, similar types ● Tip: Use lenses ○ JSON: JsonPath (Java), Play JSON (Scala) ○ Case classes: Monocle ● Express invariants for each data type
  15. 15. www.mapflat.com Batch data product anatomy 15 Cluster storage Unified log Ingress ETL Egress Data lake DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  16. 16. Datasets ● Pipeline equivalent of objects ● Dataset class == homogeneous records, open-ended ○ Compatible schema ○ E.g. MobileAdImpressions ● Dataset instance = dataset class + parameters ○ Immutable ○ Finite set of homogeneous records ○ E.g. MobileAdImpressions(hour=”2016-02-06T13”) 16
  17. 17. www.mapflat.com Directory datasets 17 hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json ● Some tools, e.g. Spark, understand Hive name conventions Dataset class Instance parameters, Hive convention Seal PartitionsPrivacy level Schema version
  18. 18. Batch processing ● outDatasets = code(inDatasets) ● Component that scale up ○ Spark, (Flink, Scalding, Crunch) ● And scale down ○ Local mode ○ Most jobs fit in one machine 18 Gradual refinement 1. Wash - time shuffle, dedup, ... 2. Decorate - geo, demographic, ... 3. Domain model - similarity, clusters, ... 4. Application model - Recommendations, ...
  19. 19. www.mapflat.com Workflow manager ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ● Backfills for previous failures ● DSL describes dependencies ● Includes ingress & egress Suggested: Luigi / Airflow 19 DB `
  20. 20. www.mapflat.com ClientSessions A/B tests DSL DAG example (Luigi) 20 class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + [UserDB(date=self.hour.date)] ... class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ... class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)] def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour)) ... Actions, hourly ● Expressive, embedded DSL - a must for ingress, egress ○ Avoid weak DSL tools: Oozie, AWS Data Pipeline UserDB Time shuffle, user decorate Form sessions A/B compare ClientActions, daily A/B session evaluationDataset instance Job (aka Task) classes
  21. 21. www.mapflat.com Batch processing testing ● Omit collection in SUT ○ Technically different ● Avoid clusters ○ Slow tests ● Seams ○ Between jobs ○ End of pipelines 21 Data lake Artifact of business value E.g. service index Job Pipeline
  22. 22. www.mapflat.com Batch test frameworks - don’t ● Spark, Scalding, Crunch variants ● Seam == internal data structure ○ Omits I/O - common bug source ● Vendor lock-in When switching batch framework: ○ Need tests for protection ○ Test rewrite is unnecessary burden 22 Test input Test output Job
  23. 23. www.mapflat.com Testing single job 23 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Don’t commit - expensive to maintain. Generate / verify with code. Runs well in CI / from IDE Job
  24. 24. www.mapflat.com Testing pipelines - two options 24 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B: ● Both can be extended with egress DBs Test job with sequence of jobs
  25. 25. www.mapflat.com Testing with cloud services ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, (S3) ● Integrate PaaS service as fixture component ○ Distribute access tokens, etc ○ Pay $ or $$$ 25
  26. 26. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 26
  27. 27. www.mapflat.com Hadoop / Spark counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ○ Reuse test oracle invariants in production Obtaining quality metrics 27 DB Quality assessment job
  28. 28. www.mapflat.com Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for publishing dataset ● Push aggregates to DB ○ Standard ops: monitor, alert 28 DB ∆? Code ∆!
  29. 29. www.mapflat.com RAM Input Computer program anatomy 29 Input data Process Output File HID VariableFunction Execution path Lookup structure Output data Window File File Device
  30. 30. www.mapflat.com Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers, document “golden path” 30
  31. 31. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production Data processing applications are suited for test! 2. Static test input in version control 3. Exact expected output test oracle 4. Unit testing volatile interfaces 5. Using mocks & dependency injection 6. Tool-specific test framework 7. Calling services / databases from (batch) jobs 8. Using wall clock time 9. Embedded fixture components 31
  32. 32. www.mapflat.com Further resources. Questions? 32 http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solid http://www.mapflat.com/lands/resources/reading-list http://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data- pipelines-qcon-london-2016 http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny- 2015 https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best- Practices-Anupama-Shetty-Neil-Marshall.pdf
  33. 33. www.mapflat.com Bonus slides 33
  34. 34. www.mapflat.com Unified log ● Replicated append-only log ● Pub / sub with history ● Decoupled producers and consumers ○ In source/deployment ○ In space ○ In time ● Recovers from link failures ● Replay on transformation bug fix 34 Credits: Confluent
  35. 35. www.mapflat.com Stream pipelines ● Parallelised jobs ● Read / write to Kafka ● View egress ○ Serving index ○ SQL / cubes for Analytics ● Stream egress ○ Services subscribe to topic ○ E.g. REST post for export 35 Credits: Apache Samza
  36. 36. www.mapflat.com The data lake Unified log + snapshots ● Immutable datasets ● Raw, unprocessed ● Source of truth from batch processing perspective ● Kept as long as permitted ● Technically homogeneous ○ Except for raw imports 36 Cluster storage Data lake
  37. 37. www.mapflat.com Batch pipelines ● Things will break ○ Input will be missing ○ Jobs will fail ○ Jobs will have bugs ● Datasets must be rebuilt ● Determinism, idempotency ● Backfill missing / failed ● Eventual correctness 37 Cluster storage Data lake Pristine, immutable datasets Intermediate Derived, regenerable
  38. 38. www.mapflat.com Job == function([input datasets]): [output datasets] ● No orthogonal concerns ○ Invocation ○ Scheduling ○ Input / output location ● Testable ● No other input factors, no side-effects ● Ideally: atomic, deterministic, idempotent ● Necessary for audit Batch job 38 q
  39. 39. www.mapflat.com Form teams that are driven by business cases & need Forward-oriented -> filters implicitly applied Beware of: duplication, tech chaos/autonomy, privacy loss Data pipelines
  40. 40. www.mapflat.com Data platform, pipeline chains Common data infrastructure Productivity, privacy, end-to-end agility, complexity Beware: producer-consumer disconnect
  41. 41. www.mapflat.com Ingress / egress representation Larger variation: ● Single file ● Relational database table ● Cassandra column family, other NoSQL ● BI tool storage ● BigQuery, Redshift, ... Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it. 41
  42. 42. www.mapflat.com Egress datasets ● Serving ○ Precomputed user query answers ○ Denormalised ○ Cassandra, (many) ● Export & Analytics ○ SQL (single node / Hive, Presto, ..) ○ Workbenches (Zeppelin) ○ (Elasticsearch, proprietary OLAP) ● BI / analytics tool needs change frequently ○ Prepare to redirect pipelines 42
  43. 43. Deployment 43 Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency + backfill (Luigi range tools) * 10 * * * bin/my_pipe_daily --backfill 14 All that a pipeline needs, installed atomically

×