Test strategies for data processing pipelines

Lars Albertsson
Lars AlbertssonFounder & Data Engineer
www.mapflat.com
Test strategies for data
processing pipelines
Lars Albertsson, independent consultant (Mapflat)
Øyvind Løkling, Schibsted Products & Technology
1
www.mapflat.com
Who’s talking?
Swedish Institute of Computer. Science. (test & debug tools)
Sun Microsystems (large machine verification)
Google (Hangouts, productivity)
Recorded Future (NLP startup) (data integrations)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling, productivity)
Schibsted Products & Tech (data processing & modelling)
Mapflat (independent data engineering consultant)
2
www.mapflat.com
Agenda
● Data applications from a test perspective
● Testing stream processing product
● Testing batch processing products
● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience
3
www.mapflat.com
Test value
For data-centric applications, in this order:
● Productivity
○ Move fast without breaking things
● Fast experimentation
○ 10% good ideas, 90% bad
● Data quality
○ Challenging, more important than
● Technical quality
○ Technical failure => ops hassle, stale data
4
www.mapflat.com
Streaming data product anatomy
5
Pub / sub
Unified log
Ingress Stream processing Egress
DB
Service
TopicJob
Pipeline
Service
Export
Business
intelligence
DB
www.mapflat.com
Test harness
Test
fixture
Test concepts
6
System under test
(SUT)
3rd party
component
(e.g. DB)
3rd party
component
3rd party
component
Test
input
Test
oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build
tools
www.mapflat.com
Test scopes
7
Unit
Class
Component
Integration
System / acceptance
● Pick stable seam
● Small scope
○ Fast?
○ Easy, simple?
● Large scope
○ Real app value?
○ Slow, unstable?
● Maintenance, cost
○ Pick few SUTs
www.mapflat.com
● Output = function(input, code)
○ No external factors => deterministic
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
Data-centric application properties
8
q
www.mapflat.com
● Output = function(input, code)
○ Perfect for test!
○ Avoid: external service calls, wall clock
● Pipeline/job edges are suitable seams
○ Focus on large tests
● Internal seams => high maintenance, low value
○ Omit unit tests, mocks, dependency injection!
● Long pipelines crosses teams
○ Need for end-to-end tests, but culture challenge
Data-centric app test properties
9
q
www.mapflat.com
Suitable seams, streaming
10
Pub / sub
DB
Service
TopicJob
Pipeline
Service
Export
Business
intelligence
DB
www.mapflat.com
2, 7. Scalatest. 4. Spark Streaming jobs
IDE, CI, debug integration
Streaming SUT, example harness
11
DB
Topic
Kafka
5. Test
input
6. Test
oracle
3. Docker
1. IDE / Gradle
Polling
www.mapflat.com
Test lifecycle
12
1. Start fixture containers
2. Await fixture ready
3. Allocate test case resources
4. Start jobs
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture
For absence test, send dummy sync messages at end.
www.mapflat.com
Input generation
13
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible =>
new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
www.mapflat.com
Test oracles
14
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
www.mapflat.com
Batch data product anatomy
15
Cluster storage
Unified log
Ingress ETL Egress
Data
lake
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
Datasets
● Pipeline equivalent of objects
● Dataset class == homogeneous records, open-ended
○ Compatible schema
○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters
○ Immutable
○ Finite set of homogeneous records
○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
16
www.mapflat.com
Directory datasets
17
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS
part-00000.json
part-00001.json
● Some tools, e.g. Spark, understand Hive name
conventions
Dataset
class
Instance parameters,
Hive convention
Seal PartitionsPrivacy
level
Schema
version
Batch processing
● outDatasets =
code(inDatasets)
● Component that scale up
○ Spark, (Flink,
Scalding, Crunch)
● And scale down
○ Local mode
○ Most jobs fit in one
machine
18
Gradual refinement
1. Wash
- time shuffle, dedup, ...
2. Decorate
- geo, demographic, ...
3. Domain model
- similarity, clusters, ...
4. Application model
- Recommendations, ...
www.mapflat.com
Workflow manager
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
● Backfills for previous failures
● DSL describes dependencies
● Includes ingress & egress
Suggested: Luigi / Airflow
19
DB
`
www.mapflat.com
ClientSessions A/B tests
DSL DAG example (Luigi)
20
class ClientActions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + 
[UserDB(date=self.hour.date)]
...
class ClientSessions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)]
...
class SessionsABResults(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]
def output(self):
return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” +
“{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))
...
Actions, hourly
● Expressive, embedded DSL - a must for ingress, egress
○ Avoid weak DSL tools: Oozie, AWS Data Pipeline
UserDB
Time shuffle,
user decorate
Form sessions
A/B compare
ClientActions, daily
A/B session
evaluationDataset instance
Job (aka Task) classes
www.mapflat.com
Batch processing testing
● Omit collection in SUT
○ Technically different
● Avoid clusters
○ Slow tests
● Seams
○ Between jobs
○ End of pipelines
21
Data lake
Artifact of business value
E.g. service index
Job
Pipeline
www.mapflat.com
Batch test frameworks - don’t
● Spark, Scalding, Crunch variants
● Seam == internal data structure
○ Omits I/O - common bug source
● Vendor lock-in
When switching batch framework:
○ Need tests for protection
○ Test rewrite is unnecessary burden
22
Test input Test output
Job
www.mapflat.com
Testing single job
23
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Don’t commit -
expensive to maintain.
Generate / verify with
code.
Runs well in
CI / from IDE
Job
www.mapflat.com
Testing pipelines - two options
24
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job 3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
● Both can be extended with egress DBs
Test job with sequence of jobs
www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake
implementations
○ Exceptions: Kubernetes, Cloud SQL, (S3)
● Integrate PaaS service as fixture component
○ Distribute access tokens, etc
○ Pay $ or $$$
25
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
26
www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production
Obtaining quality metrics
27
DB
Quality assessment job
www.mapflat.com
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for
publishing dataset
● Push aggregates to DB
○ Standard ops:
monitor, alert
28
DB
∆?
Code ∆!
www.mapflat.com
RAM
Input
Computer program anatomy
29
Input data
Process Output
File
HID
VariableFunction
Execution path
Lookup
structure
Output
data
Window
File
File
Device
www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers, document “golden path”
30
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework
7. Calling services / databases from (batch) jobs
8. Using wall clock time
9. Embedded fixture components
31
www.mapflat.com
Further resources. Questions?
32
http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solid
http://www.mapflat.com/lands/resources/reading-list
http://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data-
pipelines-qcon-london-2016
http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-
2015
https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-
Practices-Anupama-Shetty-Neil-Marshall.pdf
www.mapflat.com
Bonus slides
33
www.mapflat.com
Unified log
● Replicated append-only log
● Pub / sub with history
● Decoupled producers and consumers
○ In source/deployment
○ In space
○ In time
● Recovers from link failures
● Replay on transformation bug fix
34
Credits: Confluent
www.mapflat.com
Stream pipelines
● Parallelised jobs
● Read / write to Kafka
● View egress
○ Serving index
○ SQL / cubes for Analytics
● Stream egress
○ Services subscribe to topic
○ E.g. REST post for export
35
Credits: Apache Samza
www.mapflat.com
The data lake
Unified log + snapshots
● Immutable datasets
● Raw, unprocessed
● Source of truth from batch
processing perspective
● Kept as long as permitted
● Technically homogeneous
○ Except for raw imports
36
Cluster storage
Data lake
www.mapflat.com
Batch pipelines
● Things will break
○ Input will be missing
○ Jobs will fail
○ Jobs will have bugs
● Datasets must be rebuilt
● Determinism, idempotency
● Backfill missing / failed
● Eventual correctness
37
Cluster storage
Data lake
Pristine,
immutable
datasets
Intermediate
Derived,
regenerable
www.mapflat.com
Job == function([input datasets]): [output datasets]
● No orthogonal concerns
○ Invocation
○ Scheduling
○ Input / output location
● Testable
● No other input factors, no side-effects
● Ideally: atomic, deterministic, idempotent
● Necessary for audit
Batch job
38
q
www.mapflat.com
Form teams that are driven by business cases & need
Forward-oriented -> filters implicitly applied
Beware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines
www.mapflat.com
Data platform, pipeline chains
Common data infrastructure
Productivity, privacy, end-to-end agility, complexity
Beware: producer-consumer disconnect
www.mapflat.com
Ingress / egress representation
Larger variation:
● Single file
● Relational database table
● Cassandra column family, other NoSQL
● BI tool storage
● BigQuery, Redshift, ...
Egress datasets are also atomic and immutable.
E.g. write full DB table / CF, switch service to use it, never
change it.
41
www.mapflat.com
Egress datasets
● Serving
○ Precomputed user query answers
○ Denormalised
○ Cassandra, (many)
● Export & Analytics
○ SQL (single node / Hive, Presto, ..)
○ Workbenches (Zeppelin)
○ (Elasticsearch, proprietary OLAP)
● BI / analytics tool needs change frequently
○ Prepare to redirect pipelines
42
Deployment
43
Hg/git
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
HDFS
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule, higher
frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily 
--backfill 14
All that a pipeline needs, installed atomically
1 of 43

Recommended

Engineering data quality by
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
1.3K views50 slides
The Mechanics of Testing Large Data Pipelines (QCon London 2016) by
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
7.7K views74 slides
Effective testing for spark programs Strata NY 2015 by
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
13.5K views38 slides
Data pipelines from zero to solid by
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
10.7K views58 slides
Testing data streaming applications by
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
4K views26 slides
Data engineering zoomcamp introduction by
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
17.4K views32 slides

More Related Content

What's hot

Koalas: Making an Easy Transition from Pandas to Apache Spark by
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
1.8K views33 slides
The columnar roadmap: Apache Parquet and Apache Arrow by
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
6.8K views45 slides
Funnel Analysis with Apache Spark and Druid by
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
499 views81 slides
Introduction to Redis by
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
121K views24 slides
Image Processing on Delta Lake by
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta LakeDatabricks
1.5K views25 slides
Data Quality With or Without Apache Spark and Its Ecosystem by
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
1.3K views22 slides

What's hot(20)

Koalas: Making an Easy Transition from Pandas to Apache Spark by Databricks
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks1.8K views
The columnar roadmap: Apache Parquet and Apache Arrow by Julien Le Dem
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem6.8K views
Funnel Analysis with Apache Spark and Druid by Databricks
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
Databricks499 views
Introduction to Redis by Dvir Volk
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk121K views
Image Processing on Delta Lake by Databricks
Image Processing on Delta LakeImage Processing on Delta Lake
Image Processing on Delta Lake
Databricks1.5K views
Data Quality With or Without Apache Spark and Its Ecosystem by Databricks
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks1.3K views
Optimizing Delta/Parquet Data Lakes for Apache Spark by Databricks
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks2.5K views
Introduction to Apache Flink - Fast and reliable big data processing by Till Rohrmann
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann7.2K views
From flat files to deconstructed database by Julien Le Dem
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
Julien Le Dem2.1K views
Evening out the uneven: dealing with skew in Flink by Flink Forward
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward2.5K views
Real-time Analytics with Trino and Apache Pinot by Xiang Fu
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu1.2K views
Accelerating Data Ingestion with Databricks Autoloader by Databricks
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
Databricks1.7K views
Dynamic Partition Pruning in Apache Spark by Databricks
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks4.9K views
Introduction to MongoDB by Mike Dirolf
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf38.4K views
Kafka for Real-Time Replication between Edge and Hybrid Cloud by Kai Wähner
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner1.5K views
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S... by Spark Summit
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit9.4K views
Timeseries - data visualization in Grafana by OCoderFest
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
OCoderFest995 views
Kafka timestamp offset by DaeMyung Kang
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
DaeMyung Kang4.7K views
Altinity Quickstart for ClickHouse by Altinity Ltd
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
Altinity Ltd1.5K views
Cassandra Introduction & Features by DataStax Academy
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy31.9K views

Viewers also liked

Real-time Big Data Processing with Storm by
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
12.1K views38 slides
Large scale data processing pipelines at trivago by
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
2.2K views21 slides
[246] foursquare데이터라이프사이클 설현준 by
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준NAVER D2
1.5K views103 slides
[211]대규모 시스템 시각화 현동석김광림 by
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림NAVER D2
3.4K views72 slides
[115] clean fe development_윤지수 by
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수NAVER D2
2.8K views71 slides
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual... by
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
2.6K views13 slides

Viewers also liked(17)

Real-time Big Data Processing with Storm by viirya
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
viirya12.1K views
Large scale data processing pipelines at trivago by Clemens Valiente
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
Clemens Valiente2.2K views
[246] foursquare데이터라이프사이클 설현준 by NAVER D2
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준
NAVER D21.5K views
[211]대규모 시스템 시각화 현동석김광림 by NAVER D2
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
NAVER D23.4K views
[115] clean fe development_윤지수 by NAVER D2
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수
NAVER D22.8K views
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual... by Brandon O'Brien
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien2.6K views
[225]yarn 기반의 deep learning application cluster 구축 김제민 by NAVER D2
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
NAVER D26.2K views
Real-time Stream Processing with Apache Flink by DataWorks Summit
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
DataWorks Summit6.1K views
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ... by Roberto Hashioka
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Roberto Hashioka5.4K views
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민 by NAVER D2
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
NAVER D210.9K views
[112]rest에서 graph ql과 relay로 갈아타기 이정우 by NAVER D2
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우
NAVER D225.3K views
[236] 카카오의데이터파이프라인 윤도영 by NAVER D2
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
NAVER D212.8K views
Big Data Architecture by Guido Schmutz
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz25.3K views
Building a Data Pipeline from Scratch - Joe Crobak by Hakka Labs
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs38.6K views
Real-Time Analytics with Apache Cassandra and Apache Spark by Guido Schmutz
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz9.1K views
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E... by Spark Summit
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit3.6K views

Similar to Test strategies for data processing pipelines

Test strategies for data processing pipelines, v2.0 by
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
2.7K views36 slides
Holistic data application quality by
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
396 views30 slides
Scaling up uber's real time data analytics by
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
758 views49 slides
Data ops in practice - Swedish style by
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
408 views59 slides
Google Cloud Dataflow by
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud DataflowAlex Van Boxel
7.6K views52 slides
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin... by
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
1.1K views39 slides

Similar to Test strategies for data processing pipelines(20)

Test strategies for data processing pipelines, v2.0 by Lars Albertsson
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson2.7K views
Holistic data application quality by Lars Albertsson
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson396 views
Scaling up uber's real time data analytics by Xiang Fu
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu758 views
Data ops in practice - Swedish style by Lars Albertsson
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson408 views
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin... by HostedbyConfluent
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent1.1K views
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming by Yaroslav Tkachenko
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko542 views
Productionalizing a spark application by datamantra
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra1.3K views
Fast federated SQL with Apache Calcite by Chris Baynes
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes1.4K views
Druid Optimizations for Scaling Customer Facing Analytics by Amir Youssefi
Druid Optimizations for Scaling Customer Facing AnalyticsDruid Optimizations for Scaling Customer Facing Analytics
Druid Optimizations for Scaling Customer Facing Analytics
Amir Youssefi20 views
Yaetos Tech Overview by prevota
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
prevota67 views
Scio - Moving to Google Cloud, A Spotify Story by Neville Li
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
Neville Li2.5K views
Under the hood of the Altalis Platform by Safe Software
Under the hood of the Altalis PlatformUnder the hood of the Altalis Platform
Under the hood of the Altalis Platform
Safe Software5.6K views
A primer on building real time data-driven products by Lars Albertsson
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
Lars Albertsson951 views
How I learned to time travel, or, data pipelining and scheduling with Airflow by PyData
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData8.7K views
DSDT Meetup Nov 2017 by DSDT_MTL
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
DSDT_MTL55 views
Dsdt meetup 2017 11-21 by JDA Labs MTL
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
JDA Labs MTL406 views
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita... by confluent
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent1.2K views

More from Lars Albertsson

Crossing the data divide by
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
3 views31 slides
How to not kill people - Berlin Buzzwords 2023.pdf by
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
34 views51 slides
Data engineering in 10 years.pdf by
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
841 views52 slides
The 7 habits of data effective companies.pdf by
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
252 views44 slides
Secure software supply chain on a shoestring budget by
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
268 views49 slides
DataOps - Lean principles and lean practices by
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
787 views29 slides

More from Lars Albertsson(20)

How to not kill people - Berlin Buzzwords 2023.pdf by Lars Albertsson
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson34 views
Data engineering in 10 years.pdf by Lars Albertsson
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson841 views
The 7 habits of data effective companies.pdf by Lars Albertsson
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson252 views
Secure software supply chain on a shoestring budget by Lars Albertsson
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson268 views
DataOps - Lean principles and lean practices by Lars Albertsson
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
Lars Albertsson787 views
The right side of speed - learning to shift left by Lars Albertsson
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
Lars Albertsson202 views
Mortal analytics - Covid-19 and the problem of data quality by Lars Albertsson
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson416 views
Eventually, time will kill your data processing by Lars Albertsson
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
Lars Albertsson413 views
Taming the reproducibility crisis by Lars Albertsson
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
Lars Albertsson521 views
Eventually, time will kill your data pipeline by Lars Albertsson
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
Lars Albertsson936 views
10 ways to stumble with big data by Lars Albertsson
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson1.4K views

Recently uploaded

[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...DataScienceConferenc1
5 views18 slides
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks by
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks
[DSC Europe 23] Aleksandar Tomcic - Adversarial AttacksDataScienceConferenc1
5 views20 slides
CRM stick or twist workshop by
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshopinfo828217
14 views16 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
17 views29 slides
Infomatica-MDM.pptx by
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptxKapil Rangwani
11 views16 slides
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptxDataScienceConferenc1
5 views21 slides

Recently uploaded(20)

[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf by DataScienceConferenc1
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402314 views
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int... by DataScienceConferenc1
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...
[DSC Europe 23] Rania Wazir - Opening up the box: the complexity of human int...
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus27 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20048 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines by DataScienceConferenc1
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines

Test strategies for data processing pipelines

  • 1. www.mapflat.com Test strategies for data processing pipelines Lars Albertsson, independent consultant (Mapflat) Øyvind Løkling, Schibsted Products & Technology 1
  • 2. www.mapflat.com Who’s talking? Swedish Institute of Computer. Science. (test & debug tools) Sun Microsystems (large machine verification) Google (Hangouts, productivity) Recorded Future (NLP startup) (data integrations) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling, productivity) Schibsted Products & Tech (data processing & modelling) Mapflat (independent data engineering consultant) 2
  • 3. www.mapflat.com Agenda ● Data applications from a test perspective ● Testing stream processing product ● Testing batch processing products ● Data quality testing Main focus is functional, regression testing Prerequisites: Backend dev testing, basic data experience 3
  • 4. www.mapflat.com Test value For data-centric applications, in this order: ● Productivity ○ Move fast without breaking things ● Fast experimentation ○ 10% good ideas, 90% bad ● Data quality ○ Challenging, more important than ● Technical quality ○ Technical failure => ops hassle, stale data 4
  • 5. www.mapflat.com Streaming data product anatomy 5 Pub / sub Unified log Ingress Stream processing Egress DB Service TopicJob Pipeline Service Export Business intelligence DB
  • 6. www.mapflat.com Test harness Test fixture Test concepts 6 System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  • 7. www.mapflat.com Test scopes 7 Unit Class Component Integration System / acceptance ● Pick stable seam ● Small scope ○ Fast? ○ Easy, simple? ● Large scope ○ Real app value? ○ Slow, unstable? ● Maintenance, cost ○ Pick few SUTs
  • 8. www.mapflat.com ● Output = function(input, code) ○ No external factors => deterministic ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common Data-centric application properties 8 q
  • 9. www.mapflat.com ● Output = function(input, code) ○ Perfect for test! ○ Avoid: external service calls, wall clock ● Pipeline/job edges are suitable seams ○ Focus on large tests ● Internal seams => high maintenance, low value ○ Omit unit tests, mocks, dependency injection! ● Long pipelines crosses teams ○ Need for end-to-end tests, but culture challenge Data-centric app test properties 9 q
  • 10. www.mapflat.com Suitable seams, streaming 10 Pub / sub DB Service TopicJob Pipeline Service Export Business intelligence DB
  • 11. www.mapflat.com 2, 7. Scalatest. 4. Spark Streaming jobs IDE, CI, debug integration Streaming SUT, example harness 11 DB Topic Kafka 5. Test input 6. Test oracle 3. Docker 1. IDE / Gradle Polling
  • 12. www.mapflat.com Test lifecycle 12 1. Start fixture containers 2. Await fixture ready 3. Allocate test case resources 4. Start jobs 5. Push input data to Kafka 6. While (!done && !timeout) { pollDatabase(); sleep(1ms) } 7. While (moreTests) { Goto 3 } 8. Tear down fixture For absence test, send dummy sync messages at end.
  • 13. www.mapflat.com Input generation 13 ● Input & output is denormalised & wide ● Fields are frequently changed ○ Additions are compatible ○ Modifications are incompatible => new, similar data type ● Static test input, e.g. JSON files ○ Unmaintainable ● Input generation routines ○ Robust to changes, reusable
  • 14. www.mapflat.com Test oracles 14 ● Compare with expected output ● Check fields relevant for test ○ Robust to field changes ○ Reusable for new, similar types ● Tip: Use lenses ○ JSON: JsonPath (Java), Play JSON (Scala) ○ Case classes: Monocle ● Express invariants for each data type
  • 15. www.mapflat.com Batch data product anatomy 15 Cluster storage Unified log Ingress ETL Egress Data lake DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  • 16. Datasets ● Pipeline equivalent of objects ● Dataset class == homogeneous records, open-ended ○ Compatible schema ○ E.g. MobileAdImpressions ● Dataset instance = dataset class + parameters ○ Immutable ○ Finite set of homogeneous records ○ E.g. MobileAdImpressions(hour=”2016-02-06T13”) 16
  • 17. www.mapflat.com Directory datasets 17 hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json ● Some tools, e.g. Spark, understand Hive name conventions Dataset class Instance parameters, Hive convention Seal PartitionsPrivacy level Schema version
  • 18. Batch processing ● outDatasets = code(inDatasets) ● Component that scale up ○ Spark, (Flink, Scalding, Crunch) ● And scale down ○ Local mode ○ Most jobs fit in one machine 18 Gradual refinement 1. Wash - time shuffle, dedup, ... 2. Decorate - geo, demographic, ... 3. Domain model - similarity, clusters, ... 4. Application model - Recommendations, ...
  • 19. www.mapflat.com Workflow manager ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ● Backfills for previous failures ● DSL describes dependencies ● Includes ingress & egress Suggested: Luigi / Airflow 19 DB `
  • 20. www.mapflat.com ClientSessions A/B tests DSL DAG example (Luigi) 20 class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + [UserDB(date=self.hour.date)] ... class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ... class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)] def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour)) ... Actions, hourly ● Expressive, embedded DSL - a must for ingress, egress ○ Avoid weak DSL tools: Oozie, AWS Data Pipeline UserDB Time shuffle, user decorate Form sessions A/B compare ClientActions, daily A/B session evaluationDataset instance Job (aka Task) classes
  • 21. www.mapflat.com Batch processing testing ● Omit collection in SUT ○ Technically different ● Avoid clusters ○ Slow tests ● Seams ○ Between jobs ○ End of pipelines 21 Data lake Artifact of business value E.g. service index Job Pipeline
  • 22. www.mapflat.com Batch test frameworks - don’t ● Spark, Scalding, Crunch variants ● Seam == internal data structure ○ Omits I/O - common bug source ● Vendor lock-in When switching batch framework: ○ Need tests for protection ○ Test rewrite is unnecessary burden 22 Test input Test output Job
  • 23. www.mapflat.com Testing single job 23 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Don’t commit - expensive to maintain. Generate / verify with code. Runs well in CI / from IDE Job
  • 24. www.mapflat.com Testing pipelines - two options 24 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B: ● Both can be extended with egress DBs Test job with sequence of jobs
  • 25. www.mapflat.com Testing with cloud services ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, (S3) ● Integrate PaaS service as fixture component ○ Distribute access tokens, etc ○ Pay $ or $$$ 25
  • 26. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 26
  • 27. www.mapflat.com Hadoop / Spark counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ○ Reuse test oracle invariants in production Obtaining quality metrics 27 DB Quality assessment job
  • 28. www.mapflat.com Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for publishing dataset ● Push aggregates to DB ○ Standard ops: monitor, alert 28 DB ∆? Code ∆!
  • 29. www.mapflat.com RAM Input Computer program anatomy 29 Input data Process Output File HID VariableFunction Execution path Lookup structure Output data Window File File Device
  • 30. www.mapflat.com Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers, document “golden path” 30
  • 31. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production Data processing applications are suited for test! 2. Static test input in version control 3. Exact expected output test oracle 4. Unit testing volatile interfaces 5. Using mocks & dependency injection 6. Tool-specific test framework 7. Calling services / databases from (batch) jobs 8. Using wall clock time 9. Embedded fixture components 31
  • 34. www.mapflat.com Unified log ● Replicated append-only log ● Pub / sub with history ● Decoupled producers and consumers ○ In source/deployment ○ In space ○ In time ● Recovers from link failures ● Replay on transformation bug fix 34 Credits: Confluent
  • 35. www.mapflat.com Stream pipelines ● Parallelised jobs ● Read / write to Kafka ● View egress ○ Serving index ○ SQL / cubes for Analytics ● Stream egress ○ Services subscribe to topic ○ E.g. REST post for export 35 Credits: Apache Samza
  • 36. www.mapflat.com The data lake Unified log + snapshots ● Immutable datasets ● Raw, unprocessed ● Source of truth from batch processing perspective ● Kept as long as permitted ● Technically homogeneous ○ Except for raw imports 36 Cluster storage Data lake
  • 37. www.mapflat.com Batch pipelines ● Things will break ○ Input will be missing ○ Jobs will fail ○ Jobs will have bugs ● Datasets must be rebuilt ● Determinism, idempotency ● Backfill missing / failed ● Eventual correctness 37 Cluster storage Data lake Pristine, immutable datasets Intermediate Derived, regenerable
  • 38. www.mapflat.com Job == function([input datasets]): [output datasets] ● No orthogonal concerns ○ Invocation ○ Scheduling ○ Input / output location ● Testable ● No other input factors, no side-effects ● Ideally: atomic, deterministic, idempotent ● Necessary for audit Batch job 38 q
  • 39. www.mapflat.com Form teams that are driven by business cases & need Forward-oriented -> filters implicitly applied Beware of: duplication, tech chaos/autonomy, privacy loss Data pipelines
  • 40. www.mapflat.com Data platform, pipeline chains Common data infrastructure Productivity, privacy, end-to-end agility, complexity Beware: producer-consumer disconnect
  • 41. www.mapflat.com Ingress / egress representation Larger variation: ● Single file ● Relational database table ● Cassandra column family, other NoSQL ● BI tool storage ● BigQuery, Redshift, ... Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it. 41
  • 42. www.mapflat.com Egress datasets ● Serving ○ Precomputed user query answers ○ Denormalised ○ Cassandra, (many) ● Export & Analytics ○ SQL (single node / Hive, Presto, ..) ○ Workbenches (Zeppelin) ○ (Elasticsearch, proprietary OLAP) ● BI / analytics tool needs change frequently ○ Prepare to redirect pipelines 42
  • 43. Deployment 43 Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency + backfill (Luigi range tools) * 10 * * * bin/my_pipe_daily --backfill 14 All that a pipeline needs, installed atomically