Why is TDD so hard for Data Engineering and Analytics Projects?

Empowering your data
Empowering your
business
Why is Test Driven Development
so Hard for Analytics Projects?
Phil Watt
Director
27th March 2020
phil.watt@elait.com
www.elait.com

2
Outline
INTRODUCTION
TO THE PROBLEM
SPACE
REVIEW OF THE
LITERATURE
METHODOLOGY RESULTS DISCUSSION AND
FURTHER WORK

3
Why is Test-Driven Development (TDD) so hard to adopt for Data and Analytics
projects?

4
Current Academic Conclusions on TDD Challenges in
Analytics
vs
Code Focus vs
Data and
information
X
Volume x Variety
Valid use case
combination can be
virtually unlimited
Testing continues in
production

5
Current Academic Conclusions on TDD Challenges in Analytics
Non-deterministic
results
Combined reasons
drive poor project /
developer discipline
Combined reasons
escalate cost

6
Deterministic vs Non-deterministic
Neural Network by sachin modgekar from the Noun Project

Methodology
Mixed methods
Formal Interviews
Short online survey
Synthesis and Analysis

8
Who Responded to the Survey?

9
Survey Respondents that Recognised Each
Challenge
0
2
4
6
8
10
12
14
16
Testing
focused on
data, not
software
Analytics data
volumes drive
much large
testing context
Limited valid
testing
scenarios for
software
testing, but
unlimited for
data
Data
Warehouse
Testing
continues in
production
Analytics tests
can be non-
deterministic
Combination
of these
reasons drives
up TDD costs
for analytics
Combination
of reasons can
drive poor
habits in
developers or
project
managers
Other
challenges

10
Difficulty With Each Challenge
Testing focused on
data, not software
Analytics data
volumes drive much
large testing
context
Limited valid testing
scenarios for
software testing,
but unlimited for
data
Data Warehouse
production
Analytics tests can
be non-
deterministic
Combination of
these reasons
drives up TDD costs
for analytics
Combination of
reasons can drive
poor habits in
developers or
project managers

11
• DWH can have complex logic related to delta processing, historical delta etc which
makes it even more difficult to automate [testing]. Multiple source systems which
can inject a different type of data due to their own changes make it even more
complex.
• Capability to handle end-to-end complexity of development task is rare
• 1. People with a software background may not understand analytics. 2. DW bugs
not fixed post deployment. 3. DW not tested for other purposes. eg. Marketing
analytics.
• Dev Teams / Leaders don't think of testing in this way
• Analysts and Data Scientists rarely have the personality or training to do TDD
effectively.
Other challenges

12
About the interviewees
14 individuals
12 with strong analytics domain
experience
• 4 Data Scientists
• 2 Data Engineers
• 4 Enterprise Analytics
Architects
• 2 Programme Managers
2 control interviews with
software engineering
backgrounds
5 Industry sectors
1 Public Sector
7 Professional Services (each
with experience across multiple
sectors)
2 Financial Services
1 Telco
1 Media

13
aws_transcribe_to_docx sample

14
Interview Highlights
TDD advocates (n=4) stressed
the importance of ‘habit
forming’ to drive adoption and
benefits realisation
Everyone (n=14) recognised
the theoretical benefits of TDD
in Analytics
8 said benefits were subject to the
expected duration of a project– e.g. one-
off pieces of work would not benefit
Some disagreement between
Data Scientists (n=4)
1 agnostic
2 relied on manual testing, arguing that
their work was mainly one-off jobs
1 strongly advocated forming good habits
early, adding that test scope could be
limited for one off jobs, but was still
needed
Interviewee commentary about
the Recognised Challenges
(slide 4) was broadly in line
with the survey results
All interviewees were invited to complete
the survey - 10 responded
8 survey respondents not interviewed, but
were invited to respond through my
LinkedIn network

16
Synthesising the Results
# Challenge Category Difficulty
1 Analytics data volumes drive much large testing context
Data
Hard2 Data Warehouse Testing continues in production
3 Upstream Data Changes Impact on Historical Records
4 Limited valid testing scenarios for software testing, but unlimited for data
Medium
5 Testing focused on data, not software
6 Clear requirements
Organisation
Very Hard
7 People with a software background may not understand analytics.
8 Technical Maturity of Organisation
9 Combination of reasons can drive poor habits in developers or project
managers
10 Combination of these reasons drives up TDD costs for analytics
Medium
11 Capability to handle end-to-end complexity of development task is rare
12 Developers, Data Scientists and Leaders don't think of testing in this way
13 Executive support for TDD
14 Project Duration Easy
15 Technical Debt
Technical
Very Hard
16 Analytics tests can be non-deterministic Hard
17 Modularity of Code Medium-Hard

17
Addressing the Data Challenges
x
Volume x Variety
production
Upstream Changes
Impact Historical
Records
Valid use case
combination can be
virtually unlimited
vs
Code Focus vs Data
and information
The Martial Arts by Anyssa Ferreira
from the Noun Project

18
Addressing the Organisation Challenges
Clear Requirements
vs
People with a sw
background may not
understand analytics.
Technical Maturity of
Organisation
Combined reasons
escalate cost
Combined reasons
drive poor project /
developer discipline
computer code by Juicy Fish; maturity by Ralf Schmitzer;
skills by Rflor; all from the Noun Project
Capability to handle
end-to-end complexity
of development is rare
Devs, Data Scientists &
Leaders don't think of
testing in this way
Executive support for
TDD
Project Duration

19
Addressing the Technical Challenges
>
Non-deterministic
results

Modularity of Code

Technical Debt
The Martial Arts by Anyssa Ferreira
from the Noun Project

21
Further work
More interviews, more survey
responses, more data
A range of Test Automation case studies
over a matrix of scenarios
Where TDD is used extensively
Where other test automation is used instead of TDD
Where manual testing is used
For project durations that are short, medium or long
For systems that are simple through to complex
Analysis of the impact of other factors
that could drive productivity, cycle time
and quality:
Frameworks
Low-code development tools
Open Source vs proprietary tools

I need your
help
With a 10-minute survey
https://qrco.de/DATAENGRES

We’re hiring!
For more information or to connect on social
media:
Phil Watt
phil.watt@elait.com
https://qrco.de/philwatt

Empowering your data
Empowering your
business
More information:
Recruitment: phil.watt@elait.com
Connect on social media: https://qrco.de/philwatt
Complete the survey:
https://qrco.de/DATAENGRES
Phil Watt
Director
26th March 2020
phil.watt@elait.com
www.elait.com

25
References
• Collier, KW 2011, ‘Chapter 7. Test-Driven Data Warehouse Development’, in Agile Analytics: A Value-Driven Approach to Business
Intelligence and Data Warehousing, Addison-Wesley Professional, viewed 8 September 2019, <https://learning-oreilly-
com.ezp.lib.unimelb.edu.au/library/view/agile-analytics-a/9780321669575/ch07.html>.
• Dzakovic, M 2016, ‘Industrial Application of Automated Regression Testing in Test-Driven ETL Development - IEEE Conference
Publication’, in 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Institute of Electrical and
Electronics Engineers, viewed 8 September 2019, <https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/7816512?arnumber=7816512&SID=EBSCO:edseee>.
• Golfarelli, M & Rizzi, S 2009, ‘A comprehensive approach to data warehouse testing’, Proceeding of the ACM twelfth international
workshop on Data warehousing and OLAP - DOLAP ’09, viewed 7 September 2019, <https://dl-acm-
org.ezp.lib.unimelb.edu.au/citation.cfm?id=1651295>.
• Ivo, AAS, Guerra, EM, Porto, SM, Choma, J & Quiles, MG 2018, ‘An approach for applying Test-Driven Development (TDD) in the
development of randomized algorithms’, Journal of Software Engineering Research and Development, vol. 6, no. 1, viewed 13
September 2019, <https://doaj.org/article/8be2f4e3709747e68c04537838b3b314?>.
• Krawatzeck, R, Tetzner, A & Dinter, B 2015, An Evaluation of Open Source Unit Testing Tools Suitable for Data Warehouse Testing, p.
22.
• Rencberoglu, E 2019, ‘Fundamental Techniques of Feature Engineering for Machine Learning’, Towards Data Science, April, Towards
Data Science, viewed 28 September 2019, <https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114>.
• Sambinelli, F, Ursini, EL, Borges, MAF & Martins, PS 2018, ‘Modeling and Performance Analysis of Scrumban with Test-Driven
Development Using Discrete Event and Fuzzy Logic - IEEE Conference Publication’, in 2018 6th International Conference in Software
Engineering Research and Innovation (CONISOFT), IEEE, viewed 14 September 2019, <https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/8645924?arnumber=8645924&SID=EBSCO:edseee>.
• Schutte, S, Ariyachandra, T & Frolick, M 2011, ‘Test-Driven Development of Data Warehouses’, International Journal of Business
Intelligence Research, vol. 2, no. 1, pp. 64–73, viewed 8 September 2019,
<https://pdfs.semanticscholar.org/c3e1/575409cbaa9e7f4c07201de5774f5c0181f9.pdf>.
References

Problem
statement
• Test Driven Development (TDD) is a common pattern in
software engineering that helps reduce cycle time, improve
code quality and reduce production defects.
• Within data engineering and analytics projects, TDD is held
up as best practice in development and maintenance
lifecycle phases.
• Many organisations do not see the promised benefits of
TDD in an analytics context, prompting the question:
• Why is it so hard to effectively implement
Test Driven Development in an analytics
platform?

Why is TDD so hard for Data Engineering and Analytics Projects?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why is TDD so hard for Data Engineering and Analytics Projects?

Similar to Why is TDD so hard for Data Engineering and Analytics Projects? (20)

Recently uploaded

Recently uploaded (20)

Why is TDD so hard for Data Engineering and Analytics Projects?

Editor's Notes