This document outlines Phil Watt's oral thesis presentation on the challenges of adopting test-driven development (TDD) for analytics and data projects. It begins with an introduction to the problem that while TDD is an established best practice, it is rarely used for analytics projects. It then reviews literature identifying challenges such as testing large data volumes and non-deterministic outputs. Next, it describes the author's mixed methodology of interviews and an online survey of analytics professionals to understand recognized challenges and their perceived difficulty. The results showed agreement that TDD for analytics is more complex than for software but opinions varied on solutions. Further work could study automation case studies and impact of other productivity factors.
Why is Test Driven Development for Analytics or Data Projects so Hard?
1. Phil Watt Oral
Thesis Presentation
Why is Test-Driven Development so Hard in
Analytics or Data Focussed Projects?
University of Melbourne, ISYS90111_2019_TM4
3. Why is Test-Driven Development (TDD) so
hard to adopt within Data and Analytics
projects?
TDD is an established best practice in
software development, promising
benefit such as:
Reduced Cycle Time
Improved Developer Productivity
Reduced Production Defects
Observation that analytics and data
projects mostly do not use TDD, based
on:
Analytics/data management consulting and
delivery experience in 19 countries and 5
continents;
Working across hundreds of projects in this
domain
Concept validated with eight informal
interviews.
Purpose to shape research direction, before formal
data gathering began
Interviews with analytics leaders across 5 industry
segments:
•Two Chief Data Officers
•2 Enterprise Architects managing large analytics
programmes
•2 Heads of data engineering
•2 Analytics programme leaders in large
enterprises
•1 Advanced Analytics practice leader in a large
professional services organisations
4. Recognised
Challenges
from the
Literature
Review
• Software testing is focussed on program code: analytics is
focussed on data and information
• Analytics data volumes drive a testing context several
orders of magnitude greater than most software tests
• The valid combination of scenarios in general software
testing is limited, but for analytics can be virtually
unlimited
• Data warehousing testing continues after production
deployment (notably regression testing even when code is
not changed for some given data), unlike general software
testing
• Analytics outputs can be non-deterministic, especially for
predictions and Machine Learning use cases
• The combination of these reasons drives up the cost of
TDD and test automation in analytics
• Because of these constraints, developer or project
discipline may slip, providing lower test coverage and
increased defects/cycle time
5. Methodolo
gy
Mixed methods
Formal Interviews
A 6-page briefing pack supplied to
interviewees two weeks before the
interview
Audio or video recorded then
transcribed
Short online survey
Invitation only
Two questions
•Which of the challenges in the
previous slide do you recognise?
•How difficult were these
challenges to overcome?
Synthesis and Analysis
8. Recognised Challenges
0
2
4
6
8
10
12
14
16
Testing
focused on
data, not
software
Analytics data
volumes drive
much large
testing
context
Limited valid
testing
scenarios for
software
testing, but
unlimited for
data
Data
Warehouse
Testing
continues in
production
Analytics tests
can be non-
deterministic
Combination
of these
reasons drives
up TDD costs
for analytics
Combination
of reasons can
drive poor
habits in
developers or
project
managers
Other
challenges
9. Other
challenge
s
• DWH can have complex logic related to delta processing,
historical delta etc which makes it even more difficult to
automate [testing]. Multiple source systems which can
inject a different type of data due to their own changes
make it even more complex.
• Capability to handle end-to-end complexity of
development task is rare
• 1. People with a software background may not
understand analytics. 2. DW bugs not fixed post
deployment. 3. DW not tested for other purposes. eg.
Marketing analytics.
• Dev Teams / Leaders don't think of testing in this way
• Analysts and Data Scientists rarely have the personality
or training to do TDD effectively.
10. Difficulty With Each Challenge
Testing focused on
data, not software
Analytics data
volumes drive
much large testing
context
Limited valid
testing scenarios
for software
testing, but
unlimited for data
Data Warehouse
Testing continues
in production
Analytics tests can
be non-
deterministic
Combination of
these reasons
drives up TDD
costs for analytics
Combination of
reasons can drive
poor habits in
developers or
project managers
12. About the
interviewe
es
14 individuals
12 with strong analytics
domain experience
• 4 Data Scientists
• 2 Data Engineers
• 4 Enterprise Analytics
Architects
• 2 Programme Managers
2 control interviews with
software engineering
backgrounds
5 Industry sectors
1 Public Sector
7 Professional Services (each
with experience across
multiple sectors)
2 Financial Services
1 Telco
1 Media
13. Interview Highlights
TDD advocates (n=4) stressed
the importance of ‘habit
forming’ to drive adoption
and benefits realisation
Everyone (n=14) recognised
the theoretical benefits of
TDD in Analytics
8 said benefits were subject to the
expected duration of a project– e.g.
one-off pieces of work would not
benefit
Some disagreement between
Data Scientists (n=4)
1 agnostic
2 relied on manual testing, arguing
that their work was mainly one-off
jobs
1 strongly advocated forming good
habits early, adding that test scope
could be limited for one off jobs, but
was still needed
Interviewee commentary
about the Recognised
Challenges (slide 4) was
broadly in line with the survey
results
All interviewees were invited to
complete the survey - 10 responded
8 survey respondents not interviewed,
but were invited to respond through
my LinkedIn network
15. Discussio
n
There is strong agreement between
survey respondents and
interviewees that TDD for analytics
is different and more complex than
for traditional software engineering
Although opinions vary on
why, there are some core
reasons identified
Some support for the idea that TDD
is best applied for longer term
projects, but should be avoided
when they are of short duration
Like the heuristic model from
Sambinelli et al. (2018) for
general software projects
A minority of interviewees stress
that TDD is always the right thing
for analytics, but success depends
upon:
Early, strong habit forming
around TDD practices
Careful design of the scope of
TDD
I find this minority view compelling
But this may be confirmation
bias on my part
16. Further work
With more time I would improve the
accuracy of the transcriptions, to enable
better text analytics and concept
matching across interviews
A range of Test Automation case
studies over a matrix of scenarios
Where TDD is used extensively
Where other test automation is used instead of
TDD
Where manual testing is used
For project durations that are short, medium or
long
For systems that are simple through to complex
Analysis of the impact of other factors
that could drive productivity, cycle time
and quality:
Frameworks
Low-code development tools
Open Source vs proprietary tools
17. Reference
s
Professional, viewed 8 September 2019, <https://learning-oreilly-
com.ezp.lib.unimelb.edu.au/library/view/agile-analytics-a/9780321669575/ch07.html>.
• Dzakovic, M 2016, ‘Industrial Application of Automated Regression Testing in Test-Driven
ETL Development - IEEE Conference Publication’, in 2016 IEEE International Conference on
Software Maintenance and Evolution (ICSME), Institute of Electrical and Electronics Engineers,
viewed 8 September 2019, <https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/7816512?arnumber=7816512&SID=EBSCO:edseee>.
• Golfarelli, M & Rizzi, S 2009, ‘A comprehensive approach to data warehouse
testing’, Proceeding of the ACM twelfth international workshop on Data warehousing and
OLAP - DOLAP ’09, viewed 7 September 2019, <https://dl-acm-
org.ezp.lib.unimelb.edu.au/citation.cfm?id=1651295>.
• Ivo, AAS, Guerra, EM, Porto, SM, Choma, J & Quiles, MG 2018, ‘An approach for applying
Test-Driven Development (TDD) in the development of randomized algorithms’, Journal of
Software Engineering Research and Development, vol. 6, no. 1, viewed 13 September 2019,
<https://doaj.org/article/8be2f4e3709747e68c04537838b3b314?>.
• Krawatzeck, R, Tetzner, A & Dinter, B 2015, An Evaluation of Open Source Unit Testing Tools
Suitable for Data Warehouse Testing, p. 22.
• Rencberoglu, E 2019, ‘Fundamental Techniques of Feature Engineering for Machine
Learning’, Towards Data Science, April, Towards Data Science, viewed 28 September 2019,
<https://towardsdatascience.com/feature-engineering-for-machine-learning-
3a5e293a5114>.
• Sambinelli, F, Ursini, EL, Borges, MAF & Martins, PS 2018, ‘Modeling and Performance
Analysis of Scrumban with Test-Driven Development Using Discrete Event and Fuzzy Logic -
IEEE Conference Publication’, in 2018 6th International Conference in Software Engineering
Research and Innovation (CONISOFT), IEEE, viewed 14 September 2019,
<https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/8645924?arnumber=8645924&SID=EBSCO:edseee>.
• Schutte, S, Ariyachandra, T & Frolick, M 2011, ‘Test-Driven Development of Data
Warehouses’, International Journal of Business Intelligence Research, vol. 2, no. 1, pp. 64–
73, viewed 8 September 2019,