Phil Watt Oral
Thesis Presentation
Why is Test-Driven Development so Hard in
Analytics or Data Focussed Projects?
University of Melbourne, ISYS90111_2019_TM4
Outline
INTRODUCTION TO
THE PROBLEM SPACE
REVIEW OF THE
LITERATURE
METHODOLOGY RESULTS DISCUSSION AND
FURTHER WORK
Why is Test-Driven Development (TDD) so
hard to adopt within Data and Analytics
projects?
TDD is an established best practice in
software development, promising
benefit such as:
Reduced Cycle Time
Improved Developer Productivity
Reduced Production Defects
Observation that analytics and data
projects mostly do not use TDD, based
on:
Analytics/data management consulting and
delivery experience in 19 countries and 5
continents;
Working across hundreds of projects in this
domain
Concept validated with eight informal
interviews.
Purpose to shape research direction, before formal
data gathering began
Interviews with analytics leaders across 5 industry
segments:
•Two Chief Data Officers
•2 Enterprise Architects managing large analytics
programmes
•2 Heads of data engineering
•2 Analytics programme leaders in large
enterprises
•1 Advanced Analytics practice leader in a large
professional services organisations
Recognised
Challenges
from the
Literature
Review
• Software testing is focussed on program code: analytics is
focussed on data and information
• Analytics data volumes drive a testing context several
orders of magnitude greater than most software tests
• The valid combination of scenarios in general software
testing is limited, but for analytics can be virtually
unlimited
• Data warehousing testing continues after production
deployment (notably regression testing even when code is
not changed for some given data), unlike general software
testing
• Analytics outputs can be non-deterministic, especially for
predictions and Machine Learning use cases
• The combination of these reasons drives up the cost of
TDD and test automation in analytics
• Because of these constraints, developer or project
discipline may slip, providing lower test coverage and
increased defects/cycle time
Methodolo
gy
Mixed methods
Formal Interviews
A 6-page briefing pack supplied to
interviewees two weeks before the
interview
Audio or video recorded then
transcribed
Short online survey
Invitation only
Two questions
•Which of the challenges in the
previous slide do you recognise?
•How difficult were these
challenges to overcome?
Synthesis and Analysis
SURVEY
RESULTS
Who
Responded
?
Recognised Challenges
0
2
4
6
8
10
12
14
16
Testing
focused on
data, not
software
Analytics data
volumes drive
much large
testing
context
Limited valid
testing
scenarios for
software
testing, but
unlimited for
data
Data
Warehouse
Testing
continues in
production
Analytics tests
can be non-
deterministic
Combination
of these
reasons drives
up TDD costs
for analytics
Combination
of reasons can
drive poor
habits in
developers or
project
managers
Other
challenges
Other
challenge
s
• DWH can have complex logic related to delta processing,
historical delta etc which makes it even more difficult to
automate [testing]. Multiple source systems which can
inject a different type of data due to their own changes
make it even more complex.
• Capability to handle end-to-end complexity of
development task is rare
• 1. People with a software background may not
understand analytics. 2. DW bugs not fixed post
deployment. 3. DW not tested for other purposes. eg.
Marketing analytics.
• Dev Teams / Leaders don't think of testing in this way
• Analysts and Data Scientists rarely have the personality
or training to do TDD effectively.
Difficulty With Each Challenge
Testing focused on
data, not software
Analytics data
volumes drive
much large testing
context
Limited valid
testing scenarios
for software
testing, but
unlimited for data
Data Warehouse
Testing continues
in production
Analytics tests can
be non-
deterministic
Combination of
these reasons
drives up TDD
costs for analytics
Combination of
reasons can drive
poor habits in
developers or
project managers
INTERVIEW
RESULTS
About the
interviewe
es
14 individuals
12 with strong analytics
domain experience
• 4 Data Scientists
• 2 Data Engineers
• 4 Enterprise Analytics
Architects
• 2 Programme Managers
2 control interviews with
software engineering
backgrounds
5 Industry sectors
1 Public Sector
7 Professional Services (each
with experience across
multiple sectors)
2 Financial Services
1 Telco
1 Media
Interview Highlights
TDD advocates (n=4) stressed
the importance of ‘habit
forming’ to drive adoption
and benefits realisation
Everyone (n=14) recognised
the theoretical benefits of
TDD in Analytics
8 said benefits were subject to the
expected duration of a project– e.g.
one-off pieces of work would not
benefit
Some disagreement between
Data Scientists (n=4)
1 agnostic
2 relied on manual testing, arguing
that their work was mainly one-off
jobs
1 strongly advocated forming good
habits early, adding that test scope
could be limited for one off jobs, but
was still needed
Interviewee commentary
about the Recognised
Challenges (slide 4) was
broadly in line with the survey
results
All interviewees were invited to
complete the survey - 10 responded
8 survey respondents not interviewed,
but were invited to respond through
my LinkedIn network
DISCUSSION &
FURTHER WORK
Discussio
n
There is strong agreement between
survey respondents and
interviewees that TDD for analytics
is different and more complex than
for traditional software engineering
Although opinions vary on
why, there are some core
reasons identified
Some support for the idea that TDD
is best applied for longer term
projects, but should be avoided
when they are of short duration
Like the heuristic model from
Sambinelli et al. (2018) for
general software projects
A minority of interviewees stress
that TDD is always the right thing
for analytics, but success depends
upon:
Early, strong habit forming
around TDD practices
Careful design of the scope of
TDD
I find this minority view compelling
But this may be confirmation
bias on my part
Further work
With more time I would improve the
accuracy of the transcriptions, to enable
better text analytics and concept
matching across interviews
A range of Test Automation case
studies over a matrix of scenarios
Where TDD is used extensively
Where other test automation is used instead of
TDD
Where manual testing is used
For project durations that are short, medium or
long
For systems that are simple through to complex
Analysis of the impact of other factors
that could drive productivity, cycle time
and quality:
Frameworks
Low-code development tools
Open Source vs proprietary tools
Reference
s
Professional, viewed 8 September 2019, <https://learning-oreilly-
com.ezp.lib.unimelb.edu.au/library/view/agile-analytics-a/9780321669575/ch07.html>.
• Dzakovic, M 2016, ‘Industrial Application of Automated Regression Testing in Test-Driven
ETL Development - IEEE Conference Publication’, in 2016 IEEE International Conference on
Software Maintenance and Evolution (ICSME), Institute of Electrical and Electronics Engineers,
viewed 8 September 2019, <https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/7816512?arnumber=7816512&SID=EBSCO:edseee>.
• Golfarelli, M & Rizzi, S 2009, ‘A comprehensive approach to data warehouse
testing’, Proceeding of the ACM twelfth international workshop on Data warehousing and
OLAP - DOLAP ’09, viewed 7 September 2019, <https://dl-acm-
org.ezp.lib.unimelb.edu.au/citation.cfm?id=1651295>.
• Ivo, AAS, Guerra, EM, Porto, SM, Choma, J & Quiles, MG 2018, ‘An approach for applying
Test-Driven Development (TDD) in the development of randomized algorithms’, Journal of
Software Engineering Research and Development, vol. 6, no. 1, viewed 13 September 2019,
<https://doaj.org/article/8be2f4e3709747e68c04537838b3b314?>.
• Krawatzeck, R, Tetzner, A & Dinter, B 2015, An Evaluation of Open Source Unit Testing Tools
Suitable for Data Warehouse Testing, p. 22.
• Rencberoglu, E 2019, ‘Fundamental Techniques of Feature Engineering for Machine
Learning’, Towards Data Science, April, Towards Data Science, viewed 28 September 2019,
<https://towardsdatascience.com/feature-engineering-for-machine-learning-
3a5e293a5114>.
• Sambinelli, F, Ursini, EL, Borges, MAF & Martins, PS 2018, ‘Modeling and Performance
Analysis of Scrumban with Test-Driven Development Using Discrete Event and Fuzzy Logic -
IEEE Conference Publication’, in 2018 6th International Conference in Software Engineering
Research and Innovation (CONISOFT), IEEE, viewed 14 September 2019,
<https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/8645924?arnumber=8645924&SID=EBSCO:edseee>.
• Schutte, S, Ariyachandra, T & Frolick, M 2011, ‘Test-Driven Development of Data
Warehouses’, International Journal of Business Intelligence Research, vol. 2, no. 1, pp. 64–
73, viewed 8 September 2019,

Why is Test Driven Development for Analytics or Data Projects so Hard?

  • 1.
    Phil Watt Oral ThesisPresentation Why is Test-Driven Development so Hard in Analytics or Data Focussed Projects? University of Melbourne, ISYS90111_2019_TM4
  • 2.
    Outline INTRODUCTION TO THE PROBLEMSPACE REVIEW OF THE LITERATURE METHODOLOGY RESULTS DISCUSSION AND FURTHER WORK
  • 3.
    Why is Test-DrivenDevelopment (TDD) so hard to adopt within Data and Analytics projects? TDD is an established best practice in software development, promising benefit such as: Reduced Cycle Time Improved Developer Productivity Reduced Production Defects Observation that analytics and data projects mostly do not use TDD, based on: Analytics/data management consulting and delivery experience in 19 countries and 5 continents; Working across hundreds of projects in this domain Concept validated with eight informal interviews. Purpose to shape research direction, before formal data gathering began Interviews with analytics leaders across 5 industry segments: •Two Chief Data Officers •2 Enterprise Architects managing large analytics programmes •2 Heads of data engineering •2 Analytics programme leaders in large enterprises •1 Advanced Analytics practice leader in a large professional services organisations
  • 4.
    Recognised Challenges from the Literature Review • Softwaretesting is focussed on program code: analytics is focussed on data and information • Analytics data volumes drive a testing context several orders of magnitude greater than most software tests • The valid combination of scenarios in general software testing is limited, but for analytics can be virtually unlimited • Data warehousing testing continues after production deployment (notably regression testing even when code is not changed for some given data), unlike general software testing • Analytics outputs can be non-deterministic, especially for predictions and Machine Learning use cases • The combination of these reasons drives up the cost of TDD and test automation in analytics • Because of these constraints, developer or project discipline may slip, providing lower test coverage and increased defects/cycle time
  • 5.
    Methodolo gy Mixed methods Formal Interviews A6-page briefing pack supplied to interviewees two weeks before the interview Audio or video recorded then transcribed Short online survey Invitation only Two questions •Which of the challenges in the previous slide do you recognise? •How difficult were these challenges to overcome? Synthesis and Analysis
  • 6.
  • 7.
  • 8.
    Recognised Challenges 0 2 4 6 8 10 12 14 16 Testing focused on data,not software Analytics data volumes drive much large testing context Limited valid testing scenarios for software testing, but unlimited for data Data Warehouse Testing continues in production Analytics tests can be non- deterministic Combination of these reasons drives up TDD costs for analytics Combination of reasons can drive poor habits in developers or project managers Other challenges
  • 9.
    Other challenge s • DWH canhave complex logic related to delta processing, historical delta etc which makes it even more difficult to automate [testing]. Multiple source systems which can inject a different type of data due to their own changes make it even more complex. • Capability to handle end-to-end complexity of development task is rare • 1. People with a software background may not understand analytics. 2. DW bugs not fixed post deployment. 3. DW not tested for other purposes. eg. Marketing analytics. • Dev Teams / Leaders don't think of testing in this way • Analysts and Data Scientists rarely have the personality or training to do TDD effectively.
  • 10.
    Difficulty With EachChallenge Testing focused on data, not software Analytics data volumes drive much large testing context Limited valid testing scenarios for software testing, but unlimited for data Data Warehouse Testing continues in production Analytics tests can be non- deterministic Combination of these reasons drives up TDD costs for analytics Combination of reasons can drive poor habits in developers or project managers
  • 11.
  • 12.
    About the interviewe es 14 individuals 12with strong analytics domain experience • 4 Data Scientists • 2 Data Engineers • 4 Enterprise Analytics Architects • 2 Programme Managers 2 control interviews with software engineering backgrounds 5 Industry sectors 1 Public Sector 7 Professional Services (each with experience across multiple sectors) 2 Financial Services 1 Telco 1 Media
  • 13.
    Interview Highlights TDD advocates(n=4) stressed the importance of ‘habit forming’ to drive adoption and benefits realisation Everyone (n=14) recognised the theoretical benefits of TDD in Analytics 8 said benefits were subject to the expected duration of a project– e.g. one-off pieces of work would not benefit Some disagreement between Data Scientists (n=4) 1 agnostic 2 relied on manual testing, arguing that their work was mainly one-off jobs 1 strongly advocated forming good habits early, adding that test scope could be limited for one off jobs, but was still needed Interviewee commentary about the Recognised Challenges (slide 4) was broadly in line with the survey results All interviewees were invited to complete the survey - 10 responded 8 survey respondents not interviewed, but were invited to respond through my LinkedIn network
  • 14.
  • 15.
    Discussio n There is strongagreement between survey respondents and interviewees that TDD for analytics is different and more complex than for traditional software engineering Although opinions vary on why, there are some core reasons identified Some support for the idea that TDD is best applied for longer term projects, but should be avoided when they are of short duration Like the heuristic model from Sambinelli et al. (2018) for general software projects A minority of interviewees stress that TDD is always the right thing for analytics, but success depends upon: Early, strong habit forming around TDD practices Careful design of the scope of TDD I find this minority view compelling But this may be confirmation bias on my part
  • 16.
    Further work With moretime I would improve the accuracy of the transcriptions, to enable better text analytics and concept matching across interviews A range of Test Automation case studies over a matrix of scenarios Where TDD is used extensively Where other test automation is used instead of TDD Where manual testing is used For project durations that are short, medium or long For systems that are simple through to complex Analysis of the impact of other factors that could drive productivity, cycle time and quality: Frameworks Low-code development tools Open Source vs proprietary tools
  • 17.
    Reference s Professional, viewed 8September 2019, <https://learning-oreilly- com.ezp.lib.unimelb.edu.au/library/view/agile-analytics-a/9780321669575/ch07.html>. • Dzakovic, M 2016, ‘Industrial Application of Automated Regression Testing in Test-Driven ETL Development - IEEE Conference Publication’, in 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Institute of Electrical and Electronics Engineers, viewed 8 September 2019, <https://ieeexplore-ieee- org.ezp.lib.unimelb.edu.au/document/7816512?arnumber=7816512&SID=EBSCO:edseee>. • Golfarelli, M & Rizzi, S 2009, ‘A comprehensive approach to data warehouse testing’, Proceeding of the ACM twelfth international workshop on Data warehousing and OLAP - DOLAP ’09, viewed 7 September 2019, <https://dl-acm- org.ezp.lib.unimelb.edu.au/citation.cfm?id=1651295>. • Ivo, AAS, Guerra, EM, Porto, SM, Choma, J & Quiles, MG 2018, ‘An approach for applying Test-Driven Development (TDD) in the development of randomized algorithms’, Journal of Software Engineering Research and Development, vol. 6, no. 1, viewed 13 September 2019, <https://doaj.org/article/8be2f4e3709747e68c04537838b3b314?>. • Krawatzeck, R, Tetzner, A & Dinter, B 2015, An Evaluation of Open Source Unit Testing Tools Suitable for Data Warehouse Testing, p. 22. • Rencberoglu, E 2019, ‘Fundamental Techniques of Feature Engineering for Machine Learning’, Towards Data Science, April, Towards Data Science, viewed 28 September 2019, <https://towardsdatascience.com/feature-engineering-for-machine-learning- 3a5e293a5114>. • Sambinelli, F, Ursini, EL, Borges, MAF & Martins, PS 2018, ‘Modeling and Performance Analysis of Scrumban with Test-Driven Development Using Discrete Event and Fuzzy Logic - IEEE Conference Publication’, in 2018 6th International Conference in Software Engineering Research and Innovation (CONISOFT), IEEE, viewed 14 September 2019, <https://ieeexplore-ieee- org.ezp.lib.unimelb.edu.au/document/8645924?arnumber=8645924&SID=EBSCO:edseee>. • Schutte, S, Ariyachandra, T & Frolick, M 2011, ‘Test-Driven Development of Data Warehouses’, International Journal of Business Intelligence Research, vol. 2, no. 1, pp. 64– 73, viewed 8 September 2019,