SlideShare a Scribd company logo
Empowering your data
Empowering your
business
Why is Test Driven Development
so Hard for Analytics Projects?
Phil Watt
Director
27th March 2020
phil.watt@elait.com
www.elait.com
2
Outline
INTRODUCTION
TO THE PROBLEM
SPACE
REVIEW OF THE
LITERATURE
METHODOLOGY RESULTS DISCUSSION AND
FURTHER WORK
3
Why is Test-Driven Development (TDD) so hard to adopt for Data and Analytics
projects?
4
Current Academic Conclusions on TDD Challenges in
Analytics
vs
Code Focus vs
Data and
information
X
Volume x Variety
Valid use case
combination can be
virtually unlimited
Testing continues in
production
5
Current Academic Conclusions on TDD Challenges in Analytics
Non-deterministic
results
Combined reasons
drive poor project /
developer discipline
Combined reasons
escalate cost
6
Deterministic vs Non-deterministic
Neural Network by sachin modgekar from the Noun Project
Methodology
Mixed methods
Formal Interviews
Short online survey
Synthesis and Analysis
8
Who Responded to the Survey?
9
Survey Respondents that Recognised Each
Challenge
0
2
4
6
8
10
12
14
16
Testing
focused on
data, not
software
Analytics data
volumes drive
much large
testing context
Limited valid
testing
scenarios for
software
testing, but
unlimited for
data
Data
Warehouse
Testing
continues in
production
Analytics tests
can be non-
deterministic
Combination
of these
reasons drives
up TDD costs
for analytics
Combination
of reasons can
drive poor
habits in
developers or
project
managers
Other
challenges
10
Difficulty With Each Challenge
Testing focused on
data, not software
Analytics data
volumes drive much
large testing
context
Limited valid testing
scenarios for
software testing,
but unlimited for
data
Data Warehouse
Testing continues in
production
Analytics tests can
be non-
deterministic
Combination of
these reasons
drives up TDD costs
for analytics
Combination of
reasons can drive
poor habits in
developers or
project managers
11
• DWH can have complex logic related to delta processing, historical delta etc which
makes it even more difficult to automate [testing]. Multiple source systems which
can inject a different type of data due to their own changes make it even more
complex.
• Capability to handle end-to-end complexity of development task is rare
• 1. People with a software background may not understand analytics. 2. DW bugs
not fixed post deployment. 3. DW not tested for other purposes. eg. Marketing
analytics.
• Dev Teams / Leaders don't think of testing in this way
• Analysts and Data Scientists rarely have the personality or training to do TDD
effectively.
Other challenges
12
About the interviewees
14 individuals
12 with strong analytics domain
experience
• 4 Data Scientists
• 2 Data Engineers
• 4 Enterprise Analytics
Architects
• 2 Programme Managers
2 control interviews with
software engineering
backgrounds
5 Industry sectors
1 Public Sector
7 Professional Services (each
with experience across multiple
sectors)
2 Financial Services
1 Telco
1 Media
13
aws_transcribe_to_docx sample
14
Interview Highlights
TDD advocates (n=4) stressed
the importance of ‘habit
forming’ to drive adoption and
benefits realisation
Everyone (n=14) recognised
the theoretical benefits of TDD
in Analytics
8 said benefits were subject to the
expected duration of a project– e.g. one-
off pieces of work would not benefit
Some disagreement between
Data Scientists (n=4)
1 agnostic
2 relied on manual testing, arguing that
their work was mainly one-off jobs
1 strongly advocated forming good habits
early, adding that test scope could be
limited for one off jobs, but was still
needed
Interviewee commentary about
the Recognised Challenges
(slide 4) was broadly in line
with the survey results
All interviewees were invited to complete
the survey - 10 responded
8 survey respondents not interviewed, but
were invited to respond through my
LinkedIn network
What can be
done?
16
Synthesising the Results
# Challenge Category Difficulty
1 Analytics data volumes drive much large testing context
Data
Hard2 Data Warehouse Testing continues in production
3 Upstream Data Changes Impact on Historical Records
4 Limited valid testing scenarios for software testing, but unlimited for data
Medium
5 Testing focused on data, not software
6 Clear requirements
Organisation
Very Hard
7 People with a software background may not understand analytics.
8 Technical Maturity of Organisation
9 Combination of reasons can drive poor habits in developers or project
managers
10 Combination of these reasons drives up TDD costs for analytics
Medium
11 Capability to handle end-to-end complexity of development task is rare
12 Developers, Data Scientists and Leaders don't think of testing in this way
13 Executive support for TDD
14 Project Duration Easy
15 Technical Debt
Technical
Very Hard
16 Analytics tests can be non-deterministic Hard
17 Modularity of Code Medium-Hard
17
Addressing the Data Challenges
x
Volume x Variety
Testing continues in
production
Upstream Changes
Impact Historical
Records
Valid use case
combination can be
virtually unlimited
vs
Code Focus vs Data
and information
The Martial Arts by Anyssa Ferreira
from the Noun Project
18
Addressing the Organisation Challenges
Clear Requirements
vs
People with a sw
background may not
understand analytics.
Technical Maturity of
Organisation
Combined reasons
escalate cost
Combined reasons
drive poor project /
developer discipline
computer code by Juicy Fish; maturity by Ralf Schmitzer;
skills by Rflor; all from the Noun Project
Capability to handle
end-to-end complexity
of development is rare
Devs, Data Scientists &
Leaders don't think of
testing in this way
Executive support for
TDD
Project Duration
19
Addressing the Technical Challenges
>
Non-deterministic
results

Modularity of Code

Technical Debt
The Martial Arts by Anyssa Ferreira
from the Noun Project
Next Steps
21
Further work
More interviews, more survey
responses, more data
A range of Test Automation case studies
over a matrix of scenarios
Where TDD is used extensively
Where other test automation is used instead of TDD
Where manual testing is used
For project durations that are short, medium or long
For systems that are simple through to complex
Analysis of the impact of other factors
that could drive productivity, cycle time
and quality:
Frameworks
Low-code development tools
Open Source vs proprietary tools
I need your
help
With a 10-minute survey
https://qrco.de/DATAENGRES
We’re hiring!
For more information or to connect on social
media:
Phil Watt
phil.watt@elait.com
https://qrco.de/philwatt
Empowering your data
Empowering your
business
More information:
Recruitment: phil.watt@elait.com
Connect on social media: https://qrco.de/philwatt
Complete the survey:
https://qrco.de/DATAENGRES
Phil Watt
Director
26th March 2020
phil.watt@elait.com
www.elait.com
25
References
• Collier, KW 2011, ‘Chapter 7. Test-Driven Data Warehouse Development’, in Agile Analytics: A Value-Driven Approach to Business
Intelligence and Data Warehousing, Addison-Wesley Professional, viewed 8 September 2019, <https://learning-oreilly-
com.ezp.lib.unimelb.edu.au/library/view/agile-analytics-a/9780321669575/ch07.html>.
• Dzakovic, M 2016, ‘Industrial Application of Automated Regression Testing in Test-Driven ETL Development - IEEE Conference
Publication’, in 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Institute of Electrical and
Electronics Engineers, viewed 8 September 2019, <https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/7816512?arnumber=7816512&SID=EBSCO:edseee>.
• Golfarelli, M & Rizzi, S 2009, ‘A comprehensive approach to data warehouse testing’, Proceeding of the ACM twelfth international
workshop on Data warehousing and OLAP - DOLAP ’09, viewed 7 September 2019, <https://dl-acm-
org.ezp.lib.unimelb.edu.au/citation.cfm?id=1651295>.
• Ivo, AAS, Guerra, EM, Porto, SM, Choma, J & Quiles, MG 2018, ‘An approach for applying Test-Driven Development (TDD) in the
development of randomized algorithms’, Journal of Software Engineering Research and Development, vol. 6, no. 1, viewed 13
September 2019, <https://doaj.org/article/8be2f4e3709747e68c04537838b3b314?>.
• Krawatzeck, R, Tetzner, A & Dinter, B 2015, An Evaluation of Open Source Unit Testing Tools Suitable for Data Warehouse Testing, p.
22.
• Rencberoglu, E 2019, ‘Fundamental Techniques of Feature Engineering for Machine Learning’, Towards Data Science, April, Towards
Data Science, viewed 28 September 2019, <https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114>.
• Sambinelli, F, Ursini, EL, Borges, MAF & Martins, PS 2018, ‘Modeling and Performance Analysis of Scrumban with Test-Driven
Development Using Discrete Event and Fuzzy Logic - IEEE Conference Publication’, in 2018 6th International Conference in Software
Engineering Research and Innovation (CONISOFT), IEEE, viewed 14 September 2019, <https://ieeexplore-ieee-
org.ezp.lib.unimelb.edu.au/document/8645924?arnumber=8645924&SID=EBSCO:edseee>.
• Schutte, S, Ariyachandra, T & Frolick, M 2011, ‘Test-Driven Development of Data Warehouses’, International Journal of Business
Intelligence Research, vol. 2, no. 1, pp. 64–73, viewed 8 September 2019,
<https://pdfs.semanticscholar.org/c3e1/575409cbaa9e7f4c07201de5774f5c0181f9.pdf>.
References
Problem
statement
• Test Driven Development (TDD) is a common pattern in
software engineering that helps reduce cycle time, improve
code quality and reduce production defects.
• Within data engineering and analytics projects, TDD is held
up as best practice in development and maintenance
lifecycle phases.
• Many organisations do not see the promised benefits of
TDD in an analytics context, prompting the question:
• Why is it so hard to effectively implement
Test Driven Development in an analytics
platform?

More Related Content

What's hot

14- Tumbling Window Trigger dependency in Azure Data Factory.pptx
14- Tumbling Window Trigger dependency in Azure Data Factory.pptx14- Tumbling Window Trigger dependency in Azure Data Factory.pptx
14- Tumbling Window Trigger dependency in Azure Data Factory.pptx
BRIJESH KUMAR
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
Novita Sari
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
Screw DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOpsScrew DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOps
Kellyn Pot'Vin-Gorman
 
Using Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on DatabricksUsing Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on Databricks
Databricks
 
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | EdurekaSQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
Edureka!
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
SnapLogic
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
MongoDB
 
Testing with JUnit 5 and Spring
Testing with JUnit 5 and SpringTesting with JUnit 5 and Spring
Testing with JUnit 5 and Spring
VMware Tanzu
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Anant Corporation
 
ALTERYX TOOL
ALTERYX TOOLALTERYX TOOL
ALTERYX TOOL
Sagnik Banerjee
 
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
DataWorks Summit
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Divij Sehgal
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Spring Web Service, Spring Integration and Spring Batch
Spring Web Service, Spring Integration and Spring BatchSpring Web Service, Spring Integration and Spring Batch
Spring Web Service, Spring Integration and Spring Batch
Eberhard Wolff
 

What's hot (20)

14- Tumbling Window Trigger dependency in Azure Data Factory.pptx
14- Tumbling Window Trigger dependency in Azure Data Factory.pptx14- Tumbling Window Trigger dependency in Azure Data Factory.pptx
14- Tumbling Window Trigger dependency in Azure Data Factory.pptx
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Screw DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOpsScrew DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOps
 
Using Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on DatabricksUsing Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on Databricks
 
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | EdurekaSQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 
Testing with JUnit 5 and Spring
Testing with JUnit 5 and SpringTesting with JUnit 5 and Spring
Testing with JUnit 5 and Spring
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
ALTERYX TOOL
ALTERYX TOOLALTERYX TOOL
ALTERYX TOOL
 
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Spring Web Service, Spring Integration and Spring Batch
Spring Web Service, Spring Integration and Spring BatchSpring Web Service, Spring Integration and Spring Batch
Spring Web Service, Spring Integration and Spring Batch
 

Similar to Why is TDD so hard for Data Engineering and Analytics Projects?

Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?
Phil Watt
 
MTech- Viva_Voce
MTech- Viva_VoceMTech- Viva_Voce
MTech- Viva_Voce
Vijayananda Mohire
 
fe.docx
fe.docxfe.docx
fe.docx
lmelaine
 
Integrating the users logic into Requirements Engineering
Integrating the users logic into Requirements EngineeringIntegrating the users logic into Requirements Engineering
Integrating the users logic into Requirements Engineering
Sofia Ouhbi
 
Datastage 4.5 years Exp at IBM INDIA PVT
Datastage 4.5 years Exp at IBM INDIA PVTDatastage 4.5 years Exp at IBM INDIA PVT
Datastage 4.5 years Exp at IBM INDIA PVT
Prathapreddy Sareddy
 
Comparison between Test-Driven Development and Conventional Development: A Ca...
Comparison between Test-Driven Development and Conventional Development: A Ca...Comparison between Test-Driven Development and Conventional Development: A Ca...
Comparison between Test-Driven Development and Conventional Development: A Ca...
IJERA Editor
 
Research-Based Innovation with Industry: Project Experience and Lessons Learned
Research-Based Innovation with Industry: Project Experience and Lessons LearnedResearch-Based Innovation with Industry: Project Experience and Lessons Learned
Research-Based Innovation with Industry: Project Experience and Lessons Learned
Lionel Briand
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
P44098087
P44098087P44098087
P44098087
IJERA Editor
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
Dr. Umesh Rao.Hodeghatta
 
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERSIT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
ijseajournal
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements Engineering
AIRCC Publishing Corporation
 
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERINGIMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
ijcsit
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements Engineering
AIRCC Publishing Corporation
 
Software Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global AnalysisSoftware Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global Analysis
Editor IJMTER
 
Software Architecture Evaluation: A Systematic Mapping Study
Software Architecture Evaluation: A Systematic Mapping StudySoftware Architecture Evaluation: A Systematic Mapping Study
Software Architecture Evaluation: A Systematic Mapping Study
Sofia Ouhbi
 

Similar to Why is TDD so hard for Data Engineering and Analytics Projects? (20)

Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?Why is Test Driven Development for Analytics or Data Projects so Hard?
Why is Test Driven Development for Analytics or Data Projects so Hard?
 
MTech- Viva_Voce
MTech- Viva_VoceMTech- Viva_Voce
MTech- Viva_Voce
 
fe.docx
fe.docxfe.docx
fe.docx
 
tem7
tem7tem7
tem7
 
Integrating the users logic into Requirements Engineering
Integrating the users logic into Requirements EngineeringIntegrating the users logic into Requirements Engineering
Integrating the users logic into Requirements Engineering
 
Datastage 4.5 years Exp at IBM INDIA PVT
Datastage 4.5 years Exp at IBM INDIA PVTDatastage 4.5 years Exp at IBM INDIA PVT
Datastage 4.5 years Exp at IBM INDIA PVT
 
Resume
ResumeResume
Resume
 
Comparison between Test-Driven Development and Conventional Development: A Ca...
Comparison between Test-Driven Development and Conventional Development: A Ca...Comparison between Test-Driven Development and Conventional Development: A Ca...
Comparison between Test-Driven Development and Conventional Development: A Ca...
 
Research-Based Innovation with Industry: Project Experience and Lessons Learned
Research-Based Innovation with Industry: Project Experience and Lessons LearnedResearch-Based Innovation with Industry: Project Experience and Lessons Learned
Research-Based Innovation with Industry: Project Experience and Lessons Learned
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
P44098087
P44098087P44098087
P44098087
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
Meha_Ghadge
Meha_GhadgeMeha_Ghadge
Meha_Ghadge
 
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERSIT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERS
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements Engineering
 
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERINGIMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements Engineering
 
Software Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global AnalysisSoftware Defect Prediction Using Local and Global Analysis
Software Defect Prediction Using Local and Global Analysis
 
Software Architecture Evaluation: A Systematic Mapping Study
Software Architecture Evaluation: A Systematic Mapping StudySoftware Architecture Evaluation: A Systematic Mapping Study
Software Architecture Evaluation: A Systematic Mapping Study
 

Recently uploaded

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Why is TDD so hard for Data Engineering and Analytics Projects?

  • 1. Empowering your data Empowering your business Why is Test Driven Development so Hard for Analytics Projects? Phil Watt Director 27th March 2020 phil.watt@elait.com www.elait.com
  • 2. 2 Outline INTRODUCTION TO THE PROBLEM SPACE REVIEW OF THE LITERATURE METHODOLOGY RESULTS DISCUSSION AND FURTHER WORK
  • 3. 3 Why is Test-Driven Development (TDD) so hard to adopt for Data and Analytics projects?
  • 4. 4 Current Academic Conclusions on TDD Challenges in Analytics vs Code Focus vs Data and information X Volume x Variety Valid use case combination can be virtually unlimited Testing continues in production
  • 5. 5 Current Academic Conclusions on TDD Challenges in Analytics Non-deterministic results Combined reasons drive poor project / developer discipline Combined reasons escalate cost
  • 6. 6 Deterministic vs Non-deterministic Neural Network by sachin modgekar from the Noun Project
  • 7. Methodology Mixed methods Formal Interviews Short online survey Synthesis and Analysis
  • 8. 8 Who Responded to the Survey?
  • 9. 9 Survey Respondents that Recognised Each Challenge 0 2 4 6 8 10 12 14 16 Testing focused on data, not software Analytics data volumes drive much large testing context Limited valid testing scenarios for software testing, but unlimited for data Data Warehouse Testing continues in production Analytics tests can be non- deterministic Combination of these reasons drives up TDD costs for analytics Combination of reasons can drive poor habits in developers or project managers Other challenges
  • 10. 10 Difficulty With Each Challenge Testing focused on data, not software Analytics data volumes drive much large testing context Limited valid testing scenarios for software testing, but unlimited for data Data Warehouse Testing continues in production Analytics tests can be non- deterministic Combination of these reasons drives up TDD costs for analytics Combination of reasons can drive poor habits in developers or project managers
  • 11. 11 • DWH can have complex logic related to delta processing, historical delta etc which makes it even more difficult to automate [testing]. Multiple source systems which can inject a different type of data due to their own changes make it even more complex. • Capability to handle end-to-end complexity of development task is rare • 1. People with a software background may not understand analytics. 2. DW bugs not fixed post deployment. 3. DW not tested for other purposes. eg. Marketing analytics. • Dev Teams / Leaders don't think of testing in this way • Analysts and Data Scientists rarely have the personality or training to do TDD effectively. Other challenges
  • 12. 12 About the interviewees 14 individuals 12 with strong analytics domain experience • 4 Data Scientists • 2 Data Engineers • 4 Enterprise Analytics Architects • 2 Programme Managers 2 control interviews with software engineering backgrounds 5 Industry sectors 1 Public Sector 7 Professional Services (each with experience across multiple sectors) 2 Financial Services 1 Telco 1 Media
  • 14. 14 Interview Highlights TDD advocates (n=4) stressed the importance of ‘habit forming’ to drive adoption and benefits realisation Everyone (n=14) recognised the theoretical benefits of TDD in Analytics 8 said benefits were subject to the expected duration of a project– e.g. one- off pieces of work would not benefit Some disagreement between Data Scientists (n=4) 1 agnostic 2 relied on manual testing, arguing that their work was mainly one-off jobs 1 strongly advocated forming good habits early, adding that test scope could be limited for one off jobs, but was still needed Interviewee commentary about the Recognised Challenges (slide 4) was broadly in line with the survey results All interviewees were invited to complete the survey - 10 responded 8 survey respondents not interviewed, but were invited to respond through my LinkedIn network
  • 16. 16 Synthesising the Results # Challenge Category Difficulty 1 Analytics data volumes drive much large testing context Data Hard2 Data Warehouse Testing continues in production 3 Upstream Data Changes Impact on Historical Records 4 Limited valid testing scenarios for software testing, but unlimited for data Medium 5 Testing focused on data, not software 6 Clear requirements Organisation Very Hard 7 People with a software background may not understand analytics. 8 Technical Maturity of Organisation 9 Combination of reasons can drive poor habits in developers or project managers 10 Combination of these reasons drives up TDD costs for analytics Medium 11 Capability to handle end-to-end complexity of development task is rare 12 Developers, Data Scientists and Leaders don't think of testing in this way 13 Executive support for TDD 14 Project Duration Easy 15 Technical Debt Technical Very Hard 16 Analytics tests can be non-deterministic Hard 17 Modularity of Code Medium-Hard
  • 17. 17 Addressing the Data Challenges x Volume x Variety Testing continues in production Upstream Changes Impact Historical Records Valid use case combination can be virtually unlimited vs Code Focus vs Data and information The Martial Arts by Anyssa Ferreira from the Noun Project
  • 18. 18 Addressing the Organisation Challenges Clear Requirements vs People with a sw background may not understand analytics. Technical Maturity of Organisation Combined reasons escalate cost Combined reasons drive poor project / developer discipline computer code by Juicy Fish; maturity by Ralf Schmitzer; skills by Rflor; all from the Noun Project Capability to handle end-to-end complexity of development is rare Devs, Data Scientists & Leaders don't think of testing in this way Executive support for TDD Project Duration
  • 19. 19 Addressing the Technical Challenges > Non-deterministic results  Modularity of Code  Technical Debt The Martial Arts by Anyssa Ferreira from the Noun Project
  • 21. 21 Further work More interviews, more survey responses, more data A range of Test Automation case studies over a matrix of scenarios Where TDD is used extensively Where other test automation is used instead of TDD Where manual testing is used For project durations that are short, medium or long For systems that are simple through to complex Analysis of the impact of other factors that could drive productivity, cycle time and quality: Frameworks Low-code development tools Open Source vs proprietary tools
  • 22. I need your help With a 10-minute survey https://qrco.de/DATAENGRES
  • 23. We’re hiring! For more information or to connect on social media: Phil Watt phil.watt@elait.com https://qrco.de/philwatt
  • 24. Empowering your data Empowering your business More information: Recruitment: phil.watt@elait.com Connect on social media: https://qrco.de/philwatt Complete the survey: https://qrco.de/DATAENGRES Phil Watt Director 26th March 2020 phil.watt@elait.com www.elait.com
  • 25. 25 References • Collier, KW 2011, ‘Chapter 7. Test-Driven Data Warehouse Development’, in Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing, Addison-Wesley Professional, viewed 8 September 2019, <https://learning-oreilly- com.ezp.lib.unimelb.edu.au/library/view/agile-analytics-a/9780321669575/ch07.html>. • Dzakovic, M 2016, ‘Industrial Application of Automated Regression Testing in Test-Driven ETL Development - IEEE Conference Publication’, in 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Institute of Electrical and Electronics Engineers, viewed 8 September 2019, <https://ieeexplore-ieee- org.ezp.lib.unimelb.edu.au/document/7816512?arnumber=7816512&SID=EBSCO:edseee>. • Golfarelli, M & Rizzi, S 2009, ‘A comprehensive approach to data warehouse testing’, Proceeding of the ACM twelfth international workshop on Data warehousing and OLAP - DOLAP ’09, viewed 7 September 2019, <https://dl-acm- org.ezp.lib.unimelb.edu.au/citation.cfm?id=1651295>. • Ivo, AAS, Guerra, EM, Porto, SM, Choma, J & Quiles, MG 2018, ‘An approach for applying Test-Driven Development (TDD) in the development of randomized algorithms’, Journal of Software Engineering Research and Development, vol. 6, no. 1, viewed 13 September 2019, <https://doaj.org/article/8be2f4e3709747e68c04537838b3b314?>. • Krawatzeck, R, Tetzner, A & Dinter, B 2015, An Evaluation of Open Source Unit Testing Tools Suitable for Data Warehouse Testing, p. 22. • Rencberoglu, E 2019, ‘Fundamental Techniques of Feature Engineering for Machine Learning’, Towards Data Science, April, Towards Data Science, viewed 28 September 2019, <https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114>. • Sambinelli, F, Ursini, EL, Borges, MAF & Martins, PS 2018, ‘Modeling and Performance Analysis of Scrumban with Test-Driven Development Using Discrete Event and Fuzzy Logic - IEEE Conference Publication’, in 2018 6th International Conference in Software Engineering Research and Innovation (CONISOFT), IEEE, viewed 14 September 2019, <https://ieeexplore-ieee- org.ezp.lib.unimelb.edu.au/document/8645924?arnumber=8645924&SID=EBSCO:edseee>. • Schutte, S, Ariyachandra, T & Frolick, M 2011, ‘Test-Driven Development of Data Warehouses’, International Journal of Business Intelligence Research, vol. 2, no. 1, pp. 64–73, viewed 8 September 2019, <https://pdfs.semanticscholar.org/c3e1/575409cbaa9e7f4c07201de5774f5c0181f9.pdf>. References
  • 26. Problem statement • Test Driven Development (TDD) is a common pattern in software engineering that helps reduce cycle time, improve code quality and reduce production defects. • Within data engineering and analytics projects, TDD is held up as best practice in development and maintenance lifecycle phases. • Many organisations do not see the promised benefits of TDD in an analytics context, prompting the question: • Why is it so hard to effectively implement Test Driven Development in an analytics platform?

Editor's Notes

  1. TDD is an established best practice in software development, promising benefit such as: Reduced Cycle Time Improved Developer Productivity Reduced Production Defects Observation that analytics and data projects mostly do not use TDD, based on: Analytics/data management consulting and delivery experience in 19 countries and 5 continents; Working across hundreds of projects in this domain Concept validated with eight informal interviews. Purpose to shape research direction, before formal data gathering began Interviews with analytics leaders across 5 industry segments: Two Chief Data Officers 2 Enterprise Architects managing large analytics programmes 2 Heads of data engineering 2 Analytics programme leaders in large enterprises 1 Advanced Analytics practice leader in a large professional services organisations
  2. Some models are Stochastic Models – while others are deterministic (such as linear regression) Training Data is not Production Data Data Discovery – you don’t know what you are going to find, how can you tell if you calculate the right answer?
  3. Mixed methods Formal Interviews A 6-page briefing pack supplied to interviewees two weeks before the interview Audio or video recorded then transcribed Short online survey Invitation only Two questions Which of the challenges in the previous slide do you recognise? How difficult were these challenges to overcome? Synthesis and Analysis
  4. There is strong agreement between survey respondents and interviewees that TDD for analytics is different and more complex than for traditional software engineering Although opinions vary on why, there are some core reasons identified Some support for the idea that TDD is best applied for longer term projects, but should be avoided when they are of short duration Like the heuristic model from Sambinelli et al. (2018) for general software projects A minority of interviewees stress that TDD is always the right thing for analytics, but success depends upon: Early, strong habit forming around TDD practices Careful design of the scope of TDD I find this minority view compelling But this may be confirmation bias on my part