SlideShare a Scribd company logo
1 of 35
GLOBAL SOFTWARE CONSULTANCY
AGILE, QA AND DATA PROJECTS
Anjuman Saiyed, Pranesh Gaikwad
Credits: Balvinder Khurana
1© 2020 ThoughtWorks
2
Quality Analyst
ANJUMAN SAIYED
Data Engineer
BALVINDER KHURANA
Quality Analyst
PRANESH GAIKWAD
© 2020 ThoughtWorks
3
What’s the
fuss about!
In this talk we will be
sharing our insights from
data projects Pranesh and I
had worked on as Quality
Analysts.
We will briefly explore the
aspects of Agile framework
on Data Projects and the
challenges within.
We will do so by presenting
a case study and through
that we will discuss how
QA specifically differs on
Data Projects.
© 2020 ThoughtWorks
4© 2020 ThoughtWorks
I want to get
price
recommendation
automatically
generated
I want to
publish new
prices to
stores
I want to
optimize the
profit
through new
recommend
ed prices
I want to
periodically
review the
recommend
ed prices
I want
customers to
continue
buying the
products at
new prices
Review historical prices on products to analyse product
performance
Analyse prices with respect to company strategy
Suggest prices after analysing competitor data
Manual data sorting and validation
Mathematical derivation for every price change
Filter data and generate reports
Sally Stephen
Price Analyst
?
MEET SALLY… our CLIENT
Vision Statement To be able to provide
the right price to the right customer at
the right time and the right place
Objective A business wants to increase
its profitability. It wants to price its
products more intelligently based on
external factors.
5© 2020 ThoughtWorks
6
DATA SCIENCE
Data science is a blend of various
tools, algorithms and machine
learning principles with the goal to
discover hidden patterns from the
raw data.
It is primarily used to make decisions
and predictions making use of
predictive causal analytics,
prescriptive analytics (predictive plus
decision science and machine
learning.
(edureka, 2019)
© 2020 ThoughtWorks
7
DATA ENGINEERING
Data engineering is a set of
operations aimed at creating
interfaces and mechanisms for the
flow and access of information. It
takes dedicated specialists -data
engineers - to maintain data so that
it remains available and useable by
others.
(Altexsoft, 2019)
© 2020 ThoughtWorks
PRICE RECOMMENDATION PIPELINE
8© 2020 ThoughtWorks
MAPPING IT TO DATA TERMINOLOGIES
9© 2020 ThoughtWorks
DATA WORKFLOW PIPELINE
10
© 2020 ThoughtWorks
11
Analysis & ScopingDeployment
In Development
Sign Off
Quality Analysis
Agile Feedback
Loop
Story Kick-
Offs
Desk Check
Iteration
Planning
Meeting
Show Case
Agile Story Life
Cycle
© 2020 ThoughtWorks
12
Analysis & ScopingDeployment
In Development
Sign Off
Quality Analysis
Provide
historical
product prices
Story Kick-
Offs
Desk Check
Iteration
Planning
Meeting
Show Case
© 2020 ThoughtWorks
A DE Story Life
Cycle?
13
Data Mapping
Data Modeling/Architecture
Deployment
Data Acquisition
Data Quality
Sign Off
Data Transformation
Data Validation
Data
Engineering
Stages
Iteration
Planning
Meeting
Data
Engineering
Stages
© 2020 ThoughtWorks
14
Analysis & Scoping
Analyse the Inscope stories
for Algorithm / Business
logic development
Deployment
Promoting the Algorithm /
Business Logic to
Production phase
In Development
Data Scientists build the
actual Algorithm / Business
Logic
Sign Off
Signing off the Stories /
Business Logics with Go-
Ahead flag
Quality Analysis
Quality control checks on
Transformed data /
Analysis of output of
Algorithm
Analyse
demand and
price
relationship
Story Kick-
Offs
Desk Check
Iteration
Planning
Meeting
Show Case
Data Science
Story Life
Cycle?
© 2020 ThoughtWorks
15
Deployment
Algorithm
Development
Result Analysis
Sign Off
Quality Analysis
Data Science
Stages
Iteration
Planning
MeetingData Science
Stages
© 2020 ThoughtWorks
Literature Review
Data Analysis
QA on Data
Projects
© 2020 ThoughtWorks
DATA WORKFLOW PIPELINE
17
© 2020 ThoughtWorks
18© 2020 ThoughtWorks
DATA CONTRACT
VALIDATION
1. Data to be consumed is from expected
sources (env specific data)
2. Availability of Production like data
3. Availability of different inputs - files, events,
etc.
4. Presence of mandatory attributes in the
inputs
DATA INGESTION
1. Data is pushed to correct
underlying storage locations
2. Data is ingested as filtered data
subsets based on algorithm’s
requirements
19© 2020 ThoughtWorks
DATA QUALITY
1. Comparing source data with data
pushed into the system
2. Data validation for data pushed to
correct locations
3. Validation on data ingestion
semantics like At Least once or
Exactly Once
20© 2020 ThoughtWorks
DATA
TRANSFORMATION
1. All inputs needed for the algorithm
are transformed as expected
2. Data is not corrupted as a result of
transformation.
3. Data integrity is intact
4. Data readiness is achieved for
further processing
21© 2020 ThoughtWorks
DATA PREPARATION
1. Transformed data is available
in the format expected by the
algorithm.
2. Data modeling parameters are
available
3. Removal of any outliers
4. Validate if Mean Deviations are
within threshold for each
product
22© 2020 ThoughtWorks
DATA ALGORITHM
1. All pre-conditions for the algorithm
are met
2. Failing pre-condition fails the
algorithm
3. If the algorithm fails, the next stages
do not execute and call to action is
triggered
4. All post-stages of the algorithm are
executed successfully
23© 2020 ThoughtWorks
ALGORITHM RESULTS
1. Ignore expected variations in some
values if the algorithm error is
within expected range
2. Validate if output results are
available in the expected
consumable format
3. Verify if outputs are available as
inputs to the next execution
24© 2020 ThoughtWorks
DATA STORAGE CHECKS
1. Expected transformed data is
getting stored
2. No corrupt data is getting stored
1. Data integrity with upstream data
sources
2. MetaData generation
25© 2020 ThoughtWorks
DATA EXPORT AND PUBLISH
1. All output files are present in the
output location
2. Files to be used as inputs for next
algorithm run are available
3. Data visualisation is as per
transformed data attributes
26© 2020 ThoughtWorks
WORKFLOW MANAGEMENT
VERIFICATION
1. Verify job scheduling
2. Verify job completion time
3. Availability of the prices
27© 2020 ThoughtWorks
PERFORMANCE TEST
1. System is able to consume huge amount of data
efficiently
2. Message queues are getting cleared in given amount of
time
3. Inserting huge data into underlying storage (HDFS)
4. Speed of data processing (Map reduce)
5. Memory and resources utilization
6. Data visualization after processing 28© 2020 ThoughtWorks
RECOVERY TEST
1. Job failures and recoverability
2. Correctness of data processed post recovery
3. Node failures scenarios
4. Logging for identifying failure reasons
29© 2020 ThoughtWorks
ENVIRONMENT TEST
1. Test env should have enough storage capacity
2. Clusters availability with distributed nodes
3. Test data availability on test environment
30© 2020 ThoughtWorks
CHALLENGES & LEARNINGS
31© 2020 ThoughtWorks
CHALLENGES
32© 2020 ThoughtWorks
1. Skills are required to test Data Storage systems like
HDFS, GCP
2. End to End automation on data pipelines is hard to
achieve
3. Automation tools selection can be difficult
4. Maximum efforts to generate the data from sources
and its verification
LEARNINGS
33© 2020 ThoughtWorks
1. Data preciseness and Integrity are crucial attributes
2. QA is not always on the end result
3. Scope Management is challenging on data projects
4. Important 3 V’s - Volume, Variety and Velocity should always be considered
5. Validating output with SMEs
6. Good friendships with data scientists, bribe them!
34
THANK YOU
Quality Analyst
ANJUMAN SAIYED
Data Engineer
BALVINDER KHURANA
Quality Analyst
PRANESH GAIKWAD
© 2020 ThoughtWorks
ANY QUESTIONS/COMMENTS?
QUESTIONS
35© 2020 ThoughtWorks

More Related Content

What's hot

Agile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyAgile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyRaphael Branger
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntDatabricks
 
Agile BI Development Through Automation
Agile BI Development Through AutomationAgile BI Development Through Automation
Agile BI Development Through AutomationManta Tools
 
Ringtail 8 E-Discovery Software By FTI Technology
Ringtail 8 E-Discovery Software By FTI TechnologyRingtail 8 E-Discovery Software By FTI Technology
Ringtail 8 E-Discovery Software By FTI TechnologyFTI Technology
 
Supply Chain and Logistics Management with Graph & AI
Supply Chain and Logistics Management with Graph & AISupply Chain and Logistics Management with Graph & AI
Supply Chain and Logistics Management with Graph & AITigerGraph
 
Opening Keynote: Why Elastic?
Opening Keynote: Why Elastic?Opening Keynote: Why Elastic?
Opening Keynote: Why Elastic?Elasticsearch
 
Jupyter in the modern enterprise data and analytics ecosystem
Jupyter in the modern enterprise data and analytics ecosystem Jupyter in the modern enterprise data and analytics ecosystem
Jupyter in the modern enterprise data and analytics ecosystem Gerald Rousselle
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Debraj GuhaThakurta
 
Enriching Elastic with natural language processing
Enriching Elastic with natural language processingEnriching Elastic with natural language processing
Enriching Elastic with natural language processingElasticsearch
 
M365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #Governance
M365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #GovernanceM365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #Governance
M365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #GovernanceNicolas Georgeault
 
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?Neo4j
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsProvectus
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsRyan Gross
 
GraphTour - Neo4j Platform Overview
GraphTour - Neo4j Platform OverviewGraphTour - Neo4j Platform Overview
GraphTour - Neo4j Platform OverviewNeo4j
 
Pattern driven Enterprise Architecture
Pattern driven Enterprise ArchitecturePattern driven Enterprise Architecture
Pattern driven Enterprise ArchitectureWSO2
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesRaphael Branger
 
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorldAnkit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorldSri Ambati
 
Data science with python certification training course with
Data science with python certification training course withData science with python certification training course with
Data science with python certification training course withkiruthikab6
 
A6 harnessing the power of big data and business analytics to transform bus...
A6   harnessing the power of big data and business analytics to transform bus...A6   harnessing the power of big data and business analytics to transform bus...
A6 harnessing the power of big data and business analytics to transform bus...Dr. Wilfred Lin (Ph.D.)
 

What's hot (20)

Agile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyAgile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI Sustainably
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. Hunt
 
Agile BI Development Through Automation
Agile BI Development Through AutomationAgile BI Development Through Automation
Agile BI Development Through Automation
 
Ringtail 8 E-Discovery Software By FTI Technology
Ringtail 8 E-Discovery Software By FTI TechnologyRingtail 8 E-Discovery Software By FTI Technology
Ringtail 8 E-Discovery Software By FTI Technology
 
Supply Chain and Logistics Management with Graph & AI
Supply Chain and Logistics Management with Graph & AISupply Chain and Logistics Management with Graph & AI
Supply Chain and Logistics Management with Graph & AI
 
Opening Keynote: Why Elastic?
Opening Keynote: Why Elastic?Opening Keynote: Why Elastic?
Opening Keynote: Why Elastic?
 
Jupyter in the modern enterprise data and analytics ecosystem
Jupyter in the modern enterprise data and analytics ecosystem Jupyter in the modern enterprise data and analytics ecosystem
Jupyter in the modern enterprise data and analytics ecosystem
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
 
Enriching Elastic with natural language processing
Enriching Elastic with natural language processingEnriching Elastic with natural language processing
Enriching Elastic with natural language processing
 
M365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #Governance
M365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #GovernanceM365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #Governance
M365 Saturday Saskatchewan 2020 - Build your #PowerPlatform #Governance
 
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
GraphTour - Neo4j Platform Overview
GraphTour - Neo4j Platform OverviewGraphTour - Neo4j Platform Overview
GraphTour - Neo4j Platform Overview
 
Pattern driven Enterprise Architecture
Pattern driven Enterprise ArchitecturePattern driven Enterprise Architecture
Pattern driven Enterprise Architecture
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
 
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorldAnkit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
 
Data science with python certification training course with
Data science with python certification training course withData science with python certification training course with
Data science with python certification training course with
 
A6 harnessing the power of big data and business analytics to transform bus...
A6   harnessing the power of big data and business analytics to transform bus...A6   harnessing the power of big data and business analytics to transform bus...
A6 harnessing the power of big data and business analytics to transform bus...
 
Chinmay_Kulkarni_CV
Chinmay_Kulkarni_CVChinmay_Kulkarni_CV
Chinmay_Kulkarni_CV
 

Similar to Agile, qa and data projects geek night 2020

Curiosity and Sogeti Present - The state of test data in 2022: New challenges...
Curiosity and Sogeti Present - The state of test data in 2022: New challenges...Curiosity and Sogeti Present - The state of test data in 2022: New challenges...
Curiosity and Sogeti Present - The state of test data in 2022: New challenges...Curiosity Software Ireland
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Christopher Gutknecht
 
Performance Testing
Performance TestingPerformance Testing
Performance TestingvodQA
 
Data and Application Modernization in the Age of the Cloud
Data and Application Modernization in the Age of the CloudData and Application Modernization in the Age of the Cloud
Data and Application Modernization in the Age of the Cloudredmondpulver
 
Understand your data dependencies – Key enabler to efficient modernisation
 Understand your data dependencies – Key enabler to efficient modernisation  Understand your data dependencies – Key enabler to efficient modernisation
Understand your data dependencies – Key enabler to efficient modernisation Profinit
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testingNarola Infotech
 
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...Elemica
 
vodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testingvodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testingvodQA
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектовAI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектовGeeksLab Odessa
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologySergey Shelpuk
 
Pysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avullaPysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avullaBilot
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Denodo
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
Test data automation: delivering quality data at speed
Test data automation: delivering quality data at speedTest data automation: delivering quality data at speed
Test data automation: delivering quality data at speedCuriosity Software Ireland
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?Denodo
 
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdfAll You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdfBahaa Al Zubaidi
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...ModusOptimum
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataSociety of Petroleum Engineers
 

Similar to Agile, qa and data projects geek night 2020 (20)

Curiosity and Sogeti Present - The state of test data in 2022: New challenges...
Curiosity and Sogeti Present - The state of test data in 2022: New challenges...Curiosity and Sogeti Present - The state of test data in 2022: New challenges...
Curiosity and Sogeti Present - The state of test data in 2022: New challenges...
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Data and Application Modernization in the Age of the Cloud
Data and Application Modernization in the Age of the CloudData and Application Modernization in the Age of the Cloud
Data and Application Modernization in the Age of the Cloud
 
Understand your data dependencies – Key enabler to efficient modernisation
 Understand your data dependencies – Key enabler to efficient modernisation  Understand your data dependencies – Key enabler to efficient modernisation
Understand your data dependencies – Key enabler to efficient modernisation
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testing
 
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...
 
vodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testingvodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testing
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектовAI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
Pysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avullaPysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avulla
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Test data automation: delivering quality data at speed
Test data automation: delivering quality data at speedTest data automation: delivering quality data at speed
Test data automation: delivering quality data at speed
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdfAll You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big Data
 

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Agile, qa and data projects geek night 2020

  • 1. GLOBAL SOFTWARE CONSULTANCY AGILE, QA AND DATA PROJECTS Anjuman Saiyed, Pranesh Gaikwad Credits: Balvinder Khurana 1© 2020 ThoughtWorks
  • 2. 2 Quality Analyst ANJUMAN SAIYED Data Engineer BALVINDER KHURANA Quality Analyst PRANESH GAIKWAD © 2020 ThoughtWorks
  • 3. 3 What’s the fuss about! In this talk we will be sharing our insights from data projects Pranesh and I had worked on as Quality Analysts. We will briefly explore the aspects of Agile framework on Data Projects and the challenges within. We will do so by presenting a case study and through that we will discuss how QA specifically differs on Data Projects. © 2020 ThoughtWorks
  • 4. 4© 2020 ThoughtWorks I want to get price recommendation automatically generated I want to publish new prices to stores I want to optimize the profit through new recommend ed prices I want to periodically review the recommend ed prices I want customers to continue buying the products at new prices Review historical prices on products to analyse product performance Analyse prices with respect to company strategy Suggest prices after analysing competitor data Manual data sorting and validation Mathematical derivation for every price change Filter data and generate reports Sally Stephen Price Analyst ? MEET SALLY… our CLIENT
  • 5. Vision Statement To be able to provide the right price to the right customer at the right time and the right place Objective A business wants to increase its profitability. It wants to price its products more intelligently based on external factors. 5© 2020 ThoughtWorks
  • 6. 6 DATA SCIENCE Data science is a blend of various tools, algorithms and machine learning principles with the goal to discover hidden patterns from the raw data. It is primarily used to make decisions and predictions making use of predictive causal analytics, prescriptive analytics (predictive plus decision science and machine learning. (edureka, 2019) © 2020 ThoughtWorks
  • 7. 7 DATA ENGINEERING Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists -data engineers - to maintain data so that it remains available and useable by others. (Altexsoft, 2019) © 2020 ThoughtWorks
  • 9. MAPPING IT TO DATA TERMINOLOGIES 9© 2020 ThoughtWorks
  • 10. DATA WORKFLOW PIPELINE 10 © 2020 ThoughtWorks
  • 11. 11 Analysis & ScopingDeployment In Development Sign Off Quality Analysis Agile Feedback Loop Story Kick- Offs Desk Check Iteration Planning Meeting Show Case Agile Story Life Cycle © 2020 ThoughtWorks
  • 12. 12 Analysis & ScopingDeployment In Development Sign Off Quality Analysis Provide historical product prices Story Kick- Offs Desk Check Iteration Planning Meeting Show Case © 2020 ThoughtWorks A DE Story Life Cycle?
  • 13. 13 Data Mapping Data Modeling/Architecture Deployment Data Acquisition Data Quality Sign Off Data Transformation Data Validation Data Engineering Stages Iteration Planning Meeting Data Engineering Stages © 2020 ThoughtWorks
  • 14. 14 Analysis & Scoping Analyse the Inscope stories for Algorithm / Business logic development Deployment Promoting the Algorithm / Business Logic to Production phase In Development Data Scientists build the actual Algorithm / Business Logic Sign Off Signing off the Stories / Business Logics with Go- Ahead flag Quality Analysis Quality control checks on Transformed data / Analysis of output of Algorithm Analyse demand and price relationship Story Kick- Offs Desk Check Iteration Planning Meeting Show Case Data Science Story Life Cycle? © 2020 ThoughtWorks
  • 15. 15 Deployment Algorithm Development Result Analysis Sign Off Quality Analysis Data Science Stages Iteration Planning MeetingData Science Stages © 2020 ThoughtWorks Literature Review Data Analysis
  • 16. QA on Data Projects © 2020 ThoughtWorks
  • 17. DATA WORKFLOW PIPELINE 17 © 2020 ThoughtWorks
  • 18. 18© 2020 ThoughtWorks DATA CONTRACT VALIDATION 1. Data to be consumed is from expected sources (env specific data) 2. Availability of Production like data 3. Availability of different inputs - files, events, etc. 4. Presence of mandatory attributes in the inputs
  • 19. DATA INGESTION 1. Data is pushed to correct underlying storage locations 2. Data is ingested as filtered data subsets based on algorithm’s requirements 19© 2020 ThoughtWorks
  • 20. DATA QUALITY 1. Comparing source data with data pushed into the system 2. Data validation for data pushed to correct locations 3. Validation on data ingestion semantics like At Least once or Exactly Once 20© 2020 ThoughtWorks
  • 21. DATA TRANSFORMATION 1. All inputs needed for the algorithm are transformed as expected 2. Data is not corrupted as a result of transformation. 3. Data integrity is intact 4. Data readiness is achieved for further processing 21© 2020 ThoughtWorks
  • 22. DATA PREPARATION 1. Transformed data is available in the format expected by the algorithm. 2. Data modeling parameters are available 3. Removal of any outliers 4. Validate if Mean Deviations are within threshold for each product 22© 2020 ThoughtWorks
  • 23. DATA ALGORITHM 1. All pre-conditions for the algorithm are met 2. Failing pre-condition fails the algorithm 3. If the algorithm fails, the next stages do not execute and call to action is triggered 4. All post-stages of the algorithm are executed successfully 23© 2020 ThoughtWorks
  • 24. ALGORITHM RESULTS 1. Ignore expected variations in some values if the algorithm error is within expected range 2. Validate if output results are available in the expected consumable format 3. Verify if outputs are available as inputs to the next execution 24© 2020 ThoughtWorks
  • 25. DATA STORAGE CHECKS 1. Expected transformed data is getting stored 2. No corrupt data is getting stored 1. Data integrity with upstream data sources 2. MetaData generation 25© 2020 ThoughtWorks
  • 26. DATA EXPORT AND PUBLISH 1. All output files are present in the output location 2. Files to be used as inputs for next algorithm run are available 3. Data visualisation is as per transformed data attributes 26© 2020 ThoughtWorks
  • 27. WORKFLOW MANAGEMENT VERIFICATION 1. Verify job scheduling 2. Verify job completion time 3. Availability of the prices 27© 2020 ThoughtWorks
  • 28. PERFORMANCE TEST 1. System is able to consume huge amount of data efficiently 2. Message queues are getting cleared in given amount of time 3. Inserting huge data into underlying storage (HDFS) 4. Speed of data processing (Map reduce) 5. Memory and resources utilization 6. Data visualization after processing 28© 2020 ThoughtWorks
  • 29. RECOVERY TEST 1. Job failures and recoverability 2. Correctness of data processed post recovery 3. Node failures scenarios 4. Logging for identifying failure reasons 29© 2020 ThoughtWorks
  • 30. ENVIRONMENT TEST 1. Test env should have enough storage capacity 2. Clusters availability with distributed nodes 3. Test data availability on test environment 30© 2020 ThoughtWorks
  • 31. CHALLENGES & LEARNINGS 31© 2020 ThoughtWorks
  • 32. CHALLENGES 32© 2020 ThoughtWorks 1. Skills are required to test Data Storage systems like HDFS, GCP 2. End to End automation on data pipelines is hard to achieve 3. Automation tools selection can be difficult 4. Maximum efforts to generate the data from sources and its verification
  • 33. LEARNINGS 33© 2020 ThoughtWorks 1. Data preciseness and Integrity are crucial attributes 2. QA is not always on the end result 3. Scope Management is challenging on data projects 4. Important 3 V’s - Volume, Variety and Velocity should always be considered 5. Validating output with SMEs 6. Good friendships with data scientists, bribe them!
  • 34. 34 THANK YOU Quality Analyst ANJUMAN SAIYED Data Engineer BALVINDER KHURANA Quality Analyst PRANESH GAIKWAD © 2020 ThoughtWorks ANY QUESTIONS/COMMENTS?

Editor's Notes

  1. Anjuman
  2. Anjuman
  3. Anjuman
  4. Anjuman How this translated into their business objective would be something like this. They have a vision...
  5. Pranesh Thanks Anjuman So this particular problem had two perspectives, one is Data Science to come up with an algorithm that can address Sally’s problem statement to predict the product prices by using historical data and analysis. So i can say data science is something that provides meaningful information based on large amounts of complex data.
  6. Pranesh And another perspective to solve this problem is Data Engineering, that will collect the product specific data from different sources and transform it as per requirements for further processing or store the transformed data in some storage for data science to make use of.
  7. Anjuman That sounds about the way to go. Visually, if I were to break Sally’s ask down, this is what it may look like. Essentially what she is asking for are prices and reports on periodic basis. Since we want prices to make profits, we want data intelligent mechanism to get prices at optimised profitability. Which need some valid sales data of products, Like Aggregate Prices across stores, Aggregate sales on weekly basis may be. And this data needs to be ingest from source systems like Point of Sales transactions, historical product prices from production pricing databases, competitor data from external agencies.
  8. Pranesh So what you are saying, we have translated problem statement into Data Architecture to solve the sally’s problem. Now we will try to map those with data terminologies We first identified all the data sources as input, following which we ingest it to the system having Quality checks on it. Then we moved with applying data transformation rules on ingested data so that it will be available for Algorithm to consume it. Finally we got the expected results in our case its price predictions and we published it and have them stored somewhere so that we can export the outputs to pricing system.
  9. Anjuman
  10. Pranesh With that said, the way a story would move in usual non data agile projects would fairly differ on data projects. Such as this one here as example of usual non data story life cycle.. We start with Iteration planning meeting what we called as IPM, where BAs, Developer and tester will sit together to discuss the scoping and analysing the stories to be covered in particular sprint. Once the IPM is done team will continue with story kick offs where developer and testers will discuss on functional and technical aspects of the story. Once this is done Developer will construct the actual logic followed by desk check where Developer will showcase the developed functionality to BAs and Testers…. QAs then test the functionality will all the checks and showcase the functionality to stakeholders to provide the sign off so that functionality can be promoted to next stage. Now lets see how this fits on data projects with problem statement like ours.
  11. Pranesh So let’s now consider a specific story from our problem statement, where we need to consume historical product prices to predict optimum price With this said here comes the pain points. Since scope of this story is so wast, its Analysis and Scoping become tricky since its coverage could spill up into next cycles. Since we are only consuming the historical data, Desk check of it will only include developer showing that we have consumed specific data. But what we do in regular Desk check is validating more checks on logic So that we can find issues early in life cycle. So in case of desk check for data project, it is humanly impossible to check different data variations and its outputs. Next pain point would be providing sign off from perspective of Data Scientists and since we are only consuming the historical data, Deployment of this functionality may not add much value to go into next phase.
  12. Pranesh So with data projects, we see certain practice changes on day to day product life cycle. So if you can remember the previous pain points that we talk about, we will try to map Data Engineering analogies instead. As mentioned here for Data Engineering first we do Data Mapping where we identify the data sources and the required data to be collected from those sources. In our case it will be Historical pricing Data. Once we identify all the data sources and required data ,next stage is Data modeling; which focuses on how do we structure and store this data for our use case. Then Acquiring that particular data and validating its quality. Afterwards we will transform the acquired data to extract required attributes say product information and its prices. And validating the transformed data for next phase.
  13. Anjuman With that said, even the way a story would move in non-data agile projects would fairly differ on data projects. Such as this one here… DS
  14. Pranesh/Anjuman Where as Data Science jourben usually begins with literature review to get more insights on problem statement and forces at work in given domain. With this basic understanding we analyse the data to find the patterns. An algorithm is developed based on understanding of previous 2 stages. The results of this algorithm is analysed to see if they give results with desired accuracy. Once we are satisfied with algorithm or model it can be deployed.
  15. Anjuman
  16. Pranesh Going back to data pipeline, this is what a data architecture for data pipeline commonly looks like. We can now look into Specific QA activities around each stage in it.
  17. Pranesh As first part in our data pipeline is consuming the data from different sources, now let’s discuss QA activities around this stage. We need continuous stream of events like sales transactions from Point of Sales Terminals of different stores, historical product prices from production pricing databases, competitor data from external agencies. We Ensured that all of the inputs are latest as they are on production, because this helped algorithm to analyse the valid data. Since we must test using different set of data, as the algorithm is going to need different types of pricing as inputs. we ensured that all of these prices are available from their respective sources. Like some files We also ensured that all mandatory attributes are present in source data, in our case it is the selling price of the products, discount prices etc.
  18. Pranesh If we can recall the next phase in data pipeline is ingesting the source data into system. With that said we validated if products and their respective prices which we consumed, are stored in the correct underlying storage locations. Storage systems here can be HDFS (Hadoop Distributed File System) or even storage buckets like S3. If we can recall we only needed historical data let’s say in the range of past 5 years. Hence from the input sources, we also ensured that we are ingesting data within this specific time period only, rather than considering all historical data.
  19. Pranesh The next stage in data pipeline is to check the Data quality of data ingested in previous step After ingestion, We ensured the data integrity, by comparing products and their prices with source data. We also Validated that all products have valid prices and data is pushed to correct storage location, for example no negative selling price and no null or blank values, etc. We also checked for duplicate or missing product information, as this could impact outcome from the algorithm intelligence.
  20. Pranesh The most crucial step in our data pipeline is Transforming the ingested data. So till last stage we just validated the data quality of data we ingested from source. Data Transformation on high level means extracting out necessary attributes from data set since we might not interested all of that ingested data. With said that We insured transformed data met algorithm requirements. For example, aggregate prices of a product across different stores, or aggregate sales for a product on weekly basis. Once that is done We validated there is no corrupt data in system After transformation, we also verified data integrity to check values are intact after transformation. (logic on aggregates does not impact) I can give one example here how we Ensured that, let’s say we ingested data of X products and we applied some transformation rules on product prices. so If we applied transformation on X product prices, in the end we still have X transformed product prices. We also validated that data is ready for next stage to be consumed by algorithm. Data transformation is defined by what input algorithm needs or what needs to be generated in the reports. Example, if there was a report that needed to be generated out of the system, required by C level executives every week, then transformation logic would only apply to be run weekly.
  21. Anjuman All data to be consumed by the algorithm is in expected format, for example - converting into csv at the time of run. Data modeling parameters are available to be used by the algorithm. Hyper parameter. Ensuring that the data ingested has all outliers removed. An example would be of a product whose selling price is way off than its average maintained price over the years. This could be due to good relations with the sales manager who gave heavy discounts for her favorite customer.
  22. Anjuman
  23. Anjuman (Example price points, etc.) Algorithm Errors To append to the historical data and to also evaluate if the algorithm is improving or not
  24. Anjuman Mostly with these kind of storage systems, you would also take care if the data i stored in correct partitions and correction locations Right level of meta-data is generated with every storage.
  25. Anjuman
  26. Anjuman
  27. Pranesh Apart from stages mentioned in data pipeline earlier, we feel that there could be more validation checks we can apply to further test our pipeline. One of it was validating how pipeline behaves when it encountered huge data. We validated if system consume such huge data efficiently ro not. Message queues are getting cleared on time so that there is no overlap between 2 subsequent jobs. This huge data is getting stored in underlying storage systems with ease And the same is getting transformed while applying data processing rule. We also ensured Memory and resource utilization of our jobs are within threshold while handling huge data And there is no any pain in visualizing it.
  28. Pranesh Another form of testing , we thought was Recovery test for jobs. Like how system recovered it self from any failures. And even after recovery, how data is getting consumed We also ensured if any node failed , other nodes shared the load. So that our process of data analysis is intact. The next important aspect that we covered is logging mechanism as it helped debug the failures
  29. Pranesh Health of environment need not to be validated in each iteration, So this checks can be done with some time interval . Like checking if environment is having enough storage capacity and clusters with distributed nodes. So just to summarize on QA activities that we talked about, they are not as same as traditional activities that we follow in normal projects. But approach for testing remains the same that is “challenge the business logic to make it more robust and the one that gives us confidence". But this QA activities might differ from other data projects.
  30. Anjuman Automation tool selection examples
  31. Anjuman Precise data - We need only selling prices, discount prices and promotion prices and no need of buying prices or refunds Production data - we are dealing with modeling of an algorithm that takes into account historical data from production and produce near accurate price recommendation Logging and Monitoring - Helped in debugging the job failure Close collaboration Setting up expectation Change management - Developing framework to adapt the changes at any data processing stage Iterative approach - So that we can promote the algorithm or business logic with MVP to production phase Important 3vs - to have variations in data so that price predictions will be close to accurate Good friendship - Data scientists and QAs should work hand in hand, Collaborating with data scientists Since data is most valuable to organisations, it’s preciseness and integrity are the most important attributes Important 3 V’s while testing huge data systems - Volume, Variety and Velocity Validating output with SMEs QA need not wait for the end result on Data projects Always use production like data to test Change Management - Scope Management? Good friendships with data scientists, bribe them! As a QA, if I were to QA a Data Science project then do I need to know about Data Science.