SlideShare a Scribd company logo
Experimenting with Data!
Andrea Montemaggio
head of data practice @ mashfrog group
andrea.montemaggio@mashfrog.com
github.com/klinamen
linkedin.com/in/amontemaggio
Data Science Trend
Source: Google Trends
Keyword: “Data Science”
Wikipedia on “Data Science”
Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to
"understand and analyze actual phenomena" with data.
[3]
It uses techniques and theories drawn from many fields
within the context of mathematics, statistics, computer science, information science, and domain knowledge.
[4]
Workflow Overview
Exploratory
Data Analysis
Experimental loop
Data preparation
Feature selection
and extraction
Model selection
Evaluation
Solution
Prototyping
Scoping
& Data Acquisition
Data Analysis, Modeling
and Prototyping
Engineering &
Deployment
Monitoring
and Tuning
We are a wholesale distributor and want to improve our receivables collection strategy.
Being able to know as soon as possible whether an invoice is going to be paid on time or not, would allow us to plan ahead
and target our collection efforts to address the most critical situations first.
How long will it take to cash a given invoice?
Scenario
Scoping +
Invoice data Customers data
+ …
Data Acquisition
Data
preparation
Feature selection
and extraction
Model selection
Evaluation
Trained
Classification
Model
New invoice
on time
1-30 days late
> 30 days late
Supervised Machine-Learning
A trained classification model is able to assign an invoice
to one of a predefined set of classes.
Using historical enterprise data and machine-learning to predict whether
or not an invoice is likely to be paid on time can help organizations to
optimize invoice-to-cash flow.
Problem
How long will it take to cash a given invoice?
Classification
Regression
Predicting a discrete label.
(e.g. “on-time”, “1-30”)
Predicting a continuous quantity.
(e.g. 17.5 days)
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Dataset +
Invoice data Customers data
+ …
Data Acquisition
Data
preparation
Feature selection
and extraction
Model selection
Evaluation
Trained
Classification
Model
New invoice
on time
1-30 days late
> 30 days late
A trained classification model is able to assign an invoice
to one of a predefined set of classes.
Dataset:
https:/
/www.kaggle.com/datasets/himanshu007121/invoice-data
Description:
Wholesale invoice data extracted from some accounting system (SAP?) in
CSV format.
Each record describes a document and has, among others, these pieces
of information:
- the branch that issued the document
- customer information
- total amount
- due date
- payment date
Geometry (rows × cols): 50,000 × 19
Size: 7.17 MB
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Exploratory Data Analysis
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Getting to know your data.
Goals
- Data understanding
- Data Quality assessment (e.g. missing data,
encoding problems, inconsistencies)
- Assessing value distributions and
correlations
Tools
- Excel (!)
- Programming languages: Python1
, R1
- low/no code integrated data analysis tools such
as OpenRefine1
, Orange1
, KNIME, RapidMiner.
- statistical software packages
1
FLOSS (free or open source software)
Data Preparation
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Cook before eating.
Goals
- processing raw data (or primary data),
which is rarely ready to feed your
algorithms
- fix missing values and inconsistencies
- convert between different representations
of the same datum (e.g. dates, decimal
numbers)
Tools
- Python1
- Visual tools: OpenRefine1
, AWS Data Brew
1
FLOSS (free or open source software)
Feature Engineering
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Knowledge is power.
Goals
- using domain knowledge to augment data with derived information
(feature extraction), which usually leads to better performance of ML
models
- selecting the least number of features with the greatest significance
(feature selection)
- removing redundant or useless information
Model Selection and Training
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
One model does not fit all.
Goals
- identifying candidate models for the
problem and dataset at hand
- splitting the dataset into a training set and a
test set
- model training and performance of the
candidates
- optimization of hyperparameters (i.e. model
parameters that controls the learning
process) and fine-tuning to select “The One”
whole dataset
100%
training set
~80%
test set
~20%
model selection evaluation
Evaluation
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Is it really “The One”?
Goals
- testing the best candidate model on the test
set to see how it behaves with unseen data
(generalization)
Model complexity
(# of parameters)
Classification
Metrics
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Precision (“1-5” class)
TP / (TP + FP) = 6674 / 10087 =66.17%
10087 samples predicted as “1-5”: 6674 TP + 3413 FP
10542
(samples that are really “1-5”):
6674 TP + 3868 FN
Confusion Matrix
Recall (“1-5” class)
TP / (TP + FN) = 6674 / 10542 =63.31%
Accuracy (unweighted)
Pc / Pt = 23872 / 32382 = 73.72%
32382 total predictions (Pt):
23872 correct (Pc) + 8510 errors
correct predictions
Don’t Try This at Home!
Just clone the following repository and have fun!
https://github.com/klinamen/ds0-experimenting-with-data
Thank you.
Andrea Montemaggio
head of data practice @ mashfrog group
andrea.montemaggio@mashfrog.com
github.com/klinamen
linkedin.com/in/amontemaggio
experiments never fail.

More Related Content

What's hot

Data analytics
Data analyticsData analytics
Data analytics
Bhanu Pratap
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
Nicolas Sarramagna
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R
 
Crisp dm
Crisp dmCrisp dm
Crisp dm
akbkck
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
Shesha R
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
Shivam Singh
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
Revolution Analytics
 
Analytics in Online Retail
Analytics in Online RetailAnalytics in Online Retail
ms-ba-course-descriptions
ms-ba-course-descriptionsms-ba-course-descriptions
ms-ba-course-descriptionsAniket Joshi
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
M Baddar
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
SayantiniBiswas
 
Data science guide
Data science guideData science guide
Data science guide
gokulprasath06
 
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan SikdarTrend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
raihansikdar
 
Data analytics
Data analyticsData analytics
Data analytics
Dr.Bhuvaneswari Velumani
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big Data
Data Driven Innovation
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Analytics demystified
Analytics demystifiedAnalytics demystified
Analytics demystified
Marc Moreau
 
Mbaddar intro pred_anlaytics_spss
Mbaddar intro pred_anlaytics_spssMbaddar intro pred_anlaytics_spss
Mbaddar intro pred_anlaytics_spss
M Baddar
 
Buzzword scheme
Buzzword schemeBuzzword scheme
Buzzword scheme
Sergey Shelpuk
 

What's hot (20)

Data analytics
Data analyticsData analytics
Data analytics
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Crisp dm
Crisp dmCrisp dm
Crisp dm
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
Analytics in Online Retail
Analytics in Online RetailAnalytics in Online Retail
Analytics in Online Retail
 
ms-ba-course-descriptions
ms-ba-course-descriptionsms-ba-course-descriptions
ms-ba-course-descriptions
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
Data science guide
Data science guideData science guide
Data science guide
 
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan SikdarTrend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
 
Data analytics
Data analyticsData analytics
Data analytics
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Predire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big DataPredire il futuro con Machine Learning & Big Data
Predire il futuro con Machine Learning & Big Data
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Analytics demystified
Analytics demystifiedAnalytics demystified
Analytics demystified
 
Mbaddar intro pred_anlaytics_spss
Mbaddar intro pred_anlaytics_spssMbaddar intro pred_anlaytics_spss
Mbaddar intro pred_anlaytics_spss
 
Buzzword scheme
Buzzword schemeBuzzword scheme
Buzzword scheme
 

Similar to Experimenting with Data!

Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
audeleypearl
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
roushhsiu
 
Business Analytics.pptx
Business Analytics.pptxBusiness Analytics.pptx
Business Analytics.pptx
Parveen Vashisth
 
Presentation Title
Presentation TitlePresentation Title
Presentation Titlebutest
 
Introduction to Business Analytics---PPT
Introduction to Business Analytics---PPTIntroduction to Business Analytics---PPT
Introduction to Business Analytics---PPT
Neerupa Chauhan
 
Analytics
AnalyticsAnalytics
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
vishwajeetparmar1
 
Data-Driven Organisation
Data-Driven OrganisationData-Driven Organisation
Data-Driven Organisation
Jaakko Särelä
 
ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life Cycle
SrujanaMerugu1
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation
Nguyen Ngoc Binh Phuong
 
Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)
Dolapo Amusat
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
Sandeep Garg
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
Ali Septiandri
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
JhimarPeredoJurado
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
NagarajanG35
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
Mostafa
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
SSSW
 
An Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data miningAn Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data mining
Barry Leventhal
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
Ajay Ohri
 

Similar to Experimenting with Data! (20)

Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Business Analytics.pptx
Business Analytics.pptxBusiness Analytics.pptx
Business Analytics.pptx
 
Presentation Title
Presentation TitlePresentation Title
Presentation Title
 
Introduction to Business Analytics---PPT
Introduction to Business Analytics---PPTIntroduction to Business Analytics---PPT
Introduction to Business Analytics---PPT
 
Analytics
AnalyticsAnalytics
Analytics
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
Data-Driven Organisation
Data-Driven OrganisationData-Driven Organisation
Data-Driven Organisation
 
ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life Cycle
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation
 
Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)Data Analytics & Visualization (Introduction)
Data Analytics & Visualization (Introduction)
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
An Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data miningAn Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data mining
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 

Experimenting with Data!

  • 1. Experimenting with Data! Andrea Montemaggio head of data practice @ mashfrog group andrea.montemaggio@mashfrog.com github.com/klinamen linkedin.com/in/amontemaggio
  • 2. Data Science Trend Source: Google Trends Keyword: “Data Science” Wikipedia on “Data Science” Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data. [3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. [4]
  • 3. Workflow Overview Exploratory Data Analysis Experimental loop Data preparation Feature selection and extraction Model selection Evaluation Solution Prototyping Scoping & Data Acquisition Data Analysis, Modeling and Prototyping Engineering & Deployment Monitoring and Tuning
  • 4. We are a wholesale distributor and want to improve our receivables collection strategy. Being able to know as soon as possible whether an invoice is going to be paid on time or not, would allow us to plan ahead and target our collection efforts to address the most critical situations first. How long will it take to cash a given invoice? Scenario
  • 5. Scoping + Invoice data Customers data + … Data Acquisition Data preparation Feature selection and extraction Model selection Evaluation Trained Classification Model New invoice on time 1-30 days late > 30 days late Supervised Machine-Learning A trained classification model is able to assign an invoice to one of a predefined set of classes. Using historical enterprise data and machine-learning to predict whether or not an invoice is likely to be paid on time can help organizations to optimize invoice-to-cash flow. Problem How long will it take to cash a given invoice? Classification Regression Predicting a discrete label. (e.g. “on-time”, “1-30”) Predicting a continuous quantity. (e.g. 17.5 days) Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition
  • 6. Dataset + Invoice data Customers data + … Data Acquisition Data preparation Feature selection and extraction Model selection Evaluation Trained Classification Model New invoice on time 1-30 days late > 30 days late A trained classification model is able to assign an invoice to one of a predefined set of classes. Dataset: https:/ /www.kaggle.com/datasets/himanshu007121/invoice-data Description: Wholesale invoice data extracted from some accounting system (SAP?) in CSV format. Each record describes a document and has, among others, these pieces of information: - the branch that issued the document - customer information - total amount - due date - payment date Geometry (rows × cols): 50,000 × 19 Size: 7.17 MB Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition
  • 7. Exploratory Data Analysis Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition Getting to know your data. Goals - Data understanding - Data Quality assessment (e.g. missing data, encoding problems, inconsistencies) - Assessing value distributions and correlations Tools - Excel (!) - Programming languages: Python1 , R1 - low/no code integrated data analysis tools such as OpenRefine1 , Orange1 , KNIME, RapidMiner. - statistical software packages 1 FLOSS (free or open source software)
  • 8. Data Preparation Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition Cook before eating. Goals - processing raw data (or primary data), which is rarely ready to feed your algorithms - fix missing values and inconsistencies - convert between different representations of the same datum (e.g. dates, decimal numbers) Tools - Python1 - Visual tools: OpenRefine1 , AWS Data Brew 1 FLOSS (free or open source software)
  • 9. Feature Engineering Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition Knowledge is power. Goals - using domain knowledge to augment data with derived information (feature extraction), which usually leads to better performance of ML models - selecting the least number of features with the greatest significance (feature selection) - removing redundant or useless information
  • 10. Model Selection and Training Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition One model does not fit all. Goals - identifying candidate models for the problem and dataset at hand - splitting the dataset into a training set and a test set - model training and performance of the candidates - optimization of hyperparameters (i.e. model parameters that controls the learning process) and fine-tuning to select “The One” whole dataset 100% training set ~80% test set ~20% model selection evaluation
  • 11. Evaluation Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition Is it really “The One”? Goals - testing the best candidate model on the test set to see how it behaves with unseen data (generalization) Model complexity (# of parameters)
  • 12. Classification Metrics Exploratory Data Analysis Data preparation Feature engineering Model selection & training Evaluation Scoping & Data Acquisition Precision (“1-5” class) TP / (TP + FP) = 6674 / 10087 =66.17% 10087 samples predicted as “1-5”: 6674 TP + 3413 FP 10542 (samples that are really “1-5”): 6674 TP + 3868 FN Confusion Matrix Recall (“1-5” class) TP / (TP + FN) = 6674 / 10542 =63.31% Accuracy (unweighted) Pc / Pt = 23872 / 32382 = 73.72% 32382 total predictions (Pt): 23872 correct (Pc) + 8510 errors correct predictions
  • 13. Don’t Try This at Home! Just clone the following repository and have fun! https://github.com/klinamen/ds0-experimenting-with-data
  • 14. Thank you. Andrea Montemaggio head of data practice @ mashfrog group andrea.montemaggio@mashfrog.com github.com/klinamen linkedin.com/in/amontemaggio experiments never fail.