Experimenting with Data!

Experimenting with Data!
Andrea Montemaggio
head of data practice @ mashfrog group
andrea.montemaggio@mashfrog.com
github.com/klinamen
linkedin.com/in/amontemaggio

Data Science Trend
Source: Google Trends
Keyword: “Data Science”
Wikipedia on “Data Science”
Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to
"understand and analyze actual phenomena" with data.
[3]
It uses techniques and theories drawn from many ﬁelds
within the context of mathematics, statistics, computer science, information science, and domain knowledge.
[4]

Workﬂow Overview
Exploratory
Data Analysis
Experimental loop
Data preparation
Feature selection
and extraction
Model selection
Evaluation
Solution
Prototyping
Scoping
& Data Acquisition
Data Analysis, Modeling
and Prototyping
Engineering &
Deployment
Monitoring
and Tuning

We are a wholesale distributor and want to improve our receivables collection strategy.
Being able to know as soon as possible whether an invoice is going to be paid on time or not, would allow us to plan ahead
and target our collection eﬀorts to address the most critical situations ﬁrst.
How long will it take to cash a given invoice?
Scenario

Scoping +
Invoice data Customers data
+ …
Data Acquisition
Data
preparation
Feature selection
and extraction
Model selection
Evaluation
Trained
Classification
Model
New invoice
on time
1-30 days late
> 30 days late
Supervised Machine-Learning
A trained classification model is able to assign an invoice
to one of a predefined set of classes.
Using historical enterprise data and machine-learning to predict whether
or not an invoice is likely to be paid on time can help organizations to
optimize invoice-to-cash flow.
Problem
How long will it take to cash a given invoice?
Classification
Regression
Predicting a discrete label.
(e.g. “on-time”, “1-30”)
Predicting a continuous quantity.
(e.g. 17.5 days)
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition

Dataset +
Invoice data Customers data
+ …
Data Acquisition
Data
preparation
Feature selection
and extraction
Model selection
Evaluation
Trained
Classification
Model
New invoice
on time
1-30 days late
> 30 days late
A trained classiﬁcation model is able to assign an invoice
to one of a predeﬁned set of classes.
Dataset:
https:/
/www.kaggle.com/datasets/himanshu007121/invoice-data
Description:
Wholesale invoice data extracted from some accounting system (SAP?) in
CSV format.
Each record describes a document and has, among others, these pieces
of information:
- the branch that issued the document
- customer information
- total amount
- due date
- payment date
Geometry (rows × cols): 50,000 × 19
Size: 7.17 MB
Exploratory
Data Analysis
Model selection
& training
Evaluation
Scoping & Data
Acquisition

Exploratory Data Analysis
Exploratory
Data Analysis
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Getting to know your data.
Goals
- Data understanding
- Data Quality assessment (e.g. missing data,
encoding problems, inconsistencies)
- Assessing value distributions and
correlations
Tools
- Excel (!)
- Programming languages: Python1
, R1
- low/no code integrated data analysis tools such
as OpenReﬁne1
, Orange1
, KNIME, RapidMiner.
- statistical software packages
1
FLOSS (free or open source software)

Data Preparation
Exploratory
Data Analysis
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Cook before eating.
Goals
- processing raw data (or primary data),
which is rarely ready to feed your
algorithms
- ﬁx missing values and inconsistencies
- convert between different representations
of the same datum (e.g. dates, decimal
numbers)
Tools
- Python1
- Visual tools: OpenReﬁne1
, AWS Data Brew
1
FLOSS (free or open source software)

Feature Engineering
Exploratory
Data Analysis
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Knowledge is power.
Goals
- using domain knowledge to augment data with derived information
(feature extraction), which usually leads to better performance of ML
models
- selecting the least number of features with the greatest signiﬁcance
(feature selection)
- removing redundant or useless information

Model Selection and Training
Exploratory
Data Analysis
Model selection
& training
Evaluation
Scoping & Data
Acquisition
One model does not ﬁt all.
Goals
- identifying candidate models for the
problem and dataset at hand
- splitting the dataset into a training set and a
test set
- model training and performance of the
candidates
- optimization of hyperparameters (i.e. model
parameters that controls the learning
process) and ﬁne-tuning to select “The One”
whole dataset
100%
training set
~80%
test set
~20%
model selection evaluation

Evaluation
Exploratory
Data Analysis
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Is it really “The One”?
Goals
- testing the best candidate model on the test
set to see how it behaves with unseen data
(generalization)
Model complexity
(# of parameters)

Classiﬁcation
Metrics
Exploratory
Data Analysis
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Precision (“1-5” class)
TP / (TP + FP) = 6674 / 10087 =66.17%
10087 samples predicted as “1-5”: 6674 TP + 3413 FP
10542
(samples that are really “1-5”):
6674 TP + 3868 FN
Confusion Matrix
Recall (“1-5” class)
TP / (TP + FN) = 6674 / 10542 =63.31%
Accuracy (unweighted)
Pc / Pt = 23872 / 32382 = 73.72%
32382 total predictions (Pt):
23872 correct (Pc) + 8510 errors
correct predictions

Don’t Try This at Home!
Just clone the following repository and have fun!
https://github.com/klinamen/ds0-experimenting-with-data

Thank you.
Andrea Montemaggio
head of data practice @ mashfrog group
andrea.montemaggio@mashfrog.com
github.com/klinamen
linkedin.com/in/amontemaggio
experiments never fail.

Experimenting with Data!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Experimenting with Data!

Similar to Experimenting with Data! (20)

Recently uploaded

Recently uploaded (20)

Experimenting with Data!