Introduction to the implementation of Data Science projects in organizations, with a practice session on how to apply machine-learning techniques to a business problem.
Notebook of the practice session is available at https://github.com/klinamen/ds0-experimenting-with-data
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
Experimenting with Data!
1. Experimenting with Data!
Andrea Montemaggio
head of data practice @ mashfrog group
andrea.montemaggio@mashfrog.com
github.com/klinamen
linkedin.com/in/amontemaggio
2. Data Science Trend
Source: Google Trends
Keyword: “Data Science”
Wikipedia on “Data Science”
Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to
"understand and analyze actual phenomena" with data.
[3]
It uses techniques and theories drawn from many fields
within the context of mathematics, statistics, computer science, information science, and domain knowledge.
[4]
3. Workflow Overview
Exploratory
Data Analysis
Experimental loop
Data preparation
Feature selection
and extraction
Model selection
Evaluation
Solution
Prototyping
Scoping
& Data Acquisition
Data Analysis, Modeling
and Prototyping
Engineering &
Deployment
Monitoring
and Tuning
4. We are a wholesale distributor and want to improve our receivables collection strategy.
Being able to know as soon as possible whether an invoice is going to be paid on time or not, would allow us to plan ahead
and target our collection efforts to address the most critical situations first.
How long will it take to cash a given invoice?
Scenario
5. Scoping +
Invoice data Customers data
+ …
Data Acquisition
Data
preparation
Feature selection
and extraction
Model selection
Evaluation
Trained
Classification
Model
New invoice
on time
1-30 days late
> 30 days late
Supervised Machine-Learning
A trained classification model is able to assign an invoice
to one of a predefined set of classes.
Using historical enterprise data and machine-learning to predict whether
or not an invoice is likely to be paid on time can help organizations to
optimize invoice-to-cash flow.
Problem
How long will it take to cash a given invoice?
Classification
Regression
Predicting a discrete label.
(e.g. “on-time”, “1-30”)
Predicting a continuous quantity.
(e.g. 17.5 days)
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
6. Dataset +
Invoice data Customers data
+ …
Data Acquisition
Data
preparation
Feature selection
and extraction
Model selection
Evaluation
Trained
Classification
Model
New invoice
on time
1-30 days late
> 30 days late
A trained classification model is able to assign an invoice
to one of a predefined set of classes.
Dataset:
https:/
/www.kaggle.com/datasets/himanshu007121/invoice-data
Description:
Wholesale invoice data extracted from some accounting system (SAP?) in
CSV format.
Each record describes a document and has, among others, these pieces
of information:
- the branch that issued the document
- customer information
- total amount
- due date
- payment date
Geometry (rows × cols): 50,000 × 19
Size: 7.17 MB
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
7. Exploratory Data Analysis
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Getting to know your data.
Goals
- Data understanding
- Data Quality assessment (e.g. missing data,
encoding problems, inconsistencies)
- Assessing value distributions and
correlations
Tools
- Excel (!)
- Programming languages: Python1
, R1
- low/no code integrated data analysis tools such
as OpenRefine1
, Orange1
, KNIME, RapidMiner.
- statistical software packages
1
FLOSS (free or open source software)
8. Data Preparation
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Cook before eating.
Goals
- processing raw data (or primary data),
which is rarely ready to feed your
algorithms
- fix missing values and inconsistencies
- convert between different representations
of the same datum (e.g. dates, decimal
numbers)
Tools
- Python1
- Visual tools: OpenRefine1
, AWS Data Brew
1
FLOSS (free or open source software)
9. Feature Engineering
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Knowledge is power.
Goals
- using domain knowledge to augment data with derived information
(feature extraction), which usually leads to better performance of ML
models
- selecting the least number of features with the greatest significance
(feature selection)
- removing redundant or useless information
10. Model Selection and Training
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
One model does not fit all.
Goals
- identifying candidate models for the
problem and dataset at hand
- splitting the dataset into a training set and a
test set
- model training and performance of the
candidates
- optimization of hyperparameters (i.e. model
parameters that controls the learning
process) and fine-tuning to select “The One”
whole dataset
100%
training set
~80%
test set
~20%
model selection evaluation
11. Evaluation
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Is it really “The One”?
Goals
- testing the best candidate model on the test
set to see how it behaves with unseen data
(generalization)
Model complexity
(# of parameters)
12. Classification
Metrics
Exploratory
Data Analysis
Data preparation Feature engineering
Model selection
& training
Evaluation
Scoping & Data
Acquisition
Precision (“1-5” class)
TP / (TP + FP) = 6674 / 10087 =66.17%
10087 samples predicted as “1-5”: 6674 TP + 3413 FP
10542
(samples that are really “1-5”):
6674 TP + 3868 FN
Confusion Matrix
Recall (“1-5” class)
TP / (TP + FN) = 6674 / 10542 =63.31%
Accuracy (unweighted)
Pc / Pt = 23872 / 32382 = 73.72%
32382 total predictions (Pt):
23872 correct (Pc) + 8510 errors
correct predictions
13. Don’t Try This at Home!
Just clone the following repository and have fun!
https://github.com/klinamen/ds0-experimenting-with-data
14. Thank you.
Andrea Montemaggio
head of data practice @ mashfrog group
andrea.montemaggio@mashfrog.com
github.com/klinamen
linkedin.com/in/amontemaggio
experiments never fail.