Learning content - Data Science Basics

Learning Content
Data Science basics

Content
1. Supervised classification
1. Performance indicators: Gini, Lift, Level,
Weight, Contribution, etc…
2. Statistic reduction: Discretization / Grouping
3. Naïve Bayes and SNB
4. Feature Engineering: feature surfacing
2. Relational database
1. Table, column, field, attribute
2. Types: categorical, numeric and date
3. Aggregation and filters
3. Methodology: CRISP-DM
• Step-by-step
4. Zoom on Data Preparation
1. (What’s drive) Description
2. (How to action) Prediction
3. Expected data
1. Train and test data
2. Central and Peripheral datasets
3. Central dataset
4. Peripheral dataset
5. CSV format
6. Settings
http://predicsis-ai-doc.readthedocs.io
http://predicsis-python-sdk.readthedocs.io

Knowledge database about Alien Invasion

Supervised classification
Alien?
Yes No

Supervised classification on Data
Example
• Medical data
• Tumors size, patient age
• Knowledge about cancer
PredicSis.ai learns and classifies:
Tumorsize
Patient age

Use case PredicSis.ai: Fidelio
Bank clients
• Using bank account data,
• Could we identify which clients are appetent to loyalty loan: Fidelio ?
Go to PredicSis.ai !!

Performance indicators
Performance: is a measure of
statistical dispersion
• 0 ~ random
• 1 ~ perfect prediction
Stability: is the coherence between
the train and test data related to the
model.
• 0 ~ incoherent
• 1 ~ totally coherent
Main outcome frequency:
percentage of targetted attributes in
the test data

Performance indicators (2)
Precision: The main modality's outcome
rate in a quantile.
Cumulative gain: The percentage of
the outcome main modality in the top
x% of the population.
Lift: The ratio of the proportion of
main outcome modality in the quantile
compared to the overall proportion of
main outcome modality.

Contributive features
Level: The measure of the correlation of a feature to the outcome feature.
• 0 ~ no correlation
• 1 ~ fully discrimate

Contributive features (2)
Weight: The contribution measure of the feature to the predictive performance.
• 0 ~ not used, redundant
• 1 ~ fully contributive

Contribution: A normalized mean of level and weight.

Optimal discretization:
• Each interval / group is statistically homogenous
• MODL (Boullé; 2003)
• Coverage: The percentage of the population in
a class.
• Frequency: The main outcome feature rate in a class.

Discretization / Grouping
: target set (ex: sick, healthy)
split such that the trade-off between entropy & compression is optimal

Feature engineering! Why?
XOR
X
Y
Z = (XY > 0)
Z

Discretization / Grouping
: target set (ex: sick, healthy)
I: 𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6 𝑖7
n
Discretize with MODL = Minimize the following formula:
𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 𝑛+𝐼−1
𝐼−1
+ 𝑖=1
𝐼
log 𝑛 𝑖+𝐽−1
𝐽−1
+ 𝑖=1
𝐼
log 𝑛 𝑖!
𝑛 𝑖,1!𝑛 𝑖,2! …𝑛 𝑖,𝐽!
entropycompression

Naïve Bayes / SNB (2)
Selective Naïve Bayes: a bagging algorithm used to improve Naïve Bayes
Boullé, 2007

Relational database: Table, column, attribute
Table
ColumnAttribute Field
Record

Relational database: Types
Categorical Numeric Date

Relational database
Client id Firstname Lastname …
AAAAA James Bond …
BBBBB Robert Gros …
CCCCC Johny Bigoude …
Client id Contrat id Prime $/month …
AAAAA iiiiiii 40 …
AAAAA hhhh 5 …
CCCCC kkkk 67 …
CRM Contract
Contrat id Sinister id …
iiiiiii 222222 …
hhhh 555555 …
hhhh 555555 …
kkkk 777777 …
Sinisters
City District …
Paris 1 …
Paris 2 …
London City …
Chicago … …
Area record
1:n
1:1
1:n
1:1
0:n
1:1
Client id Contract id …
AAAAA hhhh …
AAAAA pppp …
AAAAA hhhh …
BBBBB yyyy …
Support records0:n
1:1 0:1
0:n

Use case PredicSis.ai: Outbound Mail Campaign
How to optimize mail campaign?
• Using CRM about customer:
• Orders
• Webpage visited
• Campaign
• Could we identify which clients are appetents to red t-shirts?
Go to PredicSis.ai !!

Outbound Mail Campaign
3
Peripheral
Tables (visited pages, duration of the
session, browser type…)
Pages visited
on the website
(number of products, amount
spent,
order status...)
E-mail campaign
reactions
(action, action type, time since
e-mail was sent…)
Orders

What is Feature Surfacing?
1. Extraction of information contained in a multi-table data
source
• Aggregation operators
• Filter operators
2. Evaluation of aggregates extracted from a star-relational
data schema
Feature surfacing consists in applying a set of aggregation
operators on the peripheral tables to generate features in the
central table.
Central table
Peripheral table 1 Peripheral table 2
Peripheral table 3 Peripheral table 4
* *
**
1,1
0,n0,n
0,n0,n
1,1
1,1
1,1
* 1 row per entity in the central table, corresponding to
several rows for the same entity in the peripheral table.
Extraction Evaluation
(supervised)

Relational database: Join
SELECT * FROM CRM JOIN Contract ON Client id
Client id Firstname Lastname …
AAAAA James Bond …
BBBBB Robert Gros …
CCCCC Johny Bigoude …
AAAAA hhhh 5 …
CCCCC kkkk 67 …
CRM Contract
1:n
1:1
Client id Contrat id Prime $/month Firstname Lastname …
AAAAA iiiiiii 40 James Bond …
AAAAA hhhh 5 James Bond …
CCCCC kkkk 67 Johny Bigoude …

Relational database: Aggregate
SELECT COUNT(Contract id)
FROM Contract
GROUP BY Client id
AAAAA hhhh 5 …
CCCCC kkkk 67 …
Contract
Client id Count(Contrat id)
AAAAA 2
CCCCC 1
DDDD 42

Relational database: Aggregate (2)
Name Return type Operands Label
Count Num Table Number of records
CountDistinct Num Table, Cat Number of distinct attributes
Mode Cat Table, Cat Most frequent attribute
Mean Num Table, Num Mean value
StdDev Num Table, Num Standard deviation
Median Num Table, Num Median value
Min Num Table, Num Min value
Max Num Table, Num Max value
Sum Num Table, Num Sum of value

Client id Contrat id Prime $/month Created at Resign at …
AAAAA iiiiiii 40 2017-01-15 …
AAAAA hhhh 5 2016-09-01 …
CCCCC kkkk 67 2015-09-01 2017-02-01 …
Relational database: Filter
SELECT *
FROM Contract
WHERE Created at < 2017-01-01
Contract
Client id Contrat id Prime $/month Created at Resign at
AAAAA iiiiiii 40 2017-01-15
AAAAA hhhh 5 2016-09-01
CCCCC kkkk 67 2015-09-01 2017-02-01

Relational database: Filter (2)
Name Return type Operands Label
<, ≤ Table Table, Num Table filtered over field values smaller (or equal) than a record
>, ≥ Table Table, Num Table filtered over field values greater (or equal) than a record
= Table Table, Field Table filtered over field values equal than a record

Presentation of some smart aggregates
1. Count(Pages visited)
2. Max(Orders, amount spent)
3. Mode(Email reactions, action type)
4. Median(Pages visited, duration) when Pages visited.device = “smartphone”
The maximal amount spent by the customer
The most frequent email request of the customer
Number of visited pages by the customer

How to be smart?
• Good aggregate • 1st: Aggregation ☀❤️️🐰
• 2nd: Filter + Aggretation ⭐
• 3rd: Filter + Filter + Aggregation ⚠♨️️
• … etc ... ⛔️🔞
M. BOULLÉ. Towards Automatic Feature Construction for
Supervised Classification. In ECML/PKDD, P. 181-196,
2014.

Interpretation of smart aggregates
calculated over the visited pages table
Count(VisitedPages) = Number of visited pages
Interpretation graphic shows that:
• there is a niche of future buyers :
those who have visited more than 96.5 pages over the period (top segment)
• the majority of the base has visited no or only a few pages the site over the period
For each customer:
For each customer:
Median(VisitedPages, duration) = median duration of
stay on a specific page

CRISP-DM
• Business Understanding:
• Upsell, Cross-sell, Attrition, etc.
• Goals
• Data Understanding:
• Usability of internal data source
• Access
• Behoof about business problem
• Volumetry
• Well informed ?
• Exploit external data? Open data?
• Data Preparation:
• Cleaning
• Orchestration
• Build a wallet
• Set a target
• Extract observation

CRISP-DM (2)
• Modeling and Evaluation :
• Use PredicSis.ai
• Validate data preparation
• Discriminating signals
• Excepted signals (ex: number of sinisters, etc.)
• Deployement :
• Up to production
• Exploit new knowledge reports
• Fast decision making
• Build list of actionables

Data Preparation
Cleaning
Highlighting
Orchestration

Data Preparation – Cleaning / Highlighting
• Extraction of « hidden fields » :
Ex : address
5bis, rue des Coquelicots, 75 001 Paris
 STREET: 5bis, rue des Coquelicots
 CITY: Paris
 ZIP CODE: 75 001
 AREA : 1
 STATE: 75
 etc.
PredicSis.ai doesn’t understand business! It understands correlation between a field and an outcome.
« Each record has a probability to be a target! »
• Normalizing values :
Ex : a field « sex » informed manually
M, m, g  M
F, f  F
Normalize makes sens! Reports are easier to read and to interpret.

Data Preparation - Orchestration
Translation of business question on DATA!
• Build a wallet:
• All insurance contracts which are available (and cancelled) (churn fighting)
• All multi-equiped customers (cross-sell campaign)
• Etc.
• Periods definition:
• Date of reference (ex : November, 2nd 2016)
• Horizon (ex : 15 days of marketing campaign)
• Target period (ex : cancellation between November 17th and December 31th 2016)
• Observation periods (sinisters on last month, last 6 months , last year, on the entire contract life)
Observation period 1
Date of reference
Horizon Target

Data Preparation – Orchestration (2)
Train / Test Data:
• Knowledge database
• Data with an outcome modality
Data to score:
• Data used to predict
The only difference have
to be the outcome field

Expected Data
1. File format:
All files have to be CSV with
• Header line
• encoding UTF-8 without BOM
• …
2. Train and test data:
At the project creation, you have to
provide train and test files
3. Central dataset
Should have an index
4. Peripheral dataset
Contains the index field to be joined to
central dataset
5. Settings
• Separator (“t”, “,”, “;” ou “|”)
• Join key / index
• Outcome field
• Main outcome

Learning content - Data Science Basics

More Related Content

Similar to Learning content - Data Science Basics

Recently uploaded

Learning content - Data Science Basics

Editor's Notes