Feature surfacing - meetup

Feature surfacing
Discover, Aggregate & Evaluate
Jean-Baptiste PRIEZ
8 mars 2017

Feature engineering
XOR
X
Y
Z = (XY > 0)
Z

Users
Sales
Web
Users
CustomerId
Firstname
Lastname
Age
Sales
CustomerId
Product
Amount
Time
Web
CustomerId
Page
Time
Users.Customer_Id
Users.Firstname
Users.Lastname
Users.Age
Outcome
Count(Sales.Product)
CountDistinct(Sales.Product)
Mean(Sales.Amount)
Sum(Sales.Amount) where Sales.Product = 'Mobile Data'
Count(Web.Page) where Day(Web.Time) in [6;7]
…
Feature surfacing

LET’S START WITH AN EXAMPLE…
Feature Surfacing

Example: Outbound Mail Campaign
1 Central Table
Customer
(Id, e-mail address, age, state,
will buy within 5 days)

Example: Outbound Mail Campaign
3
Peripheral
Tables (visited pages, duration of the
session, browser type…)
Pages visited
on the website
(number of products, amount spent,
order status...)
E-mail campaign
reactions
(action, action type, time since
e-mail was sent…)
Orders

Which are the sources, variables to choose?
How to represent them?
Should we seat and meditate around a table?
Should we try each and every variables manually?
What if we let the machine work?
It’s a Machine Learning problem:
• How to smartly explore the entire set of possible aggregates?
• Without under/overfitting?
• With a linearithmic complexity?

What is Feature Surfacing?
1. Extraction of information contained in a
multi-table data source
• Aggregation operators
• Filter operators
2. Evaluation of aggregates extracted from a
star-relational data schema
Feature surfacing consists in applying a set of
aggregation operators on the peripheral tables to
generate features in the central table.
Central table
Peripheral table 1 Peripheral table 2
Peripheral table 3 Peripheral table 4
* *
**
1,1
0,n0,n
0,n0,n
1,1
1,1
1,1
* 1 row per entity in the central table, corresponding to
several rows for the same entity in the peripheral table.
Extraction Evaluation
(supervised)

What are the operators?
Some aggregation operators:
Name Return type Operands Label
Count Num Table Number of records
CountDistinct Num Table, Cat Number of distinct values
Mode Cat Table, Cat Most frequent value
Mean Num Table, Num Mean value
StdDev Num Table, Num Standard deviation
Median Num Table, Num Median value
Min Num Table, Num Min value
Max Num Table, Num Max value
Sum Num Table, Num Sum of value
Some filter operators:
Name Return type Operands Label
<, ≤ Table Table, Num Table filtered over field values
smaller (or equal) than a record
>, ≥ Table Table, Num Table filtered over field values
greater (or equal) than a record
= Table Table, Field Table filtered over field values
equal than a record
Customize your operators:
• Date: before, after, week-end, etc…
• Time: morning, afternoon, etc…
• String: split, infinitive verb, etc…
• ...

Presentation of some smart aggregates
1. Count(Pages visited)
2. Max(Orders, amount spent)
3. Mode(Email reactions, action type)
4. Median(Pages visited, duration) when Pages visited.device = “smartphone”
The maximal amount spent by the customer
The most frequent email request of the customer
Number of visited pages by the customer

How to be smart?
• Good aggregate • 1st: Aggregation ☀❤🐰
• 2nd: Filter + Aggretation ⭐
• 3rd: Filter + Filter + Aggregation ⚠♨🤔
• … etc ... ⛔🔞
M. BOULLÉ. Towards Automatic Feature
Construction for Supervised Classification.
In ECML/PKDD, P. 181-196, 2014.

How to evaluate and select features?
• Discretization / Grouping → Correlation with the target
• Select (the most) correlated features
: target set (ex: sick, healthy)
split such that the trade-off between entropy & compression is optimal

Discretization algorithms
• ChiMerge (R, SAS)
• Optimize entropy
• C4.5 (…)
• Optimize compression
• Fusinter (Zighed & co - Sinipa)
• MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark)
• MODL (Boullé)
• Optimize both: entropy & compression

Popularize: MODL
: target set (ex: sick, healthy)
I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖(
n
Discretize with MODL = Minimize the following formula:
𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678"
78"
+ ∑ log 5;6<8"
<8"
7
=>" + ∑ log 5;!
5;,A!5;,B! …5;,D!E7
=>"
entropycompression

Interpretation of smart aggregates
calculated over the visited pages table
Count(VisitedPages) = Number of visited pages
Interpretation graphic shows that:
• there is a niche of future buyers :
those who have visited more than 96.5 pages over the period (top segment)
• the majority of the base has visited no or only a few pages the site over the
period
For each customer:
For each customer:
Median(VisitedPages, duration) = median duration of
stay on a specific page

+ & -
• + Good complexity
• + Statistically efficient
• + Manage overfitting by design
• - not enough to win every Kaggle constests…

Let’s stay in touch!
Jean-Baptiste PRIEZ
Data Scientist
jbp@predicsis.ai

Feature surfacing - meetup

Recommended

Recommended

More Related Content

Similar to Feature surfacing - meetup

Similar to Feature surfacing - meetup (20)

Recently uploaded

Recently uploaded (20)

Feature surfacing - meetup