Feature surfacing
Discover, Aggregate & Evaluate
Jean-Baptiste PRIEZ
8 mars 2017
Feature engineering
XOR
X
Y
Z = (XY > 0)
Z
Users
Sales
Web
Users
CustomerId
Firstname
Lastname
Age
Sales
CustomerId
Product
Amount
Time
Web
CustomerId
Page
Time
Users.Customer_Id
Users.Firstname
Users.Lastname
Users.Age
Outcome
Count(Sales.Product)
CountDistinct(Sales.Product)
Mean(Sales.Amount)
Sum(Sales.Amount) where Sales.Product = 'Mobile Data'
Count(Web.Page) where Day(Web.Time) in [6;7]
…
Feature surfacing
LET’S START WITH AN EXAMPLE…
Feature Surfacing
Example: Outbound Mail Campaign
1 Central Table
Customer
(Id, e-mail address, age, state,
will buy within 5 days)
Example: Outbound Mail Campaign
3
Peripheral
Tables (visited pages, duration of the
session, browser type…)
Pages visited
on the website
(number of products, amount spent,
order status...)
E-mail campaign
reactions
(action, action type, time since
e-mail was sent…)
Orders
Which are the sources, variables to choose?
How to represent them?
Should we seat and meditate around a table?
Should we try each and every variables manually?
What if we let the machine work?
It’s a Machine Learning problem:
• How to smartly explore the entire set of possible aggregates?
• Without under/overfitting?
• With a linearithmic complexity?
What is Feature Surfacing?
1. Extraction of information contained in a
multi-table data source
• Aggregation operators
• Filter operators
2. Evaluation of aggregates extracted from a
star-relational data schema
Feature surfacing consists in applying a set of
aggregation operators on the peripheral tables to
generate features in the central table.
Central	table
Peripheral	table	1 Peripheral	table	2
Peripheral	table	3 Peripheral	table	4
* *
**
1,1
0,n0,n
0,n0,n
1,1
1,1
1,1
*	1	row	per	entity	in	the	central	table,	corresponding	to	
several	rows	for	the	same	entity	in	the	peripheral	table.
Extraction Evaluation
(supervised)
What are the operators?
Some aggregation operators:
Name Return type Operands Label
Count Num Table Number of records
CountDistinct Num Table, Cat Number of distinct values
Mode Cat Table, Cat Most frequent value
Mean Num Table, Num Mean value
StdDev Num Table, Num Standard deviation
Median Num Table, Num Median value
Min Num Table, Num Min value
Max Num Table, Num Max value
Sum Num Table, Num Sum of value
Some filter operators:
Name Return type Operands Label
<, ≤ Table Table, Num Table filtered over field values
smaller (or equal) than a record
>, ≥ Table Table, Num Table filtered over field values
greater (or equal) than a record
= Table Table, Field Table filtered over field values
equal than a record
Customize your operators:
• Date:	before,	after,	week-end,	etc…
• Time:	morning,	afternoon,	etc…
• String:	split,	infinitive	verb,	etc…
• ...
Presentation of some smart aggregates
1. Count(Pages visited)
2. Max(Orders, amount spent)
3. Mode(Email reactions, action type)
4. Median(Pages visited, duration) when Pages visited.device = “smartphone”
The maximal amount spent by the customer
The most frequent email request of the customer
Number of visited pages by the customer
How to be smart?
• Good aggregate • 1st: Aggregation ☀❤🐰
• 2nd: Filter + Aggretation ⭐
• 3rd: Filter + Filter + Aggregation ⚠♨🤔
• … etc ... ⛔🔞
M. BOULLÉ. Towards Automatic Feature
Construction for Supervised Classification.
In ECML/PKDD, P. 181-196, 2014.
How to evaluate and select features?
• Discretization / Grouping → Correlation with the target
• Select (the most) correlated features
: target set (ex: sick, healthy)
split such that the trade-off between entropy & compression is optimal
Discretization algorithms
• ChiMerge (R, SAS)
• Optimize entropy
• C4.5 (…)
• Optimize compression
• Fusinter (Zighed & co - Sinipa)
• MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark)
• MODL (Boullé)
• Optimize both: entropy & compression
Popularize: MODL
: target set (ex: sick, healthy)
I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖(
n
Discretize with MODL = Minimize the following formula:
𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678"
78"
+	∑ log 5;6<8"
<8"
7
=>" +	∑ log 5;!
5;,A!5;,B!	…5;,D!E7
=>"
entropycompression
Interpretation of smart aggregates
calculated over the visited pages table
Count(VisitedPages) = Number of visited pages
Interpretation graphic shows that:
• there is a niche of future buyers :
those who have visited more than 96.5 pages over the period (top segment)
• the majority of the base has visited no or only a few pages the site over the
period
For	each	customer:	
For	each	customer:	
Median(VisitedPages, duration) = median duration of
stay on a specific page
+ & -
• + Good complexity
• + Statistically efficient
• + Manage overfitting by design
• - not enough to win every Kaggle constests…
Let’s stay in	touch!
Jean-Baptiste PRIEZ
Data Scientist
jbp@predicsis.ai

Feature surfacing - meetup

  • 1.
    Feature surfacing Discover, Aggregate& Evaluate Jean-Baptiste PRIEZ 8 mars 2017
  • 2.
  • 3.
  • 4.
    LET’S START WITHAN EXAMPLE… Feature Surfacing
  • 5.
    Example: Outbound MailCampaign 1 Central Table Customer (Id, e-mail address, age, state, will buy within 5 days)
  • 6.
    Example: Outbound MailCampaign 3 Peripheral Tables (visited pages, duration of the session, browser type…) Pages visited on the website (number of products, amount spent, order status...) E-mail campaign reactions (action, action type, time since e-mail was sent…) Orders
  • 9.
    Which are thesources, variables to choose? How to represent them? Should we seat and meditate around a table? Should we try each and every variables manually? What if we let the machine work? It’s a Machine Learning problem: • How to smartly explore the entire set of possible aggregates? • Without under/overfitting? • With a linearithmic complexity?
  • 10.
    What is FeatureSurfacing? 1. Extraction of information contained in a multi-table data source • Aggregation operators • Filter operators 2. Evaluation of aggregates extracted from a star-relational data schema Feature surfacing consists in applying a set of aggregation operators on the peripheral tables to generate features in the central table. Central table Peripheral table 1 Peripheral table 2 Peripheral table 3 Peripheral table 4 * * ** 1,1 0,n0,n 0,n0,n 1,1 1,1 1,1 * 1 row per entity in the central table, corresponding to several rows for the same entity in the peripheral table. Extraction Evaluation (supervised)
  • 11.
    What are theoperators? Some aggregation operators: Name Return type Operands Label Count Num Table Number of records CountDistinct Num Table, Cat Number of distinct values Mode Cat Table, Cat Most frequent value Mean Num Table, Num Mean value StdDev Num Table, Num Standard deviation Median Num Table, Num Median value Min Num Table, Num Min value Max Num Table, Num Max value Sum Num Table, Num Sum of value Some filter operators: Name Return type Operands Label <, ≤ Table Table, Num Table filtered over field values smaller (or equal) than a record >, ≥ Table Table, Num Table filtered over field values greater (or equal) than a record = Table Table, Field Table filtered over field values equal than a record Customize your operators: • Date: before, after, week-end, etc… • Time: morning, afternoon, etc… • String: split, infinitive verb, etc… • ...
  • 12.
    Presentation of somesmart aggregates 1. Count(Pages visited) 2. Max(Orders, amount spent) 3. Mode(Email reactions, action type) 4. Median(Pages visited, duration) when Pages visited.device = “smartphone” The maximal amount spent by the customer The most frequent email request of the customer Number of visited pages by the customer
  • 13.
    How to besmart? • Good aggregate • 1st: Aggregation ☀❤🐰 • 2nd: Filter + Aggretation ⭐ • 3rd: Filter + Filter + Aggregation ⚠♨🤔 • … etc ... ⛔🔞 M. BOULLÉ. Towards Automatic Feature Construction for Supervised Classification. In ECML/PKDD, P. 181-196, 2014.
  • 14.
    How to evaluateand select features? • Discretization / Grouping → Correlation with the target • Select (the most) correlated features : target set (ex: sick, healthy) split such that the trade-off between entropy & compression is optimal
  • 15.
    Discretization algorithms • ChiMerge(R, SAS) • Optimize entropy • C4.5 (…) • Optimize compression • Fusinter (Zighed & co - Sinipa) • MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark) • MODL (Boullé) • Optimize both: entropy & compression
  • 16.
    Popularize: MODL : targetset (ex: sick, healthy) I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖( n Discretize with MODL = Minimize the following formula: 𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678" 78" + ∑ log 5;6<8" <8" 7 =>" + ∑ log 5;! 5;,A!5;,B! …5;,D!E7 =>" entropycompression
  • 17.
    Interpretation of smartaggregates calculated over the visited pages table Count(VisitedPages) = Number of visited pages Interpretation graphic shows that: • there is a niche of future buyers : those who have visited more than 96.5 pages over the period (top segment) • the majority of the base has visited no or only a few pages the site over the period For each customer: For each customer: Median(VisitedPages, duration) = median duration of stay on a specific page
  • 18.
    + & - •+ Good complexity • + Statistically efficient • + Manage overfitting by design • - not enough to win every Kaggle constests…
  • 19.
    Let’s stay in touch! Jean-BaptistePRIEZ Data Scientist jbp@predicsis.ai