SlideShare a Scribd company logo
Feature surfacing
Discover, Aggregate & Evaluate
Jean-Baptiste PRIEZ
8 mars 2017
Feature engineering
XOR
X
Y
Z = (XY > 0)
Z
Users
Sales
Web
Users
CustomerId
Firstname
Lastname
Age
Sales
CustomerId
Product
Amount
Time
Web
CustomerId
Page
Time
Users.Customer_Id
Users.Firstname
Users.Lastname
Users.Age
Outcome
Count(Sales.Product)
CountDistinct(Sales.Product)
Mean(Sales.Amount)
Sum(Sales.Amount) where Sales.Product = 'Mobile Data'
Count(Web.Page) where Day(Web.Time) in [6;7]
…
Feature surfacing
LET’S START WITH AN EXAMPLE…
Feature Surfacing
Example: Outbound Mail Campaign
1 Central Table
Customer
(Id, e-mail address, age, state,
will buy within 5 days)
Example: Outbound Mail Campaign
3
Peripheral
Tables (visited pages, duration of the
session, browser type…)
Pages visited
on the website
(number of products, amount spent,
order status...)
E-mail campaign
reactions
(action, action type, time since
e-mail was sent…)
Orders
Which are the sources, variables to choose?
How to represent them?
Should we seat and meditate around a table?
Should we try each and every variables manually?
What if we let the machine work?
It’s a Machine Learning problem:
• How to smartly explore the entire set of possible aggregates?
• Without under/overfitting?
• With a linearithmic complexity?
What is Feature Surfacing?
1. Extraction of information contained in a
multi-table data source
• Aggregation operators
• Filter operators
2. Evaluation of aggregates extracted from a
star-relational data schema
Feature surfacing consists in applying a set of
aggregation operators on the peripheral tables to
generate features in the central table.
Central	table
Peripheral	table	1 Peripheral	table	2
Peripheral	table	3 Peripheral	table	4
* *
**
1,1
0,n0,n
0,n0,n
1,1
1,1
1,1
*	1	row	per	entity	in	the	central	table,	corresponding	to	
several	rows	for	the	same	entity	in	the	peripheral	table.
Extraction Evaluation
(supervised)
What are the operators?
Some aggregation operators:
Name Return type Operands Label
Count Num Table Number of records
CountDistinct Num Table, Cat Number of distinct values
Mode Cat Table, Cat Most frequent value
Mean Num Table, Num Mean value
StdDev Num Table, Num Standard deviation
Median Num Table, Num Median value
Min Num Table, Num Min value
Max Num Table, Num Max value
Sum Num Table, Num Sum of value
Some filter operators:
Name Return type Operands Label
<, ≤ Table Table, Num Table filtered over field values
smaller (or equal) than a record
>, ≥ Table Table, Num Table filtered over field values
greater (or equal) than a record
= Table Table, Field Table filtered over field values
equal than a record
Customize your operators:
• Date:	before,	after,	week-end,	etc…
• Time:	morning,	afternoon,	etc…
• String:	split,	infinitive	verb,	etc…
• ...
Presentation of some smart aggregates
1. Count(Pages visited)
2. Max(Orders, amount spent)
3. Mode(Email reactions, action type)
4. Median(Pages visited, duration) when Pages visited.device = “smartphone”
The maximal amount spent by the customer
The most frequent email request of the customer
Number of visited pages by the customer
How to be smart?
• Good aggregate • 1st: Aggregation ☀❤🐰
• 2nd: Filter + Aggretation ⭐
• 3rd: Filter + Filter + Aggregation ⚠♨🤔
• … etc ... ⛔🔞
M. BOULLÉ. Towards Automatic Feature
Construction for Supervised Classification.
In ECML/PKDD, P. 181-196, 2014.
How to evaluate and select features?
• Discretization / Grouping → Correlation with the target
• Select (the most) correlated features
: target set (ex: sick, healthy)
split such that the trade-off between entropy & compression is optimal
Discretization algorithms
• ChiMerge (R, SAS)
• Optimize entropy
• C4.5 (…)
• Optimize compression
• Fusinter (Zighed & co - Sinipa)
• MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark)
• MODL (Boullé)
• Optimize both: entropy & compression
Popularize: MODL
: target set (ex: sick, healthy)
I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖(
n
Discretize with MODL = Minimize the following formula:
𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678"
78"
+	∑ log 5;6<8"
<8"
7
=>" +	∑ log 5;!
5;,A!5;,B!	…5;,D!E7
=>"
entropycompression
Interpretation of smart aggregates
calculated over the visited pages table
Count(VisitedPages) = Number of visited pages
Interpretation graphic shows that:
• there is a niche of future buyers :
those who have visited more than 96.5 pages over the period (top segment)
• the majority of the base has visited no or only a few pages the site over the
period
For	each	customer:	
For	each	customer:	
Median(VisitedPages, duration) = median duration of
stay on a specific page
+ & -
• + Good complexity
• + Statistically efficient
• + Manage overfitting by design
• - not enough to win every Kaggle constests…
Let’s stay in	touch!
Jean-Baptiste PRIEZ
Data Scientist
jbp@predicsis.ai

More Related Content

Similar to Feature surfacing - meetup

Learning content - Data Science Basics
Learning content - Data Science Basics Learning content - Data Science Basics
Learning content - Data Science Basics
PredicSis
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
Jeff Patti
 
Intro to Financial Modeling - EI
Intro to Financial Modeling - EIIntro to Financial Modeling - EI
Intro to Financial Modeling - EI
Martin Zych
 
Advanced SQL For Data Scientists
Advanced SQL For Data ScientistsAdvanced SQL For Data Scientists
Advanced SQL For Data Scientists
Databricks
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
Amrit Sarkar
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Building analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solrBuilding analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solr
Amrit Sarkar
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
akitda
 
Hw09 Analytics And Reporting
Hw09   Analytics And ReportingHw09   Analytics And Reporting
Hw09 Analytics And Reporting
Cloudera, Inc.
 
Super spike
Super spikeSuper spike
Super spike
Michael Falanga
 
Encores
EncoresEncores
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
Johann de Boer
 
kdd2015
kdd2015kdd2015
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needs
Sriskandarajah Suhothayan
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer Data
WSO2
 
Startup Metrics for Pirates (Brazil, Nov 2011)
Startup Metrics for Pirates (Brazil, Nov 2011)Startup Metrics for Pirates (Brazil, Nov 2011)
Startup Metrics for Pirates (Brazil, Nov 2011)
Dave McClure
 
Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)
Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)
Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)
Dave McClure
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
Institute of Contemporary Sciences
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
Rising Media, Inc.
 
1120 track2 komp
1120 track2 komp1120 track2 komp
1120 track2 komp
Rising Media, Inc.
 

Similar to Feature surfacing - meetup (20)

Learning content - Data Science Basics
Learning content - Data Science Basics Learning content - Data Science Basics
Learning content - Data Science Basics
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Intro to Financial Modeling - EI
Intro to Financial Modeling - EIIntro to Financial Modeling - EI
Intro to Financial Modeling - EI
 
Advanced SQL For Data Scientists
Advanced SQL For Data ScientistsAdvanced SQL For Data Scientists
Advanced SQL For Data Scientists
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
 
Building analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solrBuilding analytics applications with streaming expressions in apache solr
Building analytics applications with streaming expressions in apache solr
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
 
Hw09 Analytics And Reporting
Hw09   Analytics And ReportingHw09   Analytics And Reporting
Hw09 Analytics And Reporting
 
Super spike
Super spikeSuper spike
Super spike
 
Encores
EncoresEncores
Encores
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
kdd2015
kdd2015kdd2015
kdd2015
 
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needs
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer Data
 
Startup Metrics for Pirates (Brazil, Nov 2011)
Startup Metrics for Pirates (Brazil, Nov 2011)Startup Metrics for Pirates (Brazil, Nov 2011)
Startup Metrics for Pirates (Brazil, Nov 2011)
 
Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)
Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)
Startup Metrics 4 Pirates 2.0 (March 2011, SXSW)
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
 
1120 track2 komp
1120 track2 komp1120 track2 komp
1120 track2 komp
 

Recently uploaded

一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

Feature surfacing - meetup

  • 1. Feature surfacing Discover, Aggregate & Evaluate Jean-Baptiste PRIEZ 8 mars 2017
  • 4. LET’S START WITH AN EXAMPLE… Feature Surfacing
  • 5. Example: Outbound Mail Campaign 1 Central Table Customer (Id, e-mail address, age, state, will buy within 5 days)
  • 6. Example: Outbound Mail Campaign 3 Peripheral Tables (visited pages, duration of the session, browser type…) Pages visited on the website (number of products, amount spent, order status...) E-mail campaign reactions (action, action type, time since e-mail was sent…) Orders
  • 7.
  • 8.
  • 9. Which are the sources, variables to choose? How to represent them? Should we seat and meditate around a table? Should we try each and every variables manually? What if we let the machine work? It’s a Machine Learning problem: • How to smartly explore the entire set of possible aggregates? • Without under/overfitting? • With a linearithmic complexity?
  • 10. What is Feature Surfacing? 1. Extraction of information contained in a multi-table data source • Aggregation operators • Filter operators 2. Evaluation of aggregates extracted from a star-relational data schema Feature surfacing consists in applying a set of aggregation operators on the peripheral tables to generate features in the central table. Central table Peripheral table 1 Peripheral table 2 Peripheral table 3 Peripheral table 4 * * ** 1,1 0,n0,n 0,n0,n 1,1 1,1 1,1 * 1 row per entity in the central table, corresponding to several rows for the same entity in the peripheral table. Extraction Evaluation (supervised)
  • 11. What are the operators? Some aggregation operators: Name Return type Operands Label Count Num Table Number of records CountDistinct Num Table, Cat Number of distinct values Mode Cat Table, Cat Most frequent value Mean Num Table, Num Mean value StdDev Num Table, Num Standard deviation Median Num Table, Num Median value Min Num Table, Num Min value Max Num Table, Num Max value Sum Num Table, Num Sum of value Some filter operators: Name Return type Operands Label <, ≤ Table Table, Num Table filtered over field values smaller (or equal) than a record >, ≥ Table Table, Num Table filtered over field values greater (or equal) than a record = Table Table, Field Table filtered over field values equal than a record Customize your operators: • Date: before, after, week-end, etc… • Time: morning, afternoon, etc… • String: split, infinitive verb, etc… • ...
  • 12. Presentation of some smart aggregates 1. Count(Pages visited) 2. Max(Orders, amount spent) 3. Mode(Email reactions, action type) 4. Median(Pages visited, duration) when Pages visited.device = “smartphone” The maximal amount spent by the customer The most frequent email request of the customer Number of visited pages by the customer
  • 13. How to be smart? • Good aggregate • 1st: Aggregation ☀❤🐰 • 2nd: Filter + Aggretation ⭐ • 3rd: Filter + Filter + Aggregation ⚠♨🤔 • … etc ... ⛔🔞 M. BOULLÉ. Towards Automatic Feature Construction for Supervised Classification. In ECML/PKDD, P. 181-196, 2014.
  • 14. How to evaluate and select features? • Discretization / Grouping → Correlation with the target • Select (the most) correlated features : target set (ex: sick, healthy) split such that the trade-off between entropy & compression is optimal
  • 15. Discretization algorithms • ChiMerge (R, SAS) • Optimize entropy • C4.5 (…) • Optimize compression • Fusinter (Zighed & co - Sinipa) • MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark) • MODL (Boullé) • Optimize both: entropy & compression
  • 16. Popularize: MODL : target set (ex: sick, healthy) I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖( n Discretize with MODL = Minimize the following formula: 𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678" 78" + ∑ log 5;6<8" <8" 7 =>" + ∑ log 5;! 5;,A!5;,B! …5;,D!E7 =>" entropycompression
  • 17. Interpretation of smart aggregates calculated over the visited pages table Count(VisitedPages) = Number of visited pages Interpretation graphic shows that: • there is a niche of future buyers : those who have visited more than 96.5 pages over the period (top segment) • the majority of the base has visited no or only a few pages the site over the period For each customer: For each customer: Median(VisitedPages, duration) = median duration of stay on a specific page
  • 18. + & - • + Good complexity • + Statistically efficient • + Manage overfitting by design • - not enough to win every Kaggle constests…
  • 19. Let’s stay in touch! Jean-Baptiste PRIEZ Data Scientist jbp@predicsis.ai