Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods

Multidimensional Feature Selection and Interaction Mining
with Decision Tree based ensemble methods
Łukasz Król, Joanna Polańska
Data Mining Group
Faculty of Automatic Control,
Electronics and Computer Science
Silesian University of Technology

Feature Selection – supervised or unsupervised?
MACHINE
LEARNING
SUPERVISED
AUTOMATION
+feature selection
UNDERSTANDING
THE PROCESS
+feature selection
UNSUPERVISED
+feature selection

Explorative Supervised Feature Selection
MACHINE
LEARNING
SUPERVISED
AUTOMATION
+feature selection
UNDERSTANDING
THE PROCESS
+feature selection
UNSUPERVISED
+feature selection

platform observations features
PCR 102-103 101-102
RNA microarrays 102-103 104
RNA sequencing 102-103 105-106
SNP microarrays 102-103 105-106
CNV microarrays 102-103 106
methylation sites 102-103 108-109
full genome 102-103 109
mixed data 102-103 101-109

Common requirements:
• Handles high-dimensional mixed-input data.
• Considers feature interactions.
• Not bound to a greedy search path.
• Agnostic of type of variables and number of categories.
• Does not transform the feature space.
• A broad range of problems (types of decision vectors):
• categorical
• continuous
• censored survival time

Monte Carlo Feature Selection
Bioinformatics (2008) 24: 110-117
Advances in Machine Learning II (2010) 263: 371-385
Big Data Analysis: New Algorithms for a New Society (2015) 16: 285-304

MCFS - short description
FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T

FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
SCORE

FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
D. TREE
(STRUCTURE)
SCORE

FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
Relative Importance
D. TREE
(STRUCTURE)
SCORE

FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
Relative Importance
D. TREE
(STRUCTURE)
SCORE
Inter-Dependency

MCFS - fields for improvement
distributing computations
allowing a wider range of
models and decision vectors
introducing universal and robust
feature importance metrics

Broadside - Architecture
•Can be run on an arbitrary number of
physical machines.
•Allows to dynamically attach and detach
nodes while running computations.
•Scales almost linearly when increasing the
amount of available processors.
•Platform-independent.
•Has no dependencies other than Java 1.8.
•Is open for extending by new types of
feature selectors.

Broadside – Feature Importance Metrics
TEST SET
PERMUTED TEST SET
MODEL
(BLACK BOX)
MODEL
(BLACK BOX)
SCORE
SCORE
DELTA
base: the standard RandomForests feature importance metric

TEST SET
PERMUTED TEST SET
MODEL
(BLACK BOX)
MODEL
(BLACK BOX)
SCORE
SCORE
DELTA
base: the standard RandomForests feature importance metric
enhancement: total effect decomposition to main effects and interaction effects
A B
C D

A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A
B
C
D
AB
AC
AD
BC
BD
CD

A B
totaleffects
A x x x x
B
C
D
AB
AC
AD
BC
BD
CD

A B
totaleffects
A x x x x
B x x x x
C
D
AB
AC
AD
BC
BD
CD

A B
totaleffects
A x x x x
B x x x x
C x x x x
D
AB
AC
AD
BC
BD
CD

A B
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB
AC
AD
BC
BD
CD

A B
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB x x x x x x x
AC
AD
BC
BD
CD

A B
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB x x x x x x x
AC x x x x x x x
AD
BC
BD
CD

A B
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB x x x x x x x
AC x x x x x x x
AD x x x x x x x
BC x x x x x x x
BD x x x x x x x
CD x x x x x x x

Broadside – Flexibility
Different types of models can be plugged in to broadside by using
different model assessment metrics, ex.:
• categorical – Weighted Accuracy
• continuous – Mean Absolute Error
• survival – Concordance Index
Supported types of input variables depend on the choice of model.
Currently implemented models are:
• C4.5 classification trees
• RandomForests
• Extremely Randomized Trees
• Regression Trees
• Survival Trees (Ishvaran et al.)

Broadside – decision tree based ensemble methods
Different types of models can be plugged in to broadside by using
different model assessment metrics, ex.:
• categorical – Weighted Accuracy
• continuous – Mean Absolute Error
• survival – Concordance Index
Supported types of input variables depend on the choice of model.
Currently implemented models are:
• C4.5 classification trees
• RandomForests
• Extremely Randomized Trees
• Regression Trees
• Survival Trees (Ishvaran et al.)

Broadside initial assessment – test data

Broadside initial assessment – configurations
dataset features f. smp. size (m) classifier tree pruning metrics
A 14 2 C4.5 0.25 MCFS
A 14 2 C4.5 training set MCFS
A 14 2 C4.5 none MCFS
A 14 2 C4.5 0.25 Broadside
A 14 2 C4.5 training set Broadside
A 14 2 C4.5 none Broadside
A 14 2 RandomForests none Broadside
A 14 2 ERT [15] none Broadside
B 10000 500 C4.5 0.25 MCFS
B 10000 500 C4.5 none Broadside
B 10000 500 RandomForests none Broadside
20 000 feature samples, 100 permutations

Broadside initial assessment – dataset A
Broadside:

Broadside initial assessment – dataset A
MCFS:

Broadside initial assessment – dataset B
Broadside:

Broadside initial assessment – dataset B
MCFS:

Broadside initial assessment – NSCLC PCR data
CATEGORICAL (DEATH/LIFE)

Broadside initial assessment – NSCLC PCR data
REGRESSION
(SURVIVAL TIME FOR DEATHS)

Broadside initial assessment – CNV data

Broaside - summary
• A new feature selection and interaction mining software.
• Follows some of original MCFS ideas (Draminski et al.).
• Distributed – tested on ~350 cores.
• Up to millions of features.
• Three types of decision vectors:
• categorical
• numeric
• survival time
• Two types of input features:
• categorical
• numeric
• Interactive feature importance graphs.

availability
(+ C/C++?)
Broadside:
Upon request.
Fangorn:
https://github.com/LukaszKrol/Fangorn

Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods

Recommended

Recommended

More Related Content

Similar to Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods

Similar to Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods (20)

Recently uploaded

Recently uploaded (20)

Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods