SlideShare a Scribd company logo
1 of 40
Multidimensional Feature Selection and Interaction Mining
with Decision Tree based ensemble methods
Łukasz Król, Joanna Polańska
Data Mining Group
Faculty of Automatic Control,
Electronics and Computer Science
Silesian University of Technology
Feature Selection – supervised or unsupervised?
MACHINE
LEARNING
SUPERVISED
AUTOMATION
+feature selection
UNDERSTANDING
THE PROCESS
+feature selection
UNSUPERVISED
+feature selection
Explorative Supervised Feature Selection
MACHINE
LEARNING
SUPERVISED
AUTOMATION
+feature selection
UNDERSTANDING
THE PROCESS
+feature selection
UNSUPERVISED
+feature selection
Explorative Supervised Feature Selection
Explorative Supervised Feature Selection
platform observations features
PCR 102-103 101-102
RNA microarrays 102-103 104
RNA sequencing 102-103 105-106
SNP microarrays 102-103 105-106
CNV microarrays 102-103 106
methylation sites 102-103 108-109
full genome 102-103 109
mixed data 102-103 101-109
Explorative Supervised Feature Selection
Common requirements:
• Handles high-dimensional mixed-input data.
• Considers feature interactions.
• Not bound to a greedy search path.
• Agnostic of type of variables and number of categories.
• Does not transform the feature space.
• A broad range of problems (types of decision vectors):
• categorical
• continuous
• censored survival time
Monte Carlo Feature Selection
Bioinformatics (2008) 24: 110-117
Advances in Machine Learning II (2010) 263: 371-385
Big Data Analysis: New Algorithms for a New Society (2015) 16: 285-304
MCFS - short description
FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
MCFS - short description
FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
SCORE
MCFS - short description
FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
D. TREE
(STRUCTURE)
SCORE
MCFS - short description
FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
Relative Importance
D. TREE
(STRUCTURE)
SCORE
MCFS - short description
FULL DATA
FEATURE SUBSET
x S
TRAINTEST
x T
D. TREE
(BLACK BOX)
Relative Importance
D. TREE
(STRUCTURE)
SCORE
Inter-Dependency
MCFS - fields for improvement
distributing computations
allowing a wider range of
models and decision vectors
introducing universal and robust
feature importance metrics
Broadside - Architecture
•Can be run on an arbitrary number of
physical machines.
•Allows to dynamically attach and detach
nodes while running computations.
•Scales almost linearly when increasing the
amount of available processors.
•Platform-independent.
•Has no dependencies other than Java 1.8.
•Is open for extending by new types of
feature selectors.
Broadside – Feature Importance Metrics
TEST SET
PERMUTED TEST SET
MODEL
(BLACK BOX)
MODEL
(BLACK BOX)
SCORE
SCORE
DELTA
base: the standard RandomForests feature importance metric
Broadside – Feature Importance Metrics
TEST SET
PERMUTED TEST SET
MODEL
(BLACK BOX)
MODEL
(BLACK BOX)
SCORE
SCORE
DELTA
base: the standard RandomForests feature importance metric
enhancement: total effect decomposition to main effects and interaction effects
A B
C D
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A
B
C
D
AB
AC
AD
BC
BD
CD
Broadside – Feature Importance Metrics
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A x x x x
B
C
D
AB
AC
AD
BC
BD
CD
Broadside – Feature Importance Metrics
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A x x x x
B x x x x
C
D
AB
AC
AD
BC
BD
CD
Broadside – Feature Importance Metrics
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A x x x x
B x x x x
C x x x x
D
AB
AC
AD
BC
BD
CD
Broadside – Feature Importance Metrics
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB
AC
AD
BC
BD
CD
Broadside – Feature Importance Metrics
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB x x x x x x x
AC
AD
BC
BD
CD
Broadside – Feature Importance Metrics
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB x x x x x x x
AC x x x x x x x
AD
BC
BD
CD
Broadside – Feature Importance Metrics
A B
C D main effects interaction effects
A B C D A-B A-C A-D B-C B-D C-D
totaleffects
A x x x x
B x x x x
C x x x x
D x x x x
AB x x x x x x x
AC x x x x x x x
AD x x x x x x x
BC x x x x x x x
BD x x x x x x x
CD x x x x x x x
Broadside – Feature Importance Metrics
Broadside – Flexibility
Different types of models can be plugged in to broadside by using
different model assessment metrics, ex.:
• categorical – Weighted Accuracy
• continuous – Mean Absolute Error
• survival – Concordance Index
Supported types of input variables depend on the choice of model.
Currently implemented models are:
• C4.5 classification trees
• RandomForests
• Extremely Randomized Trees
• Regression Trees
• Survival Trees (Ishvaran et al.)
Broadside – decision tree based ensemble methods
Different types of models can be plugged in to broadside by using
different model assessment metrics, ex.:
• categorical – Weighted Accuracy
• continuous – Mean Absolute Error
• survival – Concordance Index
Supported types of input variables depend on the choice of model.
Currently implemented models are:
• C4.5 classification trees
• RandomForests
• Extremely Randomized Trees
• Regression Trees
• Survival Trees (Ishvaran et al.)
Broadside initial assessment – test data
Broadside initial assessment – configurations
dataset features f. smp. size (m) classifier tree pruning metrics
A 14 2 C4.5 0.25 MCFS
A 14 2 C4.5 training set MCFS
A 14 2 C4.5 none MCFS
A 14 2 C4.5 0.25 Broadside
A 14 2 C4.5 training set Broadside
A 14 2 C4.5 none Broadside
A 14 2 RandomForests none Broadside
A 14 2 ERT [15] none Broadside
B 10000 500 C4.5 0.25 MCFS
B 10000 500 C4.5 none Broadside
B 10000 500 RandomForests none Broadside
20 000 feature samples, 100 permutations
Broadside initial assessment – configurations
dataset features f. smp. size (m) classifier tree pruning metrics
A 14 2 C4.5 0.25 MCFS
A 14 2 C4.5 training set MCFS
A 14 2 C4.5 none MCFS
A 14 2 C4.5 0.25 Broadside
A 14 2 C4.5 training set Broadside
A 14 2 C4.5 none Broadside
A 14 2 RandomForests none Broadside
A 14 2 ERT [15] none Broadside
B 10000 500 C4.5 0.25 MCFS
B 10000 500 C4.5 none Broadside
B 10000 500 RandomForests none Broadside
20 000 feature samples, 100 permutations
Broadside initial assessment – dataset A
Broadside:
Broadside initial assessment – dataset A
MCFS:
Broadside initial assessment – dataset A
Broadside:
Broadside initial assessment – configurations
dataset features f. smp. size (m) classifier tree pruning metrics
A 14 2 C4.5 0.25 MCFS
A 14 2 C4.5 training set MCFS
A 14 2 C4.5 none MCFS
A 14 2 C4.5 0.25 Broadside
A 14 2 C4.5 training set Broadside
A 14 2 C4.5 none Broadside
A 14 2 RandomForests none Broadside
A 14 2 ERT [15] none Broadside
B 10000 500 C4.5 0.25 MCFS
B 10000 500 C4.5 none Broadside
B 10000 500 RandomForests none Broadside
20 000 feature samples, 100 permutations
Broadside initial assessment – dataset B
Broadside:
Broadside initial assessment – dataset B
MCFS:
Broadside initial assessment – NSCLC PCR data
CATEGORICAL (DEATH/LIFE)
Broadside initial assessment – NSCLC PCR data
REGRESSION
(SURVIVAL TIME FOR DEATHS)
Broadside initial assessment – CNV data
Broaside - summary
• A new feature selection and interaction mining software.
• Follows some of original MCFS ideas (Draminski et al.).
• Distributed – tested on ~350 cores.
• Up to millions of features.
• Three types of decision vectors:
• categorical
• numeric
• survival time
• Two types of input features:
• categorical
• numeric
• Interactive feature importance graphs.
availability
(+ C/C++?)
Broadside:
Upon request.
Fangorn:
https://github.com/LukaszKrol/Fangorn

More Related Content

Similar to Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods

background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Multisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno BriefingMultisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno BriefingPaveen Juntama
 
ERC_EGUE_FINAL_Aug 12_PJB
ERC_EGUE_FINAL_Aug 12_PJBERC_EGUE_FINAL_Aug 12_PJB
ERC_EGUE_FINAL_Aug 12_PJBPaul Brodbeck
 
6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdfssuserdca880
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Quality By Design
Quality By DesignQuality By Design
Quality By Designrealmayank
 
Machine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sMachine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sSiddharth Chakravarty
 
WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013OptiModel
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliMDO_Lab
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014Gloria Lovera
 
Fuzzy logic and its application in environmental engineering
Fuzzy logic and its application in environmental engineeringFuzzy logic and its application in environmental engineering
Fuzzy logic and its application in environmental engineeringDrashti Kapadia
 
rbs - presentation about applications of machine learning.
rbs - presentation about applications of machine learning.rbs - presentation about applications of machine learning.
rbs - presentation about applications of machine learning.ChellamuthuMech
 
Final instrument fc500
Final instrument fc500Final instrument fc500
Final instrument fc500hemant4014
 
Angstrom advanced ADX 2500 X-ray diffraction instrument
Angstrom advanced ADX 2500 X-ray diffraction instrumentAngstrom advanced ADX 2500 X-ray diffraction instrument
Angstrom advanced ADX 2500 X-ray diffraction instrumentAngstrom Advanced
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 InternshipTaylor Martell
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 

Similar to Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods (20)

fc.pdf
fc.pdffc.pdf
fc.pdf
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Multisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno BriefingMultisensor Data Fusion : Techno Briefing
Multisensor Data Fusion : Techno Briefing
 
ERC_EGUE_FINAL_Aug 12_PJB
ERC_EGUE_FINAL_Aug 12_PJBERC_EGUE_FINAL_Aug 12_PJB
ERC_EGUE_FINAL_Aug 12_PJB
 
6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf
 
MiPower-demo.pptx
MiPower-demo.pptxMiPower-demo.pptx
MiPower-demo.pptx
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Quality By Design
Quality By DesignQuality By Design
Quality By Design
 
Machine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sMachine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’s
 
WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013WCSMO-ModelSelection-2013
WCSMO-ModelSelection-2013
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
 
Py conie 2014
Py conie 2014Py conie 2014
Py conie 2014
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014
 
Fuzzy logic and its application in environmental engineering
Fuzzy logic and its application in environmental engineeringFuzzy logic and its application in environmental engineering
Fuzzy logic and its application in environmental engineering
 
rbs - presentation about applications of machine learning.
rbs - presentation about applications of machine learning.rbs - presentation about applications of machine learning.
rbs - presentation about applications of machine learning.
 
ARIMA
ARIMA ARIMA
ARIMA
 
Final instrument fc500
Final instrument fc500Final instrument fc500
Final instrument fc500
 
Angstrom advanced ADX 2500 X-ray diffraction instrument
Angstrom advanced ADX 2500 X-ray diffraction instrumentAngstrom advanced ADX 2500 X-ray diffraction instrument
Angstrom advanced ADX 2500 X-ray diffraction instrument
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 

Recently uploaded

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 

Recently uploaded (20)

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 

Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods

  • 1. Multidimensional Feature Selection and Interaction Mining with Decision Tree based ensemble methods Łukasz Król, Joanna Polańska Data Mining Group Faculty of Automatic Control, Electronics and Computer Science Silesian University of Technology
  • 2. Feature Selection – supervised or unsupervised? MACHINE LEARNING SUPERVISED AUTOMATION +feature selection UNDERSTANDING THE PROCESS +feature selection UNSUPERVISED +feature selection
  • 3. Explorative Supervised Feature Selection MACHINE LEARNING SUPERVISED AUTOMATION +feature selection UNDERSTANDING THE PROCESS +feature selection UNSUPERVISED +feature selection
  • 5. Explorative Supervised Feature Selection platform observations features PCR 102-103 101-102 RNA microarrays 102-103 104 RNA sequencing 102-103 105-106 SNP microarrays 102-103 105-106 CNV microarrays 102-103 106 methylation sites 102-103 108-109 full genome 102-103 109 mixed data 102-103 101-109
  • 6. Explorative Supervised Feature Selection Common requirements: • Handles high-dimensional mixed-input data. • Considers feature interactions. • Not bound to a greedy search path. • Agnostic of type of variables and number of categories. • Does not transform the feature space. • A broad range of problems (types of decision vectors): • categorical • continuous • censored survival time
  • 7. Monte Carlo Feature Selection Bioinformatics (2008) 24: 110-117 Advances in Machine Learning II (2010) 263: 371-385 Big Data Analysis: New Algorithms for a New Society (2015) 16: 285-304
  • 8. MCFS - short description FULL DATA FEATURE SUBSET x S TRAINTEST x T
  • 9. MCFS - short description FULL DATA FEATURE SUBSET x S TRAINTEST x T D. TREE (BLACK BOX) SCORE
  • 10. MCFS - short description FULL DATA FEATURE SUBSET x S TRAINTEST x T D. TREE (BLACK BOX) D. TREE (STRUCTURE) SCORE
  • 11. MCFS - short description FULL DATA FEATURE SUBSET x S TRAINTEST x T D. TREE (BLACK BOX) Relative Importance D. TREE (STRUCTURE) SCORE
  • 12. MCFS - short description FULL DATA FEATURE SUBSET x S TRAINTEST x T D. TREE (BLACK BOX) Relative Importance D. TREE (STRUCTURE) SCORE Inter-Dependency
  • 13. MCFS - fields for improvement distributing computations allowing a wider range of models and decision vectors introducing universal and robust feature importance metrics
  • 14. Broadside - Architecture •Can be run on an arbitrary number of physical machines. •Allows to dynamically attach and detach nodes while running computations. •Scales almost linearly when increasing the amount of available processors. •Platform-independent. •Has no dependencies other than Java 1.8. •Is open for extending by new types of feature selectors.
  • 15. Broadside – Feature Importance Metrics TEST SET PERMUTED TEST SET MODEL (BLACK BOX) MODEL (BLACK BOX) SCORE SCORE DELTA base: the standard RandomForests feature importance metric
  • 16. Broadside – Feature Importance Metrics TEST SET PERMUTED TEST SET MODEL (BLACK BOX) MODEL (BLACK BOX) SCORE SCORE DELTA base: the standard RandomForests feature importance metric enhancement: total effect decomposition to main effects and interaction effects A B C D
  • 17. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A B C D AB AC AD BC BD CD Broadside – Feature Importance Metrics
  • 18. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A x x x x B C D AB AC AD BC BD CD Broadside – Feature Importance Metrics
  • 19. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A x x x x B x x x x C D AB AC AD BC BD CD Broadside – Feature Importance Metrics
  • 20. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A x x x x B x x x x C x x x x D AB AC AD BC BD CD Broadside – Feature Importance Metrics
  • 21. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A x x x x B x x x x C x x x x D x x x x AB AC AD BC BD CD Broadside – Feature Importance Metrics
  • 22. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A x x x x B x x x x C x x x x D x x x x AB x x x x x x x AC AD BC BD CD Broadside – Feature Importance Metrics
  • 23. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A x x x x B x x x x C x x x x D x x x x AB x x x x x x x AC x x x x x x x AD BC BD CD Broadside – Feature Importance Metrics
  • 24. A B C D main effects interaction effects A B C D A-B A-C A-D B-C B-D C-D totaleffects A x x x x B x x x x C x x x x D x x x x AB x x x x x x x AC x x x x x x x AD x x x x x x x BC x x x x x x x BD x x x x x x x CD x x x x x x x Broadside – Feature Importance Metrics
  • 25. Broadside – Flexibility Different types of models can be plugged in to broadside by using different model assessment metrics, ex.: • categorical – Weighted Accuracy • continuous – Mean Absolute Error • survival – Concordance Index Supported types of input variables depend on the choice of model. Currently implemented models are: • C4.5 classification trees • RandomForests • Extremely Randomized Trees • Regression Trees • Survival Trees (Ishvaran et al.)
  • 26. Broadside – decision tree based ensemble methods Different types of models can be plugged in to broadside by using different model assessment metrics, ex.: • categorical – Weighted Accuracy • continuous – Mean Absolute Error • survival – Concordance Index Supported types of input variables depend on the choice of model. Currently implemented models are: • C4.5 classification trees • RandomForests • Extremely Randomized Trees • Regression Trees • Survival Trees (Ishvaran et al.)
  • 28. Broadside initial assessment – configurations dataset features f. smp. size (m) classifier tree pruning metrics A 14 2 C4.5 0.25 MCFS A 14 2 C4.5 training set MCFS A 14 2 C4.5 none MCFS A 14 2 C4.5 0.25 Broadside A 14 2 C4.5 training set Broadside A 14 2 C4.5 none Broadside A 14 2 RandomForests none Broadside A 14 2 ERT [15] none Broadside B 10000 500 C4.5 0.25 MCFS B 10000 500 C4.5 none Broadside B 10000 500 RandomForests none Broadside 20 000 feature samples, 100 permutations
  • 29. Broadside initial assessment – configurations dataset features f. smp. size (m) classifier tree pruning metrics A 14 2 C4.5 0.25 MCFS A 14 2 C4.5 training set MCFS A 14 2 C4.5 none MCFS A 14 2 C4.5 0.25 Broadside A 14 2 C4.5 training set Broadside A 14 2 C4.5 none Broadside A 14 2 RandomForests none Broadside A 14 2 ERT [15] none Broadside B 10000 500 C4.5 0.25 MCFS B 10000 500 C4.5 none Broadside B 10000 500 RandomForests none Broadside 20 000 feature samples, 100 permutations
  • 30. Broadside initial assessment – dataset A Broadside:
  • 31. Broadside initial assessment – dataset A MCFS:
  • 32. Broadside initial assessment – dataset A Broadside:
  • 33. Broadside initial assessment – configurations dataset features f. smp. size (m) classifier tree pruning metrics A 14 2 C4.5 0.25 MCFS A 14 2 C4.5 training set MCFS A 14 2 C4.5 none MCFS A 14 2 C4.5 0.25 Broadside A 14 2 C4.5 training set Broadside A 14 2 C4.5 none Broadside A 14 2 RandomForests none Broadside A 14 2 ERT [15] none Broadside B 10000 500 C4.5 0.25 MCFS B 10000 500 C4.5 none Broadside B 10000 500 RandomForests none Broadside 20 000 feature samples, 100 permutations
  • 34. Broadside initial assessment – dataset B Broadside:
  • 35. Broadside initial assessment – dataset B MCFS:
  • 36. Broadside initial assessment – NSCLC PCR data CATEGORICAL (DEATH/LIFE)
  • 37. Broadside initial assessment – NSCLC PCR data REGRESSION (SURVIVAL TIME FOR DEATHS)
  • 39. Broaside - summary • A new feature selection and interaction mining software. • Follows some of original MCFS ideas (Draminski et al.). • Distributed – tested on ~350 cores. • Up to millions of features. • Three types of decision vectors: • categorical • numeric • survival time • Two types of input features: • categorical • numeric • Interactive feature importance graphs.