SlideShare a Scribd company logo
1 of 21
Download to read offline
Boruta – A System for Feature Selection
Boruta – A System for Feature Selection
Miron B. Kursa, Aleksander Jankowski, Witold R. Rudnicki
Brandon Sherman
December 1, 2016
1 / 21
Boruta – A System for Feature Selection
Background
Outline
1 Background
2 Boruta
3 Example – Detecting Aptamer Sequences
4 Questions?
2 / 21
Boruta – A System for Feature Selection
Background
Red deer; Białowieża Forest - Podlaskie Voivodeship, Poland
3 / 21
Boruta – A System for Feature Selection
Background
How do random forests work?
A random forest is a machine learning classifier that works as
follows:
1 For a dataset (X, Y) with observations {x1, x2, . . . , xN } and
response {y1, y2, . . . , yN } (yi ∈ {0, 1}), bootstrap B datasets
of size N.
Bootstrapping - for a dataset (X, Y) with N observations,
generate another dataset (Xb, Yb) with N observations by
sampling from (X, Y) with replacement.
2 For b = {1, . . . , B}, train a decision tree on dataset (Xb, Yb).
3 Predict a new observation Xnew
on each decision tree by
taking the majority vote from all of the trees.
4 Calculate the out-of-bag error, which is the mean prediction
error for each decision tree’s predictions on observations not
included in its bootstrapped sample.
5 Calculate feature importance for feature Xj by fitting a model
on a randomly permuted Xj (called X
(s)
j ), and comparing its
out-of-bag error to the out of bag error for Xj .
4 / 21
Boruta – A System for Feature Selection
Background
Problems with Feature Importances
“Feature importances can’t tell you the emotional story. They can
give you the exact mathematical design, but what’s missing is the
eyebrows.” – Frank Zappa (heavily paraphrased)
5 / 21
Boruta – A System for Feature Selection
Background
Problems with Feature Importances
Vague, nebulous notion of “how important is this feature?”
Feature importances are relative, so no notion of “how
important is this feature on its own?”
No idea how many features are needed.
Is a feature with small importance unimportant or slightly
important?
The “Breiman assumption”, which states that feature
importance is normally distributed, is false.
6 / 21
Boruta – A System for Feature Selection
Boruta
Outline
1 Background
2 Boruta
3 Example – Detecting Aptamer Sequences
4 Questions?
7 / 21
Boruta – A System for Feature Selection
Boruta
Boruta (Slavic forest spirit); artist’s depiction
8 / 21
Boruta – A System for Feature Selection
Boruta
A Quick Diversion
Fire stock broker Dan
Aykroyd and hire
untrained homeless man
Eddie Murphy
Can Eddie Murphy be as
good a stockbroker as
Dan Aykroyd?
If so, then there’s nothing
inherent about Dan
Aykroyd that makes him a
good stockbroker
9 / 21
Boruta – A System for Feature Selection
Boruta
Boruta Algorithm
For each feature Xj , randomly permute it to generate a
“shadow feature” (random attribute) X
(s)
j .
Fit a random forest to the original and the shadow features
Calculate feature importances on original and shadow features
The feature is important for a single run if its importance is
higher than the maximum importance of all shadow features
(MIRA).
Eliminate all features whose importances across all runs are
low enough. Keep all features whose importances across all
runs are high enough.
Repeat from the beginning with all tentative features.
10 / 21
Boruta – A System for Feature Selection
Boruta
Boruta – Now with more math! (Part 1)
Iterate the following procedure N times for all original features
{X1, . . . , Xp}:
1 Create a random forest consisting of original and
newly-generated shadow features.
2 Calculate all Imp(Xj ) and MIRA
3 If a particular Imp(Xj ) > MIRA, then increment Hj and call Xj
important for the run.
11 / 21
Boruta – A System for Feature Selection
Boruta
Boruta – Now with more math! (Part 2)
Once {H1, . . . , Hp} have been calculated:
1 Perform the statistical test H0 : Hi = E(H) vs.
H1 : Hi = E(H).
Because hits follow a binomial distribution, we have
Hi
≈
∼ N (0.5N), (
√
0.25N)2
.
2 If Hi is significantly greater than E(H), then we say the
feature is important.
3 If Hi is significantly lower than E(H), then we say the feature
is unimportant.
4 Finish the procedure after some number of iterations or if all
features have been rejected or deemed important. Otherwise,
repeat the procedure from the beginning on all tentative
features.
12 / 21
Boruta – A System for Feature Selection
Example – Detecting Aptamer Sequences
Outline
1 Background
2 Boruta
3 Example – Detecting Aptamer Sequences
4 Questions?
13 / 21
Boruta – A System for Feature Selection
Example – Detecting Aptamer Sequences
The Big Question
Which DNA sequences are indicitative of
aptamers?
14 / 21
Boruta – A System for Feature Selection
Example – Detecting Aptamer Sequences
15 / 21
Boruta – A System for Feature Selection
Example – Detecting Aptamer Sequences
The Aptamer Problem
An aptamer is an RNA or DNA chain that strongly binds to
various molecular targets.
An aptamer is represented as a genetic sequence that contains
certain k-mers.
A, GC, TGA, and AGAC are a 1-, 2-, 3-, and 4-mer,
respectively.
Each row of a dataset has features consisting of a sequence
split into p k-mers, and whether or not the k-mers represent
an aptamer sequence. The presence of a k-mer in the
sequence is marked with a 1 and the absence of a k-mer is
marked with a 0.
Very few sequences in the dataset (small n)
Many possible k-mers (high p)
How do we know which k-mers make up aptamers?
16 / 21
Boruta – A System for Feature Selection
Example – Detecting Aptamer Sequences
Boruta and the Aptamer Problem
For a dataset consisting of n genetic sequences, p 3-, 4-, and
5-mers, and whether or not the sequences make up an aptamer:
1 Create a shadow feature for each of the p k-mers
2 Run Boruta on the combined dataset.
3 Build a random forest on all k-mers selected by Boruta
4 Calculate out-of-bag (OOB) error on the new random forest
model. 30% is the maximum acceptable OOB error.
17 / 21
Boruta – A System for Feature Selection
Example – Detecting Aptamer Sequences
Example: ATP Binding Sites
See 2-Boruta.R
18 / 21
Boruta – A System for Feature Selection
Example – Detecting Aptamer Sequences
Aptamer Problem Results
Out of 23 genetic sequence datasets:
2 had OOB error greater than 30% and didn’t select any
sequences.
1 had increased OOB error after Boruta from 11% to 38%.
20 had average out-of-bag error of 11% and, on average,
selected 18 out of 1170 k-mers.
The k-mers selected were known to be important based on
past biological knowledge.
Boruta does pretty darn well!
19 / 21
Boruta – A System for Feature Selection
Questions?
Outline
1 Background
2 Boruta
3 Example – Detecting Aptamer Sequences
4 Questions?
20 / 21
Boruta – A System for Feature Selection
Questions?
Any questions?
21 / 21

More Related Content

What's hot

Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Feature selection
Feature selectionFeature selection
Feature selectiondkpawar
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentationAyanaRukasar
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)Learnbay Datascience
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningFrancesco Casalegno
 
Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]JULIO GONZALEZ SANZ
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 

What's hot (20)

Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Dbscan
DbscanDbscan
Dbscan
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Topology for data science
Topology for data scienceTopology for data science
Topology for data science
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]Introduction to bayesian_networks[1]
Introduction to bayesian_networks[1]
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 

Similar to Boruta

Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...hyunsung lee
 
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Universidade de São Paulo
 
Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Datatuxette
 
Knowledge extraction from support vector machines
Knowledge extraction from support vector machinesKnowledge extraction from support vector machines
Knowledge extraction from support vector machinesEyad Alshami
 
12 adversal search
12 adversal search12 adversal search
12 adversal searchTianlu Wang
 
5 saso2012-presentation
5 saso2012-presentation5 saso2012-presentation
5 saso2012-presentationAle Cignetti
 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
 

Similar to Boruta (11)

Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
 
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
 
Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Data
 
Knowledge extraction from support vector machines
Knowledge extraction from support vector machinesKnowledge extraction from support vector machines
Knowledge extraction from support vector machines
 
12 adversal search
12 adversal search12 adversal search
12 adversal search
 
5 saso2012-presentation
5 saso2012-presentation5 saso2012-presentation
5 saso2012-presentation
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
Data analytics concepts
Data analytics conceptsData analytics concepts
Data analytics concepts
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine Learning
 
Slides
SlidesSlides
Slides
 

Boruta

  • 1. Boruta – A System for Feature Selection Boruta – A System for Feature Selection Miron B. Kursa, Aleksander Jankowski, Witold R. Rudnicki Brandon Sherman December 1, 2016 1 / 21
  • 2. Boruta – A System for Feature Selection Background Outline 1 Background 2 Boruta 3 Example – Detecting Aptamer Sequences 4 Questions? 2 / 21
  • 3. Boruta – A System for Feature Selection Background Red deer; Białowieża Forest - Podlaskie Voivodeship, Poland 3 / 21
  • 4. Boruta – A System for Feature Selection Background How do random forests work? A random forest is a machine learning classifier that works as follows: 1 For a dataset (X, Y) with observations {x1, x2, . . . , xN } and response {y1, y2, . . . , yN } (yi ∈ {0, 1}), bootstrap B datasets of size N. Bootstrapping - for a dataset (X, Y) with N observations, generate another dataset (Xb, Yb) with N observations by sampling from (X, Y) with replacement. 2 For b = {1, . . . , B}, train a decision tree on dataset (Xb, Yb). 3 Predict a new observation Xnew on each decision tree by taking the majority vote from all of the trees. 4 Calculate the out-of-bag error, which is the mean prediction error for each decision tree’s predictions on observations not included in its bootstrapped sample. 5 Calculate feature importance for feature Xj by fitting a model on a randomly permuted Xj (called X (s) j ), and comparing its out-of-bag error to the out of bag error for Xj . 4 / 21
  • 5. Boruta – A System for Feature Selection Background Problems with Feature Importances “Feature importances can’t tell you the emotional story. They can give you the exact mathematical design, but what’s missing is the eyebrows.” – Frank Zappa (heavily paraphrased) 5 / 21
  • 6. Boruta – A System for Feature Selection Background Problems with Feature Importances Vague, nebulous notion of “how important is this feature?” Feature importances are relative, so no notion of “how important is this feature on its own?” No idea how many features are needed. Is a feature with small importance unimportant or slightly important? The “Breiman assumption”, which states that feature importance is normally distributed, is false. 6 / 21
  • 7. Boruta – A System for Feature Selection Boruta Outline 1 Background 2 Boruta 3 Example – Detecting Aptamer Sequences 4 Questions? 7 / 21
  • 8. Boruta – A System for Feature Selection Boruta Boruta (Slavic forest spirit); artist’s depiction 8 / 21
  • 9. Boruta – A System for Feature Selection Boruta A Quick Diversion Fire stock broker Dan Aykroyd and hire untrained homeless man Eddie Murphy Can Eddie Murphy be as good a stockbroker as Dan Aykroyd? If so, then there’s nothing inherent about Dan Aykroyd that makes him a good stockbroker 9 / 21
  • 10. Boruta – A System for Feature Selection Boruta Boruta Algorithm For each feature Xj , randomly permute it to generate a “shadow feature” (random attribute) X (s) j . Fit a random forest to the original and the shadow features Calculate feature importances on original and shadow features The feature is important for a single run if its importance is higher than the maximum importance of all shadow features (MIRA). Eliminate all features whose importances across all runs are low enough. Keep all features whose importances across all runs are high enough. Repeat from the beginning with all tentative features. 10 / 21
  • 11. Boruta – A System for Feature Selection Boruta Boruta – Now with more math! (Part 1) Iterate the following procedure N times for all original features {X1, . . . , Xp}: 1 Create a random forest consisting of original and newly-generated shadow features. 2 Calculate all Imp(Xj ) and MIRA 3 If a particular Imp(Xj ) > MIRA, then increment Hj and call Xj important for the run. 11 / 21
  • 12. Boruta – A System for Feature Selection Boruta Boruta – Now with more math! (Part 2) Once {H1, . . . , Hp} have been calculated: 1 Perform the statistical test H0 : Hi = E(H) vs. H1 : Hi = E(H). Because hits follow a binomial distribution, we have Hi ≈ ∼ N (0.5N), ( √ 0.25N)2 . 2 If Hi is significantly greater than E(H), then we say the feature is important. 3 If Hi is significantly lower than E(H), then we say the feature is unimportant. 4 Finish the procedure after some number of iterations or if all features have been rejected or deemed important. Otherwise, repeat the procedure from the beginning on all tentative features. 12 / 21
  • 13. Boruta – A System for Feature Selection Example – Detecting Aptamer Sequences Outline 1 Background 2 Boruta 3 Example – Detecting Aptamer Sequences 4 Questions? 13 / 21
  • 14. Boruta – A System for Feature Selection Example – Detecting Aptamer Sequences The Big Question Which DNA sequences are indicitative of aptamers? 14 / 21
  • 15. Boruta – A System for Feature Selection Example – Detecting Aptamer Sequences 15 / 21
  • 16. Boruta – A System for Feature Selection Example – Detecting Aptamer Sequences The Aptamer Problem An aptamer is an RNA or DNA chain that strongly binds to various molecular targets. An aptamer is represented as a genetic sequence that contains certain k-mers. A, GC, TGA, and AGAC are a 1-, 2-, 3-, and 4-mer, respectively. Each row of a dataset has features consisting of a sequence split into p k-mers, and whether or not the k-mers represent an aptamer sequence. The presence of a k-mer in the sequence is marked with a 1 and the absence of a k-mer is marked with a 0. Very few sequences in the dataset (small n) Many possible k-mers (high p) How do we know which k-mers make up aptamers? 16 / 21
  • 17. Boruta – A System for Feature Selection Example – Detecting Aptamer Sequences Boruta and the Aptamer Problem For a dataset consisting of n genetic sequences, p 3-, 4-, and 5-mers, and whether or not the sequences make up an aptamer: 1 Create a shadow feature for each of the p k-mers 2 Run Boruta on the combined dataset. 3 Build a random forest on all k-mers selected by Boruta 4 Calculate out-of-bag (OOB) error on the new random forest model. 30% is the maximum acceptable OOB error. 17 / 21
  • 18. Boruta – A System for Feature Selection Example – Detecting Aptamer Sequences Example: ATP Binding Sites See 2-Boruta.R 18 / 21
  • 19. Boruta – A System for Feature Selection Example – Detecting Aptamer Sequences Aptamer Problem Results Out of 23 genetic sequence datasets: 2 had OOB error greater than 30% and didn’t select any sequences. 1 had increased OOB error after Boruta from 11% to 38%. 20 had average out-of-bag error of 11% and, on average, selected 18 out of 1170 k-mers. The k-mers selected were known to be important based on past biological knowledge. Boruta does pretty darn well! 19 / 21
  • 20. Boruta – A System for Feature Selection Questions? Outline 1 Background 2 Boruta 3 Example – Detecting Aptamer Sequences 4 Questions? 20 / 21
  • 21. Boruta – A System for Feature Selection Questions? Any questions? 21 / 21