SlideShare a Scribd company logo
1 of 23
Eliminating the Irrelevant:
The HARVEST Algorithm
SAMSI
March 14, 2019
Herbert I. Weisberg
Victor P. Pontes
Mathis Thoma
CAUSALYTICS, LLC
• Huge number of features but few are relevant
• Often a limited number of observations
• Two contradictory challenges:
– Find the relevant features
– Construct a statistical model
Predictive Analytics: A Very Hard Problem
2
1) Screen out irrelevant features efficiently
2) Build model using (mostly) relevant features
We focus on task 1. If done well, then task 2 should be
(relatively) easy.
Divide and Conquer: Two (Easier?) Problems
3
• Prediction depends on Correlation not Causation
• But no stable Correlation exists without Causation
Relevance = Correlation grounded in Causation
Conceptual Framework
4
Rationale for HARVEST
• A relevant feature cannot decrease model fit to
data except by chance
• An irrelevant feature cannot increase model fit
except by chance
How can we identify features that don’t truly
contribute to accuracy and eliminate them, while
“sparing” relevant features?
5
• Poor trade-off between sensitivity and
specificity
• Insensitive to inter-relationships among
features
• Cannot distinguish stable relationships from
random associations
6
Univariate “filtering” methods:
• Highly
• Accurate &
• Robust
• Variable
• Evaluation
• Selection &
• Testing
7
HARVEST
• Step 1: Generate n random subsets of k features.
• Step 2: For each random subset, train (fit) the
corresponding predictive model.
• Step 3: For each subset, calculate an estimate of
accuracy A.
8
HARVEST
• Step 4: Rank all n subsets based on the A -values,
from 1 (highest A) to n (lowest A).
• Step 5: For each feature i, identify the 𝑛𝑖 subsets
that contain feature i.
9
HARVEST
For each feature i :
• Step 6: Calculate the average rank of the 𝑛𝑖
subsets that contain feature i.
• Step 7: Apply the Wilcoxon Rank Sum test to
calculate a (one-sided) p-value for feature i.
• Step 8: Eliminate all features not statistically
significant.
10
HARVEST
• The general form of the learning machine (model)
• The accuracy criterion: A
• The total number of random subsets: n
• The number of features included in each subset: k
• The specification of a desired p-value
11
HARVEST “Parameters”
How Many Subsets?
𝐸 𝑛𝑖 =
𝑛𝑘
𝑁
For example if: N = 1000 k = 20 n = 5000
Then: 𝐸(𝑛𝑖) = 100
12
𝜇𝑖 =
𝑛 + 1
2
𝜎𝑖 =
(𝑛 − 𝑛𝑖)(𝑛 + 1)
12𝑛𝑖
13
Wilcoxon Rank Sum Test
Possible Ways to Utilize “Harvested” Features
• Create a model of the same form (e.g. GLM,
logistic)
• Refine the model (e.g. interactions,
transformations)
• Perform a stepwise regression
• Create a new model of a different form (e.g. CART)
• Perform another round of HARVEST
14
• Problem: predict prostate cancer metastasis
• Started with gene expressions for 6,000 genes
• Applied HARVEST in three rounds to training
data
• Selected final 10 genes
• Fit a logistic model to internal validation data
• Validated in 5 external samples
A Genomics Example: The Study
15
Existing Model: Mean AUC for 5 samples = .75
(Best achieved previously)
Our Model: Mean AUC for 5 samples = .79
A Genomics Example: The Results
16
Simulation: HARVEST vs. Six Other Methods
• Model from Wang et al.: Random Lasso
(Annals of Applied Statistics, 2011)
• Linear model: 6 relevant features out of 40 total
• Number of observations = 50
• Methods were Random Lasso, Lasso, and 4 others
• Generated data and compared with published
results
17
• Model form: simple linear regression
• The accuracy criterion: R-squared
• The total number of random subsets (n) : 4000
• Number of features included in each subset (k): 15
• Desired p-value: 0.05
18
HARVEST Parameters
Simulation Results: HARVEST vs. Six Other
Methods
19
Sensitivity Specificity
Method Minimum Median Maximum Minimum Median Maximum
Lasso 11 70 77 75 83 88
Adaptive Lasso 16 49 59 86 92 96
Elastic Net 63 92 96 77 83 91
Relaxed Lasso 4 63 70 91 96 100
VISA 4 62 73 92 97 99
Random Lasso 84 96 97 70 79 89
HARVEST 93 95 98 84 91 96
Modified Simulation: HARVEST vs. Lasso
• Modified Model from Wang et al.
• Linear model: 15 relevant features out of 300 total
• Number of observations = 200
• Method for comparison was Lasso
• Generated data and compared results
20
• Model form: simple linear regression
• The accuracy criterion (A): R-squared
• The total number of random subsets(n): 1500
• Number of features included in each subset (k): 20
• Desired p-value: 0.01
21
HARVEST Parameters:
Modified Wang Model Simulation
22
Variable Selection: HARVEST vs. LASSO
(100 simulated Data Sets)
• Motivated by a clear conceptual definition of
feature relevance.
• Can rigorously test features for relevance.
• Early results of predictive models produced using
HARVEST are very promising.
• Needs additional testing and further refinement.
23
Summary

More Related Content

Similar to Eliminating Irrelevant Features with the HARVEST Algorithm

Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Matthew Powers
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
30thSep2014
30thSep201430thSep2014
30thSep2014Mia liu
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Factor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesFactor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesManohar Pahan
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Sangwoo Mo
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringBigML, Inc
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modelingiwan_rg
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 InternshipTaylor Martell
 
Standard Statistical Feature analysis of Image Features for Facial Images usi...
Standard Statistical Feature analysis of Image Features for Facial Images usi...Standard Statistical Feature analysis of Image Features for Facial Images usi...
Standard Statistical Feature analysis of Image Features for Facial Images usi...Bulbul Agrawal
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
C11BD 22-23 data ana-Exploration II.pptx
C11BD 22-23 data ana-Exploration II.pptxC11BD 22-23 data ana-Exploration II.pptx
C11BD 22-23 data ana-Exploration II.pptxTariqqandeel
 

Similar to Eliminating Irrelevant Features with the HARVEST Algorithm (20)

Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
30thSep2014
30thSep201430thSep2014
30thSep2014
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Factor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesFactor Analysis for Exploratory Studies
Factor Analysis for Exploratory Studies
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modeling
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
Standard Statistical Feature analysis of Image Features for Facial Images usi...
Standard Statistical Feature analysis of Image Features for Facial Images usi...Standard Statistical Feature analysis of Image Features for Facial Images usi...
Standard Statistical Feature analysis of Image Features for Facial Images usi...
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
C11BD 22-23 data ana-Exploration II.pptx
C11BD 22-23 data ana-Exploration II.pptxC11BD 22-23 data ana-Exploration II.pptx
C11BD 22-23 data ana-Exploration II.pptx
 
Mini_Project
Mini_ProjectMini_Project
Mini_Project
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 

Recently uploaded (20)

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 

Eliminating Irrelevant Features with the HARVEST Algorithm

  • 1. Eliminating the Irrelevant: The HARVEST Algorithm SAMSI March 14, 2019 Herbert I. Weisberg Victor P. Pontes Mathis Thoma CAUSALYTICS, LLC
  • 2. • Huge number of features but few are relevant • Often a limited number of observations • Two contradictory challenges: – Find the relevant features – Construct a statistical model Predictive Analytics: A Very Hard Problem 2
  • 3. 1) Screen out irrelevant features efficiently 2) Build model using (mostly) relevant features We focus on task 1. If done well, then task 2 should be (relatively) easy. Divide and Conquer: Two (Easier?) Problems 3
  • 4. • Prediction depends on Correlation not Causation • But no stable Correlation exists without Causation Relevance = Correlation grounded in Causation Conceptual Framework 4
  • 5. Rationale for HARVEST • A relevant feature cannot decrease model fit to data except by chance • An irrelevant feature cannot increase model fit except by chance How can we identify features that don’t truly contribute to accuracy and eliminate them, while “sparing” relevant features? 5
  • 6. • Poor trade-off between sensitivity and specificity • Insensitive to inter-relationships among features • Cannot distinguish stable relationships from random associations 6 Univariate “filtering” methods:
  • 7. • Highly • Accurate & • Robust • Variable • Evaluation • Selection & • Testing 7 HARVEST
  • 8. • Step 1: Generate n random subsets of k features. • Step 2: For each random subset, train (fit) the corresponding predictive model. • Step 3: For each subset, calculate an estimate of accuracy A. 8 HARVEST
  • 9. • Step 4: Rank all n subsets based on the A -values, from 1 (highest A) to n (lowest A). • Step 5: For each feature i, identify the 𝑛𝑖 subsets that contain feature i. 9 HARVEST
  • 10. For each feature i : • Step 6: Calculate the average rank of the 𝑛𝑖 subsets that contain feature i. • Step 7: Apply the Wilcoxon Rank Sum test to calculate a (one-sided) p-value for feature i. • Step 8: Eliminate all features not statistically significant. 10 HARVEST
  • 11. • The general form of the learning machine (model) • The accuracy criterion: A • The total number of random subsets: n • The number of features included in each subset: k • The specification of a desired p-value 11 HARVEST “Parameters”
  • 12. How Many Subsets? 𝐸 𝑛𝑖 = 𝑛𝑘 𝑁 For example if: N = 1000 k = 20 n = 5000 Then: 𝐸(𝑛𝑖) = 100 12
  • 13. 𝜇𝑖 = 𝑛 + 1 2 𝜎𝑖 = (𝑛 − 𝑛𝑖)(𝑛 + 1) 12𝑛𝑖 13 Wilcoxon Rank Sum Test
  • 14. Possible Ways to Utilize “Harvested” Features • Create a model of the same form (e.g. GLM, logistic) • Refine the model (e.g. interactions, transformations) • Perform a stepwise regression • Create a new model of a different form (e.g. CART) • Perform another round of HARVEST 14
  • 15. • Problem: predict prostate cancer metastasis • Started with gene expressions for 6,000 genes • Applied HARVEST in three rounds to training data • Selected final 10 genes • Fit a logistic model to internal validation data • Validated in 5 external samples A Genomics Example: The Study 15
  • 16. Existing Model: Mean AUC for 5 samples = .75 (Best achieved previously) Our Model: Mean AUC for 5 samples = .79 A Genomics Example: The Results 16
  • 17. Simulation: HARVEST vs. Six Other Methods • Model from Wang et al.: Random Lasso (Annals of Applied Statistics, 2011) • Linear model: 6 relevant features out of 40 total • Number of observations = 50 • Methods were Random Lasso, Lasso, and 4 others • Generated data and compared with published results 17
  • 18. • Model form: simple linear regression • The accuracy criterion: R-squared • The total number of random subsets (n) : 4000 • Number of features included in each subset (k): 15 • Desired p-value: 0.05 18 HARVEST Parameters
  • 19. Simulation Results: HARVEST vs. Six Other Methods 19 Sensitivity Specificity Method Minimum Median Maximum Minimum Median Maximum Lasso 11 70 77 75 83 88 Adaptive Lasso 16 49 59 86 92 96 Elastic Net 63 92 96 77 83 91 Relaxed Lasso 4 63 70 91 96 100 VISA 4 62 73 92 97 99 Random Lasso 84 96 97 70 79 89 HARVEST 93 95 98 84 91 96
  • 20. Modified Simulation: HARVEST vs. Lasso • Modified Model from Wang et al. • Linear model: 15 relevant features out of 300 total • Number of observations = 200 • Method for comparison was Lasso • Generated data and compared results 20
  • 21. • Model form: simple linear regression • The accuracy criterion (A): R-squared • The total number of random subsets(n): 1500 • Number of features included in each subset (k): 20 • Desired p-value: 0.01 21 HARVEST Parameters: Modified Wang Model Simulation
  • 22. 22 Variable Selection: HARVEST vs. LASSO (100 simulated Data Sets)
  • 23. • Motivated by a clear conceptual definition of feature relevance. • Can rigorously test features for relevance. • Early results of predictive models produced using HARVEST are very promising. • Needs additional testing and further refinement. 23 Summary