SlideShare a Scribd company logo
Topics in the Development and
Validation of Gene Expression
Profiling Based Predictive Classifiers
Richard Simon, D.Sc.
Chief, Biometric Research Branch
National Cancer Institute
Linus.nci.nih.gov/brb
BRB Website
http://linus.nci.nih.gov/brb
• Powerpoint presentations and audio files
• Reprints & Technical Reports
• BRB-ArrayTools software
• BRB-ArrayTools Data Archive
• Sample Size Planning for Targeted
Clinical Trials
Simplified Description of Microarray
Assay
• Extract mRNA from cells of interest
– Each mRNA molecule was transcribed from a single gene and it
has a linear structure complementary to that gene
• Convert mRNA to cDNA introducing a fluorescently
labeled dye to each molecule
• Distribute the cDNA sample to a solid surface containing
“probes” of DNA representing all “genes”; the probes are
in known locations on the surface
• Let the molecules from the sample hybridize with the
probes for the corresponding genes
• Remove excess sample and illuminate surface with laser
with frequency corresponding to the dye
• Measure intensity of fluorescence over each probe
Resulting Data
• Intensity over a probe is approximately
proportional to abundance of mRNA
molecules in the sample for the gene
corresponding to the probe
• 40,000 variables measured for each case
– Excessive hype
– Excessive skepticism
– Some familiar statistical paradigms don’t work
well
Good Microarray Studies Have
Clear Objectives
• Class Comparison (Gene Finding)
– Find genes whose expression differs among predetermined
classes, e.g. tissue or experimental condition
• Class Prediction
– Prediction of predetermined class (e.g. treatment outcome)
using information from gene expression profile
– Survival risk-group prediction
• Class Discovery
– Discover clusters of specimens having similar expression
profiles
Class Comparison and Class
Prediction
• Not clustering problems
• Supervised methods
Class Prediction ≠ Class Comparison
• A set of genes is not a predictive model
• Emphasis in class comparison is often on understanding
biological mechanisms
– More difficult than accurate prediction and usually requires a
different experiment
• Demonstrating statistical significance of prognostic
factors is not the same as demonstrating predictive
accuracy
Components of Class Prediction
• Feature (gene) selection
– Which genes will be included in the model
• Select model type
– E.g. Diagonal linear discriminant analysis,
Nearest-Neighbor, …
• Fitting parameters (regression
coefficients) for model
– Selecting value of tuning parameters
Feature Selection
• Genes that are differentially expressed among the
classes at a significance level α (e.g. 0.01)
– The α level is a tuning parameter
– Number of false discoveries is not of direct relevance for
prediction
• For prediction it is usually more serious to exclude an
informative variable than to include some noise variables
Optimal significance level cutoffs for gene selection. 50 differentially expressed genes
out of 22,000 genes on the microarrays
2δ/σ n=10 n=30 n=50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
Complex Gene Selection
• Small subset of genes which together give
most accurate predictions
– Genetic algorithms
• Little evidence that complex feature
selection is useful in microarray problems
Linear Classifiers for Two
Classes
( )
vector of log ratios or log signals
features (genes) included in model
weight for i'th feature
decision boundary ( ) > or < d
i i
i F
i
l x w x
x
F
w
l x
ε
=
=
=
=
∑
Linear Classifiers for Two Classes
• Fisher linear discriminant analysis
• Diagonal linear discriminant analysis (DLDA)
– Ignores correlations among genes
• Compound covariate predictor
• Golub’s weighted voting method
• Support vector machines with inner product
kernel
• Perceptrons
When p>>n
• It is always possible to find a set of
features and a weight vector for which the
classification error on the training set is
zero.
• There is generally not sufficient
information in p>>n training sets to
effectively use more complex methods
Myth
• Complex classification algorithms such as
neural networks perform better than
simpler methods for class prediction.
• Comparative studies have shown that
simpler methods work as well or better for
microarray problems because they avoid
overfitting the data.
Other Simple Methods
• Nearest neighbor classification
• Nearest k-neighbors
• Nearest centroid classification
• Shrunken centroid classification
Evaluating a Classifier
• Most statistical methods were not developed for
p>>n prediction problems
• Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data
• Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy
• Testing whether analysis of independent data
results in selection of the same set of genes is
not an appropriate test of predictive accuracy of
a classifier
Internal Validation of a Classifier
• Re-substitution estimate
– Develop classifier on dataset, test predictions
on same data
– Very biased for p>>n
• Split-sample validation
• Cross-validation
Split-Sample Evaluation
• Training-set
– Used to select features, select model type, determine
parameters and cut-off thresholds
• Test-set
– Withheld until a single model is fully specified using
the training-set.
– Fully specified model is applied to the expression
profiles in the test-set to predict class labels.
– Number of errors is counted
Leave-one-out Cross Validation
• Omit sample 1
– Develop multivariate classifier from scratch on
training set with sample 1 omitted
– Predict class for sample 1 and record whether
prediction is correct
Leave-one-out Cross Validation
• Repeat analysis for training sets with each
single sample omitted one at a time
• e = number of misclassifications
determined by cross-validation
• Subdivide e for estimation of sensitivity
and specificity
• With proper cross-validation, the model
must be developed from scratch for each
leave-one-out training set. This means
that feature selection must be repeated for
each leave-one-out training set.
– Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of
DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003.
• The cross-validated estimate of
misclassification error is an estimate of the
prediction error for model fit using
specified algorithm to full dataset
Prediction on Simulated Null Data
Generation of Gene Expression Profiles
• 14 specimens (Pi is the expression profile for specimen i)
• Log-ratio measurements on 6000 genes
• Pi ~ MVN(0, I6000)
• Can we distinguish between the first 7 specimens (Class 1) and the last 7
(Class 2)?
Prediction Method
• Compound covariate prediction
• Compound covariate built from the log-ratios of the 10 most differentially
expressed genes.
Number of misclassifications
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Major Flaws Found in 40 Studies
Published in 2004
• Inadequate control of multiple comparisons in gene
finding
– 9/23 studies had unclear or inadequate methods to deal with
false positives
• 10,000 genes x .05 significance level = 500 false positives
• Misleading report of prediction accuracy
– 12/28 reports based on incomplete cross-validation
• Misleading use of cluster analysis
– 13/28 studies invalidly claimed that expression clusters based on
differentially expressed genes could help distinguish clinical
outcomes
• 50% of studies contained one or more major flaws
Myth
• Split sample validation is superior to
LOOCV or 10-fold CV for estimating
prediction error
Comparison of Internal Validation Methods
Molinaro, Pfiffer & Simon
• For small sample sizes, LOOCV is much less
biased than split-sample validation
• For small sample sizes, LOOCV is preferable to
10-fold, 5-fold cross-validation or repeated k-fold
versions
• For moderate sample sizes, 10-fold is preferable
to LOOCV
• Some claims for bootstrap resampling for
estimating prediction error are not valid for p>>n
problems
Simulated Data
40 cases, 10 genes selected from 5000
Method Estimate Std Deviation
True .078
Resubstitution .007 .016
LOOCV .092 .115
10-fold CV .118 .120
5-fold CV .161 .127
Split sample 1-1 .345 .185
Split sample 2-1 .205 .184
.632+ bootstrap .274 .084
Simulated Data
40 cases
Method Estimate Std Deviation
True .078
10-fold .118 .120
Repeated 10-fold .116 .109
5-fold .161 .127
Repeated 5-fold .159 .114
Split 1-1 .345 .185
Repeated split 1-1 .371 .065
DLBCL Data
Method Bias Std Deviation MSE
LOOCV -.019 .072 .008
10-fold CV -.007 .063 .006
5-fold CV .004 .07 .007
Split 1-1 .037 .117 .018
Split 2-1 .001 .119 .017
.632+ bootstrap -.006 .049 .004
• Ordinary bootstrap
– Training and test sets overlap
• Bootstrap cross-validation (Fu, Carroll,Wang)
– Perform LOOCV on bootstrap samples
– Training and test sets overlap
• Leave-one-out bootstrap
– Predict for cases not in bootstrap sample
– Training sets are too small
• Out-of-bag bootstrap (Breiman)
– Predict for case i based on majority rule of predictions for
bootstrap samples not containing case i
• .632+ bootstrap
– w*LOOBS+(1-w)RSB
Permutation Distribution of Cross-
validated Misclassification Rate of a
Multivariate Classifier
• Randomly permute class labels and
repeat the entire cross-validation
• Re-do for all (or 1000) random
permutations of class labels
• Permutation p value is fraction of random
permutations that gave as few
misclassifications as e in the real data
Does an Expression Profile Classifier
Predict More Accurately Than Standard
Prognostic Variables?
• Not an issue of which variables are significant
after adjusting for which others or which are
independent predictors
– Predictive accuracy, not significance
• The two classifiers can be compared by ROC
analysis as functions of the threshold for
classification
• The predictiveness of the expression profile
classifier can be evaluated within levels of the
classifier based on standard prognostic variables
Does an Expression Profile Classifier
Predict More Accurately Than Standard
Prognostic Variables?
• Some publications fit logistic model to
standard covariates and the cross-validated
predictions of expression profile classifiers
• This is valid only with split-sample analysis
because the cross-validated predictions are
not independent
log ( ) ( | )i iit p y x i zα β γ= + − +
Survival Risk Group Prediction
• For analyzing right censored data to develop predictive
classifiers it is not necessary to make the data binary
• Can do cross-validation to predict high or low risk group
for each case
• Compute Kaplan-Meier curves of predicted risk groups
• Permutation significance of log-rank statistic
• Implemented in BRB-ArrayTools
• BRB-ArrayTools also provides for comparing the risk
group classifier based on expression profiles to one
based on standard covariates and one based on a
combination of both types of variables
Myth
• Huge sample sizes are needed to develop
effective predictive classifiers
Sample Size Planning
References
• K Dobbin, R Simon. Sample size
determination in microarray experiments
for class comparison and prognostic
classification. Biostatistics 6:27-38, 2005
• K Dobbin, R Simon. Sample size planning
for developing classifiers using high
dimensional DNA microarray data.
Biostatistics (2007)
Sample Size Planning for Classifier
Development
• The expected value (over training sets) of
the probability of correct classification
PCC(n) should be within γ of the maximum
achievable PCC(∞)
Probability Model
• Two classes
• Log expression or log ratio MVN in each class with
common covariance matrix
• m differentially expressed genes
• p-m noise genes
• Expression of differentially expressed genes are
independent of expression for noise genes
• All differentially expressed genes have same inter-class
mean difference 2δ
• Common variance for differentially expressed genes and
for noise genes
Classifier
• Feature selection based on univariate t-
tests for differential expression at
significance level α
• Simple linear classifier with equal weights
(except for sign) for all selected genes.
Power for selecting each of the informative
genes that are differentially expressed by
mean difference 2δ is 1-β(n)
• For 2 classes of equal prevalence, let λ1 denote
the largest eigenvalue of the covariance matrix
of informative genes. Then
1
( )
m
PCC
δ
σ λ
  
∞ ≤ Φ  
  
( )
( ) ( )1
1
( ) 1
1
mm
PCC n
m p m
βδ
β
σ λ β α
 − 
≥ Φ − 
− + −  
1.0 1.2 1.4 1.6 1.8 2.0
406080100
2 delta/sigma
Samplesize
gamma=0.05
gamma=0.10
Sample size as a function of effect size (log-base 2 fold-change between classes divided by
standard deviation). Two different tolerances shown, . Each class is equally represented in the
population. 22000 genes on an array.
b) PCC(60) as a function of the proportion in the under-represented class. Parameter settings same
as a), with 10 differentially expressed genes among 22,000 total genes. If the proportion in the under-
represented class is small (e.g., <20%), then the PCC(60) can decline significantly.
0.1 0.2 0.3 0.4 0.5
0.750.800.85
Proportion in under-represented class
PCC(60)
Acknowledgements
• Kevin Dobbin
• Alain Dupuy
• Wenyu Jiang
• Annette Molinaro
• Ruth Pfeiffer
• Michael Radmacher
• Joanna Shih
• Yingdong Zhao
• BRB-ArrayTools Development Team
BRB-ArrayTools
• Contains analysis tools that I have selected as
valid and useful
• Analysis wizard and multiple help screens for
biomedical scientists
• Imports data from all platforms and major
databases
• Automated import of data from NCBI Gene
Express Omnibus
Predictive Classifiers in
BRB-ArrayTools
• Classifiers
– Diagonal linear discriminant
– Compound covariate
– Bayesian compound covariate
– Support vector machine with
inner product kernel
– K-nearest neighbor
– Nearest centroid
– Shrunken centroid (PAM)
– Random forrest
– Tree of binary classifiers for k-
classes
• Survival risk-group
– Supervised pc’s
• Feature selection options
– Univariate t/F statistic
– Hierarchical variance option
– Restricted by fold effect
– Univariate classification power
– Recursive feature elimination
– Top-scoring pairs
• Validation methods
– Split-sample
– LOOCV
– Repeated k-fold CV
– .632+ bootstrap
Selected Features of BRB-ArrayTools
• Multivariate permutation tests for class comparison to control
number and proportion of false discoveries with specified
confidence level
– Permits blocking by another variable, pairing of data, averaging of
technical replicates
• SAM
– Fortran implementation 7X faster than R versions
• Extensive annotation for identified genes
– Internal annotation of NetAffx, Source, Gene Ontology, Pathway
information
– Links to annotations in genomic databases
• Find genes correlated with quantitative factor while controlling
number of proportion of false discoveries
• Find genes correlated with censored survival while controlling
number or proportion of false discoveries
• Analysis of variance
Selected Features of BRB-ArrayTools
• Gene set enrichment analysis.
– Gene Ontology groups, signaling pathways, transcription
factor targets, micro-RNA putative targets
– Automatic data download from Broad Institute
– KS & LS test statistics for null hypothesis that gene set is not
enriched
– Hotelling’s and Goeman’s Global test of null hypothesis that
no genes in set are differentially expressed
– Goeman’s Global test for survival data
• Class prediction
– Multiple classifiers
– Complete LOOCV, k-fold CV, repeated k-fold, .632
bootstrap
– permutation significance of cross-validated error rate
Selected Features of BRB-ArrayTools
• Survival risk-group prediction
– Supervised principal components with and without clinical
covariates
– Cross-validated Kaplan Meier Curves
– Permutation test of cross-validated KM curves
• Clustering tools for class discovery with
reproducibility statistics on clusters
– Internal access to Eisen’s Cluster and Treeview
• Visualization tools including rotating 3D principal
components plot exportable to Powerpoint with
rotation controls
• Extensible via R plug-in feature
• Tutorials and datasets
BRB-ArrayTools
• Extensive built-in gene annotation and
linkage to gene annotation websites
• Publicly available for non-commercial use
– http://linus.nci.nih.gov/brb
BRB-ArrayTools
December 2006
• 6635 Registered users
• 1938 Distinct institutions
• 68 Countries
• 311 Citations

More Related Content

Similar to Vanderbilt b

High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data Analysis
Setia Pramana
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
Jennifer Shelton
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
UC Davis
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Ali Al Hamadani
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
GenomeInABottle
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
Bioinformatics and Computational Biosciences Branch
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
butest
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0
Computer Science Club
 
Enhance Genomic Research with Polygenic Risk Score Calculations in SVS
Enhance Genomic Research with Polygenic Risk Score Calculations in SVSEnhance Genomic Research with Polygenic Risk Score Calculations in SVS
Enhance Genomic Research with Polygenic Risk Score Calculations in SVS
Golden Helix
 
Design of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer DiseasesDesign of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer Diseases
Mohamed Loey
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
pannicle
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Dmitry Grapov
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
Golden Helix
 
Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
Nikhil Shrivastava, MS, SAFe PMPO
 
A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...
Xi Chen
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
Aboul Ella Hassanien
 
Parkinson disease classification recorded v2.0
Parkinson disease classification recorded   v2.0Parkinson disease classification recorded   v2.0
Parkinson disease classification recorded v2.0
Nikhil Shrivastava, MS, SAFe PMPO
 
DREAM Challenge
DREAM ChallengeDREAM Challenge
DREAM Challenge
Tulip Nandu
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GenomeInABottle
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
GenomeInABottle
 

Similar to Vanderbilt b (20)

High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data Analysis
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0
 
Enhance Genomic Research with Polygenic Risk Score Calculations in SVS
Enhance Genomic Research with Polygenic Risk Score Calculations in SVSEnhance Genomic Research with Polygenic Risk Score Calculations in SVS
Enhance Genomic Research with Polygenic Risk Score Calculations in SVS
 
Design of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer DiseasesDesign of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer Diseases
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
 
Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
 
A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Parkinson disease classification recorded v2.0
Parkinson disease classification recorded   v2.0Parkinson disease classification recorded   v2.0
Parkinson disease classification recorded v2.0
 
DREAM Challenge
DREAM ChallengeDREAM Challenge
DREAM Challenge
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Vanderbilt b

  • 1. Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute Linus.nci.nih.gov/brb
  • 2. BRB Website http://linus.nci.nih.gov/brb • Powerpoint presentations and audio files • Reprints & Technical Reports • BRB-ArrayTools software • BRB-ArrayTools Data Archive • Sample Size Planning for Targeted Clinical Trials
  • 3. Simplified Description of Microarray Assay • Extract mRNA from cells of interest – Each mRNA molecule was transcribed from a single gene and it has a linear structure complementary to that gene • Convert mRNA to cDNA introducing a fluorescently labeled dye to each molecule • Distribute the cDNA sample to a solid surface containing “probes” of DNA representing all “genes”; the probes are in known locations on the surface • Let the molecules from the sample hybridize with the probes for the corresponding genes • Remove excess sample and illuminate surface with laser with frequency corresponding to the dye • Measure intensity of fluorescence over each probe
  • 4. Resulting Data • Intensity over a probe is approximately proportional to abundance of mRNA molecules in the sample for the gene corresponding to the probe • 40,000 variables measured for each case – Excessive hype – Excessive skepticism – Some familiar statistical paradigms don’t work well
  • 5. Good Microarray Studies Have Clear Objectives • Class Comparison (Gene Finding) – Find genes whose expression differs among predetermined classes, e.g. tissue or experimental condition • Class Prediction – Prediction of predetermined class (e.g. treatment outcome) using information from gene expression profile – Survival risk-group prediction • Class Discovery – Discover clusters of specimens having similar expression profiles
  • 6. Class Comparison and Class Prediction • Not clustering problems • Supervised methods
  • 7. Class Prediction ≠ Class Comparison • A set of genes is not a predictive model • Emphasis in class comparison is often on understanding biological mechanisms – More difficult than accurate prediction and usually requires a different experiment • Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy
  • 8. Components of Class Prediction • Feature (gene) selection – Which genes will be included in the model • Select model type – E.g. Diagonal linear discriminant analysis, Nearest-Neighbor, … • Fitting parameters (regression coefficients) for model – Selecting value of tuning parameters
  • 9. Feature Selection • Genes that are differentially expressed among the classes at a significance level α (e.g. 0.01) – The α level is a tuning parameter – Number of false discoveries is not of direct relevance for prediction • For prediction it is usually more serious to exclude an informative variable than to include some noise variables
  • 10.
  • 11. Optimal significance level cutoffs for gene selection. 50 differentially expressed genes out of 22,000 genes on the microarrays 2δ/σ n=10 n=30 n=50 1 0.167 0.003 0.00068 1.25 0.085 0.0011 0.00035 1.5 0.045 0.00063 0.00016 1.75 0.026 0.00036 0.00006 2 0.015 0.0002 0.00002
  • 12. Complex Gene Selection • Small subset of genes which together give most accurate predictions – Genetic algorithms • Little evidence that complex feature selection is useful in microarray problems
  • 13. Linear Classifiers for Two Classes ( ) vector of log ratios or log signals features (genes) included in model weight for i'th feature decision boundary ( ) > or < d i i i F i l x w x x F w l x ε = = = = ∑
  • 14. Linear Classifiers for Two Classes • Fisher linear discriminant analysis • Diagonal linear discriminant analysis (DLDA) – Ignores correlations among genes • Compound covariate predictor • Golub’s weighted voting method • Support vector machines with inner product kernel • Perceptrons
  • 15. When p>>n • It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero. • There is generally not sufficient information in p>>n training sets to effectively use more complex methods
  • 16. Myth • Complex classification algorithms such as neural networks perform better than simpler methods for class prediction.
  • 17. • Comparative studies have shown that simpler methods work as well or better for microarray problems because they avoid overfitting the data.
  • 18. Other Simple Methods • Nearest neighbor classification • Nearest k-neighbors • Nearest centroid classification • Shrunken centroid classification
  • 19. Evaluating a Classifier • Most statistical methods were not developed for p>>n prediction problems • Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data • Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy • Testing whether analysis of independent data results in selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier
  • 20.
  • 21.
  • 22. Internal Validation of a Classifier • Re-substitution estimate – Develop classifier on dataset, test predictions on same data – Very biased for p>>n • Split-sample validation • Cross-validation
  • 23. Split-Sample Evaluation • Training-set – Used to select features, select model type, determine parameters and cut-off thresholds • Test-set – Withheld until a single model is fully specified using the training-set. – Fully specified model is applied to the expression profiles in the test-set to predict class labels. – Number of errors is counted
  • 24. Leave-one-out Cross Validation • Omit sample 1 – Develop multivariate classifier from scratch on training set with sample 1 omitted – Predict class for sample 1 and record whether prediction is correct
  • 25. Leave-one-out Cross Validation • Repeat analysis for training sets with each single sample omitted one at a time • e = number of misclassifications determined by cross-validation • Subdivide e for estimation of sensitivity and specificity
  • 26. • With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set. – Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003. • The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset
  • 27. Prediction on Simulated Null Data Generation of Gene Expression Profiles • 14 specimens (Pi is the expression profile for specimen i) • Log-ratio measurements on 6000 genes • Pi ~ MVN(0, I6000) • Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? Prediction Method • Compound covariate prediction • Compound covariate built from the log-ratios of the 10 most differentially expressed genes.
  • 28. Number of misclassifications 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  • 29.
  • 30. Major Flaws Found in 40 Studies Published in 2004 • Inadequate control of multiple comparisons in gene finding – 9/23 studies had unclear or inadequate methods to deal with false positives • 10,000 genes x .05 significance level = 500 false positives • Misleading report of prediction accuracy – 12/28 reports based on incomplete cross-validation • Misleading use of cluster analysis – 13/28 studies invalidly claimed that expression clusters based on differentially expressed genes could help distinguish clinical outcomes • 50% of studies contained one or more major flaws
  • 31. Myth • Split sample validation is superior to LOOCV or 10-fold CV for estimating prediction error
  • 32.
  • 33. Comparison of Internal Validation Methods Molinaro, Pfiffer & Simon • For small sample sizes, LOOCV is much less biased than split-sample validation • For small sample sizes, LOOCV is preferable to 10-fold, 5-fold cross-validation or repeated k-fold versions • For moderate sample sizes, 10-fold is preferable to LOOCV • Some claims for bootstrap resampling for estimating prediction error are not valid for p>>n problems
  • 34.
  • 35. Simulated Data 40 cases, 10 genes selected from 5000 Method Estimate Std Deviation True .078 Resubstitution .007 .016 LOOCV .092 .115 10-fold CV .118 .120 5-fold CV .161 .127 Split sample 1-1 .345 .185 Split sample 2-1 .205 .184 .632+ bootstrap .274 .084
  • 36. Simulated Data 40 cases Method Estimate Std Deviation True .078 10-fold .118 .120 Repeated 10-fold .116 .109 5-fold .161 .127 Repeated 5-fold .159 .114 Split 1-1 .345 .185 Repeated split 1-1 .371 .065
  • 37. DLBCL Data Method Bias Std Deviation MSE LOOCV -.019 .072 .008 10-fold CV -.007 .063 .006 5-fold CV .004 .07 .007 Split 1-1 .037 .117 .018 Split 2-1 .001 .119 .017 .632+ bootstrap -.006 .049 .004
  • 38.
  • 39. • Ordinary bootstrap – Training and test sets overlap • Bootstrap cross-validation (Fu, Carroll,Wang) – Perform LOOCV on bootstrap samples – Training and test sets overlap • Leave-one-out bootstrap – Predict for cases not in bootstrap sample – Training sets are too small • Out-of-bag bootstrap (Breiman) – Predict for case i based on majority rule of predictions for bootstrap samples not containing case i • .632+ bootstrap – w*LOOBS+(1-w)RSB
  • 40.
  • 41.
  • 42.
  • 43. Permutation Distribution of Cross- validated Misclassification Rate of a Multivariate Classifier • Randomly permute class labels and repeat the entire cross-validation • Re-do for all (or 1000) random permutations of class labels • Permutation p value is fraction of random permutations that gave as few misclassifications as e in the real data
  • 44.
  • 45.
  • 46. Does an Expression Profile Classifier Predict More Accurately Than Standard Prognostic Variables? • Not an issue of which variables are significant after adjusting for which others or which are independent predictors – Predictive accuracy, not significance • The two classifiers can be compared by ROC analysis as functions of the threshold for classification • The predictiveness of the expression profile classifier can be evaluated within levels of the classifier based on standard prognostic variables
  • 47. Does an Expression Profile Classifier Predict More Accurately Than Standard Prognostic Variables? • Some publications fit logistic model to standard covariates and the cross-validated predictions of expression profile classifiers • This is valid only with split-sample analysis because the cross-validated predictions are not independent log ( ) ( | )i iit p y x i zα β γ= + − +
  • 48. Survival Risk Group Prediction • For analyzing right censored data to develop predictive classifiers it is not necessary to make the data binary • Can do cross-validation to predict high or low risk group for each case • Compute Kaplan-Meier curves of predicted risk groups • Permutation significance of log-rank statistic • Implemented in BRB-ArrayTools • BRB-ArrayTools also provides for comparing the risk group classifier based on expression profiles to one based on standard covariates and one based on a combination of both types of variables
  • 49. Myth • Huge sample sizes are needed to develop effective predictive classifiers
  • 50. Sample Size Planning References • K Dobbin, R Simon. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27-38, 2005 • K Dobbin, R Simon. Sample size planning for developing classifiers using high dimensional DNA microarray data. Biostatistics (2007)
  • 51. Sample Size Planning for Classifier Development • The expected value (over training sets) of the probability of correct classification PCC(n) should be within γ of the maximum achievable PCC(∞)
  • 52. Probability Model • Two classes • Log expression or log ratio MVN in each class with common covariance matrix • m differentially expressed genes • p-m noise genes • Expression of differentially expressed genes are independent of expression for noise genes • All differentially expressed genes have same inter-class mean difference 2δ • Common variance for differentially expressed genes and for noise genes
  • 53. Classifier • Feature selection based on univariate t- tests for differential expression at significance level α • Simple linear classifier with equal weights (except for sign) for all selected genes. Power for selecting each of the informative genes that are differentially expressed by mean difference 2δ is 1-β(n)
  • 54. • For 2 classes of equal prevalence, let λ1 denote the largest eigenvalue of the covariance matrix of informative genes. Then 1 ( ) m PCC δ σ λ    ∞ ≤ Φ     
  • 55. ( ) ( ) ( )1 1 ( ) 1 1 mm PCC n m p m βδ β σ λ β α  −  ≥ Φ −  − + −  
  • 56. 1.0 1.2 1.4 1.6 1.8 2.0 406080100 2 delta/sigma Samplesize gamma=0.05 gamma=0.10 Sample size as a function of effect size (log-base 2 fold-change between classes divided by standard deviation). Two different tolerances shown, . Each class is equally represented in the population. 22000 genes on an array.
  • 57.
  • 58. b) PCC(60) as a function of the proportion in the under-represented class. Parameter settings same as a), with 10 differentially expressed genes among 22,000 total genes. If the proportion in the under- represented class is small (e.g., <20%), then the PCC(60) can decline significantly. 0.1 0.2 0.3 0.4 0.5 0.750.800.85 Proportion in under-represented class PCC(60)
  • 59.
  • 60.
  • 61. Acknowledgements • Kevin Dobbin • Alain Dupuy • Wenyu Jiang • Annette Molinaro • Ruth Pfeiffer • Michael Radmacher • Joanna Shih • Yingdong Zhao • BRB-ArrayTools Development Team
  • 62. BRB-ArrayTools • Contains analysis tools that I have selected as valid and useful • Analysis wizard and multiple help screens for biomedical scientists • Imports data from all platforms and major databases • Automated import of data from NCBI Gene Express Omnibus
  • 63. Predictive Classifiers in BRB-ArrayTools • Classifiers – Diagonal linear discriminant – Compound covariate – Bayesian compound covariate – Support vector machine with inner product kernel – K-nearest neighbor – Nearest centroid – Shrunken centroid (PAM) – Random forrest – Tree of binary classifiers for k- classes • Survival risk-group – Supervised pc’s • Feature selection options – Univariate t/F statistic – Hierarchical variance option – Restricted by fold effect – Univariate classification power – Recursive feature elimination – Top-scoring pairs • Validation methods – Split-sample – LOOCV – Repeated k-fold CV – .632+ bootstrap
  • 64. Selected Features of BRB-ArrayTools • Multivariate permutation tests for class comparison to control number and proportion of false discoveries with specified confidence level – Permits blocking by another variable, pairing of data, averaging of technical replicates • SAM – Fortran implementation 7X faster than R versions • Extensive annotation for identified genes – Internal annotation of NetAffx, Source, Gene Ontology, Pathway information – Links to annotations in genomic databases • Find genes correlated with quantitative factor while controlling number of proportion of false discoveries • Find genes correlated with censored survival while controlling number or proportion of false discoveries • Analysis of variance
  • 65. Selected Features of BRB-ArrayTools • Gene set enrichment analysis. – Gene Ontology groups, signaling pathways, transcription factor targets, micro-RNA putative targets – Automatic data download from Broad Institute – KS & LS test statistics for null hypothesis that gene set is not enriched – Hotelling’s and Goeman’s Global test of null hypothesis that no genes in set are differentially expressed – Goeman’s Global test for survival data • Class prediction – Multiple classifiers – Complete LOOCV, k-fold CV, repeated k-fold, .632 bootstrap – permutation significance of cross-validated error rate
  • 66. Selected Features of BRB-ArrayTools • Survival risk-group prediction – Supervised principal components with and without clinical covariates – Cross-validated Kaplan Meier Curves – Permutation test of cross-validated KM curves • Clustering tools for class discovery with reproducibility statistics on clusters – Internal access to Eisen’s Cluster and Treeview • Visualization tools including rotating 3D principal components plot exportable to Powerpoint with rotation controls • Extensible via R plug-in feature • Tutorials and datasets
  • 67. BRB-ArrayTools • Extensive built-in gene annotation and linkage to gene annotation websites • Publicly available for non-commercial use – http://linus.nci.nih.gov/brb
  • 68. BRB-ArrayTools December 2006 • 6635 Registered users • 1938 Distinct institutions • 68 Countries • 311 Citations