A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data
 

A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data

on

  • 1,311 views

 

Statistics

Views

Total Views
1,311
Views on SlideShare
723
Embed Views
588

Actions

Likes
0
Downloads
8
Comments
0

3 Embeds 588

http://www.salford-systems.com 508
https://www.salford-systems.com 71
http://test.salford-systems.com 9

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data Presentation Transcript

  • A Study of Random Forests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray DataJorge M. Arevalillo and Hilario NavarroDpt. Statistics and Operational ResearchUniversity Nacional de Educación a Distancia1 Salford Analytics and Data Mining Conference 2012. San Diego
  • Outline  Weak Marginal / Strong bivariate genetic interactions  RF learning mechanism  RF bivariate interaction detector procedure  Controlling the curse of dimensionality  Handling the small sample effect  Application to microarray data  Conclusions2 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Human Genetics Basics DNA is often described as the blueprint of living organisms. It is composed bytwo complementary strands of nucleotides (A-T, C-G) Adenine (A) pairs with thymine (T) and cytosine (C) with guanine (G) Basically, a gene is a piece of the DNA that contains the genetic information forthe synthesis of a protein  The human genome in numbers  23 pairs of chromosomes  2 meters of DNA  A sequence of 3 billion bps length  30000 – 40000 genes  Over 99% of the genome is identical in all human beings3 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • The central dogma of molecular biology  The expression of the genetic information stored in the DNA occurs in two stages •TRANSCIPTION. During which DNA is transcribed into messenger RNA (mRNA). •TRANSLATION. At this stage mRNA is transported to cell cytoplasm and translated to produce a protein  Amino acids are used to construct proteins which in turn will determine the observed phenotype DNA microarray technologies allow to measure the abundance of mRNA bymonitoring the expression levels for hundreds or thousands of genes at differentconditions of the phenotype4 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Weak marginal / Strong bivariate geneticinteractions In binary classification we define a WM/SB bivariate gene to gene interaction as apair of variables (genes) whose joint distribution discriminates the outcome but haveirrelevant marginal distributions for class separation5 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • RF learning mechanism  Random Forest is an ensemble of decision trees grown in a special way  Randomness is injected in RF mechanism by bootstrap resampling to grow each tree in the forest and also by finding the best splitter at each node within a randomly selected subset of inputs The number ntree of trees in the forest and the number R of candidate inputs forsplitting each node must be set in advance. Defaults: ntree = 500 and R = squareroot of the number p of inputs Each tree is grown on nearly 63% of data. The classification error rate isestimated using the 37% left out observations. The error rate evaluated on the outof bag cases is called oob6 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • RF table of variable importance  The high dimensional nature of the data obtained by gene expression microarray experiments has created the need for variable selection procedures that separate relevant predictors (genes) carrying on useful information for classifying the phenotype from irrelevant predictor (genes)  RF generates variable importance measures that allow to rank predictors in accordance to their contribution to the predictive accuracy of the ensemble  RF gives two measures of variable importance • GINI MEASURE. Each variable is assigned a score that accounts for the all the improvements in the Gini index in all the nodes of the trees in the forests that use the variable as splitting variable • PERMUTATION BASED MEASURE. For each variable, all the cases are randomly permuted to a noisy predictor; this noisy predictor is used in place of the original predictor and the oob is computed again. The importance of the variable is defined by the difference between oob errors after and before permutation7 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • The oob error rate degradation in highdimensional settings  An extreme synthetic example. XOR interaction pattern  The oob error rate rapidly becomes degraded as the number of noisy inputs increases; hence the XOR signal will be lost  The interaction is captured as long as it appears alone without the disturbance of the noisy inputs; so an exhaustive search among all the pairs of inputs is required if we want RF learning mechanism detects the interaction  Our proposal offers shortcuts and tricky artifacts that simplify the search8 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Search procedure. Sequential stage RF ranking of variable importance gives new insights regarding the degradationof the oob error rate Some alternatives, Díaz Uriarte (2006) and Genuer (2008), that explore thisranking in a sequential manner have been proposed to identify relevant patternscorrelated to the outcome9 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Search procedure. Hunting stage The second stage is designed to hunt difficult to uncover bivariate associations,which are lost by sequential search strategies The idea is to group the inputs in blocks; then use the oob error of RF run for allthe variables belonging to each pair of blocks in order to highlight block matcheswhere the WM / SB interactions are more likely to appear. This will limit the search Block j Block i Match (i,j) Ranking of block matches10 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Drawback with the oob error rate  Simulation experiment with block size = 6  The boxplots show that the oob error rate cannot distinguish between block matches containing a weak marginal / strong bivariate association and block matches with only noisy inputs sample sizes (40,40) sample sizes (40,20) 0.7 0.50 0.45 0.6 0.40  The curses ofoob error rate oob error rate 0.5 dimensionality and 0.35 low sample size are 0.4 0.30 coming up again 0.25 0.3 0.20 XOR NOISY INPUTS XOR NOISY INPUTS overlap=0.31 overlap=0.4211 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Data augmentation To overcome this drawback, data are artificially augmented and then ooberror rate of a RF run on the augmented data is computed Data perturbation is carried out in accordance to the following scheme  r is the sample range of X  b is the number of bins the range is divided in. It controls the amount of perturbation  An augmentation parameter k that Details in Arevalillo and Navarro (2011), gives the factor by which the dataset Fundamenta Informaticae Special issue on must be amplified is also introduced Machine Learning in Bioinformatics The new oob error computed on the augmented merged dataset is actually aperturbed error rate measure. We call it perturbed oob12 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • The perturbed oob measure sample sizes (40,40) sample sizes (40,40) 0.7 0.7 0.6 0.6 0.5oob error rate perturbed oob 0.5  The perturbed 0.4 0.4 oob measure 0.3 0.3 0.2 overcomes the initial drawback 0.1 XOR NOISY INPUTS overlap=0.31 0.0 1 (overlap=0.15) 3 (overlap=0.07) 5 (overlap=0.05) 7 (overlap=0.05) 9 (overlap=0.03) k sample sizes (40,20) 0.7 sample sizes (40,20) 0.6 0.50 0.45 0.5 0.40 perturbed oob oob error rate 0.4 0.35 0.3 0.30 0.25 0.2 0.20 0.1 XOR NOISY INPUTS overlap=0.42 0.0 1 (overlap=0.35) 3 (overlap=0.24) 5 (overlap=0.18) 7 (overlap=0.16) 9 (overlap=0.14) k 13 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Summary of the algorithm The details about the implementation of the algorithm can be seen inArevalillo and Navarro (2011), Fundamenta Informaticae Special issue onMachine Learning in Bioinformatics Usually bsize = 6, 8, b =5 and k = 3, 5, 7 are good settings Strategies for this step include: screeplots for variable importance, VARSEL (Díaz Uriarte (BMC. Bioinformatics. 2006) and oob error smoothing (Genuer et al. INRIA. 2008)14 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Application to the colon cancer data Gene expression levels corresponding to 40 tumor and 22 healthy tissue sampleswere collected with an Affymetrix oligonucleotide Hum6000 array (Alon et al. PNAS1999). The expression levels were arranged in a matrix with 2000 columns (genes)and 62 rows along with a column containing the clinical outcome variable Y Y=1 for tumorous samples and Y=0 for healthy samples The data are publicly available and can be downloaded from the package colonCA of Bioconductor www.bioconductor.org15 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Data pre-processing Gene expression intensities were pre-processed with a log transformation and astandardization across genes The figure shows the potential outliers given by RF outlier detector. Cases 18, 20,52, 55 and 58 were previously indentified as outliers in the specialized literature(Chow et al. Physiol. Genomics 2001. Ambroise and McLachlan. PNAS 2002) These outliers might be caused by different sources of error while collecting the data. We eliminate them from the analysis and end up with a data set containing 57 cases and 2000 predictors16 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • A first selection. Sequential search Simple inspection of the screeplot of RF variable importance allow us to identify the most relevant variables. A forward sequential search strategy as in Genuer (2008) gives a selection containing the most informative genes for classifying the clinical outcome List of genes selected after the sequential search step. It has a great agreement with previous selections (Ben-Dor et al. J.Comp.Biol. 2000)17 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Results Control parameters for the hunting stage of the procedure have been set to blocksize = 5, k = 5 and b = 5. RF controls ntree and mtry were set to their default valuesFindings for three top ranked block matches (heat map plots of the oob for eachmatch and the scatter plots for the selected gene to gene interactions) Bivariate gene interaction (X86693, M80815) (R60883, U04953) (L12350, X86693)18 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Additional insights  Oob error rate with all the genes = 3.5%  Oob error rate with the first 300 top ranked genes as predictors = 1.8%  Oob error rate with all the genes but the 300 top ranked = 26.3%  In this case the sequential stage is carried out manually by filtering the 300 top ranked genes  The hunting step of RF bivariate interaction detector procedure allows to uncover interesting patterns from the remaining 1700 genes  Interesting gene associations come up from the first 100 positions of the ranking of block matches  RF oob error for the best 10 gene to gene interactions is 10.5%19 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Summary and conclusions  RF is a widely used algorithm for classification and variable selection in high dimensional small sample data. However, sequential search strategies based on the oob error and its ranking of variable importance usually fail in uncovering weak marginal / strong bivariate hidden interactions in these data structures  This happens because of the curse of dimensionality and the small sample size; both of them produce the degradation in the performance of RF classifier. Data augmentation and an exhaustive exploration by blocks of the feature space, which uses RF as the search engine, will protect us from this phenomenon  A perturbed oob measure is obtained when RF is run for all the features belonging to every pair of blocks in the augmented dataset  So the ranking of perturbed oobs will limit the search from the set of all possible bivariate interactions to the variables within the top ranked blocks  The application of the proposed bivariate interaction detector algorithm to a real gene expression data was able to uncover WM/SB gene to gene interactions associated with the phenotype20 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Future research The method was proposed for binary classification. Its extension to multi-classproblems and the development of tricks and shortcuts that reduce thecomputational cost open future research avenues The interaction detector algorithm utilizes RF as the search engine. The use ofother search engines with classifiers like LDA, QDA, SVM, … is also an issue forfuture research. Recently, Arevalillo and Navarro (2011) BMC Bioinformatics haveproposed the QDA as search engine The development of an R package that incorporates all these improvements Finally, the study of the problem of finding informative WM/SB genomicinteractions in SNP data is an open research issue21 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • Thank you for your attention Jorge M. Arevalillo: jmartin@ccia.uned.es Hilario Navarro: hnavarro@ccia.uned.es Department of Statistics and Operational Research University Nacional Educación a Distancia Paseo Senda del Rey nº 9. 28040 Madrid22 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions