Bolstered error estimation for discrete classifier applied to genomic signal processing
Upcoming SlideShare
Loading in...5
×
 

Bolstered error estimation for discrete classifier applied to genomic signal processing

on

  • 338 views

 

Statistics

Views

Total Views
338
Slideshare-icon Views on SlideShare
338
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Bolstered error estimation for discrete classifier applied to genomic signal processing Bolstered error estimation for discrete classifier applied to genomic signal processing Document Transcript

    • Marcel Brun1, Virginia Ballarín1 1Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP mbrun@fi.mdp.edu.ar Introduction Introduction Bolstering is a error estimation technique that provides a less biased estimation than resubstitution, avoiding the large variability of leave-one-out and cross validation [Braga & Dougherty 2004]. AT this moment a general model for Bolstering was provided for continuous classification spaces, like Rn, where the concept of expanding the sample points by a circular kernel is conceptually clear, and works very well in practice [Sima & Braga 2005]. In the other hand, discrete classifiers, like the ones used for image processing and genomic signal processing, present a more complex framework for the design of Bolstered error estimation. In this work we define a model for Bolstering based on a convolution kernel on both conditional probabilities. BRCA2 = 1 BRCA1 = 0 Discrete Classification in Genomics: Can we deduce Microarray Data the transcriptional state of a gene, or a phenotypical feature, based on the transcriptional state of other Frequencies Decision genes? Data Collecting x1x2x3 0 1 x1x2x3 ψ Gene Gene Gene 3 Status 000 0 14 000 1 1 2 001 2 6 001 1 1 0 1 1 “If gene X1 is active and gene X2 is 010 3 2 010 0 suppressed, gene Y would be 1 0 0 1 011 5 1 011 0 activated” 0 1 1 0 Can we infer regulatory genetic 100 0 3 100 1 function from the cDNA 1 0 1 1 101 7 2 101 0 microarray data, for both known and unknown 1 0 1 0 110 3 1 R EL-B 110 0 R CH1 B CL3 FR A1 IAP -1 A TF3 functions? Cell-line Condition … … … … 111 15 1 111 0 ML-1 IR -1 1 1 1 1 1 ML-1 MMS 0 0 0 0 1 0 Molt4 IR -1 0 0 1 1 0 Molt4 MMS 0 0 1 0 1 0 Continuous Bolstering: Bolstered SR IR -1 0 0 1 1 1 Automatic Design: Statistical analysis of the SR MMS 0 0 0 0 1 0 resubstitution for linear classification, assuming A549 IR 0 0 0 0 0 0 relationship between the index (target) and the status of A549 MMS 0 0 0 0 1 0 uniform circular bolstering kernels. The A549 UV 0 0 0 0 1 0 the genes of interest (predictors) define the optimal MCF7 IR -1 0 1 1 0 0 bolstered MCF7 MCF7 MMS UV 0 0 0 0 1 1 0 1 1 1 0 0 binary classifier. Resubstitution error is estimated by resubstitution error is the sum of all RKO RKO MMS IR 0 0 1 0 0 0 1 0 1 1 1 0 probability of wrong classification (values in red). In this contributions (shaded areas) divided by the example is 9/65=13.8% number of points. Resubstitution estimator is usually low biased!! Discrete Bolstering Discrete Bolstering Discrete Bolstering: Bolstered resubstitution error estimation for discrete classification, using a lattice bolstering kernel. The bolstered count for each configuration is based on the weighted sum of its original value and the ones of its neighbors. In this example, the assigned class for configuration 010 changes from Positive to Negative because of the new counting. Before Bolstering: estimated error = 0.138 After Bolstering: estimated error = 0.223 1 111 Bolstering 15 111 12 111 1.1 111 0.1 3 110 7 5 011 1 110 2 101 1 011 3.9 110 6.6 5.5 011 1.3 110 2.4 1.6 011 101 101 101 0.7 0 100 3 010 2 001 3 100 2 010 6 001 1 100 2.9 010 2.6 001 3.8 100 3 010 5.9 001 0.1 0.1 0 000 14 000 0.5 000 10.9 000 Number of positive Samples for Number of negative Samples Convolution Kernel Result of convolution for Result of convolution for each observed configuration for each observed configuration positive samples negative samples (35 observations) (30 observations) Results Results Conclusions Conclusions 3 variables simulated data (geometric spatial distribution) with convolution • Discrete Bolstering can be defined in function of convolution kernels, like in the kernel varying as function of a parameter a. continuous case. • Convolution of both conditional probabilities induce changes in the amount of error computed for the estimated classifier. Convolution Kernels Estimated Error as function of the Bolstering Kernel • The increase/decrease in the estimated error can be made to change continuously as 0.7 N = 3, M = 58 function of a Kernel Size parameter a. • Usually there is an optimal a which makes the bolstered error estimator similar to the 0.6 true error of the estimated classifier. 0.5 • Future works is directed to the choose the optimal Kernel parameter a for specific situations. Bayes Error 0.282 Estimated Error 0.4 True error 0.301 1 LOO error 0.293 Resub error 0.224 References References 0.9 Bolstered (Best b = 0.05) 0.8 0.3 Bolstered 0.302 Diff Oper (Best Diff = 0.01) • Ulisses Braga-Neto, Edward Dougherty, “Bolstered error estimation”, Pattern Recognition, 37, pp. 1267-1281, 2004. 0.7 0.6 • Braga-Neto, U., and Dougherty, E. R., "Classification," Genomic Signal Processing and Statistics, eds. Dougherty, E. R., Shmulevich, I. , Kernel Value 0.5 0.2 Chen, J., and Wang, Z. J., EURASIP Book Series on Signal Processing and Communication, Hindawi Publishing Corporation, 2005. 0.4 • Choudhary A, Brun M, Hua J, Lowey J, Suh E, Dougherty ER., “Genetic test bed for feature selection”, Bioinformatics. 2006 Apr 1;22(7):837- 0.3 42. Epub 2006 Jan 20. 0.2 0.1 0.1 • Chao Sima, Ulisses Braga-Neto and Edward R. Dougherty, “Superior feature-set ranking for small samples using bolstered error estimation”, 0 Bioinformatics, 21 (7), pp 1046–1054, 2005 0 0.5 1 1.5 2 2.5 3 Distance 0 −4 −3 −2 −1 0 1 2 3 4 • Phillip Stafford and Marcel Brun, “Three methods for optimization of cross-laboratory and cross-platform microarray expression data”, Nucleic Parameter b Acids Research, 2007, 1–16 • Qian Xu, Jianping Hua, Ulisses Braga-Neto, Zixiang Xiong, Edward Suh, Edward R. Dougherty, Ph.D., “Confidence Intervals for the True Classification Error Conditioned on the Estimated Error”, Technology in Cancer Research and Treatment, Volume 5, Number 6, December (2006)