Data balancing for phenotype classification based on SNPs

1 Facultad de Ingeniería - Universidad Nacional de Mar del Plata 2 - Agencia Nacional de Promoción Científica y Tecnológica – FONCyT - PICT 2006 1er Congreso Argentino de Bioinformática Data Balancing for Phenotype Classification Based on SNPs Marcel Brun 1,2 , Virginia Ballarín 1

Single Nucleotide Polymorphism (SNP) A single nucleotide polymorphism is a position in the genome where two different alleles have been found to be present in the population, both at a greater than 1% frequency This is responsible for most of the genetic variation between individuals. SNPs may occur in non-coding regions (SNPs) as well as in coding regions (cSNPs). Each polymorphism, or variant, occurs at a frequency of greater than 1% in the population

SNPs Categorization Categorization: A) Functional SNPS that affect phenotype directly via gen alterations. B) Functional SNPs that produces higher predisposition to phenotypical changes (via transcriptional changes or external factors). Silent SNPs, no functionals. Can we predict the Phenotype based on a combination of SNPs? Can we find SNPs associated to specific phenotype changes? SNPs Coding Regions Non Coding Regions Changes in Protein No Changes in Protein Changes in Transcription Factors

Classification based on SNP data SNP Data Combinatorial search Errors Tables for combinations of n SNPs as predictors of control/disease 35 3 1 1 2 3 0 0 0.423 0.430 0.423 0.248 1 2 4 0 0 0.340 0.430 0.340 0.284 1 2 5 0 0 0.351 0.430 0.351 0.279 1 2 6 0 0 0.205 0.430 0.205 0.342 1 2 7 0 0 0.323 0.430 0.323 0.291 1 3 4 0 0 0.344 0.430 0.344 0.282 1 3 5 0 0 0.328 0.430 0.328 0.289 1 3 6 0 0 0.333 0.430 0.333 0.287 1 3 7 0 0 0.314 0.430 0.314 0.295 1 4 5 0 0 0.267 0.430 0.267 0.315 1 4 6 0 0 0.130 0.430 0.130 0.374 1 4 7 0 0 0.267 0.430 0.267 0.315 Processed Results in html pages and Excel datasheets Calls – SNiPer HD

Example: Bovine race classification based on SNPs Race classification based on small sets of SNPs European vs. Indicine breeds. Data from Bovine Hapmap Consortium Focus on region of interest in the genome (Chromosomes 4 and 9) Mariela A. Gonzalez, Marcel Brun, Pablo M. Corva, Virginia Ballarin, “Análisis de señales genómicas para la clasificación de razas bovinas”, CAI 2009, 1er Congreso Argentino de Agroinformática, in the 38 JAIIO, Mar del Plata, Argentina, 24-28 de Agosto 2009. rs29024708 BTA-160695 BTA-140710  29 BTA-71641 rs29019831 BTA-117838  4 SNP 3 SNP 2 SNP 1 0.01864 0.00339 Error 0.989 0.999 Sensibilidad 0.974 0.995 Especificidad 0.994 0.999 NPV 0.957 0.991 PPV 0.011 0.00098 FNR 0.022 0.0046 FPR 38.14 38.46 TN Medio 19.76 20.34 TP Medio 0.22 0.02 FN Medio 0.88 0.18 FP Medio  4  29

Discrete Ful Logic for SNP-based classification SNP calls can be considered discrete variables with three possible values AA, AB and BB SNPs can be used together to predict the status of disease against control . This status can be considered a Boolean variable: “ 1” for case and “0” for control. Training consist in learning the logic to determine the outcome as a function of the observed SNPs. The problem is constrained to few observed SNPs to avoid over-fitting in training. Given a number of SNPs, a decision table defines a “full-logic” discrete classifier. Example of decision table Control Unknown Unknown Control Control Case Case Case Control Outcome AB AA BB AA AA AB BB AB AB AB AA BB BB BB AB BB AA AA SNP 2 SNP 1

Estimation of error rate for 2 SNPs from data – Discrete Full logic Statistical Inference of the optimal function using multi-resolution to “generalize” 2 mistakes on 6  66% accuracy Control AB AA S 7 Case AB AA S 7 Case AA BB S 6 Case AA BB S 5 Case AA AA S 4 Control Case Control Phenotype AA AA AB SNP 2 AB AA AA SNP 1 S 3 S 2 S 1 Train 2 1 0 1 0 0 2 0 0 # Cases 0 2 0 0 0 0 0 0 0 # Control Control Control AB AA Case ??? BB AA Case Case AA AB Case ??? BB AB Case ??? AB AB Case Case ??? ??? Decision Case AA BB Case BB BB Case AB BB Case AA AA Generalization SNP 2 SNP 1 Case AA AA S 11 Case AB BB S 12 Control BB AB S 10 Control Case Control Phenotype AA AA AB SNP 2 AA AA AA SNP 1 S 13 S 9 S 8 TEST

Example of Truth table for a real case 396 185 Total Case 2 ( 0.3 % ) 0 ( 0 % ) AA AA AA Case 0 ( 0 % ) 0 ( 0 % ) AB AA AA Case 0 ( 0 % ) 0 ( 0 % ) BB AA AA Case 28 ( 3.5 % ) 0 ( 0 % ) AA AB AA Case 11 ( 1.4 % ) 0 ( 0 % ) AB AB AA Case 0 ( 0 % ) 0 ( 0 % ) BB AB AA Case 22 ( 2.8 % ) 3 ( 0.8 % ) AA BB AA Case 11 ( 1.4 % ) 0 ( 0 % ) AB BB AA Case 0 ( 0 % ) 0 ( 0 % ) BB BB AA Control 5 ( 0.6 % ) 8 ( 2.2 % ) AA AA AB Case 5 ( 0.6 % ) 2 ( 0.5 % ) AB AA AB Control 1 ( 0.1 % ) 1 ( 0.3 % ) BB AA AB Case 60 ( 7.6 % ) 14 ( 3.8 % ) AA AB AB Case 35 ( 4.4 % ) 5 ( 1.4 % ) AB AB AB Case 3 ( 0.4 % ) 1 ( 0.3 % ) BB AB AB Case 49 ( 6.2 % ) 11 ( 3 % ) AA BB AB Case 31 ( 3.9 % ) 6 ( 1.6 % ) AB BB AB Control 3 ( 0.4 % ) 2 ( 0.5 % ) BB BB AB Control 8 ( 1 % ) 25 ( 6.8 % ) AA AA BB Case 9 ( 1.1 % ) 4 ( 1.1 % ) AB AA BB Control 1 ( 0.1 % ) 1 ( 0.3 % ) BB AA BB Control 41 ( 5.2 % ) 49 ( 13.2 % ) AA AB BB Control 15 ( 1.9 % ) 11 ( 3 % ) AB AB BB Case 2 ( 0.3 % ) 0 ( 0 % ) BB AB BB Control 31 ( 3.9 % ) 34 ( 9.2 % ) AA BB BB Case 21 ( 2.7 % ) 5 ( 1.4 % ) AB BB BB Control 2 ( 0.3 % ) 3 ( 0.8 % ) BB BB BB Predicted # Case ( Freq. Corrected ) # Control ( Freq. Corrected ) SNP 41 SNP 6 SNP 1

Feature Selection – Sets of 3 SNPs Selected SNPs Selected SNPs Step N Step N+1 Error Error

Why Balancing Previous Example shows 185 controls vs. 396 cases These numbers are based on the how the data sampling was done It does not represent population proportions But it affects the classifier design because of differeces in the prior probabilities.

Why Balancing 48% 2.5% 6.6% 15% 0 2.9 Using the best error threshold may shield non desirable FPR and FNR values Changes in threshold provide “ better” combined FPR and FNR at the cost of increased error rate. Error = 6.6% Error = 14.2%

Why Balancing Usual learning algorithm do assume: a) The goal is to minimize the error rate b) Training data and application data have the same distribution Error rate may not be the best goal for many problems But usually the proportion of the two classes in the training samples do not reflect their population probabilities (Priors). Even if the samples represent the population proportion, the best classifier given error rates may be very bad regarding FPR and FNR.

Why is this bad? Designed classifier is sub-optimal (Red Line) Estimated error may not predict true error (Blue Line) Fixed Population Training data with different proportions of Positive/Negative Samples Optimal classifiers for extreme cases have zero error (empirical error) The optimal classifier is obtained when the proportion (60%-40%) are similar to the population’s one. The classifier trained in 50%-50% does not perform so bad In both cases the empirical estimate is a good estimator

Balancing Several techniques to balance continuous data artificially: Replicate samples from smaller set (upsampling) Remove Samples from larger set (downsampling) Threshold adjustment (LDA / Trees / etc) Taking new Samples (resampling) These techniques produce changes in the empirical joint distribution avoiding changes in the conditional distributions

Balancing on discrete data Unbalanced Data Classifier design 19 85 AA AA 512 0 45 90 90 50 60 10 95 # 0 120 19 8 3 15 19 16 12 9 # 1 AB AA BB AA AA AB BB AB AB AB AA BB BB BB AB BB SNP 2 SNP 1 0 0.0295 0.1318 AA AA 0.1860 0.0295 0.0124 0.0047 0.0233 0.0295 0.0248 0.0186 0.0140 # 1 0.814 0.0000 0.0698 0.1395 0.1395 0.0775 0.0930 0.0155 0.1473 # 0 0 0 0 0 0 1 0 0 Classif AB AA BB AA AA AB BB AB AB AB AA BB BB BB AB BB SNP 2 SNP 1

Balancing on discrete data X*0.5/0.814 X*0.5/0.1860 Balancing without resampling. Changes in Marginal Distributions. Conditional Distributions unchanged. Changes in the plug-in classifier and its error rates ERROR = 25 % FPR = 11 % FNR = 86 % ERROR = 34 % FPR = 23 % FNR = 45% 0 0.0792 0.0810 AA AA 0.5000 0.0792 0.0333 0.0125 0.0625 0.0792 0.0667 0.0500 0.0375 # 1 0.5000 0.0000 0.0429 0.0857 0.0857 0.0476 0.0571 0.0095 0.0905 # 0 1 0 0 0 1 1 1 0  B AB AA BB AA AA AB BB AB AB AB AA BB BB BB AB BB SNP 2 SNP 1 0 0.0295 0.1318 AA AA 0.1860 0.0295 0.0124 0.0047 0.0233 0.0295 0.0248 0.0186 0.0140 # 1 0.814 0.0000 0.0698 0.1395 0.1395 0.0775 0.0930 0.0155 0.1473 # 0 0 0 0 0 0 1 0 0  AB AA BB AA AA AB BB AB AB AB AA BB BB BB AB BB SNP 2 SNP 1

Balancing on discrete data ERROR  B = 26 % (against 25% of  ) ERROR  = 49 % (against 34% of  B ) What happens if we exchange the classifiers (assumption of different prior distribution than used for design)? The Balanced classifier  is more robust against wrong prior distribution !!!! 0 0.0792 0.0810 AA AA 0.5000 0.0792 0.0333 0.0125 0.0625 0.0792 0.0667 0.0500 0.0375 # 1 0.5000 0.0000 0.0429 0.0857 0.0857 0.0476 0.0571 0.0095 0.0905 # 0 0 0 0 0 0 1 0 0  AB AA BB AA AA AB BB AB AB AB AA BB BB BB AB BB SNP 2 SNP 1 0 0.0295 0.1318 AA AA 0.1860 0.0295 0.0124 0.0047 0.0233 0.0295 0.0248 0.0186 0.0140 # 1 0.814 0.0000 0.0698 0.1395 0.1395 0.0775 0.0930 0.0155 0.1473 # 0 1 0 0 0 1 1 1 0  B AB AA BB AA AA AB BB AB AB AB AA BB BB BB AB BB SNP 2 SNP 1

Balancing samples allows for classifier design that is independent of proportion of samples between cases and control. Balancing - Simulations Same example as before (Same joint point-label distribution) Before training the classifier, samples are “balanced” artificially. We obtain always the same classifier. Predicted error from training samples still not predicting correctly the true error (but bounded by FPR and FNR) Even with correct proportions (60% class 0), the optimal operator does not reach optimal error (because of balancing)

Another Advantages When balancing the data, sampling from same population generates the same classifier regardless the number of samples produced for each sample ! Because, if the conditional probabilities are the same, after balancing we always reach the same joint distribution ! Therefore the Error rate, FPR and FNR of the designed classifier does NOT depend on the proportion of samples used!!

Another Advantages Balancing tends to minimize jointly FPR and FNR Because it minimizes Error=(FPR+FNR)/2 The designed classifier has smaller range for estimated error Non Balanced Design Balanced Design

Another Advantages Balancing Data is equivalent to change the threshold as to obtain the closest values between FPR and FNR. It is done without need to search for best threshold Threshold Change Balancing

Example of Application Victor L. Boyartchuk, Karl W. Broman, Rebecca E. Mosher, Sarah E.F. D’Orazio, Michael N. Starnbach & William F. Dietrich, “Multigenic control of Listeria monocytogenes susceptibility in mice”, 2001 Nature Publishing Group, Brief Communications, 2001 Selected SNPs Survival vs. No Survival Analysis

Results Balanced vs. Unbalanced classifier design Small increase in FPR but large reduction in FNR We expect this classifier to be more robust against varying prior proportions. 9.7% 51.4% 30.3% Balanced Design 19.8% 50.9% 29.7% Classic Design FNR FPR Error Rate

Conclusions Data Balancing is a necessary step to ensure proper classifier design under unknown priors P(case) and P(control). The designed classifier is more robust against the unknown priors. Data Balancing for discrete data can be (and should be) easily applied via the estimated joint distribution, avoiding resampling.

Acknowledgments UNMdP Virginia Ballarin Mariela Azul Gonzalez INTA (Balcarce) Pablo Corva FI-UNER Inti Anabela Pagnuco Agencia FONCyT – PICT 2313 TGen Edward Dougherty Dietrich Stephan

Data balancing for phenotype classification based on SNPs

More Related Content

Similar to Data balancing for phenotype classification based on SNPs

More from Asociación Argentina de Bioinformática y Biología Computacional

Recently uploaded

Data balancing for phenotype classification based on SNPs