1.
Bioinformatics class 5 Statistical Genetics Kristel Van Steen, PhD, PhD Dept of Electrical Engineering and Computer Science Montefiore Institute – ULg 2008-2009
Causes of association between a marker and a disease:
Very close linkage
Chance
Pleiotropy
This can become a problem when selection on one trait favors one specific mutant, while the selection at the other trait, that is influenced by the same gene, favors another mutant.
Stratification / Population heterogeneity
23.
SNPs or CNVs Copy Number Variations ? Over 99% of human DNA sequences are the same across the population
A number of generations ago, a normal allele d mutated to a disease allele D on a particular chromosome on which the allele at a marker locus was M.
This chromosome is passed down through the generations, and now there are many copies. If the distance between D and M is small, recombinations are unlikely, so most D chromosomes carry M
This type of association contrasts with “spurious associations”
Computer capacities need to be large (storage and CPU time)
Skills and algorithms from computer science and bioinformatics are handy
Statistical analysis
Burden of dimensionality
Multiple testing
False positives
36.
N = 100 50 Cases, 50 Controls Multiple Testing / Missing Data AA Aa aa BB Bb bb CC Cc cc DD Dd dd AA Aa aa AA Aa aa BB Bb bb BB Bb bb SNP 1 SNP 1 SNP 1 SNP 2 SNP 2 SNP 2 SNP 4 SNP 3
Bellman R (1961) Adaptive control processes : A guided tour. Princeton University Press:
“ ... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”
Van Steen K, McQueen MB, Herbert A et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:683-691.
41.
PBAT Family-Based Association X=Genotype S=Parental Genotypes
42.
PBAT Population-Based Association X=Genotype S=Parental Genotypes
44.
(Laird NM and Lange C. Nat Rev Genet 2006). PBAT
45.
Affymetrix Platform: 1 DSL (Method IV: FDR – Benjamini and Hochberg 1995) Power by simulation: Prostate cancer data on 467 subjects from 167 families (Kennedy et al 2003)
There are likely to be many susceptibility genes each with combinations of rare and common alleles and genotypes that impact disease susceptibility primarily through non-linear interactions with genetic and environmental factors
MDR is a strategy to tackle the dimensionality problem of interaction detection
MDR creates a one-dimensional multi-locus genotype variable (high and low risk), which is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing.
57.
MDR Steps A single model with minimum classification error is the best Model 9/10 training data 1/10 test data 10 runs 10 cross-validation 10 best models. The model with minimum PE is the best n -locus model .
Low power in the presence of genetic heterogeneity
Some important interactions could be missed due to pooling too many cells together
No adjustment can be made for main effects and confounding factors / restricted to dichotomous outcomes
Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
Limited number of features: Preselecting favorable features
De Lobel et al (work in progress):
Rely on RF screening applied to combined multi-locus genotype info (De Lobel et al – PhD student)
Rely on screening strategies such as C2BAT?
Rely on screening strategies such as FITF (Millstein et al, 2005)?
W hen the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
82.
Homogenous Populations The EIGENSTRAT algorithm ( a ) Principal components analysis to genotype data to infer continuous axes of genetic variation. ( b ) Genotype at a candidate SNP and Phenotype are adjusted by amounts attributable to ancestry along each axis removing all correlations to ancestry. ( c ) After ancestry adjustment, an association statistic is computed (Price et al, 2006)
86.
Type of data (DV) Qualitative (categorical) 1 independent variable 2 independent variables Quantitative (measurement) Relationships Differences 2 groups Multiple groups Nonparametric Parametric 2 dependent variables Goodness of fit x 2 Independence test x 2 1 predictor Multiple predictors Continuous measurement Ranks Multiple regression Spearman r s Primary interest Degree of relationship Form of relationship Pearson r Regression independent dependent 2-sample t Mann-Whitney U Related sample t Wilcoxon T 1 IV Multiple IVs independent dependent One-way ANOVA Kruskal-Wallis H Factorial ANOVA Repeated measures ANOVA Friedman McNemar test Hypothesis Testing
Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors.
For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model.
# of parameters P min ( n case , n control )/10 - 1
Be the first to comment