Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,281
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Bioinformatics class 5 Statistical Genetics Kristel Van Steen, PhD, PhD Dept of Electrical Engineering and Computer Science Montefiore Institute – ULg 2008-2009
  • 2. Outline
    • Introduction
      • Vocabulary and Definitions
    • Biostatistical aspects of GWAs
      • GWAs
      • Surrogate for true genetic info
      • Challenges of GWAs
        • Power
        • Abundance of data
        • Design
        • Epistasis
  • 3. Outline
    • Quality control in GWAs including
      • Genotype calling and missing frequencies
      • Minor allele frequencies and HWE
      • Population stratification
    • Frequentist and Bayesian approaches to analyses
    • Accumulating the evidence for GWAs
    • The future of the field ...
  • 4.
    • Introduction
  • 5. Some Definitions
    • A locus
    • is a location on the genome which is sufficiently short so that recombination can be ignored, say up to a few Kb.
    • Alleles
    • are DNA sequence variants distinguished at a locus.
    • It may not be possible to distinguish all possible variants. Often grouped into two alleles D and d
  • 6. Some Definitions
    • A genotype
    • is the unordered pair of an individual’s two alleles at one or more loci
    • E.g., genotypes for one bi-allelic locus are Dd (heterozygous) and DD or dd (homozygous)
    • A haplotype
    • consists of the allelic types at two or more loci on a single chromosome.
  • 7.
    • Biostatistical Aspects
    • of
    • Genome-Wide
    • Association Studies
  • 8.
    • Genome-Wide Association Studies
  • 9. Genetic Association
    • A statistical relationship in a population between an individual’s phenotype (characteristic of interest) and their genotype at a genetic locus
    • Genotypes
      • Known mutation in a gene
      • Marker with/without known effects on coding
  • 10. Genetic Association Studies
    • Aim:
      • to detect associations between one or more genetic polymorphisms and a trait; the latter may be
        • measured,
        • dichotomous,
        • time to onset.
    • Genuine genetic associations arise only because human populations share common ancestry
  • 11. Gene Mapping Approaches
    • Candidate gene studies
      • Association approaches
        • Family-based
        • Population-based
      • Resequencing approaches
    • Genome-wide studies
      • Linkage analysis
        • Family-based
      • Association analysis
        • Family-based
        • Population-based
  • 12. Disease Association
    • candidate gene approach
    • vs
    • (genome-wide) screening approach
    • Can’t see the forest for the trees
    • vs
    • Can’t see the trees for the forest
    Can’t see the forest for the trees vs Can’t see the trees for the forest
  • 13.
        • Association studies have greater power than linkage studies to detect small effects, but require looking at more places
    ( Risch and Merikangas 1996 )
  • 14. Association versus Linkage
    • Association at the population level
    • Intrafamilial association
    • Pinpoints alleles
    • Pinpoints loci
    • More powerful
    • Less powerful
    • More tests required
    • Fewer tests required
    • More sensitive to mistyping
    • Less sensitive to mistyping
    • Sensitive to population stratification
    • Not sensitive to population stratification
    Allelic Association Linkage
  • 15. Linkage versus Association ( Courtesy of Ed Silverman)
  • 16.
    • Surrogate for True Genetic Info
  • 17. SNPs
    • Single nucleotide polymorphisms (SNPs) are DNA sequence variations occurring when a single nucleotide (A, C, T, G) in the genome is altered.
    • The inherited allelic variation must have >1% population frequency.
    • SNPs can occur in both coding and non-coding regions, making up 90% of all human genetic variation
    • Frequency: roughly, every 100 to 300 bases along the about 3 billion base human genome
    • Remark: Some definitions include methylated and
    • deaminated dinucleotides
  • 18. Distribution of SNPs and Power
  • 19. CDCV Hypothesis
  • 20. Linkage Disequilibrium D Disease locus d Marker locus 1 2 p D1 = p D p 1 p D p d p 1 p 2
  • 21. Indirect Associations
    • The polymorphism is a surrogate for the causal locus:
      • Indirect associations are weaker than the direct associations they reflect
      • Essential to type several surrounding markers
      • Try to exclude the possibility that a causal variant exists but is not picked up by the marker set:
      • Genome-wide vs Candidate gene approach
    (Ott 2004)
  • 22. Causes for an Association
    • Causes of association between a marker and a disease:
    • Very close linkage
    • Chance
    • Pleiotropy
      • This can become a problem when selection on one trait favors one specific mutant, while the selection at the other trait, that is influenced by the same gene, favors another mutant.
    • Stratification / Population heterogeneity
  • 23. SNPs or CNVs Copy Number Variations ? Over 99% of human DNA sequences are the same across the population
  • 24. Several Platforms Ziegler et al 2008)
  • 25. A Successful GWA
    • Succession of design, experimental and data analysis steps in a genome-wide association study
    (Ziegler et al 2008)
  • 26.
    • Challenges of Genome-Wide Association Studies
  • 27.
    • Challenges of Genome-Wide Association Studies
    • Power
  • 28. Power Determinants
    • Important variables and parameters:
    • Study design
    • QTL effect size
    • Allele frequencies of marker and QTL
    • Linkage disequilibrium
  • 29. Study Design (Cordell and Clayton, 2005)
  • 30. Power of an Association Study
    • Important variables and parameters:
    • Study design
    • Effect size
    • Allele frequencies of marker and DSL
    • Linkage disequilibrium
  • 31. Power of an Association Study
    • Important variables and parameters:
    • Study design
    • Effect size
    • Allele frequencies of marker and DSL
    • Linkage disequilibrium
  • 32. Power of an Association Study
    • Important variables and parameters:
    • Study design
    • QTL effect size
    • Allele frequencies of marker and QTL
    • Linkage disequilibrium
  • 33. Associations of Interest
    • A number of generations ago, a normal allele d mutated to a disease allele D on a particular chromosome on which the allele at a marker locus was M.
    • This chromosome is passed down through the generations, and now there are many copies. If the distance between D and M is small, recombinations are unlikely, so most D chromosomes carry M
    • This type of association contrasts with “spurious associations”
    mutation M d M D
  • 34.
    • Challenges of Genome-Wide Association Studies
    • Abundance of Data
  • 35. Sheer Amount of Data
    • Computer capacities need to be large (storage and CPU time)
    • Skills and algorithms from computer science and bioinformatics are handy
    • Statistical analysis
      • Burden of dimensionality
      • Multiple testing
      • False positives
  • 36. N = 100 50 Cases, 50 Controls Multiple Testing / Missing Data AA Aa aa BB Bb bb CC Cc cc DD Dd dd AA Aa aa AA Aa aa BB Bb bb BB Bb bb SNP 1 SNP 1 SNP 1 SNP 2 SNP 2 SNP 2 SNP 4 SNP 3
  • 37. Curse of Dimensionality
    • Bellman R (1961) Adaptive control processes : A guided tour. Princeton University Press:
    • “ ... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”
  • 38.
    • Challenges of Genome-Wide Association Studies
    • Change Design
  • 39. The FBAT Test Statistic Simple Formulation
    • N trios: 2 parents, 1 affected child
  • 40. PBAT Screening Methodology
    • Family-based studies
    • Address multiple-comparisons
    • Screen and test using the same dataset
    Van Steen K, McQueen MB, Herbert A et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:683-691.
  • 41. PBAT Family-Based Association X=Genotype S=Parental Genotypes
  • 42. PBAT Population-Based Association X=Genotype S=Parental Genotypes
  • 43. PBAT: Screening Step
    • Screen
      • Use ‘between-family’ information [ f(S,Y) ]
      • Calculate conditional power ( a b ,Y,S )
      • Select top N SNPs on the basis of power
  • 44. (Laird NM and Lange C. Nat Rev Genet 2006). PBAT
  • 45. Affymetrix Platform: 1 DSL (Method IV: FDR – Benjamini and Hochberg 1995) Power by simulation: Prostate cancer data on 467 subjects from 167 families (Kennedy et al 2003)
  • 46. Affymetrix Platform
  • 47. Simulation: mDSL - Affymetrix
    • Genetic data from Affymetrix SNPChip 10K array on 467 subjects from 167 families
    • Simulations (10,000 replicates)
      • Select 5 regions; 1 DSL in each region
      • Generate traits according to normal distribution, including up to 5 genetic contributions
      • For each replicate: generate heritability according to uniform distribution with mean h = 0.03 for all loci considered (or h = 0.05 for all loci)
  • 48. Power to detect Genes with multiple DSL (Van Steen et al. Nature Genetics 2005).
  • 49. Power to detect Genes with multiple DSL (Van Steen et al. Nature Genetics 2005).
  • 50.
    • Challenges of Genome-Wide Association Studies
    • Out-Of-Control Screening for Epistatis?
  • 51. Genetic Architecture of Disease
    • The number of genes that impact disease susceptibility
    • The distribution of alleles and genotypes at those genes
    • The manner in which the alleles and genotypes impact disease susceptibility
    • (Weiss 1993)
  • 52. Complications in disentangling?
    • There are likely to be many susceptibility genes each with combinations of rare and common alleles and genotypes that impact disease susceptibility primarily through non-linear interactions with genetic and environmental factors
  • 53. Gene-Environment Interactions
  • 54. Interactions (Thornton-Wells et al 2004)
  • 55. Terminology: Epistasis (Moore 2004) Does evidence of statistical epistasis necessarily imply genetical or biological epistasis?
  • 56. MDR for Interaction Detection
    • MDR is a strategy to tackle the dimensionality problem of interaction detection
    • MDR creates a one-dimensional multi-locus genotype variable (high and low risk), which is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing.
  • 57. MDR Steps A single model with minimum classification error is the best Model 9/10 training data 1/10 test data 10 runs 10 cross-validation  10 best models. The model with minimum PE is the best n -locus model .
  • 58. Advantages of MDR
    • Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect.
    • Non-parametric:
      • Overcome “curse of dimensionality” by logistic regression model.
    • No assumption about a genetic model
    • Low false positive rates
  • 59.  
  • 60. Some Disadvantages of MDR
    • Low power in the presence of genetic heterogeneity
    • Some important interactions could be missed due to pooling too many cells together
    • No adjustment can be made for main effects and confounding factors / restricted to dichotomous outcomes
    • Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
    • When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
    Solutions ?
  • 61. MDR Extensions
    • GH / main effects / confounding factors:
    • Model-based MDR
    • Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
    • When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
  • 62. MDR Extensions
    • GH: Model-based MDR
    • Limited number of features: Preselecting favorable features
    • De Lobel et al (work in progress):
      • Rely on RF screening applied to combined multi-locus genotype info (De Lobel et al – PhD student)
      • Rely on screening strategies such as C2BAT?
      • Rely on screening strategies such as FITF (Millstein et al, 2005)?
    • W hen the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
  • 63. MDR Extensions
    • GH: Model-based MDR
    • Limited number of features: Preselecting favorable features
    • Missing genotype data: Alternatives to current MDR approach of creating a “4 th category level”
  • 64. MDR-MB: Alternative Identification of Risk Cells MDR: MDR-MB: X={H,L} X={H,L,O}
    • OR_H, OR_L
    • significance: adjust for the nr
    • of combined cells in category
  • 65. MB-MDR Power of MDR compared to MB-MDR under aforementioned scenarios (Calle, Urrea, Malats, Van Steen 2008- submitted soon)
  • 66.
    • Quality Control in Genome-Wide Association Studies
  • 67. Genotype Calling
    • Assignment of genotypes to subjects according to their signal intensities is performed automatically using one of the available algorithms
      • Different laboratories may use different coding systems
      • Pooled analyses?
  • 68. Genotype Calling
    • Within-subject variability due to
      • DNA
      • concentration
      • Specific features during
      • hybridization
      • process
  • 69. Allele Signal Intensity/Classification (Ziegler et al 2008)
  • 70. Calling all Subjects Together?
    • Current standard approach to genotype calling is separate calling of cases and controls
    • Alternatively, do joint calling of all subjects and adjust for possible differential bias
    • Differential bias may result in displacement of genotype clouds between cases and controls
    (Clayton et al 2005)
  • 71. Cross-Platform Comparisons
    • High reproducibility of genotypes is important
    • Compare several platforms
    • Compare several technologies
    • Report concordance statistics
      • Same platform: ~ 99%
      • Across platforms: > 95%
  • 72. Missing Frequency per SNP
    • Questionnable SNP quality if genotyping failed in many individuals or could not be done
    • Differential missingness in study groups requires checking this criterion separately for all study groups
    • Exclude SNP when missingness frequency > 2-3%
  • 73. Minor Allele Frequency
    • Minor allele frequency is typically used as a data filter
    • Exclude SNPs when minor allele frequency < 1%:
      • Low power to detect association between the SNP and the trait of interest
      • Sometimes cut-off of 5% is considered
  • 74. Comparison of Control Groups
    • Compare the genotype frequencies between the control groups:
      • Trends of genotypes should be identical
      • Apply an equivalence test
      • E.g., equivalence Cochran-Armitrage trend test
    • The Cochran-Armitage test for trend can also be described as a method of directing chi-squared tests toward narrow alternatives.
    • It is quite common as a genotype-based test for candidate gene association.
  • 75. Hardy-Weinberg Equilibrium
    • HWE: genotype and allele frequencies in a large, randomly mating population remain stable over generations
    • Consequence:
      • Fixed relationship between allele and genotype frequencies
      • Deviations may point to quality problems (other sources: selection by disease status)
    • Check HWE in controls and omit SNP when p-value < 10 -4
  • 76. Subject-Level Quality Control
    • Complement of SNP-wise missing frequency is subject-wise missing frequency (i.e., call rate)
    • Exclude probands when genotype success < 97% of the SNPs (this number still includes the monomorphic SNPs)
    • Sometimes 90% is used as ...
  • 77. Population Stratification
    • S impson's paradox
    • If we mix two populations that
    • have both different disease
    • prevalence and different marker
    • the disease and marker allele in
    • each population, and there is no
    • association between the disease
    • and marker allele in each
    • population, then there will be an
    • association between the disease
    • and the marker allele in the
    • mixed population.
    (Marchini, 2004)
  • 78. Guarding Against Stratification
    • Possible solutions:
      • Apply genomic control
      • Consider a homogenous population
      • Use family-based controls
  • 79. Genomic Control (GC)
    • GC is all about:
      • Calculating an association statistic for a candidate locus
      • Calculating the same association statistic, from the same sample, for a set of unlinked loci
      • Determining significance by reference to the results for the unlinked loci
  • 80. Homogenous Populations
    • STRUCTURE
      • Uses multilocus genotype data to investigate pop. structure
      • Assigns samples to discrete subpop. clusters
      • Aggregates evidence of association within each cluster
    • Issues
      • LE within subpop.?
      • Missing data?
      • The number of clusters?
      • Convergence of MCMC?
    (Pritchard et al, 2000)
  • 81. Homogenous Populations
  • 82. Homogenous Populations The EIGENSTRAT algorithm ( a ) Principal components analysis to genotype data to infer continuous axes of genetic variation. ( b ) Genotype at a candidate SNP and Phenotype are adjusted by amounts attributable to ancestry along each axis removing all correlations to ancestry. ( c ) After ancestry adjustment, an association statistic is computed (Price et al, 2006)
  • 83. When Ethnic Stratification can be Ruled Out …
    • Measured Genotype approach (MG)
      • (Hopper and Mathews 1982; Boerwinkle et al 1986; George and Elston 1987)
    • Overall test of between- and within-family variation
      • (Havill et al 2005; Lange et al 2005)
    • The basis is a computationally heavy mixed model:
      • Genetic polymorphism under study = fixed effect or covariate
      • Polygenic component = random effect
    • Analytical solution: GRAMMAR (Aulchenko et al 2007) with MG
  • 84.
    • Frequentist and Bayesian Approaches to Analyses
  • 85. Analysis Targets
    • Single marker analysis
    • Haplotype analysis
    • Gene-gene interactions
    • Gene-environment interactions
  • 86. Type of data (DV) Qualitative (categorical) 1 independent variable 2 independent variables Quantitative (measurement) Relationships Differences 2 groups Multiple groups Nonparametric Parametric 2 dependent variables Goodness of fit x 2 Independence test x 2 1 predictor Multiple predictors Continuous measurement Ranks Multiple regression Spearman r s Primary interest Degree of relationship Form of relationship Pearson r Regression independent dependent 2-sample t Mann-Whitney U Related sample t Wilcoxon T 1 IV Multiple IVs independent dependent One-way ANOVA Kruskal-Wallis H Factorial ANOVA Repeated measures ANOVA Friedman McNemar test Hypothesis Testing
  • 87. Limitation of Regression
    • Having too many independent variables in relation to the number of observed outcome events
    • A ssuming 10 bi-allelic loci:
    • # of Parameters =
    Main effect 2-locus interaction 3-locus interaction 4-locus interaction # of Parameters 20 180 960 3360
  • 88. Limitation of Regression
    • Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors.
    • For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model.
    # of parameters P  min ( n case , n control )/10 - 1
  • 89. Alternatives to Traditional Methods
    • Tree-based methods:
      • Recursive Partitioning (Helix Tree)
      • Random Forests (R, CART)
    • Pattern recognition methods:
      • Symbolic Discriminant Analysis (SDA)
      • Mining association rules
      • Neural networks (NN)
      • Support vector machines (SVM)
    • Data reduction methods:
      • DICE ( Detection of Informative Combined Effects )
      • MDR (Multifactor Dimensionality Reduction)
      • Logic regression …
    (e.g., Onkamo and Toivonen 2006)
  • 90.
    • Accumulating the Evidence for Analyzing GWAs
  • 91. Focus on Replication
    • What SNPs are most likely to replicate?
      • Modest to strong statistical significance
      • Relatively common minor allele frequency
      • Modest to strong genetic effect size
  • 92. Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE
  • 93. Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE I
  • 94. Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE II
  • 95. Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE III
  • 96. One-Stage Design SNPs STUDY SAMPLE N ONE STAGE
  • 97. One Stage Design SNPs STUDY SAMPLE N  ONE STAGE
  • 98. Study Design Comparison
    • Multi-Stage
      • Less Expensive
      • More Complicated
      • Less Powerful
    • Single-Stage
      • More Expensive
      • Less Complicated
      • More Powerful
  • 99. (Skol et al 2005)
  • 100.
    • The future: Are we (really) ready for genome-wide analyses?
  • 101. Are we Ready for Genome-wide Association Studies?
    • Availability of International HapMap resource, documenting patterns of genome-wide variation and LD in 4 population samples
    • Availability of dense genotyping chips, with good coverage of the human genome
    • Availability of large and well-characterized clinical samples for many common diseases
    (The Wellcome Trust Case Control Consortium, 2007)
  • 102. (Cancer epidemiol Biomarkers Prev 2006: 15(4))
  • 103.
    • Practical
  • 104. GWA in Practice
    • Go to http://mga.bionet.nsc.ru/nlru/GenABEL/
    • At this page you can find
      • GenABEL package
      • Documentation
      • Test data sets
      • Links to related GWA software (ProbABEL -- analysis of imputed data, MetABEL -- meta-analysis)
  • 105. GenABEL
  • 106. GenABEL (http://mga.bionet.nsc.ru/nlru/GenABEL/)
  • 107. Getting Started
    • To run under Windows, first install R statistical computing software .
    • Download and save GenABEL_1.3-7.zip.
    • Start R, choose “Packages”, then “Install package(s) from local zip files...” and select the GenABEL_1.3-7.zip.
    • Make sure the packages “genetics” and “DGCgenetics” are installed as well (http://www-gene.cimr.cam.ac.uk/clayton/software/)
    • Start with command “library(GenABEL)”.
    • To see what it can do, try “demo(srdta)” “or demo(srdtawin)”.
  • 108. Computer Practical Exercise
    • Heather Cordell
    • http://www.staff.ncl.ac.uk/heather.cordell/WTACcasecon2007.html
    • Using R for
      • Case-control association
      • Gene-gene interactions
  • 109. Homework Assignment
    • Background Reading due 21 October
    • on
    • Multifactor Dimensionality Reduction
  • 110. Some Notes
    • on the exam …
    • on the class schedule for the remaining classes
      • Check the updates
  • 111. Presentations
    • 15 minutes per group
    • Mind
      • the correctness of the content
      • the quality of « presenting »
    • Group score
  • 112.