• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ppt
 

ppt

on

  • 1,645 views

 

Statistics

Views

Total Views
1,645
Views on SlideShare
1,645
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    ppt ppt Presentation Transcript

    • Bioinformatics class 5 Statistical Genetics Kristel Van Steen, PhD, PhD Dept of Electrical Engineering and Computer Science Montefiore Institute – ULg 2008-2009
    • Outline
      • Introduction
        • Vocabulary and Definitions
      • Biostatistical aspects of GWAs
        • GWAs
        • Surrogate for true genetic info
        • Challenges of GWAs
          • Power
          • Abundance of data
          • Design
          • Epistasis
    • Outline
      • Quality control in GWAs including
        • Genotype calling and missing frequencies
        • Minor allele frequencies and HWE
        • Population stratification
      • Frequentist and Bayesian approaches to analyses
      • Accumulating the evidence for GWAs
      • The future of the field ...
      • Introduction
    • Some Definitions
      • A locus
      • is a location on the genome which is sufficiently short so that recombination can be ignored, say up to a few Kb.
      • Alleles
      • are DNA sequence variants distinguished at a locus.
      • It may not be possible to distinguish all possible variants. Often grouped into two alleles D and d
    • Some Definitions
      • A genotype
      • is the unordered pair of an individual’s two alleles at one or more loci
      • E.g., genotypes for one bi-allelic locus are Dd (heterozygous) and DD or dd (homozygous)
      • A haplotype
      • consists of the allelic types at two or more loci on a single chromosome.
      • Biostatistical Aspects
      • of
      • Genome-Wide
      • Association Studies
      • Genome-Wide Association Studies
    • Genetic Association
      • A statistical relationship in a population between an individual’s phenotype (characteristic of interest) and their genotype at a genetic locus
      • Genotypes
        • Known mutation in a gene
        • Marker with/without known effects on coding
    • Genetic Association Studies
      • Aim:
        • to detect associations between one or more genetic polymorphisms and a trait; the latter may be
          • measured,
          • dichotomous,
          • time to onset.
      • Genuine genetic associations arise only because human populations share common ancestry
    • Gene Mapping Approaches
      • Candidate gene studies
        • Association approaches
          • Family-based
          • Population-based
        • Resequencing approaches
      • Genome-wide studies
        • Linkage analysis
          • Family-based
        • Association analysis
          • Family-based
          • Population-based
    • Disease Association
      • candidate gene approach
      • vs
      • (genome-wide) screening approach
      • Can’t see the forest for the trees
      • vs
      • Can’t see the trees for the forest
      Can’t see the forest for the trees vs Can’t see the trees for the forest
          • Association studies have greater power than linkage studies to detect small effects, but require looking at more places
      ( Risch and Merikangas 1996 )
    • Association versus Linkage
      • Association at the population level
      • Intrafamilial association
      • Pinpoints alleles
      • Pinpoints loci
      • More powerful
      • Less powerful
      • More tests required
      • Fewer tests required
      • More sensitive to mistyping
      • Less sensitive to mistyping
      • Sensitive to population stratification
      • Not sensitive to population stratification
      Allelic Association Linkage
    • Linkage versus Association ( Courtesy of Ed Silverman)
      • Surrogate for True Genetic Info
    • SNPs
      • Single nucleotide polymorphisms (SNPs) are DNA sequence variations occurring when a single nucleotide (A, C, T, G) in the genome is altered.
      • The inherited allelic variation must have >1% population frequency.
      • SNPs can occur in both coding and non-coding regions, making up 90% of all human genetic variation
      • Frequency: roughly, every 100 to 300 bases along the about 3 billion base human genome
      • Remark: Some definitions include methylated and
      • deaminated dinucleotides
    • Distribution of SNPs and Power
    • CDCV Hypothesis
    • Linkage Disequilibrium D Disease locus d Marker locus 1 2 p D1 = p D p 1 p D p d p 1 p 2
    • Indirect Associations
      • The polymorphism is a surrogate for the causal locus:
        • Indirect associations are weaker than the direct associations they reflect
        • Essential to type several surrounding markers
        • Try to exclude the possibility that a causal variant exists but is not picked up by the marker set:
        • Genome-wide vs Candidate gene approach
      (Ott 2004)
    • Causes for an Association
      • Causes of association between a marker and a disease:
      • Very close linkage
      • Chance
      • Pleiotropy
        • This can become a problem when selection on one trait favors one specific mutant, while the selection at the other trait, that is influenced by the same gene, favors another mutant.
      • Stratification / Population heterogeneity
    • SNPs or CNVs Copy Number Variations ? Over 99% of human DNA sequences are the same across the population
    • Several Platforms Ziegler et al 2008)
    • A Successful GWA
      • Succession of design, experimental and data analysis steps in a genome-wide association study
      (Ziegler et al 2008)
      • Challenges of Genome-Wide Association Studies
      • Challenges of Genome-Wide Association Studies
      • Power
    • Power Determinants
      • Important variables and parameters:
      • Study design
      • QTL effect size
      • Allele frequencies of marker and QTL
      • Linkage disequilibrium
    • Study Design (Cordell and Clayton, 2005)
    • Power of an Association Study
      • Important variables and parameters:
      • Study design
      • Effect size
      • Allele frequencies of marker and DSL
      • Linkage disequilibrium
    • Power of an Association Study
      • Important variables and parameters:
      • Study design
      • Effect size
      • Allele frequencies of marker and DSL
      • Linkage disequilibrium
    • Power of an Association Study
      • Important variables and parameters:
      • Study design
      • QTL effect size
      • Allele frequencies of marker and QTL
      • Linkage disequilibrium
    • Associations of Interest
      • A number of generations ago, a normal allele d mutated to a disease allele D on a particular chromosome on which the allele at a marker locus was M.
      • This chromosome is passed down through the generations, and now there are many copies. If the distance between D and M is small, recombinations are unlikely, so most D chromosomes carry M
      • This type of association contrasts with “spurious associations”
      mutation M d M D
      • Challenges of Genome-Wide Association Studies
      • Abundance of Data
    • Sheer Amount of Data
      • Computer capacities need to be large (storage and CPU time)
      • Skills and algorithms from computer science and bioinformatics are handy
      • Statistical analysis
        • Burden of dimensionality
        • Multiple testing
        • False positives
    • N = 100 50 Cases, 50 Controls Multiple Testing / Missing Data AA Aa aa BB Bb bb CC Cc cc DD Dd dd AA Aa aa AA Aa aa BB Bb bb BB Bb bb SNP 1 SNP 1 SNP 1 SNP 2 SNP 2 SNP 2 SNP 4 SNP 3
    • Curse of Dimensionality
      • Bellman R (1961) Adaptive control processes : A guided tour. Princeton University Press:
      • “ ... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”
      • Challenges of Genome-Wide Association Studies
      • Change Design
    • The FBAT Test Statistic Simple Formulation
      • N trios: 2 parents, 1 affected child
    • PBAT Screening Methodology
      • Family-based studies
      • Address multiple-comparisons
      • Screen and test using the same dataset
      Van Steen K, McQueen MB, Herbert A et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:683-691.
    • PBAT Family-Based Association X=Genotype S=Parental Genotypes
    • PBAT Population-Based Association X=Genotype S=Parental Genotypes
    • PBAT: Screening Step
      • Screen
        • Use ‘between-family’ information [ f(S,Y) ]
        • Calculate conditional power ( a b ,Y,S )
        • Select top N SNPs on the basis of power
    • (Laird NM and Lange C. Nat Rev Genet 2006). PBAT
    • Affymetrix Platform: 1 DSL (Method IV: FDR – Benjamini and Hochberg 1995) Power by simulation: Prostate cancer data on 467 subjects from 167 families (Kennedy et al 2003)
    • Affymetrix Platform
    • Simulation: mDSL - Affymetrix
      • Genetic data from Affymetrix SNPChip 10K array on 467 subjects from 167 families
      • Simulations (10,000 replicates)
        • Select 5 regions; 1 DSL in each region
        • Generate traits according to normal distribution, including up to 5 genetic contributions
        • For each replicate: generate heritability according to uniform distribution with mean h = 0.03 for all loci considered (or h = 0.05 for all loci)
    • Power to detect Genes with multiple DSL (Van Steen et al. Nature Genetics 2005).
    • Power to detect Genes with multiple DSL (Van Steen et al. Nature Genetics 2005).
      • Challenges of Genome-Wide Association Studies
      • Out-Of-Control Screening for Epistatis?
    • Genetic Architecture of Disease
      • The number of genes that impact disease susceptibility
      • The distribution of alleles and genotypes at those genes
      • The manner in which the alleles and genotypes impact disease susceptibility
      • (Weiss 1993)
    • Complications in disentangling?
      • There are likely to be many susceptibility genes each with combinations of rare and common alleles and genotypes that impact disease susceptibility primarily through non-linear interactions with genetic and environmental factors
    • Gene-Environment Interactions
    • Interactions (Thornton-Wells et al 2004)
    • Terminology: Epistasis (Moore 2004) Does evidence of statistical epistasis necessarily imply genetical or biological epistasis?
    • MDR for Interaction Detection
      • MDR is a strategy to tackle the dimensionality problem of interaction detection
      • MDR creates a one-dimensional multi-locus genotype variable (high and low risk), which is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing.
    • MDR Steps A single model with minimum classification error is the best Model 9/10 training data 1/10 test data 10 runs 10 cross-validation  10 best models. The model with minimum PE is the best n -locus model .
    • Advantages of MDR
      • Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect.
      • Non-parametric:
        • Overcome “curse of dimensionality” by logistic regression model.
      • No assumption about a genetic model
      • Low false positive rates
    •  
    • Some Disadvantages of MDR
      • Low power in the presence of genetic heterogeneity
      • Some important interactions could be missed due to pooling too many cells together
      • No adjustment can be made for main effects and confounding factors / restricted to dichotomous outcomes
      • Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
      • When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
      Solutions ?
    • MDR Extensions
      • GH / main effects / confounding factors:
      • Model-based MDR
      • Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
      • When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
    • MDR Extensions
      • GH: Model-based MDR
      • Limited number of features: Preselecting favorable features
      • De Lobel et al (work in progress):
        • Rely on RF screening applied to combined multi-locus genotype info (De Lobel et al – PhD student)
        • Rely on screening strategies such as C2BAT?
        • Rely on screening strategies such as FITF (Millstein et al, 2005)?
      • W hen the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
    • MDR Extensions
      • GH: Model-based MDR
      • Limited number of features: Preselecting favorable features
      • Missing genotype data: Alternatives to current MDR approach of creating a “4 th category level”
    • MDR-MB: Alternative Identification of Risk Cells MDR: MDR-MB: X={H,L} X={H,L,O}
      • OR_H, OR_L
      • significance: adjust for the nr
      • of combined cells in category
    • MB-MDR Power of MDR compared to MB-MDR under aforementioned scenarios (Calle, Urrea, Malats, Van Steen 2008- submitted soon)
      • Quality Control in Genome-Wide Association Studies
    • Genotype Calling
      • Assignment of genotypes to subjects according to their signal intensities is performed automatically using one of the available algorithms
        • Different laboratories may use different coding systems
        • Pooled analyses?
    • Genotype Calling
      • Within-subject variability due to
        • DNA
        • concentration
        • Specific features during
        • hybridization
        • process
    • Allele Signal Intensity/Classification (Ziegler et al 2008)
    • Calling all Subjects Together?
      • Current standard approach to genotype calling is separate calling of cases and controls
      • Alternatively, do joint calling of all subjects and adjust for possible differential bias
      • Differential bias may result in displacement of genotype clouds between cases and controls
      (Clayton et al 2005)
    • Cross-Platform Comparisons
      • High reproducibility of genotypes is important
      • Compare several platforms
      • Compare several technologies
      • Report concordance statistics
        • Same platform: ~ 99%
        • Across platforms: > 95%
    • Missing Frequency per SNP
      • Questionnable SNP quality if genotyping failed in many individuals or could not be done
      • Differential missingness in study groups requires checking this criterion separately for all study groups
      • Exclude SNP when missingness frequency > 2-3%
    • Minor Allele Frequency
      • Minor allele frequency is typically used as a data filter
      • Exclude SNPs when minor allele frequency < 1%:
        • Low power to detect association between the SNP and the trait of interest
        • Sometimes cut-off of 5% is considered
    • Comparison of Control Groups
      • Compare the genotype frequencies between the control groups:
        • Trends of genotypes should be identical
        • Apply an equivalence test
        • E.g., equivalence Cochran-Armitrage trend test
      • The Cochran-Armitage test for trend can also be described as a method of directing chi-squared tests toward narrow alternatives.
      • It is quite common as a genotype-based test for candidate gene association.
    • Hardy-Weinberg Equilibrium
      • HWE: genotype and allele frequencies in a large, randomly mating population remain stable over generations
      • Consequence:
        • Fixed relationship between allele and genotype frequencies
        • Deviations may point to quality problems (other sources: selection by disease status)
      • Check HWE in controls and omit SNP when p-value < 10 -4
    • Subject-Level Quality Control
      • Complement of SNP-wise missing frequency is subject-wise missing frequency (i.e., call rate)
      • Exclude probands when genotype success < 97% of the SNPs (this number still includes the monomorphic SNPs)
      • Sometimes 90% is used as ...
    • Population Stratification
      • S impson's paradox
      • If we mix two populations that
      • have both different disease
      • prevalence and different marker
      • the disease and marker allele in
      • each population, and there is no
      • association between the disease
      • and marker allele in each
      • population, then there will be an
      • association between the disease
      • and the marker allele in the
      • mixed population.
      (Marchini, 2004)
    • Guarding Against Stratification
      • Possible solutions:
        • Apply genomic control
        • Consider a homogenous population
        • Use family-based controls
    • Genomic Control (GC)
      • GC is all about:
        • Calculating an association statistic for a candidate locus
        • Calculating the same association statistic, from the same sample, for a set of unlinked loci
        • Determining significance by reference to the results for the unlinked loci
    • Homogenous Populations
      • STRUCTURE
        • Uses multilocus genotype data to investigate pop. structure
        • Assigns samples to discrete subpop. clusters
        • Aggregates evidence of association within each cluster
      • Issues
        • LE within subpop.?
        • Missing data?
        • The number of clusters?
        • Convergence of MCMC?
      (Pritchard et al, 2000)
    • Homogenous Populations
    • Homogenous Populations The EIGENSTRAT algorithm ( a ) Principal components analysis to genotype data to infer continuous axes of genetic variation. ( b ) Genotype at a candidate SNP and Phenotype are adjusted by amounts attributable to ancestry along each axis removing all correlations to ancestry. ( c ) After ancestry adjustment, an association statistic is computed (Price et al, 2006)
    • When Ethnic Stratification can be Ruled Out …
      • Measured Genotype approach (MG)
        • (Hopper and Mathews 1982; Boerwinkle et al 1986; George and Elston 1987)
      • Overall test of between- and within-family variation
        • (Havill et al 2005; Lange et al 2005)
      • The basis is a computationally heavy mixed model:
        • Genetic polymorphism under study = fixed effect or covariate
        • Polygenic component = random effect
      • Analytical solution: GRAMMAR (Aulchenko et al 2007) with MG
      • Frequentist and Bayesian Approaches to Analyses
    • Analysis Targets
      • Single marker analysis
      • Haplotype analysis
      • Gene-gene interactions
      • Gene-environment interactions
    • Type of data (DV) Qualitative (categorical) 1 independent variable 2 independent variables Quantitative (measurement) Relationships Differences 2 groups Multiple groups Nonparametric Parametric 2 dependent variables Goodness of fit x 2 Independence test x 2 1 predictor Multiple predictors Continuous measurement Ranks Multiple regression Spearman r s Primary interest Degree of relationship Form of relationship Pearson r Regression independent dependent 2-sample t Mann-Whitney U Related sample t Wilcoxon T 1 IV Multiple IVs independent dependent One-way ANOVA Kruskal-Wallis H Factorial ANOVA Repeated measures ANOVA Friedman McNemar test Hypothesis Testing
    • Limitation of Regression
      • Having too many independent variables in relation to the number of observed outcome events
      • A ssuming 10 bi-allelic loci:
      • # of Parameters =
      Main effect 2-locus interaction 3-locus interaction 4-locus interaction # of Parameters 20 180 960 3360
    • Limitation of Regression
      • Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors.
      • For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model.
      # of parameters P  min ( n case , n control )/10 - 1
    • Alternatives to Traditional Methods
      • Tree-based methods:
        • Recursive Partitioning (Helix Tree)
        • Random Forests (R, CART)
      • Pattern recognition methods:
        • Symbolic Discriminant Analysis (SDA)
        • Mining association rules
        • Neural networks (NN)
        • Support vector machines (SVM)
      • Data reduction methods:
        • DICE ( Detection of Informative Combined Effects )
        • MDR (Multifactor Dimensionality Reduction)
        • Logic regression …
      (e.g., Onkamo and Toivonen 2006)
      • Accumulating the Evidence for Analyzing GWAs
    • Focus on Replication
      • What SNPs are most likely to replicate?
        • Modest to strong statistical significance
        • Relatively common minor allele frequency
        • Modest to strong genetic effect size
    • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE
    • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE I
    • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE II
    • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE III
    • One-Stage Design SNPs STUDY SAMPLE N ONE STAGE
    • One Stage Design SNPs STUDY SAMPLE N  ONE STAGE
    • Study Design Comparison
      • Multi-Stage
        • Less Expensive
        • More Complicated
        • Less Powerful
      • Single-Stage
        • More Expensive
        • Less Complicated
        • More Powerful
    • (Skol et al 2005)
      • The future: Are we (really) ready for genome-wide analyses?
    • Are we Ready for Genome-wide Association Studies?
      • Availability of International HapMap resource, documenting patterns of genome-wide variation and LD in 4 population samples
      • Availability of dense genotyping chips, with good coverage of the human genome
      • Availability of large and well-characterized clinical samples for many common diseases
      (The Wellcome Trust Case Control Consortium, 2007)
    • (Cancer epidemiol Biomarkers Prev 2006: 15(4))
      • Practical
    • GWA in Practice
      • Go to http://mga.bionet.nsc.ru/nlru/GenABEL/
      • At this page you can find
        • GenABEL package
        • Documentation
        • Test data sets
        • Links to related GWA software (ProbABEL -- analysis of imputed data, MetABEL -- meta-analysis)
    • GenABEL
    • GenABEL (http://mga.bionet.nsc.ru/nlru/GenABEL/)
    • Getting Started
      • To run under Windows, first install R statistical computing software .
      • Download and save GenABEL_1.3-7.zip.
      • Start R, choose “Packages”, then “Install package(s) from local zip files...” and select the GenABEL_1.3-7.zip.
      • Make sure the packages “genetics” and “DGCgenetics” are installed as well (http://www-gene.cimr.cam.ac.uk/clayton/software/)
      • Start with command “library(GenABEL)”.
      • To see what it can do, try “demo(srdta)” “or demo(srdtawin)”.
    • Computer Practical Exercise
      • Heather Cordell
      • http://www.staff.ncl.ac.uk/heather.cordell/WTACcasecon2007.html
      • Using R for
        • Case-control association
        • Gene-gene interactions
    • Homework Assignment
      • Background Reading due 21 October
      • on
      • Multifactor Dimensionality Reduction
    • Some Notes
      • on the exam …
      • on the class schedule for the remaining classes
        • Check the updates
    • Presentations
      • 15 minutes per group
      • Mind
        • the correctness of the content
        • the quality of « presenting »
      • Group score
    •