ppt
Upcoming SlideShare
Loading in...5
×
 

ppt

on

  • 1,741 views

 

Statistics

Views

Total Views
1,741
Views on SlideShare
1,741
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ppt ppt Presentation Transcript

  • Bioinformatics class 5 Statistical Genetics Kristel Van Steen, PhD, PhD Dept of Electrical Engineering and Computer Science Montefiore Institute – ULg 2008-2009
  • Outline
    • Introduction
      • Vocabulary and Definitions
    • Biostatistical aspects of GWAs
      • GWAs
      • Surrogate for true genetic info
      • Challenges of GWAs
        • Power
        • Abundance of data
        • Design
        • Epistasis
  • Outline
    • Quality control in GWAs including
      • Genotype calling and missing frequencies
      • Minor allele frequencies and HWE
      • Population stratification
    • Frequentist and Bayesian approaches to analyses
    • Accumulating the evidence for GWAs
    • The future of the field ...
    • Introduction
  • Some Definitions
    • A locus
    • is a location on the genome which is sufficiently short so that recombination can be ignored, say up to a few Kb.
    • Alleles
    • are DNA sequence variants distinguished at a locus.
    • It may not be possible to distinguish all possible variants. Often grouped into two alleles D and d
  • Some Definitions
    • A genotype
    • is the unordered pair of an individual’s two alleles at one or more loci
    • E.g., genotypes for one bi-allelic locus are Dd (heterozygous) and DD or dd (homozygous)
    • A haplotype
    • consists of the allelic types at two or more loci on a single chromosome.
    • Biostatistical Aspects
    • of
    • Genome-Wide
    • Association Studies
    • Genome-Wide Association Studies
  • Genetic Association
    • A statistical relationship in a population between an individual’s phenotype (characteristic of interest) and their genotype at a genetic locus
    • Genotypes
      • Known mutation in a gene
      • Marker with/without known effects on coding
  • Genetic Association Studies
    • Aim:
      • to detect associations between one or more genetic polymorphisms and a trait; the latter may be
        • measured,
        • dichotomous,
        • time to onset.
    • Genuine genetic associations arise only because human populations share common ancestry
  • Gene Mapping Approaches
    • Candidate gene studies
      • Association approaches
        • Family-based
        • Population-based
      • Resequencing approaches
    • Genome-wide studies
      • Linkage analysis
        • Family-based
      • Association analysis
        • Family-based
        • Population-based
  • Disease Association
    • candidate gene approach
    • vs
    • (genome-wide) screening approach
    • Can’t see the forest for the trees
    • vs
    • Can’t see the trees for the forest
    Can’t see the forest for the trees vs Can’t see the trees for the forest
        • Association studies have greater power than linkage studies to detect small effects, but require looking at more places
    ( Risch and Merikangas 1996 )
  • Association versus Linkage
    • Association at the population level
    • Intrafamilial association
    • Pinpoints alleles
    • Pinpoints loci
    • More powerful
    • Less powerful
    • More tests required
    • Fewer tests required
    • More sensitive to mistyping
    • Less sensitive to mistyping
    • Sensitive to population stratification
    • Not sensitive to population stratification
    Allelic Association Linkage
  • Linkage versus Association ( Courtesy of Ed Silverman)
    • Surrogate for True Genetic Info
  • SNPs
    • Single nucleotide polymorphisms (SNPs) are DNA sequence variations occurring when a single nucleotide (A, C, T, G) in the genome is altered.
    • The inherited allelic variation must have >1% population frequency.
    • SNPs can occur in both coding and non-coding regions, making up 90% of all human genetic variation
    • Frequency: roughly, every 100 to 300 bases along the about 3 billion base human genome
    • Remark: Some definitions include methylated and
    • deaminated dinucleotides
  • Distribution of SNPs and Power
  • CDCV Hypothesis
  • Linkage Disequilibrium D Disease locus d Marker locus 1 2 p D1 = p D p 1 p D p d p 1 p 2
  • Indirect Associations
    • The polymorphism is a surrogate for the causal locus:
      • Indirect associations are weaker than the direct associations they reflect
      • Essential to type several surrounding markers
      • Try to exclude the possibility that a causal variant exists but is not picked up by the marker set:
      • Genome-wide vs Candidate gene approach
    (Ott 2004)
  • Causes for an Association
    • Causes of association between a marker and a disease:
    • Very close linkage
    • Chance
    • Pleiotropy
      • This can become a problem when selection on one trait favors one specific mutant, while the selection at the other trait, that is influenced by the same gene, favors another mutant.
    • Stratification / Population heterogeneity
  • SNPs or CNVs Copy Number Variations ? Over 99% of human DNA sequences are the same across the population
  • Several Platforms Ziegler et al 2008)
  • A Successful GWA
    • Succession of design, experimental and data analysis steps in a genome-wide association study
    (Ziegler et al 2008)
    • Challenges of Genome-Wide Association Studies
    • Challenges of Genome-Wide Association Studies
    • Power
  • Power Determinants
    • Important variables and parameters:
    • Study design
    • QTL effect size
    • Allele frequencies of marker and QTL
    • Linkage disequilibrium
  • Study Design (Cordell and Clayton, 2005)
  • Power of an Association Study
    • Important variables and parameters:
    • Study design
    • Effect size
    • Allele frequencies of marker and DSL
    • Linkage disequilibrium
  • Power of an Association Study
    • Important variables and parameters:
    • Study design
    • Effect size
    • Allele frequencies of marker and DSL
    • Linkage disequilibrium
  • Power of an Association Study
    • Important variables and parameters:
    • Study design
    • QTL effect size
    • Allele frequencies of marker and QTL
    • Linkage disequilibrium
  • Associations of Interest
    • A number of generations ago, a normal allele d mutated to a disease allele D on a particular chromosome on which the allele at a marker locus was M.
    • This chromosome is passed down through the generations, and now there are many copies. If the distance between D and M is small, recombinations are unlikely, so most D chromosomes carry M
    • This type of association contrasts with “spurious associations”
    mutation M d M D
    • Challenges of Genome-Wide Association Studies
    • Abundance of Data
  • Sheer Amount of Data
    • Computer capacities need to be large (storage and CPU time)
    • Skills and algorithms from computer science and bioinformatics are handy
    • Statistical analysis
      • Burden of dimensionality
      • Multiple testing
      • False positives
  • N = 100 50 Cases, 50 Controls Multiple Testing / Missing Data AA Aa aa BB Bb bb CC Cc cc DD Dd dd AA Aa aa AA Aa aa BB Bb bb BB Bb bb SNP 1 SNP 1 SNP 1 SNP 2 SNP 2 SNP 2 SNP 4 SNP 3
  • Curse of Dimensionality
    • Bellman R (1961) Adaptive control processes : A guided tour. Princeton University Press:
    • “ ... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”
    • Challenges of Genome-Wide Association Studies
    • Change Design
  • The FBAT Test Statistic Simple Formulation
    • N trios: 2 parents, 1 affected child
  • PBAT Screening Methodology
    • Family-based studies
    • Address multiple-comparisons
    • Screen and test using the same dataset
    Van Steen K, McQueen MB, Herbert A et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:683-691.
  • PBAT Family-Based Association X=Genotype S=Parental Genotypes
  • PBAT Population-Based Association X=Genotype S=Parental Genotypes
  • PBAT: Screening Step
    • Screen
      • Use ‘between-family’ information [ f(S,Y) ]
      • Calculate conditional power ( a b ,Y,S )
      • Select top N SNPs on the basis of power
  • (Laird NM and Lange C. Nat Rev Genet 2006). PBAT
  • Affymetrix Platform: 1 DSL (Method IV: FDR – Benjamini and Hochberg 1995) Power by simulation: Prostate cancer data on 467 subjects from 167 families (Kennedy et al 2003)
  • Affymetrix Platform
  • Simulation: mDSL - Affymetrix
    • Genetic data from Affymetrix SNPChip 10K array on 467 subjects from 167 families
    • Simulations (10,000 replicates)
      • Select 5 regions; 1 DSL in each region
      • Generate traits according to normal distribution, including up to 5 genetic contributions
      • For each replicate: generate heritability according to uniform distribution with mean h = 0.03 for all loci considered (or h = 0.05 for all loci)
  • Power to detect Genes with multiple DSL (Van Steen et al. Nature Genetics 2005).
  • Power to detect Genes with multiple DSL (Van Steen et al. Nature Genetics 2005).
    • Challenges of Genome-Wide Association Studies
    • Out-Of-Control Screening for Epistatis?
  • Genetic Architecture of Disease
    • The number of genes that impact disease susceptibility
    • The distribution of alleles and genotypes at those genes
    • The manner in which the alleles and genotypes impact disease susceptibility
    • (Weiss 1993)
  • Complications in disentangling?
    • There are likely to be many susceptibility genes each with combinations of rare and common alleles and genotypes that impact disease susceptibility primarily through non-linear interactions with genetic and environmental factors
  • Gene-Environment Interactions
  • Interactions (Thornton-Wells et al 2004)
  • Terminology: Epistasis (Moore 2004) Does evidence of statistical epistasis necessarily imply genetical or biological epistasis?
  • MDR for Interaction Detection
    • MDR is a strategy to tackle the dimensionality problem of interaction detection
    • MDR creates a one-dimensional multi-locus genotype variable (high and low risk), which is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing.
  • MDR Steps A single model with minimum classification error is the best Model 9/10 training data 1/10 test data 10 runs 10 cross-validation  10 best models. The model with minimum PE is the best n -locus model .
  • Advantages of MDR
    • Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect.
    • Non-parametric:
      • Overcome “curse of dimensionality” by logistic regression model.
    • No assumption about a genetic model
    • Low false positive rates
  •  
  • Some Disadvantages of MDR
    • Low power in the presence of genetic heterogeneity
    • Some important interactions could be missed due to pooling too many cells together
    • No adjustment can be made for main effects and confounding factors / restricted to dichotomous outcomes
    • Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
    • When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
    Solutions ?
  • MDR Extensions
    • GH / main effects / confounding factors:
    • Model-based MDR
    • Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. (~ 15 factors out of 500, on 4,000 subjects).
    • When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
  • MDR Extensions
    • GH: Model-based MDR
    • Limited number of features: Preselecting favorable features
    • De Lobel et al (work in progress):
      • Rely on RF screening applied to combined multi-locus genotype info (De Lobel et al – PhD student)
      • Rely on screening strategies such as C2BAT?
      • Rely on screening strategies such as FITF (Millstein et al, 2005)?
    • W hen the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted…
  • MDR Extensions
    • GH: Model-based MDR
    • Limited number of features: Preselecting favorable features
    • Missing genotype data: Alternatives to current MDR approach of creating a “4 th category level”
  • MDR-MB: Alternative Identification of Risk Cells MDR: MDR-MB: X={H,L} X={H,L,O}
    • OR_H, OR_L
    • significance: adjust for the nr
    • of combined cells in category
  • MB-MDR Power of MDR compared to MB-MDR under aforementioned scenarios (Calle, Urrea, Malats, Van Steen 2008- submitted soon)
    • Quality Control in Genome-Wide Association Studies
  • Genotype Calling
    • Assignment of genotypes to subjects according to their signal intensities is performed automatically using one of the available algorithms
      • Different laboratories may use different coding systems
      • Pooled analyses?
  • Genotype Calling
    • Within-subject variability due to
      • DNA
      • concentration
      • Specific features during
      • hybridization
      • process
  • Allele Signal Intensity/Classification (Ziegler et al 2008)
  • Calling all Subjects Together?
    • Current standard approach to genotype calling is separate calling of cases and controls
    • Alternatively, do joint calling of all subjects and adjust for possible differential bias
    • Differential bias may result in displacement of genotype clouds between cases and controls
    (Clayton et al 2005)
  • Cross-Platform Comparisons
    • High reproducibility of genotypes is important
    • Compare several platforms
    • Compare several technologies
    • Report concordance statistics
      • Same platform: ~ 99%
      • Across platforms: > 95%
  • Missing Frequency per SNP
    • Questionnable SNP quality if genotyping failed in many individuals or could not be done
    • Differential missingness in study groups requires checking this criterion separately for all study groups
    • Exclude SNP when missingness frequency > 2-3%
  • Minor Allele Frequency
    • Minor allele frequency is typically used as a data filter
    • Exclude SNPs when minor allele frequency < 1%:
      • Low power to detect association between the SNP and the trait of interest
      • Sometimes cut-off of 5% is considered
  • Comparison of Control Groups
    • Compare the genotype frequencies between the control groups:
      • Trends of genotypes should be identical
      • Apply an equivalence test
      • E.g., equivalence Cochran-Armitrage trend test
    • The Cochran-Armitage test for trend can also be described as a method of directing chi-squared tests toward narrow alternatives.
    • It is quite common as a genotype-based test for candidate gene association.
  • Hardy-Weinberg Equilibrium
    • HWE: genotype and allele frequencies in a large, randomly mating population remain stable over generations
    • Consequence:
      • Fixed relationship between allele and genotype frequencies
      • Deviations may point to quality problems (other sources: selection by disease status)
    • Check HWE in controls and omit SNP when p-value < 10 -4
  • Subject-Level Quality Control
    • Complement of SNP-wise missing frequency is subject-wise missing frequency (i.e., call rate)
    • Exclude probands when genotype success < 97% of the SNPs (this number still includes the monomorphic SNPs)
    • Sometimes 90% is used as ...
  • Population Stratification
    • S impson's paradox
    • If we mix two populations that
    • have both different disease
    • prevalence and different marker
    • the disease and marker allele in
    • each population, and there is no
    • association between the disease
    • and marker allele in each
    • population, then there will be an
    • association between the disease
    • and the marker allele in the
    • mixed population.
    (Marchini, 2004)
  • Guarding Against Stratification
    • Possible solutions:
      • Apply genomic control
      • Consider a homogenous population
      • Use family-based controls
  • Genomic Control (GC)
    • GC is all about:
      • Calculating an association statistic for a candidate locus
      • Calculating the same association statistic, from the same sample, for a set of unlinked loci
      • Determining significance by reference to the results for the unlinked loci
  • Homogenous Populations
    • STRUCTURE
      • Uses multilocus genotype data to investigate pop. structure
      • Assigns samples to discrete subpop. clusters
      • Aggregates evidence of association within each cluster
    • Issues
      • LE within subpop.?
      • Missing data?
      • The number of clusters?
      • Convergence of MCMC?
    (Pritchard et al, 2000)
  • Homogenous Populations
  • Homogenous Populations The EIGENSTRAT algorithm ( a ) Principal components analysis to genotype data to infer continuous axes of genetic variation. ( b ) Genotype at a candidate SNP and Phenotype are adjusted by amounts attributable to ancestry along each axis removing all correlations to ancestry. ( c ) After ancestry adjustment, an association statistic is computed (Price et al, 2006)
  • When Ethnic Stratification can be Ruled Out …
    • Measured Genotype approach (MG)
      • (Hopper and Mathews 1982; Boerwinkle et al 1986; George and Elston 1987)
    • Overall test of between- and within-family variation
      • (Havill et al 2005; Lange et al 2005)
    • The basis is a computationally heavy mixed model:
      • Genetic polymorphism under study = fixed effect or covariate
      • Polygenic component = random effect
    • Analytical solution: GRAMMAR (Aulchenko et al 2007) with MG
    • Frequentist and Bayesian Approaches to Analyses
  • Analysis Targets
    • Single marker analysis
    • Haplotype analysis
    • Gene-gene interactions
    • Gene-environment interactions
  • Type of data (DV) Qualitative (categorical) 1 independent variable 2 independent variables Quantitative (measurement) Relationships Differences 2 groups Multiple groups Nonparametric Parametric 2 dependent variables Goodness of fit x 2 Independence test x 2 1 predictor Multiple predictors Continuous measurement Ranks Multiple regression Spearman r s Primary interest Degree of relationship Form of relationship Pearson r Regression independent dependent 2-sample t Mann-Whitney U Related sample t Wilcoxon T 1 IV Multiple IVs independent dependent One-way ANOVA Kruskal-Wallis H Factorial ANOVA Repeated measures ANOVA Friedman McNemar test Hypothesis Testing
  • Limitation of Regression
    • Having too many independent variables in relation to the number of observed outcome events
    • A ssuming 10 bi-allelic loci:
    • # of Parameters =
    Main effect 2-locus interaction 3-locus interaction 4-locus interaction # of Parameters 20 180 960 3360
  • Limitation of Regression
    • Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors.
    • For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model.
    # of parameters P  min ( n case , n control )/10 - 1
  • Alternatives to Traditional Methods
    • Tree-based methods:
      • Recursive Partitioning (Helix Tree)
      • Random Forests (R, CART)
    • Pattern recognition methods:
      • Symbolic Discriminant Analysis (SDA)
      • Mining association rules
      • Neural networks (NN)
      • Support vector machines (SVM)
    • Data reduction methods:
      • DICE ( Detection of Informative Combined Effects )
      • MDR (Multifactor Dimensionality Reduction)
      • Logic regression …
    (e.g., Onkamo and Toivonen 2006)
    • Accumulating the Evidence for Analyzing GWAs
  • Focus on Replication
    • What SNPs are most likely to replicate?
      • Modest to strong statistical significance
      • Relatively common minor allele frequency
      • Modest to strong genetic effect size
  • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE
  • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE I
  • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE II
  • Multi-Stage Design SNPs N 1 N 2  2  1 N 3 STUDY SAMPLE STAGE III
  • One-Stage Design SNPs STUDY SAMPLE N ONE STAGE
  • One Stage Design SNPs STUDY SAMPLE N  ONE STAGE
  • Study Design Comparison
    • Multi-Stage
      • Less Expensive
      • More Complicated
      • Less Powerful
    • Single-Stage
      • More Expensive
      • Less Complicated
      • More Powerful
  • (Skol et al 2005)
    • The future: Are we (really) ready for genome-wide analyses?
  • Are we Ready for Genome-wide Association Studies?
    • Availability of International HapMap resource, documenting patterns of genome-wide variation and LD in 4 population samples
    • Availability of dense genotyping chips, with good coverage of the human genome
    • Availability of large and well-characterized clinical samples for many common diseases
    (The Wellcome Trust Case Control Consortium, 2007)
  • (Cancer epidemiol Biomarkers Prev 2006: 15(4))
    • Practical
  • GWA in Practice
    • Go to http://mga.bionet.nsc.ru/nlru/GenABEL/
    • At this page you can find
      • GenABEL package
      • Documentation
      • Test data sets
      • Links to related GWA software (ProbABEL -- analysis of imputed data, MetABEL -- meta-analysis)
  • GenABEL
  • GenABEL (http://mga.bionet.nsc.ru/nlru/GenABEL/)
  • Getting Started
    • To run under Windows, first install R statistical computing software .
    • Download and save GenABEL_1.3-7.zip.
    • Start R, choose “Packages”, then “Install package(s) from local zip files...” and select the GenABEL_1.3-7.zip.
    • Make sure the packages “genetics” and “DGCgenetics” are installed as well (http://www-gene.cimr.cam.ac.uk/clayton/software/)
    • Start with command “library(GenABEL)”.
    • To see what it can do, try “demo(srdta)” “or demo(srdtawin)”.
  • Computer Practical Exercise
    • Heather Cordell
    • http://www.staff.ncl.ac.uk/heather.cordell/WTACcasecon2007.html
    • Using R for
      • Case-control association
      • Gene-gene interactions
  • Homework Assignment
    • Background Reading due 21 October
    • on
    • Multifactor Dimensionality Reduction
  • Some Notes
    • on the exam …
    • on the class schedule for the remaining classes
      • Check the updates
  • Presentations
    • 15 minutes per group
    • Mind
      • the correctness of the content
      • the quality of « presenting »
    • Group score
  •