Statistical Genetics Using Sequence Data Dajiang J. Liu Department of Statistics
Why We Study Statistical Genetics Statistics is originated from genetics R.A. Fisher: “ The Correlation Between Relatives on the Supposition of Mendelian Inheritance” Introduced the concept of variance in this article Francis Galton : Regression of human height toward the mean: Introduced correlation and regression Karl Pearson:  “ Mendelism and the problem of mental defect” “ Tuberculosis, heredity and environment ” Why don’t we seek our roots? In order to find disease genes in the genome, statistics is a must
Statistical Genetics Disease gene mapping :   The determination of the sequence of genes and their relative distances from one another on a specific chromosome Technology driven field : Mendel’s era: Segregation Analysis -  Patience :  peas, fruit fly: inbreeding is necessary   Experimental  Design
Statistical Genetics Modern era: Microsatellite Markers: Genetic linkage analysis Extremely successful for mapping and identifying Mendelian traits Single nucleotide polymorphism (SNP) marker Case control studies: Genome Wide Association Studies: To identify common variants involved in complex traits Computational Techniques for likelihood in Pedigrees Statistics play a major role
Statistical Genetics Sequencing Era: Study of diseases due to rare variants is emerging ABI SOLiD sequencer Statistics is ALL  for sequencing data
Statistical Genetics Data we work with Human  Genome  Project Hap Map  Project 1000  Genome Project
Multi-facotorial Disease Etiology Hypothesis Common Disease Common Variants Hypothesis (CD/CV) hypothesis: Common diseases are caused by a few common variants with moderate effect E.g. Age-related Macular Degeneration:  Common variants are likely to have lower odds ratio than rare variants:
Multi-facotorial Disease Etiology Hypothesis Common Disease Rare Variants Hypothesis: Common diseases are caused by multiple rare variants with large effect size: The discovery of rare variants will have high impact on public health since they will aid in risk prediction and treatment E.g. Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol E.g. Colorectal Adenomas
Challenges on Statistical Methodologies Variants misclassification: Non-causal variants Included: Huge number of mutations on the genome: Most of them are not causing the disease under study  Causal Variants Excluded: Intronic mutations: Intergenic regions: Unknown patterns of interactions: Within gene interactions: e.g. Hirschsprung’s disease (RET gene) Gene x gene interactions: e.g. breast cancer genes (BRCA 1 BRCA2 x CHEK2) Adaptive methods are needed 1. 2. x
Kernel Based Adaptive Clustering Combine variant classification with association testing into a coherent framework Applicable to population based case/control studies using unrelated individuals Robust against variants misclassifications Can handle gene x gene interactions and gene x environment interactions

10 Liu, Dajiang

  • 1.
    Statistical Genetics UsingSequence Data Dajiang J. Liu Department of Statistics
  • 2.
    Why We StudyStatistical Genetics Statistics is originated from genetics R.A. Fisher: “ The Correlation Between Relatives on the Supposition of Mendelian Inheritance” Introduced the concept of variance in this article Francis Galton : Regression of human height toward the mean: Introduced correlation and regression Karl Pearson: “ Mendelism and the problem of mental defect” “ Tuberculosis, heredity and environment ” Why don’t we seek our roots? In order to find disease genes in the genome, statistics is a must
  • 3.
    Statistical Genetics Diseasegene mapping : The determination of the sequence of genes and their relative distances from one another on a specific chromosome Technology driven field : Mendel’s era: Segregation Analysis - Patience : peas, fruit fly: inbreeding is necessary Experimental Design
  • 4.
    Statistical Genetics Modernera: Microsatellite Markers: Genetic linkage analysis Extremely successful for mapping and identifying Mendelian traits Single nucleotide polymorphism (SNP) marker Case control studies: Genome Wide Association Studies: To identify common variants involved in complex traits Computational Techniques for likelihood in Pedigrees Statistics play a major role
  • 5.
    Statistical Genetics SequencingEra: Study of diseases due to rare variants is emerging ABI SOLiD sequencer Statistics is ALL for sequencing data
  • 6.
    Statistical Genetics Datawe work with Human Genome Project Hap Map Project 1000 Genome Project
  • 7.
    Multi-facotorial Disease EtiologyHypothesis Common Disease Common Variants Hypothesis (CD/CV) hypothesis: Common diseases are caused by a few common variants with moderate effect E.g. Age-related Macular Degeneration: Common variants are likely to have lower odds ratio than rare variants:
  • 8.
    Multi-facotorial Disease EtiologyHypothesis Common Disease Rare Variants Hypothesis: Common diseases are caused by multiple rare variants with large effect size: The discovery of rare variants will have high impact on public health since they will aid in risk prediction and treatment E.g. Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol E.g. Colorectal Adenomas
  • 9.
    Challenges on StatisticalMethodologies Variants misclassification: Non-causal variants Included: Huge number of mutations on the genome: Most of them are not causing the disease under study Causal Variants Excluded: Intronic mutations: Intergenic regions: Unknown patterns of interactions: Within gene interactions: e.g. Hirschsprung’s disease (RET gene) Gene x gene interactions: e.g. breast cancer genes (BRCA 1 BRCA2 x CHEK2) Adaptive methods are needed 1. 2. x
  • 10.
    Kernel Based AdaptiveClustering Combine variant classification with association testing into a coherent framework Applicable to population based case/control studies using unrelated individuals Robust against variants misclassifications Can handle gene x gene interactions and gene x environment interactions