Секвенирование как инструмент исследования сложных фенотипов человека: от генов к полным геномам (Василий Раменский)
1. Секвенирование как инструмент исследования сложных фенотипов человека: от генов к полным геномам
Василий Раменский
UCLA Center for Neurobehavioral Genetics
2 октября 2014 г.
UCLA Center for Neurobehavioral Genetics, Los Angeles USA
4. (a) Contribution of genetic factors
// Genetic ≠ inherited: de novo mutations
(b) Non-Mendelian inheritance
What is a complex phenotype?
5. (a) Contribution of genetic factors
// Genetic ≠ inherited: de novo mutations
(b) Non-Mendelian inheritance
SCZ, schizophrenia; ASD, autistic spectrum disorders; BP, bipolar disorder; AD, Alzheimer’s disease, ADHD, attention deficit hyperactivity disorder; TS, Tourette syndrome; OCD, obsessive compulsive disorder; ID, intellectual disability
What is a complex phenotype?
6. -- Heritable quantitative traits
Examples: working memory, executive function, sociability, attention, temperament, brain measures
-- Hypothesis: individuals diagnosed with conditions like ASD or SCZ may be at the extreme end of distribution for some endophenotypes; risk prediction
-- Hope: simpler genetic architectures than clinical diagnoses, easier to dissect
Endophenotypes: intermediate layer
7.
8.
9. Tactical
-- Loci involved (in an individual and in the population)
-- Causal allele spectrum at each loci: rare, common…
-- Loci interaction: common allele as a modifier of rare ones
Strategical
-- Risk prediction
-- Identification of disease pathways treatment
Goals of genetic analysis
10. Experiment
-- Genome-wide association analysis (GWAS)
-- Sequencing: DNA-Seq, RNA-Seq, ChIP-Seq, …
Data analysis
-- Bioinformatics: variant calling and quality control
-- Bioinformatics: variant annotation and functionality prediction
-- Statistical genetics: single variant or gene level association analysis
Validation
-- Followup genotyping
-- Model organisms, in vitro experiments
Methods
13. Sullivan et al. 2012; Mitchell 2014
-- AD, BP, CSZ: allelic spectrum and aetiological role for both rare and common variation
-- ASD, SCZ: variation at hundreds of different genes involved; organized in pathways
-- AD: unexpected cholesterol metabolism and the innate immune response pathways
-- ASD: de novos
-- The same SNVs in ASD, SCZ, epilepsy, ADHD, ID and other (Mitchell 2014)
-- SCZ: GWAS points to verified and predicted targets of non-coding RNA miR-137
Genetic architecture of disease
15. 1) High risk rare alleles causing Mendelian disease
-- Mostly coding: nonsense, missense, splice site, indels
-- Examples: APP or PS mutations in AD; LRRK2 mutations in Parkinson’s disease
2) Moderate risk low frequency alleles
-- Example: GBA mutations in Parkinson’s disease
-- Most difficult to detect earlier
3) Low risk common alleles
-- Detectable by GWAS
-- Examples: SNCA or MAPT in
Parkinson’s; CLU, PICALM , CR1: AD
-- Rarely coding; gene regulation?
Genetic architecture of disease
16. 4) High risk common alleles
-- Examples: APOE mutations in AD; complement H factor in macular degeneration
-- Easily identifiable by GWAS
-- Late onset diseases
5) De novo mutations
-- Example: autism
-- Diseases which affect reproductive fitness
-- Requires trio sequencing
Genetic architecture of disease
17. 6) Low risk rare variants
-- Expected to affect gene regulation, splicing etc.
-- Most difficult to identify, require:
-- large number of cases and controls,
-- reliable bioinformatic and statistical genetics methods;
-- functional followup
“Auxiliary” alleles:
7) Alleles in phenotype modifier genes
-- Example: modifier genes in cystic fibrosis
8) Alleles in epistasis with the disease one
-- Example: Bardet-Biedl syndrome
Genetic architecture of disease
18. Published GWA at p≤5X10-8 for 18 trait categories (07/2012)
NHGRI GWA Catalog
www.genome.gov/GWAStudies
www.ebi.ac.uk/fgpt/gwas/
20. I. Allelic Spectrum of Metabolic Syndrome (ASMS) in the Northern Finland Birth Cohort 1966 (NFBC66)
21. Genetically homogenous Finnish population
-- Finns descend from small number of founders 4000- 2000 years ago
-- Internal migration in the 17th century created small subisolates
-- Grew rapidly with little further migration
-- Genetically homogenous sub-populations
Sabatti et al., 2009
22.
23. NFBC66:
-- genetic isolate that is relatively homogeneous in genetic background (extensive LD) and environmental exposures;
-- quantitative traits: no biases characteristic of case-control studies;
-- birth cohort: no age as a potential confounder; longitudinal data;
-- founder population: potential enrichment in damaging variants (not pertinent for GWAS, though)
-- genotypes on ~329K SNPs in 4,763 individuals (out of 12,058 live births)
Nine heritable traits (risk factors for cardiovascular disease or T2D):
-- body mass index (BMI, 1); fasting serum concentrations of lipids: triglycerides (TG), HDL and LDL (2-4); indicators of glucose homeostasis (glucose (GLU), and insulin (INS)) and inflammation (CRP) (5-7); systolic (SBP) and diastolic (DBP) blood pressure (8-9);
-- Extreme values of these traits, in combination, identify a metabolic syndrome, hypothesized to increase risks for both CVD and T2D
NFBC66 and metabolic traits
24. -- 31 associations to 6 traits passing a 5x10-7 threshold after correction, mostly replicating earlier findings;
-- 9 previously unreported associations
-- “Five of these associations—HDL with NR1H3 (LXRA), LDL with AR and FADS1-FADS2, glucose with MTNR1B and insulin with PANK1— implicate genes with known or postulated roles in metabolism”;
-- the currently identified loci, singly and cumulatively, explain little
of the trait variability in NFBC1966 (at most ~6% based on multivariate
regression);
-- contribution of rare variants?
GWAS results in NFBC66
29. ASMS sequencing: overview
-- Samples: 6,121 persons: 4,447 NFBC + 835 FUSION controls + 839 FUSION cases (Finland-United States Investigation of NIDDM Genetics)
-- Regions of interest: 78 genes from 17 loci on 10 chromosomes, UTRs+coding, ~270Kbp
-- Sequencing: pools of barcoded libraries per lane; 12 for Illumina GAIIx and 18 for Illumina HiSeq 2000; mean coverage depth 31-285x
-- Data processing: BWA, single sample BAMs, independent variant calling by three centers (UMich, WashU, UCLA); extensive QC
-- Consensus sites: 2,234 consensus sites, overall concordance rate between centers was 99.96%; 1,072 singletons or doubletons; 1,697 with MAF<=0.5%
-- Annotation/prediction: MapSNPs/PolyPhen-2
33. Association analysis strategy
Phenotypes:
-- low-density lipoprotein (LDL), high-density lipoprotein (HDL), total cholesterol (TC), triglycerides (TG), fasting glucose (FG), fasting insuline (FI);
-- residuals regressed on age, age^2, sex, oral contraceptive use, pregnancy status;
-- excluded T2D cases from fusion excluded for GLU and INS analysis
Single-variant analysis: variants with MAF>0.1% in additive genetic model; first 5 PCs as covariates; method: PLINK
Gene-level tests: non-synonymous variants with MAF<1% (from 2 to 33 per gene); methods: CMC, SKAT (with direction)
Goal: new single variant signals independent from GWAS or association at the gene level (group tests)
34. Association results
Initially: 17 loci X 6 metabolic phenotypes => 39 unique locus-phenotype combinations ( 32 for lipid measures + 6 for GLU + 1 for INS)
Results:
-- For 27 of the 39 locus-phenotype combinations, the re-sequencing analysis essentially recapitulated the results from the GWAS
-- Remaining 12 locus-phenotype associations (7 loci): new signals independent from GWAS
-- ABCA1, gene-level: 23 rare variants implicated in TC and HDL-C
-- CETP, gene-level : 4 and 4 rare NS variants assoc. with increased and decreased HDL-C
-- Protective variant His177Tyr in G6PC2 (lowering FG), FinnMAF=1.4% (vs. 0.23% in Europe);
-- Damaging rs28933094 in LIPC (hepatic lipase deficiency), FinnMAF=1.5%
39. Why?!
-- Incomplete coverage for some loci
-- Causal non-coding variants?
-- Indels, CNVs etc (complicated architecture)?
-- Epistatic interactions?
-- Compound heterozygotes?
40. -- Extensive rare variation in the human population
-- GWAS DNA-seq transition: knowing full coding SNV spectrum may not give immediate answers
Lessons from ASMS story
41. Harvard Medical School: Jeremiah Scharf, Dongmei Yu UCLA: Giovanni Coppola, Nelson Freimer, Alden Huang, Jae-Hoon Sul, Renee Sears, Vasily Ramenskiy; U.Chicago: Nancy Cox, Vasa Trubetskoy, Lea Davis
II.Tourette syndrome in large pedigrees and independent samples
42. Tourette syndrome (TS)
-- an inherited neuropsychiatric disorder with onset in childhood, characterized by multiple physical (motor) tics and at least one vocal (phonic) tic
-- ~0.4%-3.8% of children ages 5 to 18 may have TS
-- extreme TS in adulthood is a rarity, and TS does not adversely affect intelligence or life expectancy
43. TS/CT chr 2p linkage region in pedigrees
Dongmei Yu, Jeremiah Scharf
44. Tourette syndrome Large Family sequencing by CIDR (2011)
Samples: 15 pedigrees, 109 samples: 66 affected, 35 not affected, 8 unknown
Exome sequencing: Agilent HumanExon 50Mb Kit, >100 K SNVs
Custom targeted sequencing: 5.7 Mbp from chr2 (1-91 Mbp): ~22K SNVs
-- known and predicted exons not on the Agilent exome kit;
-- additional, brain-specific transcripts and AS exons (derived from UCLA fetal and adult brain RNA-sequencing libraries);
-- alternative brain-specific TSS tags using a brain cap-analysis gene expression (CAGE) library;
-- putative promoter regions;
-- predicted splice sites;
-- conserved sequences derived from alignments with 44 vertebrate species
45. •
Single-Variant Analysis
‣
EMMAX
‣
EIGENSTRAT
‣
PLINK-TDT
•
Gene-Based Tests
‣
PLINK/SEQ methods
‣
VAAST
‣
Zhu-Xiong method (?)
•
Imputed Data
•
CNV Analysis
Analysis Plan
Global
Local
•
Perfect Cosegregation
•
Whole Dataset
•
Under Linkage Peaks
•
Regions from Literature
•
Multiple-Hit Analysis
•
Family-based VAAST
•
De novo Analysis
Data in Web-Based Database
46. Manhattan plot of GWAS meta-analysis (Dongmei Yu)
-- Genome-wide significant result in the linkage region
-- Significant SNPs are located in the lncRNA gene
47. Expression correlation with top hit gene
-- BrainSpan database: expression values for 48,582 genes in 237 experiments, prenatal states only (total: ~53K in 524 exp.); gene should have >0 expression in at least one experiment
-- Pearson correlation coefficient calculated for all gene pairs in prenatal samples
-- List of genes with expression in developing brain correlated with the query gene
48. -- Compares a gene list against background of ~49K genes
-- Check 1-tail p<0.01 positive correlation: 476 genes
-- Check 1-tail p<0.001 negative correlation: 259 genes
51. GO terms in 259 genes (neg. corr., p<0.001)
-- “Wnt1 has also been shown to antagonize neural differentiation and is a major factor in self-renewal of neural stem cells. This allows for regeneration of nervous system cells, which is further evidence of a role in promoting neural stem cell proliferation”
52. -- Sample sizes
-- GWAS is not dead
-- Non-coding RNA genes
Lessons from TS story: what matters?
53. III. Analysis of WGS variation in the genomic region associated with amygdala volume in bipolar family individuals
UCLA Bipolar project
Nelson Freimer
Susan Service
Scott Fears
Carrie Bearden
+ many others
54. Bipolar disorder
-- A severe psychiatric illness, characterized by alternating episodes of depression and mania,
-- Ranks among the top ten causes of morbidity and life- long disability world-wide
-- Prevalence: 1-2% of the population
55. Sequencing and variant calling
1) Initial WGS and variant calling: Illumina
-- 450 individuals from 27 large families (67 trios, 78 married-ins)
2) Genotype recalling at high-quality segregating sites: Samtools
-- 24.6 mln variants in 450 individuals
-- Average genotype concordance with genotyped SNPs per individual: 99.78%; Mendelian inconsistency rate in trios: 1.78%
3) Pedigree-based genotype refinement: TrioCaller (Jae-Hoon Sul)
-- 23 mln variants in 450 individuals
-- Genotype concordance: 99.86%; Mendelian inconsistencies: 0.18%;
4) Imputation on chr6: PLINK, FamLDCaller (Jae-Hoon Sul)
-- 977K variants on chr6 after QC in 839 individuals
-- No singletons, no sites with >=5 discordant genotypes, threshold r^2=0.1
56. Multisystem component phenotypes of bipolar disorder (Fears et al, 2014)
-- 169 quantitative neurocognitive, temperament-related, and neuroanatomical phenotypes that appear heritable and associated with severe BP, measured in 738 adults (181 affected);
-- About 25% of the phenotypes, including measures from each phenotype domain, were both heritable and associated with BP-I
59. -- Burden test with effect direction (over-dispersion)
-- Earlier method SKAT (Sequence Kernel Association Test) modified to work with family samples
-- OK for quantitative phenotypes
60. Bioinformatics: potential functional variants
Coding variants:
3 nonsense, splice site
2 damaging
1 benign
Non-coding variants (accumulated):
+1if conserved or accelerated in any available lineage
+0.5 if Active/Strong chromatin in 10 brain tissues
+0.25 if disrupts TF binding site
-- Protein coding AND “my rank”>0: ~16 K variants
-- MAF<10%
61. Gene Pvalue Nvar Pos,Mbp
----------------------------------
LATS1 0.001005 138 150.01
RAET1G 0.003231 14 150.24
CNKSR3 0.004171 30 154.73
UST 0.004190 494 149.23
PPIL4 0.005438 97 149.85
famSKAT results: take 1 (Susan Service)
Variants:
-- Protein coding genes
-- MAF<10% in married ins
-- Rank>0
62.
63. Bioinformatics: functional variants for take 2
-- Priority: nonsense >> splice-site >> damaging >> benign >> synonymous >> UTR exon >> flank >> enhancer >> intron
-- New: “Enhancer”, FANTOM5 enhancers associated with a gene
-- New: GWAVA scores (0..1)
-- Components for “my rank”: conservation, TFBS overlap, active chromatin