Laboratory of Systems Biology and Genetics Bart Deplancke (firstname.lastname@example.org)Human genetic variation and its contribution to complex traits 26 June 2000
The human genome First announcementIn June 2000: first announcement of a working draft (haplotype!)with the Nature and Science papers in February 2001 James Kent (UCSC) Eugene Myers (Celera) International Human Genome Sequencing Consortium (2001) Nature 409:860-921; Venter et al. (2001) Science 291:1304-1351.In June 2001: finished chromosome 20, with others followinguntil finishing of chromosome 1 in May 2006 Gregory et al. (2006), Nature, 441, 315-321
Classes of human genetic variationCommon versus rareRefers to the frequency of the minor allele in the human population: • Common variants = minor allele frequency (MAF) >1% in the population. Also described as polymorphisms. • Rare variants = MAF < 1%Neutrality: • The vast majority of genetic variants are likely neutral = no contribution to phenotypic variation. • Some may reach significant frequencies, but this is chance.Two different nucleotide composition classes: • Single nucleotide variants • Structural variants
Single nucleotide variants T/G T/G A/CATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…ATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…
How are SNPs detected?High-density oligonucleotide arrays Chee et al., Science, 1996 Simple 5’ to 3’ read-out Flanking issues Unique oligonucleotide primers to generate minimally overlapping lone range-PCR products of 10-kb average length
How are SNPs detected? Other strategies Clustered Reduced alignment representationshotgun sequencingfollowed by genomic alignment Gene-centric studies Reference sequence From Rothberg et al. Nature Biotech, 2001
The SNP database - dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/ > High >Three “out of Africa” genomes:• 1.2 million (67%) (all three), 1.7 million (52%) (any two), 1.0 million (30%) unique• Overall, 5.2 million SNPs in the three genomes, the majority being present in dbSNP• Data indicate that most SNVs are common rather than rare
Single nucleotide variants• Estimated that the human genome contains > 11 million SNPs(~7 million with MAF > 5%, rest between 1-5%).• Unknown how many rare or even novel (“de novo”) SNVs• SNP alleles in the same genomic interval are often correlated withone another “Linkage disequilibrium (LD)” = Nonrandomassociation of alleles – varies in complex and unpredictable manneracross the genome and between different populations.• International HapMap Project can we divide the genome intogroups of highly correlated SNPs that are generally inheritedtogether = “LD bins” Number of tag SNPs required to capture common Phase II SNPs
Single nucleotide variants Recap • International HapMap Project can we divide the genome into groups of highly correlated SNPs that are generally inherited together = “LD bins” Number of tag SNPs required to capture common Phase II SNPs Based on genotyping over 3.1Pairwise linkage disequilibrium million SNPs in 270 individuals(LD) r2 (if 1 SNPs statistically from 4 geographically diverseindistinguishable) populations (Frazer et al., Nature, 2007) By genotyping the DNA sample of an individual with a “tagging” SNP from each LD bin, knowledge regarding 80% of SNPs with a MAF > 5% across the genome is gained. (Frazer et al., Nature Rev. Genetic., 2010)
Population Stratification Subdivision of a population into different ethnic groups with potentially different marker allele frequencies and thus different disease prevalence From SvenBergmann, UNIL Principle Component Analysis reveals SNP-vectors explaining largest variation in the data
Population StratificationEthnic groups cluster according to geographic distances PC2 PC2 From Sven PC1 PC1 Bergmann, UNIL
Population StratificationPCA of POPRES cohort From Sven Bergmann, UNIL
Structural variants (Frazer et al., Nature Rev. Genetic., 2010) A classic that opened the door to structural variant research:Sebat et al. Large-Scale Copy Number Polymorphism in the Human Genome. Science, 2004. Used ROMA technique to detect copy number variants
Representational Oligonucleotide Microarray Analysis (ROMA) 1) Genome digestion 2) Adapters to sticky ends and PCR amplification 3) After PCR, representations of the entire genome (restriction fragments) are amplified to pronounce relative increases, decreases or preserve equal copy number in the two genomes. 4) Representations of the two different genomes are labeled with different fluorophores and co-hybridized to a microarray with probes specific to restriction site locations across the entire human genome.
Structural variants (SVs) (Frazer et al., Nature Rev. Genetic., 2010)Our ability to detect SVs is still very poor (see later)
Structural variants (SVs) Fosmid-based library sequencing of 8 humans (4 Yorubian and 4 non-African) (Kidd et al., Nature, 2008) • 1 million fosmid clones/individual • Both ends of each clone insert sequenced a pair of high-quality end sequences (termed an end-sequence pair (ESP). Only SVs over 8 kb can be detected(~450 bp/sequence)
Structural variants (SVs)Fosmid-based library sequencing of 8 humans (4 Yorubian and 4 non-African) (Kidd et al., Nature, 2008) ~2,000 SVs that were experimentally verified Novel sequence (either in gaps (black) or not (orange))
Structural variants (SVs) Fosmid-based library sequencing of 8 humans (4 Yorubian and 4 non-African) (Kidd et al., Nature, 2008)• 50% of SVs seen >1 individual ~2,000 SVs that were• ~50% outside regions previously annotated as SVs experimentally verifiednearly half lay outside regions of the genome previously Noveldescribed as structurally variant sequence• 525 new insertion sequences (either in• 20% of all genetic variants = SVs, but covers >70% of gaps (black) or notnucleotide variation (orange))• SVs b/w 9- 25 Mb (~0.5-1% of the genome)• The majority of SVs are yet to be discovered
Structural variants (SVs)Fosmid-based library sequencing of 8 humans (4 Yorubian and 4 non-African) (Kidd et al., Nature, 2008) Regions of increased SNV density
Structural variants and linkage disequilibrium McCarroll et al., Nature Genet., 2008 • Most common, diallelic CNPs (with MAF greater than 5%) were perfectly captured (r2 = 1.0) by at least one SNP tag from HapMap Phase II • Mean r2 as a function of distance from a polymorphism = indistinguishable for SNPs and diallelic CNPs common, diallelic CNPs are ancestral mutations Common SVs are in LD with tagging SNPs
Common versus rare “Common disease – common variant hypothesis” versusCommon complex traits are the summation of low-frequency, high-penetrance variants OR = odd ratio or PAR = population attributable risk = measure of the multifactorial inherited component of a disease
Whole Genome Association studies How significant is this?
Whole genome association studies P-valueNote: “Genome-wide” is a misnomer • 20% of common SNPs not or only partially tagged • Rare variants not tagged at all
Whole Genome Association studies Concept -log10(p)Scan Entire Genome * *- 500,000 SNPs -log10(p) * **Identify local regionsof interest, examinegenes, SNP densityregulatory regions, etcReplicate the finding From Sven Bergmann, UNIL
Whole Genome Association studies VisualizationWellcome Trust Case Control Consortium. Genome-wide associationstudy of 14,000 cases of seven common diseases and 3,000 sharedcontrols. Nature 447, 661–678 (2007). McCarthy et al., Nature Rev. Genet., 2008
Whole genome association studies Concept -log10(p)Scan Entire Genome * *- 500,000s SNPs -log10(p) * **Identify local regionsof interest, examinegenes, SNP densityregulatory regions, etcReplicate the finding From Sven Bergmann (UNIL)
Whole genome association studies An avalanche of GWA studies• From 2006 >220 studies reported to date• For over 80 phenotypes 300 loci have been implicated• Most implicated loci were identified for the first time (no prior knowledge)
Whole genome association studies Type 2 diabetes: an example Frazer et al., Nat. Rev. Genet., 2010• 18 genomic intervals with 4 containing previously implicated genes• Major message: the molecular diversity of T2D genes was not anticipated, thus: (Patients with = disease) ≠ (Patients with = underlying biological disorder)
Whole genome association studiesOverlap of genetic risk factor loci for common diseases Frazer et al., Nat. Rev. Genet., 2010• 15 loci are associated with two or more diseases (8 are shown)• Not necessarily same impact (PTPN22 + Crohn’s, - for other ai diseases• Different diseases may have similar molecular underpinnings • Expected: ai diseases (same clinical features) • Unexpected: e.g. GCKR in both TGC levels and ai disease
Whole genome association studies From association to molecular mechanism• Very difficult: • what are the precise variants associated with a trait? • if located in exons: easy, but outside, then what? • most are located outside exons! (e.g. 9p21 <-> myocardial infarction is located 150 kb from the nearest gene!) • May have a regulatory function, i.e. control gene expression AG 1 c2 3• humans are heterozygous at more functional cis-regulatory sites than at amino acid positions, with10,700 functional biallelic cis-regulatory polymorphisms in a typical human (Rockman and Wray. Mol.Biol. Evol., 2002: 19, 1991).• 34% of promoter polymorphisms (170 tested) significantly modulated reporter gene expression(>1.5-fold) (Hoogendoorn et al., Hum. Mol. Genet., 2003: 12, 2249).• Case study with the CC chemokine receptor 5, a major chemokine coreceptor of HIV-1 necessary forviral entry into cells • G to A SNP of CCR5 at –2459 nt • CCR5 density – low (homozygous GG), intermediate (GA), and highest (homozygous –2459AA) (Salkowitz et al., Clin. Immunol., 2003: 108, 234).
Whole genome association studies Mapping eQTLs• Transcript abundance = a quantitative trait that can be mapped with considerable power = eQTLs Environment Genetics Heritability (H2) = genetic variance over total trait variance with 0 = no genetic effects and 1 = all variance is under genetic control Classic paper: Schadt et al., Nature, 2003 Genetics of gene expression surveyed in maize, mouse and man • Liver tissues from 111 F2 mice constructed (from C57BL/6J and DBA/2J) • Microarray analysis of 23,574 genes: 7,861 significantly differentially expressed (either in the parental strains or in at least 10% of the F2 mice) • eQTL identification (log of the odds ratio (LOD) > 4.3 (P-value < 0.00005))for 2,123 genes • These eQTLs explained 25% of the transcription variation of the corresponding genes
Whole genome association studies Mapping eQTLs Schadt et al., Nature, 2003% eQTL across 920 evenly spaced bins, each 2 cM wide • Several hotspots (>1% of detected eQTLs are located within a 4 cM interval) • 40% of genes with ≥ 1 eQTL (LOD > 3.0) had more than one eQTL, and close to 4% of such genes had more than three eQTL Gene expression = complex trait
Whole genome association studies Mapping eQTLs Schadt et al., Nature, 2003Known polymorphisms between the two parental strains • Overlap between polymorphism and eQTL = cis-acting transcriptional regulation For example: • The C5 gene 2 bp deletion in the coding region in DBA mice resulting in rapid transcript decay compared with B6. A LOD of 27.4 centred over the C5 gene on chromosome 2 is readily detected (black curve). • The Alad gene present in 2 copies in DBA
Whole genome association studies Mapping eQTLs Schadt et al., Nature, 2003Combining clinical, gene expression and genetic factors • Classical QTLs for FPM: 4 significant loci • Further analyses with subgroups: additional loci identified • Some QTLs only affect a subset of the F2 population, demonstrating the complexity underlying traits such as obesity
Whole genome association studies Mapping eQTLsDixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression • 206 families of British descent using immortalized lymphoblastoid cell lines (LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes) ~15,000 H2 > 0.3 Gene Ontology descriptors for: • Response to unfolded protein (HSFs, chaperones) • Immune responses and apoptosis • Regulation of progression through the cell cycle, • RNA processing and DNA repair.
Whole genome association studies Mapping eQTLsDixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression • 206 families of British descent using immortalized lymphoblastoid cell lines (LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes) • Trans effects are weaker than those in cis • Nevertheless, significant trans associations were detected: e.g. 1) ~700 transcripts with the peak of association on the same chromosome but >100 kb from the nearest transcribed gene, 2) 10,382 transcripts, the peak of association was on a different chromosome
Whole genome association studies Mapping eQTLs Using eQTLs to better understand GWAS results Libioulle et al., PLOS Genet., 2007GWAS for Crohn’s disease • One of the neighboring genes PTGER4 may be 1.25 Mb Gene desert involved • Trace eQTLs in LCL data • Disease-associated polymorphisms may be regulating PTGER4 expression in cis, but >250 kb away more research needed but likely regulatory polymorphism
Whole genome association studies Mapping eQTLs We looked at SNPs but what about other structural variants?Stranger et al., Science, 2007: Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes • LCLs of 210 unrelated HapMap individuals from four populations • Copy number variants were identified via CGH against a common reference individual SNP CNV From probe associated with linked gene From probe associated with linked gene • 83.6% and 17.7% of the total detected genetic variation in gene expression • SNPs close to their respective genes, less so for CNVs • Little overlap between SNP and CNV associations (only 20%) • Not “mere” gene dosage effects
Whole genome association studies How universal are GWAS findings?Frazer et al., Nat. Rev. Genet., 2010 Associated with myocardial infarction • Allele frequencies are different in different populations • LD patterns across loci that co-segregate with a causally associated variant may be different LD less strong in African population from population to bottleneck principle population • Control for population differences is essential Red = high pairwise SNPs that efficiently (r2 > in large studies SNP correlation 0.8) tag one another are connected
Whole genome association studies Impact so far• No complex traits for which there is > 10% of the genetic variance explained e.g. T2D: 18 genetic variants together < 4% of the total trait liability• Sample size may compensate (increased statistical power) But…studies for lipid phenotypes involving >40,000 people still <10% … some diseases have only a low number of affected individuals• Does the answer lie in structural variants? Most are still unmapped But… they are likely in LD with common SNPs• Does the answer lie in rare variants? Possibly… • Rare variants are not in LD with tagging SNPs and thus so far undetected (Amish study) • Can have very high penetrance • However, how to detect on a population-wide basis?
Whole genome association studies The power of whole-genome sequencingMiller syndrome: autosomal recessive genetic trait (Roach et al., Science, 2010)• Sequenced genomes of 2 parents and 2 children, both affected by Miller Syndrome• Identified 3.7 million SNPs that varied within the family• Resequenced 34000 candidate mutations 28 de novo mutations• Narrowing down via “rare” assumption and knowledge of recessive inheritance• Found one gene, dihydroorotate dehydrogenase (DHOH) known to be involved
Entering the age of personalized medicineToward the elucidation of each person’s genetic make-upNecessary for: 1) DNA-based risk assessment for common complex disease 2) Drug discovery (new implicated genes can be identified)But also to: 3) Identify molecular signatures for disease diagnosis and prognosisAnd for: 4) A DNA-guided therapy and dose selectionA person’s genetic make-up significantly affects the efficacy of a drug • Polymorphisms in the VKORC1 and CYP2C9 genes dictate the effective dose levels of the anti-coagulant Warfarin • Polymorphisms in the UGT1A1 gene correlate with increased toxicity of the anti-colon cancer drug Irinotecan • Polymorphisms in the MTHFR gene are associated with increased toxicity of Methotrexate used to treat Crohn’s disease • Polymorphisms in the CYP2D6 gene dictates the probability of relapse in women with breastcancer treated with Tamoxifen
Entering the age of personalized medicine The revolution of high-throughput sequencing: Illumina Metzker et al., Nat. Rev. Genet., 2010Solid phase amplification: 1) initial priming and extendingof the single-stranded, single-molecule template, and 2)bridge amplification of the immobilized template withimmediately adjacent primers to form clusters. 1 1
Entering the age of personalized medicine From sequence to genome: mapping reads Trapnell and Salzberg, Nat. Biotech., 2009 Using BW, the index for the entire human Four sequences of equal genome fits into < 2 strength = seeds Gb of memoryIf 1SNP, the other 3 Is 30 times fasterseeds intact; than indexingIf 2 SNPs, the other 2seeds intact; Also is limited to 2 SNPs within oneThus, max 2 SNPs/read readLimitation:Indexing takes up hugememory
Entering the age of personalized medicine Burrows-Wheeler transform Wikipedia Easier to compress strings with runs of repeated characters
Entering the age of personalized medicine A first human genome project using HTS Bentley et al., Nature, 2008 • Solexa Technology • First: X-chromosome • 204 million reads • Sampling of sequence fragments is close to random (GC content slight effect)
Entering the age of personalized medicine A first human genome project using HTS Bentley et al., Nature, 2008 • 135 Gb of sequence (~4 billion paired 35-base reads) (8 weeks) • The approximate consumables cost = $250,000 • 97% of the reads were aligned using MAQ • 99.9% of the human reference covered with ≥ 1 reads at 40.6X 99% agreement with HapMap results!
Entering the age of personalized medicine More human genome projects Snyder et al., G&D, 2010
Entering the age of personalized medicine More human genome projects Snyder et al., G&D, 2010
Entering the age of personalized medicine More human genome projects Snyder et al., G&D, 2010
Entering the age of personalized medicine Tackling the SV problem using HTS• Really difficult and progress is limited.• Existing methods are based on two approaches: • Paired-end mapping (PEM) • Depth-of-coverage (DOC) approach• The ends of each fragment tagged by a biotinylated (B) nucleotide• Circularization forms a junction between the two ends• Random fragmentation and recovery of biotinylated fragments• Circularized DNA is randomly fragmented and the biotinylated junction fragments arerecovered• Standard sequencing procedure thereafter
Entering the age of personalized medicineTackling the SV problem using HTS: paired-end mapping Medvedev et al., Nature Meth., 2009
Entering the age of personalized medicine Tackling the SV problem using HTS: DOCSnyder et al., G&D, 2010 Campbell et al., Nature Genet., 2008
Entering the age of personalized medicineTackling the SV problem using HTS: state-of-the-art Snyder et al., G&D, 2010
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.