Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Â
Human genetic variation and its contribution to complex traits
1. Deplancke Lab
Monica Albarca
Jean-Daniel Feuz
Carine Gubelmann
Korneel Hens
Alina Isakova
Irina Krier
Andreas Massouras
Sunil Raghav
Jovan Simicevic deplanckelab.epfl.ch
Sebastian Waszak
Wiebke Westhall
You?
2. Laboratory of Systems Biology and Genetics
Bart Deplancke (bart.deplancke@epfl.ch)
Human genetic variation and its
contribution to complex traits
26 June 2000
3. The human genome
First announcement
In June 2000: first announcement of a working draft (haplotype!)
with the Nature and Science papers in February 2001
James Kent (UCSC) Eugene Myers
(Celera)
International Human Genome Sequencing Consortium
(2001) Nature 409:860-921; Venter et al. (2001) Science
291:1304-1351.
In June 2001: finished chromosome 20, with others following
until finishing of chromosome 1 in May 2006
Gregory et al. (2006), Nature, 441, 315-321
5. Classes of human genetic variation
Common versus rare
Refers to the frequency of the minor allele in the human population:
⢠Common variants = minor allele frequency (MAF) >1% in the
population. Also described as polymorphisms.
⢠Rare variants = MAF < 1%
Neutrality:
⢠The vast majority of genetic variants are likely neutral = no
contribution to phenotypic variation.
⢠Some may reach significant frequencies, but this is chance.
Two different nucleotide composition classes:
⢠Single nucleotide variants
⢠Structural variants
7. How are SNPs detected?
High-density oligonucleotide arrays
Chee et al., Science, 1996
Simple 5â to 3â read-out
Flanking issues
Unique oligonucleotide primers to
generate minimally overlapping lone
range-PCR products of 10-kb average
length
8. How are SNPs detected?
Other strategies
Clustered
Reduced alignment
representation
shotgun sequencing
followed by genomic
alignment
Gene-centric
studies
Reference sequence
From Rothberg et al. Nature Biotech, 2001
9. The SNP database - dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/
>
High
>
Three âout of Africaâ genomes:
⢠1.2 million (67%) (all three), 1.7 million (52%) (any two), 1.0 million (30%) unique
⢠Overall, 5.2 million SNPs in the three genomes, the majority being present in dbSNP
⢠Data indicate that most SNVs are common rather than rare
10. Single nucleotide variants
⢠Estimated that the human genome contains > 11 million SNPs
(~7 million with MAF > 5%, rest between 1-5%).
⢠Unknown how many rare or even novel (âde novoâ) SNVs
⢠SNP alleles in the same genomic interval are often correlated with
one another ď âLinkage disequilibrium (LD)â = Nonrandom
association of alleles â varies in complex and unpredictable manner
across the genome and between different populations.
⢠International HapMap Project ď can we divide the genome into
groups of highly correlated SNPs that are generally inherited
together = âLD binsâ
Number of tag SNPs required to capture common Phase II SNPs
11. Single nucleotide variants
Recap
⢠International HapMap Project ď can we divide the genome into
groups of highly correlated SNPs that are generally inherited
together = âLD binsâ
Number of tag SNPs required to capture common Phase II SNPs
Based on genotyping over 3.1
Pairwise linkage disequilibrium million SNPs in 270 individuals
(LD) r2 (if 1 ď SNPs statistically from 4 geographically diverse
indistinguishable) populations (Frazer et al., Nature,
2007)
By genotyping the DNA sample of an individual with a âtaggingâ
SNP from each LD bin, knowledge regarding 80% of SNPs with a
MAF > 5% across the genome is gained.
(Frazer et al., Nature Rev. Genetic., 2010)
13. Population Stratification
Subdivision of a population into different ethnic groups with
potentially different marker allele frequencies and thus different
disease prevalence
From Sven
Bergmann, UNIL
Principle Component Analysis reveals SNP-vectors
explaining largest variation in the data
16. Structural variants
(Frazer et al., Nature Rev. Genetic., 2010)
A classic that opened the door to structural variant research:
Sebat et al. Large-Scale Copy Number Polymorphism in the Human Genome. Science, 2004.
Used ROMA technique to detect copy number variants
17. Representational Oligonucleotide Microarray Analysis (ROMA)
1) Genome digestion
2) Adapters to sticky ends and
PCR amplification
3) After PCR, representations of
the entire genome (restriction
fragments) are amplified to
pronounce relative increases,
decreases or preserve equal
copy number in the two
genomes.
4) Representations of the two
different genomes are labeled
with different fluorophores
and co-hybridized to a
microarray with probes
specific to restriction site
locations across the entire
human genome.
19. Structural variants (SVs)
(Frazer et al., Nature Rev. Genetic., 2010)
Our ability to detect SVs is still very poor (see later)
20. Structural variants (SVs)
Fosmid-based library
sequencing of 8 humans (4
Yorubian and 4 non-African)
(Kidd et al., Nature, 2008)
⢠1 million fosmid clones/individual
⢠Both ends of each clone insert sequenced
ď a pair of high-quality end sequences
(termed an end-sequence pair (ESP).
Only SVs over 8 kb
can be detected
(~450 bp/sequence)
21. Structural variants (SVs)
Fosmid-based library sequencing of 8 humans (4 Yorubian
and 4 non-African) (Kidd et al., Nature, 2008)
~2,000 SVs that were
experimentally verified
Novel
sequence
(either in
gaps (black)
or not
(orange))
22. Structural variants (SVs)
Fosmid-based library sequencing of 8 humans (4 Yorubian
and 4 non-African) (Kidd et al., Nature, 2008)
⢠50% of SVs seen >1 individual ~2,000 SVs that were
⢠~50% outside regions previously annotated as SVs experimentally verified
nearly half lay outside regions of the genome previously
Novel
described as structurally variant sequence
⢠525 new insertion sequences (either in
⢠20% of all genetic variants = SVs, but covers >70% of gaps (black)
or not
nucleotide variation (orange))
⢠SVs ď b/w 9- 25 Mb (~0.5-1% of the genome)
⢠The majority of SVs are yet to be discovered
23. Structural variants (SVs)
Fosmid-based library sequencing of 8 humans (4 Yorubian
and 4 non-African) (Kidd et al., Nature, 2008)
Regions of
increased SNV
density
24. Structural variants and linkage disequilibrium
McCarroll et al., Nature Genet., 2008
⢠Most common, diallelic CNPs (with MAF greater than 5%) were perfectly
captured (r2 = 1.0) by at least one SNP tag from HapMap Phase II
⢠Mean r2 as a function of distance from a polymorphism = indistinguishable for
SNPs and diallelic CNPs ď common, diallelic CNPs are ancestral mutations
Common SVs are in LD with tagging SNPs
26. Common versus rare
âCommon disease â common variant hypothesisâ
versus
Common complex traits are the summation of low-frequency, high-penetrance variants
OR = odd ratio or
PAR = population attributable risk = measure of the multifactorial inherited component
of a disease
28. Whole genome association studies
P-value
Note: âGenome-wideâ is a misnomer
⢠20% of common SNPs not or only partially tagged
⢠Rare variants not tagged at all
29. Whole Genome Association studies
Concept
-log10(p)
Scan Entire Genome * *
- 500,000 SNPs
-log10(p)
* **
Identify local regions
of interest, examine
genes, SNP density
regulatory regions, etc
Replicate the finding
From Sven
Bergmann, UNIL
30. Whole Genome Association studies
Visualization
Wellcome Trust Case Control Consortium. Genome-wide association
study of 14,000 cases of seven common diseases and 3,000 shared
controls. Nature 447, 661â678 (2007).
McCarthy et al., Nature Rev. Genet., 2008
31. Whole genome association studies
Concept
-log10(p)
Scan Entire Genome * *
- 500,000s SNPs
-log10(p)
* **
Identify local regions
of interest, examine
genes, SNP density
regulatory regions, etc
Replicate the finding
From Sven Bergmann
(UNIL)
32. Whole genome association studies
An avalanche of GWA studies
⢠From 2006 ď >220 studies reported to date
⢠For over 80 phenotypes ď 300 loci have been implicated
⢠Most implicated loci were identified for the first time (no prior knowledge)
33. Whole genome association studies
Type 2 diabetes: an example
Frazer et al., Nat. Rev. Genet., 2010
⢠18 genomic intervals with 4 containing previously implicated genes
⢠Major message: the molecular diversity of T2D genes was not anticipated, thus:
(Patients with = disease) â (Patients with = underlying biological disorder)
34. Whole genome association studies
Overlap of genetic risk factor loci for common diseases
Frazer et al., Nat. Rev. Genet., 2010
⢠15 loci are associated with two or more diseases (8 are shown)
⢠Not necessarily same impact (PTPN22 + Crohnâs, - for other ai diseases
⢠Different diseases may have similar molecular underpinnings
⢠Expected: ai diseases (same clinical features)
⢠Unexpected: e.g. GCKR in both TGC levels and ai disease
35. Whole genome association studies
From association to molecular mechanism
⢠Very difficult:
⢠what are the precise variants associated with a trait?
⢠if located in exons: easy, but outside, then what?
⢠most are located outside exons!
(e.g. 9p21 <-> myocardial infarction is located 150 kb from the nearest gene!)
⢠May have a regulatory function, i.e. control gene expression
Aď G
1 c2 3
⢠humans are heterozygous at more functional cis-regulatory sites than at amino acid positions, with
10,700 functional biallelic cis-regulatory polymorphisms in a typical human (Rockman and Wray. Mol.
Biol. Evol., 2002: 19, 1991).
⢠34% of promoter polymorphisms (170 tested) significantly modulated reporter gene expression
(>1.5-fold) (Hoogendoorn et al., Hum. Mol. Genet., 2003: 12, 2249).
⢠Case study with the CC chemokine receptor 5, a major chemokine coreceptor of HIV-1 necessary for
viral entry into cells
⢠G to A SNP of CCR5 at â2459 nt
⢠CCR5 density â low (homozygous GG), intermediate (GA), and highest (homozygous â2459AA)
(Salkowitz et al., Clin. Immunol., 2003: 108, 234).
36. Whole genome association studies
Mapping eQTLs
⢠Transcript abundance = a quantitative trait that can be mapped with considerable power = eQTLs
Environment Genetics
Heritability (H2) = genetic variance over total trait variance with 0 =
no genetic effects and 1 = all variance is under genetic control
Classic paper: Schadt et al., Nature, 2003
Genetics of gene expression surveyed in maize, mouse and man
⢠Liver tissues from 111 F2 mice constructed (from C57BL/6J and DBA/2J)
⢠Microarray analysis of 23,574 genes: 7,861 significantly differentially expressed (either in the
parental strains or in at least 10% of the F2 mice)
⢠eQTL identification (log of the odds ratio (LOD) > 4.3 (P-value < 0.00005))for 2,123 genes
⢠These eQTLs explained 25% of the transcription variation of the corresponding genes
37. Whole genome association studies
Mapping eQTLs
Schadt et al., Nature, 2003
% eQTL across 920 evenly spaced bins, each 2âcM wide
⢠Several hotspots (>1% of detected
eQTLs are located within a 4 cM
interval)
⢠40% of genes with ⼠1 eQTL (LOD >
3.0) had more than one eQTL, and
close to 4% of such genes had more
than three eQTL
ď Gene expression = complex trait
38. Whole genome association studies
Mapping eQTLs
Schadt et al., Nature, 2003
Known polymorphisms between the two parental strains
⢠Overlap between polymorphism and
eQTL = cis-acting transcriptional
regulation
For example:
⢠The C5 gene ď 2 bp deletion in the
coding region in DBA mice resulting in
rapid transcript decay compared with B6.
A LOD of 27.4 centred over the C5 gene
on chromosome 2 is readily detected
(black curve).
⢠The Alad gene present in 2 copies in
DBA
39. Whole genome association studies
Mapping eQTLs
Schadt et al., Nature, 2003
Combining clinical, gene expression and genetic factors
⢠Classical QTLs for FPM:
4 significant loci
⢠Further analyses with subgroups:
additional loci identified
⢠Some QTLs only affect a subset of the F2
population, demonstrating the complexity
underlying traits such as obesity
40. Whole genome association studies
Mapping eQTLs
Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression
⢠206 families of British descent using immortalized lymphoblastoid cell lines
(LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)
~15,000 ď H2 > 0.3
Gene Ontology descriptors for:
⢠Response to unfolded protein (HSFs, chaperones)
⢠Immune responses and apoptosis
⢠Regulation of progression through the cell cycle,
⢠RNA processing and DNA repair.
41. Whole genome association studies
Mapping eQTLs
Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression
⢠206 families of British descent using immortalized lymphoblastoid cell lines
(LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)
⢠Trans effects are weaker than those in cis
⢠Nevertheless, significant trans associations
were detected:
e.g. 1) ~700 transcripts with the peak of
association on the same chromosome but
>100 kb from the nearest transcribed gene,
2) 10,382 transcripts, the peak of
association was on a different chromosome
42. Whole genome association studies
Mapping eQTLs
Using eQTLs to better understand GWAS results
Libioulle et al., PLOS Genet., 2007
GWAS for Crohnâs disease
⢠One of the
neighboring genes
PTGER4 may be
1.25 Mb Gene desert involved
⢠Trace eQTLs in
LCL data
⢠Disease-associated polymorphisms may be regulating PTGER4 expression
in cis, but >250 kb away ď more research needed but likely regulatory
polymorphism
43. Whole genome association studies
Mapping eQTLs
We looked at SNPs but what about other structural variants?
Stranger et al., Science, 2007: Relative Impact of Nucleotide and Copy Number Variation on
Gene Expression Phenotypes
⢠LCLs of 210 unrelated HapMap individuals from four populations
⢠Copy number variants were identified via CGH against a common reference individual
SNP CNV
From probe associated with linked gene From probe associated with linked gene
⢠83.6% and 17.7%
of the total detected genetic variation in gene expression
⢠SNPs close to their respective genes, less so for CNVs
⢠Little overlap between SNP and CNV associations (only 20%)
⢠Not âmereâ gene dosage effects
44. Whole genome association studies
How universal are GWAS findings?
Frazer et al., Nat. Rev. Genet., 2010
Associated with myocardial
infarction
⢠Allele frequencies are
different in different
populations
⢠LD patterns across loci
that co-segregate with
a causally associated
variant may be different
LD less strong in African population
from population to
ď bottleneck principle
population
⢠Control for population
differences is essential
Red = high pairwise SNPs that efficiently (r2 > in large studies
SNP correlation 0.8) tag one another are
connected
45. Whole genome association studies
Impact so far
⢠No complex traits for which there is > 10% of the genetic variance explained
e.g. T2D: 18 genetic variants together < 4% of the total trait liability
⢠Sample size may compensate (increased statistical power)
ButâŚstudies for lipid phenotypes involving >40,000 people still <10%
⌠some diseases have only a low number of affected individuals
⢠Does the answer lie in structural variants? Most are still unmapped
But⌠they are likely in LD with common SNPs
⢠Does the answer lie in rare variants?
PossiblyâŚ
⢠Rare variants are not in LD with tagging SNPs and thus so far undetected
(Amish study)
⢠Can have very high penetrance
⢠However, how to detect on a population-wide basis?
46. Whole genome association studies
The power of whole-genome sequencing
Miller syndrome: autosomal recessive genetic trait (Roach et al., Science, 2010)
⢠Sequenced genomes of 2 parents and 2 children, both affected by Miller Syndrome
⢠Identified 3.7 million SNPs that varied within the family
⢠Resequenced 34000 candidate mutations ď 28 de novo mutations
⢠Narrowing down via ârareâ assumption and knowledge of recessive inheritance
⢠Found one gene, dihydroorotate dehydrogenase (DHOH) known to be involved
47. Entering the age of personalized medicine
Toward the elucidation of each personâs genetic make-up
Necessary for:
1) DNA-based risk assessment for common complex disease
2) Drug discovery (new implicated genes can be identified)
But also to:
3) Identify molecular signatures for disease diagnosis and prognosis
And for:
4) A DNA-guided therapy and dose selection
A personâs genetic make-up significantly affects the efficacy of a drug
⢠Polymorphisms in the VKORC1 and CYP2C9 genes dictate the effective dose levels of the
anti-coagulant Warfarin
⢠Polymorphisms in the UGT1A1 gene correlate with increased toxicity of the anti-colon
cancer drug Irinotecan
⢠Polymorphisms in the MTHFR gene are associated with increased toxicity of Methotrexate
used to treat Crohnâs disease
⢠Polymorphisms in the CYP2D6 gene dictates the probability of relapse in women with
breastcancer treated with Tamoxifen
48. Entering the age of personalized medicine
The revolution of high-throughput sequencing: Illumina
Metzker et al., Nat. Rev. Genet., 2010
Solid phase amplification: 1) initial priming and extending
of the single-stranded, single-molecule template, and 2)
bridge amplification of the immobilized template with
immediately adjacent primers to form clusters. 1
1
49. Entering the age of personalized medicine
From sequence to genome: mapping reads
Trapnell and Salzberg, Nat. Biotech., 2009
Using BW, the index
for the entire human
Four sequences of equal
genome fits into < 2
strength = seeds
Gb of memory
If 1SNP, the other 3 Is 30 times faster
seeds intact; than indexing
If 2 SNPs, the other 2
seeds intact; Also is limited to 2
SNPs within one
Thus, max 2 SNPs/read read
Limitation:
Indexing takes up huge
memory
50. Entering the age of personalized medicine
Burrows-Wheeler transform
Wikipedia
Easier to compress strings with runs of repeated characters
51. Entering the age of personalized medicine
A first human genome project using HTS
Bentley et al., Nature,
2008
⢠Solexa Technology
⢠First: X-chromosome
⢠204 million reads
⢠Sampling of
sequence fragments
is close to random
(GC content slight
effect)
52. Entering the age of personalized medicine
A first human genome project using HTS
Bentley et al., Nature, 2008
⢠135âGb of sequence (~4 billion paired 35-base reads) (8 weeks)
⢠The approximate consumables cost = $250,000
⢠97% of the reads were aligned using MAQ
⢠99.9% of the human reference covered with ⼠1 reads at 40.6X
99% agreement with HapMap results!
53. Entering the age of personalized medicine
More human genome projects
Snyder et al., G&D, 2010
54. Entering the age of personalized medicine
More human genome projects
Snyder et al., G&D, 2010
55. Entering the age of personalized medicine
More human genome projects
Snyder et al., G&D, 2010
56. Entering the age of personalized medicine
Tackling the SV problem using HTS
⢠Really difficult and progress is limited.
⢠Existing methods are based on two approaches:
⢠Paired-end mapping (PEM)
⢠Depth-of-coverage (DOC) approach
⢠The ends of each fragment tagged by a biotinylated (B) nucleotide
⢠Circularization forms a junction between the two ends
⢠Random fragmentation and recovery of biotinylated fragments
⢠Circularized DNA is randomly fragmented and the biotinylated junction fragments are
recovered
⢠Standard sequencing procedure thereafter
57. Entering the age of personalized medicine
Tackling the SV problem using HTS: paired-end mapping
Medvedev et al., Nature Meth., 2009
58. Entering the age of personalized medicine
Tackling the SV problem using HTS: DOC
Snyder et al., G&D, 2010 Campbell et al., Nature Genet., 2008
59. Entering the age of personalized medicine
Tackling the SV problem using HTS: state-of-the-art
Snyder et al., G&D, 2010