Human genetic variation and its contribution to complex traits

Deplancke Lab
Monica Albarca
Jean-Daniel Feuz
Carine Gubelmann
Korneel Hens
Alina Isakova
Irina Krier
Andreas Massouras
Sunil Raghav
Jovan Simicevic deplanckelab.epfl.ch
Sebastian Waszak
Wiebke Westhall
You?

Laboratory of Systems Biology and Genetics
Bart Deplancke (bart.deplancke@epfl.ch)
Human genetic variation and its
contribution to complex traits

26 June 2000

The human genome
First announcement
In June 2000: first announcement of a working draft (haplotype!)
with the Nature and Science papers in February 2001

James Kent (UCSC) Eugene Myers
(Celera)
International Human Genome Sequencing Consortium
(2001) Nature 409:860-921; Venter et al. (2001) Science
291:1304-1351.

In June 2001: finished chromosome 20, with others following
until finishing of chromosome 1 in May 2006

Gregory et al. (2006), Nature, 441, 315-321

Why are we so phenotypically different?

Classes of human genetic variation
Common versus rare
Refers to the frequency of the minor allele in the human population:
• Common variants = minor allele frequency (MAF) >1% in the
population. Also described as polymorphisms.
• Rare variants = MAF < 1%

Neutrality:
• The vast majority of genetic variants are likely neutral = no
contribution to phenotypic variation.
• Some may reach significant frequencies, but this is chance.

Two different nucleotide composition classes:
• Single nucleotide variants
• Structural variants

Single nucleotide variants
T/G T/G A/C

ATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…

ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…


ATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…


How are SNPs detected?
High-density oligonucleotide arrays
Chee et al., Science, 1996
Simple 5’ to 3’ read-out

Flanking issues

Unique oligonucleotide primers to
generate minimally overlapping lone
range-PCR products of 10-kb average
length

How are SNPs detected?
Other strategies
Clustered
Reduced alignment
representation
shotgun sequencing
followed by genomic
alignment

Gene-centric
studies

Reference sequence

From Rothberg et al. Nature Biotech, 2001

The SNP database - dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/

>
High
>

Three “out of Africa” genomes:
• 1.2 million (67%) (all three), 1.7 million (52%) (any two), 1.0 million (30%) unique
• Overall, 5.2 million SNPs in the three genomes, the majority being present in dbSNP
• Data indicate that most SNVs are common rather than rare

• Estimated that the human genome contains > 11 million SNPs
(~7 million with MAF > 5%, rest between 1-5%).
• Unknown how many rare or even novel (“de novo”) SNVs

• SNP alleles in the same genomic interval are often correlated with
one another  “Linkage disequilibrium (LD)” = Nonrandom
association of alleles – varies in complex and unpredictable manner
across the genome and between different populations.

• International HapMap Project  can we divide the genome into
groups of highly correlated SNPs that are generally inherited
together = “LD bins”
Number of tag SNPs required to capture common Phase II SNPs

Recap
• International HapMap Project  can we divide the genome into
groups of highly correlated SNPs that are generally inherited
together = “LD bins”
Number of tag SNPs required to capture common Phase II SNPs
Based on genotyping over 3.1
Pairwise linkage disequilibrium million SNPs in 270 individuals
(LD) r2 (if 1  SNPs statistically from 4 geographically diverse
indistinguishable) populations (Frazer et al., Nature,
2007)

By genotyping the DNA sample of an individual with a “tagging”
SNP from each LD bin, knowledge regarding 80% of SNPs with a
MAF > 5% across the genome is gained.
(Frazer et al., Nature Rev. Genetic., 2010)

Querying human genetic variation
Scan Entire Genome
- 500,000 SNPs

Population Stratification
Subdivision of a population into different ethnic groups with
potentially different marker allele frequencies and thus different
disease prevalence
From Sven
Bergmann, UNIL

Principle Component Analysis reveals SNP-vectors
explaining largest variation in the data

Ethnic groups cluster according to geographic distances
PC2
PC2

From Sven
PC1 PC1 Bergmann, UNIL

PCA of POPRES cohort

From Sven
Bergmann, UNIL

Structural variants


A classic that opened the door to structural variant research:
Sebat et al. Large-Scale Copy Number Polymorphism in the Human Genome. Science, 2004.

Used ROMA technique to detect copy number variants

Representational Oligonucleotide Microarray Analysis (ROMA)

1) Genome digestion
2) Adapters to sticky ends and
PCR amplification
3) After PCR, representations of
the entire genome (restriction
fragments) are amplified to
pronounce relative increases,
decreases or preserve equal
copy number in the two
genomes.
4) Representations of the two
different genomes are labeled
with different fluorophores
and co-hybridized to a
microarray with probes
specific to restriction site
locations across the entire
human genome.

Representational Oligonucleotide Microarray Analysis (ROMA)

On average, individuals (20
tested) differed by 11 CNPs
(average length = 465 kb)
affecting 70 genes.

Structural variants (SVs)

Our ability to detect SVs is still very poor (see later)

Fosmid-based library
sequencing of 8 humans (4
Yorubian and 4 non-African)
(Kidd et al., Nature, 2008)

• 1 million fosmid clones/individual
• Both ends of each clone insert sequenced
 a pair of high-quality end sequences
(termed an end-sequence pair (ESP).
Only SVs over 8 kb
can be detected

(~450 bp/sequence)

Fosmid-based library sequencing of 8 humans (4 Yorubian
and 4 non-African) (Kidd et al., Nature, 2008)

~2,000 SVs that were
experimentally verified

Novel
sequence
(either in
gaps (black)
or not
(orange))


• 50% of SVs seen >1 individual ~2,000 SVs that were
• ~50% outside regions previously annotated as SVs experimentally verified
nearly half lay outside regions of the genome previously
Novel
described as structurally variant sequence
• 525 new insertion sequences (either in
• 20% of all genetic variants = SVs, but covers >70% of gaps (black)
or not
nucleotide variation (orange))
• SVs  b/w 9- 25 Mb (~0.5-1% of the genome)
• The majority of SVs are yet to be discovered


Regions of
increased SNV
density

Structural variants and linkage disequilibrium
McCarroll et al., Nature Genet., 2008

• Most common, diallelic CNPs (with MAF greater than 5%) were perfectly
captured (r2 = 1.0) by at least one SNP tag from HapMap Phase II

• Mean r2 as a function of distance from a polymorphism = indistinguishable for
SNPs and diallelic CNPs  common, diallelic CNPs are ancestral mutations

Common SVs are in LD with tagging SNPs

Contribution of variants to phenotypes?

Common versus rare
“Common disease – common variant hypothesis”
versus
Common complex traits are the summation of low-frequency, high-penetrance variants

OR = odd ratio or

PAR = population attributable risk = measure of the multifactorial inherited component
of a disease

Whole Genome Association studies

How significant is this?

Whole genome association studies
P-value

Note: “Genome-wide” is a misnomer
• 20% of common SNPs not or only partially tagged
• Rare variants not tagged at all

Concept

-log10(p)
Scan Entire Genome * *
- 500,000 SNPs

-log10(p)
* **
Identify local regions
of interest, examine
genes, SNP density
regulatory regions, etc

Replicate the finding

From Sven
Bergmann, UNIL

Visualization

Wellcome Trust Case Control Consortium. Genome-wide association
study of 14,000 cases of seven common diseases and 3,000 shared
controls. Nature 447, 661–678 (2007).

McCarthy et al., Nature Rev. Genet., 2008

Concept

-log10(p)
Scan Entire Genome * *
- 500,000s SNPs

-log10(p)

* **
Identify local regions
of interest, examine
genes, SNP density
regulatory regions, etc

Replicate the finding

From Sven Bergmann
(UNIL)

An avalanche of GWA studies

• From 2006  >220 studies reported to date
• For over 80 phenotypes  300 loci have been implicated
• Most implicated loci were identified for the first time (no prior knowledge)

Type 2 diabetes: an example

Frazer et al., Nat. Rev. Genet., 2010

• 18 genomic intervals with 4 containing previously implicated genes
• Major message: the molecular diversity of T2D genes was not anticipated, thus:
(Patients with = disease) ≠ (Patients with = underlying biological disorder)

Overlap of genetic risk factor loci for common diseases


• 15 loci are associated with two or more diseases (8 are shown)
• Not necessarily same impact (PTPN22 + Crohn’s, - for other ai diseases
• Different diseases may have similar molecular underpinnings
• Expected: ai diseases (same clinical features)
• Unexpected: e.g. GCKR in both TGC levels and ai disease

From association to molecular mechanism
• Very difficult:
• what are the precise variants associated with a trait?
• if located in exons: easy, but outside, then what?
• most are located outside exons!
(e.g. 9p21 <-> myocardial infarction is located 150 kb from the nearest gene!)
• May have a regulatory function, i.e. control gene expression

AG
1 c2 3

• humans are heterozygous at more functional cis-regulatory sites than at amino acid positions, with
10,700 functional biallelic cis-regulatory polymorphisms in a typical human (Rockman and Wray. Mol.
Biol. Evol., 2002: 19, 1991).

• 34% of promoter polymorphisms (170 tested) significantly modulated reporter gene expression
(>1.5-fold) (Hoogendoorn et al., Hum. Mol. Genet., 2003: 12, 2249).

• Case study with the CC chemokine receptor 5, a major chemokine coreceptor of HIV-1 necessary for
viral entry into cells
• G to A SNP of CCR5 at –2459 nt
• CCR5 density – low (homozygous GG), intermediate (GA), and highest (homozygous –2459AA)
(Salkowitz et al., Clin. Immunol., 2003: 108, 234).

Mapping eQTLs
• Transcript abundance = a quantitative trait that can be mapped with considerable power = eQTLs

Environment Genetics

Heritability (H2) = genetic variance over total trait variance with 0 =
no genetic effects and 1 = all variance is under genetic control

Classic paper: Schadt et al., Nature, 2003
Genetics of gene expression surveyed in maize, mouse and man

• Liver tissues from 111 F2 mice constructed (from C57BL/6J and DBA/2J)
• Microarray analysis of 23,574 genes: 7,861 significantly differentially expressed (either in the
parental strains or in at least 10% of the F2 mice)
• eQTL identification (log of the odds ratio (LOD) > 4.3 (P-value < 0.00005))for 2,123 genes
• These eQTLs explained 25% of the transcription variation of the corresponding genes

Mapping eQTLs
Schadt et al., Nature, 2003

% eQTL across 920 evenly spaced bins, each 2 cM wide
• Several hotspots (>1% of detected
eQTLs are located within a 4 cM
interval)

• 40% of genes with ≥ 1 eQTL (LOD >
3.0) had more than one eQTL, and
close to 4% of such genes had more
than three eQTL
 Gene expression = complex trait

Mapping eQTLs

Known polymorphisms between the two parental strains
• Overlap between polymorphism and
eQTL = cis-acting transcriptional
regulation

For example:
• The C5 gene 2 bp deletion in the
coding region in DBA mice resulting in
rapid transcript decay compared with B6.
A LOD of 27.4 centred over the C5 gene
on chromosome 2 is readily detected
(black curve).
• The Alad gene present in 2 copies in
DBA

Mapping eQTLs

Combining clinical, gene expression and genetic factors

• Classical QTLs for FPM:
4 significant loci

• Further analyses with subgroups:
additional loci identified

• Some QTLs only affect a subset of the F2
population, demonstrating the complexity
underlying traits such as obesity

Mapping eQTLs
Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression

• 206 families of British descent using immortalized lymphoblastoid cell lines
(LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)

~15,000 H2 > 0.3
Gene Ontology descriptors for:
• Response to unfolded protein (HSFs, chaperones)
• Immune responses and apoptosis
• Regulation of progression through the cell cycle,
• RNA processing and DNA repair.

Mapping eQTLs
Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression

• 206 families of British descent using immortalized lymphoblastoid cell lines
(LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)

• Trans effects are weaker than those in cis

• Nevertheless, significant trans associations
were detected:
e.g. 1) ~700 transcripts with the peak of
association on the same chromosome but
>100 kb from the nearest transcribed gene,
2) 10,382 transcripts, the peak of
association was on a different chromosome

Mapping eQTLs
Using eQTLs to better understand GWAS results
Libioulle et al., PLOS Genet., 2007

GWAS for Crohn’s disease

• One of the
neighboring genes
PTGER4 may be
1.25 Mb Gene desert involved
• Trace eQTLs in
LCL data

• Disease-associated polymorphisms may be regulating PTGER4 expression
in cis, but >250 kb away  more research needed but likely regulatory
polymorphism

Mapping eQTLs
We looked at SNPs but what about other structural variants?
Stranger et al., Science, 2007: Relative Impact of Nucleotide and Copy Number Variation on
Gene Expression Phenotypes
• LCLs of 210 unrelated HapMap individuals from four populations
• Copy number variants were identified via CGH against a common reference individual
SNP CNV

From probe associated with linked gene From probe associated with linked gene

• 83.6% and 17.7%
of the total detected genetic variation in gene expression
• SNPs close to their respective genes, less so for CNVs
• Little overlap between SNP and CNV associations (only 20%)
• Not “mere” gene dosage effects

How universal are GWAS findings?
Associated with myocardial
infarction

• Allele frequencies are
different in different
populations

• LD patterns across loci
that co-segregate with
a causally associated
variant may be different
LD less strong in African population
from population to
 bottleneck principle
population

• Control for population
differences is essential
Red = high pairwise SNPs that efficiently (r2 > in large studies
SNP correlation 0.8) tag one another are
connected

Impact so far
• No complex traits for which there is > 10% of the genetic variance explained
e.g. T2D: 18 genetic variants together < 4% of the total trait liability

• Sample size may compensate (increased statistical power)
But…studies for lipid phenotypes involving >40,000 people still <10%
… some diseases have only a low number of affected individuals

• Does the answer lie in structural variants? Most are still unmapped
But… they are likely in LD with common SNPs

• Does the answer lie in rare variants?
Possibly…
• Rare variants are not in LD with tagging SNPs and thus so far undetected
(Amish study)
• Can have very high penetrance
• However, how to detect on a population-wide basis?

The power of whole-genome sequencing
Miller syndrome: autosomal recessive genetic trait (Roach et al., Science, 2010)

• Sequenced genomes of 2 parents and 2 children, both affected by Miller Syndrome

• Identified 3.7 million SNPs that varied within the family

• Resequenced 34000 candidate mutations  28 de novo mutations

• Narrowing down via “rare” assumption and knowledge of recessive inheritance

• Found one gene, dihydroorotate dehydrogenase (DHOH) known to be involved

Entering the age of personalized medicine
Toward the elucidation of each person’s genetic make-up
Necessary for:
1) DNA-based risk assessment for common complex disease
2) Drug discovery (new implicated genes can be identified)

But also to:
3) Identify molecular signatures for disease diagnosis and prognosis

And for:
4) A DNA-guided therapy and dose selection

A person’s genetic make-up significantly affects the efficacy of a drug
• Polymorphisms in the VKORC1 and CYP2C9 genes dictate the effective dose levels of the
anti-coagulant Warfarin
• Polymorphisms in the UGT1A1 gene correlate with increased toxicity of the anti-colon
cancer drug Irinotecan
• Polymorphisms in the MTHFR gene are associated with increased toxicity of Methotrexate
used to treat Crohn’s disease
• Polymorphisms in the CYP2D6 gene dictates the probability of relapse in women with
breastcancer treated with Tamoxifen

The revolution of high-throughput sequencing: Illumina
Metzker et al., Nat. Rev. Genet., 2010

Solid phase amplification: 1) initial priming and extending
of the single-stranded, single-molecule template, and 2)
bridge amplification of the immobilized template with
immediately adjacent primers to form clusters. 1

1

From sequence to genome: mapping reads
Trapnell and Salzberg, Nat. Biotech., 2009

Using BW, the index
for the entire human
Four sequences of equal
genome fits into < 2
strength = seeds
Gb of memory

If 1SNP, the other 3 Is 30 times faster
seeds intact; than indexing
If 2 SNPs, the other 2
seeds intact; Also is limited to 2
SNPs within one
Thus, max 2 SNPs/read read

Limitation:
Indexing takes up huge
memory

Burrows-Wheeler transform
Wikipedia

Easier to compress strings with runs of repeated characters

A first human genome project using HTS

Bentley et al., Nature,
2008
• Solexa Technology
• First: X-chromosome
• 204 million reads
• Sampling of
sequence fragments
is close to random
(GC content slight
effect)

A first human genome project using HTS
Bentley et al., Nature, 2008
• 135 Gb of sequence (~4 billion paired 35-base reads) (8 weeks)
• The approximate consumables cost = $250,000
• 97% of the reads were aligned using MAQ
• 99.9% of the human reference covered with ≥ 1 reads at 40.6X

99% agreement with HapMap results!

More human genome projects
Snyder et al., G&D, 2010

Tackling the SV problem using HTS
• Really difficult and progress is limited.
• Existing methods are based on two approaches:
• Paired-end mapping (PEM)
• Depth-of-coverage (DOC) approach

• The ends of each fragment tagged by a biotinylated (B) nucleotide
• Circularization forms a junction between the two ends
• Random fragmentation and recovery of biotinylated fragments
• Circularized DNA is randomly fragmented and the biotinylated junction fragments are
recovered
• Standard sequencing procedure thereafter

Tackling the SV problem using HTS: paired-end mapping
Medvedev et al., Nature Meth., 2009

Tackling the SV problem using HTS: DOC
Snyder et al., G&D, 2010 Campbell et al., Nature Genet., 2008

Tackling the SV problem using HTS: state-of-the-art
Snyder et al., G&D, 2010

Human genetic variation and its contribution to complex traits

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Human genetic variation and its contribution to complex traits

Similar to Human genetic variation and its contribution to complex traits (20)

Recently uploaded

Recently uploaded (20)

Human genetic variation and its contribution to complex traits