This document summarizes genetic variation projects like HapMap and 1000 Genomes that aimed to catalog common human genetic variants. It describes types of variation like SNPs and how factors like selection and recombination influence their distribution. It provides overviews of the HapMap and 1000 Genomes projects, including their goals, populations studied, methods, and data formats. The information from these projects can be used to study traits, diseases, and human history.
3. Types of Human Genetic Variation
• Individual: de novo and rare variations
• Population: variations which have become
fixed within a population
– Single Nucleotide Polymorphisms (SNPs): base
pair substitutions
• Transition: purine -> purine (A<->G), pyrimidine ->
pyrimidine (C<->T)
• Transversion: purine <-> pyrimidine
• common ~1-5% minor allele frequency (MAF) in major
populations
4. Types of Human Genetic Variation
(cont.)
– Copy-Number Variations (CNVs):
• insertions, deletions, duplications of DNA segments
(>1kb)
– Other Variations:
• Structural: inversions
• Repeats: microsatellites (STRs), minisatellites (VNTRs)
• Frameshift mutations
5. SNP Distribution throughout the
HLA!
Genome
Sachidanandam et al. 2001
• Genetic variability throughout the genome
reflects function (among other factors)
6. Factors Affecting SNP Distribution
• Intrinsic, Structural:
Mutation clusters due to
recombination events and
sequence context-specific
effects [3,4]
– a) Time to Most Recent
Common Ancestor of
genes in population
influences SNPs (older
genes -> more SNPs in
population)
– b) base
composition, local
recombination, gene
density, chromatin
structure, nucleosome
position, replication
timing
Lercher and Hurst 2002
7. Factors Affecting SNP Distribution
(cont.)
• Functional: mutation clusters due to natural
selection (examples include immunoglobulin
genes)
a) balancing selection increases diversity
b) purifying and directional selection
decrease diversity
c) transcriptional activity
• Ascertainment bias: better characterization of
SNPs around genes of interest [5]
8. Effects of Genetic Variation
• Pathogenic and non-pathogenic heritable traits
• Genetic variation reveals millions of years of
human history
– “One can think of selective pressures as natural, in
vivo human experiments in which we can measure the
response of human populations to unknown
perturbations, and these alterations can inform the
function of genes within a given locus.” Raj et al. 2012
– Understand the history of mutation, selection and
recombination within the human genome
9. Potential Uses of SNP data
Ultimately, synergy of genomics and functional work
will allow us to understand human traits and disease.
• Association Mapping: Genome Wide
Association (GWA) studies,
Pharmacogenomics
• Modeling Mendelian and Complex diseases
• eQTL and functional genomics
• Selection!
11. Selection of Lassa Fever Susceptibility
Genes in YRI populations
Andersen et al (2012)
12. eQTL
SLE susceptibility locus
(rs11755393; GWAS p= 2.20 x 10 -08 )
Positive Selection
Slide from Replogle
and Raj
13. International HapMap Project
• “to identify and catalog genetic
similarities and differences in
human beings”
• Haplotype Map: SNPs (genotypes)
at separate loci whose alleles are
statistically associated due to
limited genetic recombination
HapMap Project
14. Linkage Disequilibrium (LD)
• Alleles at different loci are not independent
due to
Linkage equilibrium Linkage disequilibrium
fB fb fB fb
AB
Ab
fA AB Ab fA
fa fa aB
aB ab ab
Image by Gil McVean
15. Origin of LD
.
.
. .
.
. .
.
.
The mutation arises on a If the mutation Over time the
particular genetic increases in association between the
background frequency, the new mutation and linked
associated haplotype mutations will decay by
will also increase in recombination
frequency.
Recombination is the
Factors Increasing LD: only factor which
1) Genetic Drift decreases LD.
(stochastic sampling)
2) Selection
Image modified from 3) Non-Random
Gil McVean Mating
16. Haplotype
HapMap Project
• ~107 common (MAF >1%) SNPs in the human genome
• ‘tag SNPs’ allow for identification of an individual’s haplotypes
• Estimated 300,000-600,000 tag SNPs in genome
• Genotyping: testing tag SNPs
• Sequencing: whole genome sequence
17. HapMap Populations
• 270 total DNA samples
• Yoruba in Ibadan, Nigeria (YRI)
• Japanese in Tokyo, Japan (JPT)
• Han Chinese in Beijing, China (CHB)
• CEPH (Utah residents with ancestry from
northern and western Europe) (CEU)
18. HapMap Methodology
• Genotype individuals for several million SNPs
– 1 SNP per 5kb or less
– MAF >1% as estimated by TSC project, JSNP, dbSNP, and
initial SNP map
– Random shotgun sequencing to obtain additional SNPs
– Coding and noncoding SNPs
• Data analysis to identify LD and Haplotype maps
• Tag SNPs are useful with haplotype and recombination
map
• Data available online in multiple formats
http://hapmap.ncbi.nlm.nih.gov/downloads/index.htm
l.en
19. HapMap Methodology (cont.)
• Data analysis to identify LD and Haplotype
maps
• Tag SNPs are useful with haplotype and
recombination map
• Data available online in multiple formats
http://hapmap.ncbi.nlm.nih.gov/downloads/i
ndex.html.en
• Phase III data released 2009
20. Reference
Genome?
• Mosaic haploid
DNA sequence
• GRCh37
21. 1000 Genomes
• “to find most genetic variants that have
frequencies of at least 1% in the populations
studied”
• Low coverage sequencing of >2000
individuals, exome sequencing, trios
• Characterization of SNPs and Structural
Variants (INDELs)
22. 1000 Genomes Populations
• Yoruba in Ibadan, Nigeria (YRI)
• Japanese in Tokyo, Japan (JPT)
• Han Chinese in Beijing, China (CHB)
• CEPH (Utah residents with ancestry from
northern and western Europe) (CEU)
• Luhya in Webuye, Kenya (LWK)
• Toscani in Italy (TSI)
• Peruvians in Lima, Peru (PER)
• Mexican ancestry in Los Angeles, CA (MXL)
• And many more!
23. “Low-Coverage” Sequencing
• Sequencing:
1) DNA copies broken into short pieces
2) Each piece is sequenced (random pieces means
most of genome is covered)
3) Sequenced fragments are aligned and joined to
determine complete genome
• 28X sequencing coverage necessary for
complete genome
• Low-coverage sequencing (4X coverage): many
pieces of individual genomes are missed
24. 1000 Genomes Data
• Latest release:
– 1092 samples
– SNP, indel, and large deletion
– Autosomes and chrX
– ~38.2 M SNPs from low coverage and exome
sequencing
• 1000genomes site has a link to a NCBI FTP
with their latest data
25. VCF file format
• Variant Call Format 4.1: meta-info followed by
header and data
• tab-delimited text file
• Compressed .gz
zcat file.vcf.gz| grep -e ^# -e SNP | bgzip -c >
snps.vcf.gz
• http://www.1000genomes.org/wiki/Analysis/Vari
ant%20Call%20Format/vcf-variant-call-format-
version-41
26. Columns in VCF format
• CHROM: chromosome (no colons)
• POS: numerical reference position, with the 1st base having
position 1 (some variants have multiple pos records)
• ID: semi-colon separated list of unique identifiers where available
(ex. dbSNP rs number)
• EF: reference base(s) A,C,G,T,N (case insensitive) for a given variant
• ALT: comma separated list of alternate non-reference alleles called
on at least one of the samples.
• QUAL: phred-scaled quality score for the assertion made in
ALT. i.e. -10log_10 prob(call in ALT is wrong)
• FILTER: another quality measure; PASS if this position has passed all
filters
• INFO: semicolon seperated additional info; ex. AF (allele
frequency), DB (dbSNP membership), VALIDATED
28. Interested?
• Get Prof. Cavalcanti to buy Human
Evolutionary Genetics: Origins, Peoples and
Disease
29. References
1. Sachidanandam R et al. (2001) A map of human genome sequence variation containing 1.42 million single
nucleotide polymorphisms. Nature 409: 928-933.
2. Lercher MJ and Hurst LD (2002) Human SNP variability and mutation rate are higher in regions of high
recombination Trends Genet. 18: 337-340.
3. Rogozin IB and Pavlov YI (2003) Theoretical analysis of mutational hotspots and their DNA sequence context
specificity. Mutat Res 544(1): 65-85.
4. Ma X, et al. (2012) Mutation Hot Spots in Yeast Caused by Long-Range Clustering of Homopolymeric
Sequences.Cell Reports 1(1): 36-42.
5. Clark AG, et al. (2005) Ascertainment bias in studies of human genome-wide polymorphism. Genome Res
15: 1496-1502.
6. Raj T et al. (2012) Alzheimer Disease Susceptibility Loci: Evidence for a Protein Network under Natural Selection.
AJHG 90 720-726.
7. Voight BF et al. (2006) A Map of Recent Positive Selection in the Human Genome. PLoS Biology 4(3): e72.
8. Andersen KG et al. (2012) Genome-wide scans provide evidence for positive selection of genes implicated in
Lassa fever. Philos Trans R Soc Lond B Biol Sci 367(1590): 868-877.
9. Hapmap.org
10. McVean, Gil (2004). Population Genetics of the Human Genome. Oxford Human Genome Lecture Series.
11. Gibbs RA et al. (2003) The International HapMap Project. Nature 426: 789-796.
12. 1000genomes.org
13. Durbin R M et al. (2010). A map of human genome variation from population-scale sequencing. Nature
467(7319): 1061-1073.