GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD
1. Department of Genetics and Plant Breeding, College of Agriculture,
C.S.K.H.P. Krishi Vishvavidyalaya,
Palampur-176062
Understanding Different Components of GWAS to Increase Its Efficiency
in Crop Improvement
Speaker:
Om Prakash Raigar
(Genetics and Plant Breeding)
2. What is QTL Mapping
Association Mapping
Genome Wide Association Mapping
Component of GWAS
Applications of GWAS
Conclusion
Content
Department of Genetics and Plant Breeding, College of Agriculture,
C.S.K.H.P. Krishi Vishvavidyalaya,
Palampur-176062
3. Quantitative Trait Loci
Quantitative traits resulted from large number of
polygenes involved in their control, produce small
individual effects on phenotype and show cumulative
effects
The efforts for physical localization of polygene began
when Sex (1923) reported linkage between seed coat color
and seed size in common bean (Phaseolus vulgaris), which
are qualitative and quantitative traits, respectively
Identification of the genomic regions associated with the
expression of a quantitative trait, such a genomic region is
referred as quantitative trait locus (QTL)
Phenotypic evaluation of the mapping population for QTL
analysis should be performed at multiple locations since
evaluation at a single location may underestimate the total
number of QTLs involved in the control of the concerned
traits
4. Mapping population
(RILs,F2,DH ,BC Lines)
Genotype with
molecular markers
Phenotypic
evaluation
Link trait data with marker
data - Mapping software
Trait QTL mapped on
chromosome
Parent 1 Parent 2
QTL Mapping
5. Mapping population
(F2, BC, DH, RIL,NIL)
Few alleles per locus
Low resolution(5-
10cM) due low
recombination events
Additional steps
required to narrow
QTL
Difficult to discover
causative genes
Disadvantages of QTL Mapping
7. Why Association Mapping
First time association mapping used in plant species by Thornsberry (2001).
The objective of gene mapping is to find tightly linked / associated molecular
markers to the genes governing the quantitative traits making marker assisted
selection feasible.
Two approaches are mostly used for genetic mapping
(i) Linkage mapping
(ii) Association mapping (Utilizes linkage disequilibrium
“Association mapping, is a population-based survey used to identify trait-marker
relationships based on linkage disequilibrium by exploiting historical and
evolutionary recombination events”
Linkage disequilibrium was first defined by Jennings in 1917 and quantified in
1964 by Lewtonin.
8. Zhu et al. (2008)
Linkage Mapping Association Mapping
• Both linkage analysis and association mapping rely on co-inheritance of
functional polymorphisms and neighboring DNA variants. However, in
some cases, DNA variants associted with the traits might be on different
chromosomes.
• In case of linkage analysis using F2 generation, there are only a few
opportunities for recombination resulting in relatively low mapping
resolution
• Whereas, association mapping utilizes historical recombination and
natural genetic diversity resulting into high resolution mapping.
9. “Linkage disequilibrium", LD is non-random association
between alleles at different loci (Jennings, 1917)
“Association Mapping” refers to the significant association of
a marker locus with a phenotype trait while.
Soto-cerda and Cloutier, 2009
Principle of Association Mapping
10. 1. Mutation
2. Population
bottleneck and
Genetic drift
3. Selection
4. Population
structure
LD
Soto-cerda and Cloutier, 2009
Creative Factors of Linkage Disequilibrium
11. LD is the difference between observed gamete frequency of
haplotypes and expected
D= Coefficient of LD
Where PAB is the frequency of gametes carrying allele A and B at
two loci; PA and PB are the product of the frequencies of the allele
A and B, respectively
At L. E. D=0
At L. D. D≠0
Soto-cerda and Cloutier, 2009
LD Quantification
D = PAB − PAPB
Loci B b Total
A PAB PAb PA
a PaB Pab Pa
Total PB Pb 1.0
13. LD decays by one-half with each
generation of random mating
Thus, LD declines as the
number of generations
increases, so that in diverse
populations LD is limited to
small distances
LD will tend to decay with
genetic distance between the
loci under consideration
Soto-Cerda and Cloutier, 2009
LD Decay
14. • Two most utilized statistics for LD
• D′ (Lewontin, 1964) and
• r2, the square of the correlation coefficient between two
loci (Hill & Robertson, 1968)
• D′ only reflects the recombinational history whereas r2
summarizes both recombinational and mutational history
Soto-Cerda and Cloutier, 2009
LD Statistics
15. • The statistical significance of LD is typically determined using a χ²
test of a 2 × 2 contingency table
• A p-value threshold of 0.05 is often used to declare lack of
independence of alleles at two loci, thus suggesting association
Computer Softwares
• “Graphical Overview of Linkage Disequilibrium” (GOLD )
• “Trait Analysis by aSSociation, Evolution and Linkage” (TASSEL)
• PowerMarker
Gupta et al., 2005
Statistical Significance of LD
16.
17. Genome Wide Association Mapping
Identification of the candidate genes for many important traits in crop/living
organism as it tests the association between the marker type (e.g SNP) and
the phenotype of a target trait.
GWAS- involves linkage disequilibrium (LD) based association mapping
Aim:
To identify marker traits association for one trait at a time,
To study the genetic architecture of the trait.
It involves identification of all QTLs/genes and interaction among QTLs
identified through GWAS
It used for mere identification of marker-trait association for marker-assisted
selection
20. Nested Association Mapping
Linkage analysis maps broad chromosome regions with relatively low marker coverage
(Coarse mapping), while association mapping offers high resolution with very high
marker coverage (Fine mapping) .
An integrated mapping strategy like NAM would combine the advantages of the both
the approaches to improve mapping resolution without requiring excessively dense
marker maps.
Steps to develop NAM population:
1. Selection of diverse founders and developing recombinant inbred lines (RILs).
2. Dense genotyping of founders
3. Genotyping a smaller number of tagging markers on both the founders and the progenies
to project the high-density marker information from the founders to the progenies
4. Phenotyping progenies for various complex traits
5. Conducting genome-wide association mapping
21. Yu et al. (2008)
Cross between 25
diverse founders and the
common parent (B73).
The genomes of these
immortalized RILs are
mosaics of the founder
due to diminishing
chances of
recombination over
short genetic distance
and a given number of
generations,
X – Crossing
- Selfing
SSD - Single-seed descent.
Nested Association Mapping (NAM) population development
22. Multiparent Advanced Generation Intercross (MAGIC)
MAGIC : A Recombinant Inbred Line (RIL) population is created from multiple
founder lines, in which the genome of the founders are first mixed by several rounds
of mating, and subsequently inbred to generate a stable panel of inbred lines (RILs)
by Single Seed Descent method.
More parental accessions increases the allelic diversity, potentially increasing the
number of QTL that segregate in the population.
The successive rounds of recombination cause LD to decay, thereby increasing the
precision and resolution of QTL location (Mackay & Powell, 2007).
24. Comparison of
mapping
population and
resolution
One chance for
Recombination –
Poor resolution
Many recombination
events during
generation advancement -
Intermediate resolution
So many historical
recombination
events – Fine resolution
Soto-Cerda and
Cloutier (2012)
25. Process of determining the accurate order of nucleotides along chromosomes
and genome
It includes different method or technology that is used to determine the order
of four bases “adenine, guanine, cytosine, and thymine” in a strand of DNA
The first genome sequenced bacteriophage ΦX174 in the year 1977
In Haemophilus influenzae 1995
In eukaryotic genome Saccharomyces cerevisiae 1996
DNA Sequencing
Department of Genetics and Plant Breeding, College of Agriculture,
C.S.K.H.P. Krishi Vishvavidyalaya,
Palampur-176062
26. Next-generation sequencing e.g. genotyping-by-sequencing (GBS) provide
thousands of single nucleotide polymorphism (SNPs) covering the most
genomic region in plant chromosomes.
High throughput (next generation) sequencing applies to
genome sequencing
genome resequencing
transcriptome profiling ( RNA – Seq)
DNA- protein interactions(ChIP- Sequencing) and
epigenome characterization
Next Generation Sequencing (NGS)
Department of Genetics and Plant Breeding, College of Agriculture,
C.S.K.H.P. Krishi Vishvavidyalaya,
Palampur-176062
27. Publications related to Next Generation Sequencing (NGS)
Department of Genetics and Plant Breeding, College of Agriculture,
C.S.K.H.P. Krishi Vishvavidyalaya,
Palampur-176062
28. Population Structure, Relatedness ((Kinship) and LD
Larger the genetic variation, the faster the LD decay and fine resolution, a
direct consequence of the broader historical recombination
LD declines quickly in out-crossing species facilitating fine mapping of a trait.
LD blocks are extended in self pollinated crops as compared to the narrow LD
blocks in cross pollinated crop due to low recombination rate in autogamous
crops.
Population structure occurs from the unequal distribution of alleles among
subpopulations of different ancestries.
When these subgroups are sampled to construct a panel of lines for AM, the
intentional or unintentional mixing of individuals with different allele
frequencies creates false LD.
29. LD extends with broad genetic base so mapping population should be composed of
unrelated diverse genotypes for fine mapping or higher resolution.
Higher numbers of related individuals in mapping populations give false positive
LD.
Genetic drift can create LD between closely linked loci. The effect is similar to taking
a small sample from a large population.
Even if two loci are in linkage equilibrium, sampling only few individuals can
create LD
Allele mixing also increase LD by introducing new alleles in populations.
Population Structure, Relatedness ((Kinship) and LD
30. Statistical Models
Population structure (Q-matrix) and kinship coefficient (K-
matrix) can be estimated using the program STRUCTURE.
GLM (General Linear Model) utilizes only one either Q or K
Mixed linear model (MLM) can be utilized to block
population structure (Q) and kinship information (K).
Q+K MLM model performs better than any other model
that used Q- or K-matrix alone.
Compressed Mixed Linear Model(CMLM) clustered the
individual into fewer groups based on the kinship among
the individuals.
31. Yu and Buckler (2006)
Ideal sample with subtle population
structure and familial relatedness
,regression and genomic control (GC)
Family-based sample. GC and
mixed model (Q + K).
Both population structure and
familial relationships. SA, GC,
mixed model ([Q] plus kinship
matrix [K]).
With population structure,
structured association (SA)
and GC.
With severe population structure
and familial relationships, methods
unknown.
Different types of population encountered in association mapping
studies and relevant statistical solution.
32. Association Detection and Validation
False Discovery Rate (FDR): The FDR is the proportion of positive results
that are actually false positives versus the whole set of positive results
obtained from a statistical test. FDR approaches may be most appropriate
when multiple traits are being compared or when the markers are not in
extensive LD
Bonferroni Correction: It controls type I error rate (α) for simultaneous
multiple testing by reducing false positives. To perform a Bonferroni
correction, divide the critical P value (α) by the number of comparisons being
made. For example, if 10 hypotheses are being tested, the new critical P value
would be α/10. Put simply, the probability of identifying at least one
significant result due to chance increases as more hypotheses are tested.
33. Genomic coordinates (e.g. SNPs) are displayed along the X-axis, with the
negative logarithm of the association P-value for each single nucleotide
polymorphism (SNP) displayed on the Y-axis, meaning that each dot on the Manhattan
plot signifies a SNP. Because the strongest associations have the smallest P-values (e.g.,
10−15), their negative logarithms will be the greatest (e.g., 15). The size of the data
points in the plot and their height on the left-hand side of the data pane relate
directly to their significance: the larger the point and the higher the point on the scale,
the more significant the association with the trait.
33
Manhattan Plot
34. Bioinformatics for Candidate QTL Identification
Sr.
No.
Software Focus Website
1. STRUCTURE
2.3
Population structure http://pntch bsd
uchicago.edu/software.html
2. BAPS 5.0 Population structure http://web.abo.fi/jdc/mnS/mate/j
c. software/bapc.html
3. mStruct Population structure http://www.CS. cmu.edu
suyashinslriict. html
4. Haploview 4.2 Haplotype analysis and
LD
http://www.broadmU.edu/mpg/h
aploview/
5. TASSEL Stratification LD and AM http://www.maizegenetics.net
6. GenStat Stratification LD and AM http://www.vsni.co.uk/
7. JMP genomics Stratification LD and
structured AM
http://www.jmp.
com/software/genomics
35. Sr.
No.
Software Focus Website
8. SVS7 Stratification LD and
AM
http://goldenhelix.com
9. EINGENSTRAT PCA (as an alternative
to population structure)
and association
http:y7genepath.med.harvard.e
du/-reich/Software.htm
10. MTDFREML Mixed model http://aipl.arsusda.gov/curtvt/m
tdfreml.html
11. ASREML Mixed model http:y7www.vsni.co.uk/produc
ts/asreml
Bioinformatics for Candidate QTL Identification
37. Gene Identification
In GWAS,oftena trait-associatedSNP isnot causal, but issimplyin LDwith the causal
SNP. Therefore, identification of causal variant among GWAS signals becomes
important.
For identification of causal variants following approaches can be utilized:
(i) fine mapping
(ii) localization success rate approach
(iii) conditional analysis
(a) Conditional analysis (cGWAS) at a locus or in an LD block
(b) Conditional analysis (cGWAS) of whole genome
(c) Network-based conditional analysis (cGWAS)
(d) Conditional analysis using GWAS summary statistics
38. Prediction of protein coding regions and prediction of the functional sites of
genes
Two classes of methods are generally adopted: One is based on sequence
similarity searches, while the other is gene structure and signal-based searches
also known as ab initio prediction
GENE PREDICTION PROGRAMS
FIRST
GENERATION
SECOND
GENERATION
THIRD
GENERATION
FOURTH
GENERATION
GENSCAN,
AUGUSTUS
Test Code,
GRAIL
SORFIND,
Xpound
GeneID,
GeneParser,
GenLang ,
FGENEH
GENE PREDICTION
39. Gene Characterization
Identification of causal markers and prioritization of associated markers should
generally be followed by identification and functional characterization of
candidate genes through bioinformatics analysis.
But for validation and functional characterization, reverse genetics
technologies are often used, where the effect of variations/alterations in a
gene on phenotype is examined
Approaches use for gene characterization:
Targeting induced local lesions in genomes (TILLING)
Insertional mutagenesis, VIGS and RNAi
Genome editing and base editing
40. Case Studies-I
The haplotype block characterization showed 1268 blocks of different sizes spread
along the genome, including highly conserved regions like the 1BS chromosome arm
where the 1BL/1RS wheat/rye translocation is located.
Based on GWAS we identified ninety-seven chromosome regions associated with
heading date, plant height, thousand grain weight, grain number per spike and fruiting
efficiency at harvest (FEh).
In particular FEh stands out as a promising trait to raise yield potential in Argentinean
wheats
Detecte fifteen haplotypes/markers associated with increased FEh values, eleven of
which showed significant effects in all three evaluated locations
In the case of adaptation, the Ppd-D1 gene is consolidated as the main determinant of
the life cycle of Argentinean wheat cultivars
41. This work reveals the genetic structure of the Argentinean hexaploid wheat germplasm
using a wide set of molecular markers anchored to the Ref Seq v1.0. Additionally GWAS
detects chromosomal regions (haplotypes) associated with important yield and
daptation components that will allow improvement of these traits through marker-
assisted selection
Case Studies-I
43. Case Studies-II
• Core collection captures most of the diversity of pearl millets inSenegal.
• includes 60 early-flowering Souna and 31 late-flowering Sanio morphotypes.
• Sixteen agro-morphological traits were evaluated in the panel in two seasons.
• Different Phenological and phenotypic traits used.
• Using GBS, 21,663 SNPs with more than 5% of MAFs werediscovered.
• Linkage groups (LG 3) (~ 89.7 Mb) and (LG 6) (~ 68.1 Mb) differentiated two
clusters among the early-flowering Souna.
• Using GWAS 18 genes were link to phenotypic variation.
44. • Core collection captures most of the diversity of pearl millets inSenegal.
• includes 60 early-flowering Souna and 31 late-flowering Sanio morphotypes.
• Sixteen agro-morphological traits were evaluated in the panel in two seasons.
• Different Phenological and phenotypic traits used.
• Using GBS, 21,663 SNPs with more than 5% of MAFs werediscovered.
• Linkage groups (LG 3) (~ 89.7 Mb) and (LG 6) (~ 68.1 Mb) differentiated
two clusters among the early-flowering Souna.
• Using GWAS 18 genes were link to phenotypic variation.
Most widely exploited linkage mapping is very costly, time consuming, evaluates few alleles and has low resolution.
It has identified very few QTLs
Fig. a represents strong linkage disequilibrium between to loci (locus 1 and 2). While, Fig. b represents significant association between loci in LD and phenotype. When loci in LD are associated with phenotype, it is called association mapping.
Larger the genetic variation, the faster the LD decay and fine resolution, a direct consequence of the broader historical recombination.
LD declines quickly in out-crossing species facilitating fine mapping of a trait.
LD blocks are extended in self pollinated crops as compared to the narrow LD blocks in cross pollinated crop due to low recombination rate in autogamous crops.
the intentional or unintentional mixing of individuals with different allele frequencies creates false LD
Factors such as genetic drift, population bottlenecks and gene flow can contribute to generating artificial LD and negatively impact the ability to use LD in AM for the precise localization of QTL
It describes non-equal haplotypes frequency in a population (PAB # PA x PB), where A and B are alleles at two different loci, PAB is the frequency of haplotypes having both alleles (co-occurrence) at the two loci, PA and PB are the frequency of haplotypes having only A allele and B allele, respectively.
Significant LD can occur between alleles at distant loci or even at different chromosomes, generated by different genetic factors other than linkage.
The figure shows a common ancestor and evolutionary variation emerged over time. The recent mutation is represented with red colour.
Combined strategies (Linkage + LD) used in NAM and MAGIC population
With common-parent-specific (CPS) markers (i.e., markers for which B73 has a rare allele) scored for both founders and RILs, the marker or sequence information nested between two flanking CPS markers can be predicted for RILs on the basis of marker or genome sequence available for the founders.
By choosing diverse founders, linkage disequilibrium within these chromosome segments resulting from historical/evolutionary recombination was mostly preserved in RILs due to the small probability of recombination (during RIL development) within the short genetic distances between flanking CPS markers
The figure shows comparison between mapping utilizing doubled haploids, Recombinant Inbred Lines (RILs) and diverse germplasm. In case of DH, there is only one chance for recombination, so resolution is poor (i.e. 10 cM). In RILs, few recombination events during segregating generations results into to intermediate resolution (i.e. 5 cM). While, in association mapping utilizing diverse germplasm, Fine resolution (i.e. 1 cM) is possible due to utilization of many numbers of historical recombination events occurred during germplasm history.
Kinship :Genetic relatedness
Q matrix
STRUCTURE software typically is used to estimate Q. The Q is an n × p matrix, where n is the number of individuals and p is the number of defined subpopulations.
K matrix
SPAGeDi soft ware is used to estimate K among individuals. K is an n × n matrix with off –diagonal elements being Fij, a marker-based estimate of probability of identity by descent. The diagonal elements of K are one for inbreds and 0.5 × (1 + Fx) for non-inbred individuals, where Fx is the inbreeding coefficient.
The kinship between pairs of group is replaced by the kinship between pair of individuals, reduces the computation demand substaintially.
This slide shows various possible populations in terms of population structure and familial relatedness. Fig a shows ideal population without subtle Q and K, so regression and GC can be applied for association mapping…..
P value and logP value :
Non-parametric estimate of the FDR is possible through comparison of distributions of P values against a set of markers of known influence and a set of random markers scored on the same association population, with simulations. The probability of false associations is simply the ratio of the proportion of significant associations detected in the random set to the proportion of significant associations detected in the simulated set of loci with known influence.
In order to increase the power of fine mapping that can differentiate among several SNPs in LD, a large population is required
LSR is the probability of the causal SNP being top-ranked within an asso- ciated region. Often, LSR can be improved through a use of multiple populations for a joint analysis, rather than a single large population
LSR approach considers following two issues, while conducting analysis: (i) structure of the LD in the population being studied, and (ii) identification of the population(s) achieving an increase in LSR for fine mapping.
conditional analysis, which can iden- tify a causal SNP from among many correlated variants within a LD block ora haplotype. Such conditional analysis may be conducted at the level of an indi- viduallocus representing a genomicregion or on whole genome level
Fig. 1 a HB sizes and position in each wheat chromosome based on IWGSC Ref Seq v1.0 coordinates. The red HBs indicate sizes >30 Mb, orange HBs
sizes between 10 and 30 Mb and green HBs sizes <10 Mb. b Schematic representation of the LD (Linkage Disequilibrium) detected on chromosome
1B. In red, the high level of LD in the short arm of the chromosome due to the presence of the 1BL/1RS wheat/rye translocation
Manhattan plot on 21 wheat chromosomes for fruiting efficiency at harvest (FEh) in Balcarce 2013. Red line represents the GWAS threshold
of P <0.05 = −log10 (P-value) = 1.3 and blue line represents the GWAS threshold of P <0.001 = −log10 (P-value) = 3. Red square highlights the
Chr5A-B43-Hap1 associated with FEh in five of ten tested environments. b Haplotype block based on seven SNP markers located between 476.44
and 476.67Mb on chromosome 5A, named Chr5A-B43. Four different haplotype variants (Hap1–Hap4) are present at different frequencies in the
analyzed population. Red rectangle highlights the Chr5A-B43-Hap1 associated with FEh. c Boxplots indicate the phenotype values corresponding to
the four different haplotype groups in the three evaluated locations. Hap1 was associated significantly high FEh in all locations