GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD

Department of Genetics and Plant Breeding, College of Agriculture,
C.S.K.H.P. Krishi Vishvavidyalaya,
Palampur-176062
Understanding Different Components of GWAS to Increase Its Efficiency
in Crop Improvement
Speaker:
Om Prakash Raigar
(Genetics and Plant Breeding)

 What is QTL Mapping
 Association Mapping
 Genome Wide Association Mapping
 Component of GWAS
 Applications of GWAS
 Conclusion
Content
Palampur-176062

Quantitative Trait Loci
 Quantitative traits resulted from large number of
polygenes involved in their control, produce small
individual effects on phenotype and show cumulative
effects
 The efforts for physical localization of polygene began
when Sex (1923) reported linkage between seed coat color
and seed size in common bean (Phaseolus vulgaris), which
are qualitative and quantitative traits, respectively
 Identification of the genomic regions associated with the
expression of a quantitative trait, such a genomic region is
referred as quantitative trait locus (QTL)
 Phenotypic evaluation of the mapping population for QTL
analysis should be performed at multiple locations since
evaluation at a single location may underestimate the total
number of QTLs involved in the control of the concerned
traits

Mapping population
(RILs,F2,DH ,BC Lines)
Genotype with
molecular markers
Phenotypic
evaluation
Link trait data with marker
data - Mapping software
Trait QTL mapped on
chromosome
Parent 1 Parent 2
QTL Mapping

 Mapping population
(F2, BC, DH, RIL,NIL)
 Few alleles per locus
 Low resolution(5-
10cM) due low
recombination events
 Additional steps
required to narrow
QTL
 Difficult to discover
causative genes
Disadvantages of QTL Mapping

Genome Wide Association Mapping
Alqudah et al., 2020

Why Association Mapping
First time association mapping used in plant species by Thornsberry (2001).
The objective of gene mapping is to find tightly linked / associated molecular
markers to the genes governing the quantitative traits making marker assisted
selection feasible.
Two approaches are mostly used for genetic mapping
(i) Linkage mapping
(ii) Association mapping (Utilizes linkage disequilibrium
“Association mapping, is a population-based survey used to identify trait-marker
relationships based on linkage disequilibrium by exploiting historical and
evolutionary recombination events”
Linkage disequilibrium was first defined by Jennings in 1917 and quantified in
1964 by Lewtonin.

Zhu et al. (2008)
Linkage Mapping Association Mapping
• Both linkage analysis and association mapping rely on co-inheritance of
functional polymorphisms and neighboring DNA variants. However, in
some cases, DNA variants associted with the traits might be on different
chromosomes.
• In case of linkage analysis using F2 generation, there are only a few
opportunities for recombination resulting in relatively low mapping
resolution
• Whereas, association mapping utilizes historical recombination and
natural genetic diversity resulting into high resolution mapping.

“Linkage disequilibrium", LD is non-random association
between alleles at different loci (Jennings, 1917)
“Association Mapping” refers to the significant association of
a marker locus with a phenotype trait while.
Soto-cerda and Cloutier, 2009
Principle of Association Mapping

1. Mutation
2. Population
bottleneck and
Genetic drift
3. Selection
4. Population
structure
LD
Creative Factors of Linkage Disequilibrium

LD is the difference between observed gamete frequency of
haplotypes and expected
D= Coefficient of LD
Where PAB is the frequency of gametes carrying allele A and B at
two loci; PA and PB are the product of the frequencies of the allele
A and B, respectively
At L. E. D=0
At L. D. D≠0
LD Quantification
D = PAB − PAPB
Loci B b Total
A PAB PAb PA
a PaB Pab Pa
Total PB Pb 1.0

 LD decays by one-half with each
generation of random mating
 Thus, LD declines as the
number of generations
increases, so that in diverse
populations LD is limited to
small distances
 LD will tend to decay with
genetic distance between the
loci under consideration
Soto-Cerda and Cloutier, 2009
LD Decay

• Two most utilized statistics for LD
• D′ (Lewontin, 1964) and
• r2, the square of the correlation coefficient between two
loci (Hill & Robertson, 1968)
• D′ only reflects the recombinational history whereas r2
summarizes both recombinational and mutational history
Soto-Cerda and Cloutier, 2009
LD Statistics

• The statistical significance of LD is typically determined using a χ²
test of a 2 × 2 contingency table
• A p-value threshold of 0.05 is often used to declare lack of
independence of alleles at two loci, thus suggesting association
Computer Softwares
• “Graphical Overview of Linkage Disequilibrium” (GOLD )
• “Trait Analysis by aSSociation, Evolution and Linkage” (TASSEL)
• PowerMarker
Gupta et al., 2005
Statistical Significance of LD

 Identification of the candidate genes for many important traits in crop/living
organism as it tests the association between the marker type (e.g SNP) and
the phenotype of a target trait.
 GWAS- involves linkage disequilibrium (LD) based association mapping
Aim:
 To identify marker traits association for one trait at a time,
 To study the genetic architecture of the trait.
 It involves identification of all QTLs/genes and interaction among QTLs
identified through GWAS
 It used for mere identification of marker-trait association for marker-assisted
selection

Mapping Population

QTL Mapping Population
Experimental population
F2
Backcross
RIL
Doubled haploid line
AIL
BIL
MAGIC
NAM

Nested Association Mapping
 Linkage analysis maps broad chromosome regions with relatively low marker coverage
(Coarse mapping), while association mapping offers high resolution with very high
marker coverage (Fine mapping) .
 An integrated mapping strategy like NAM would combine the advantages of the both
the approaches to improve mapping resolution without requiring excessively dense
marker maps.
Steps to develop NAM population:
1. Selection of diverse founders and developing recombinant inbred lines (RILs).
2. Dense genotyping of founders
3. Genotyping a smaller number of tagging markers on both the founders and the progenies
to project the high-density marker information from the founders to the progenies
4. Phenotyping progenies for various complex traits
5. Conducting genome-wide association mapping

Yu et al. (2008)
Cross between 25
diverse founders and the
common parent (B73).
The genomes of these
immortalized RILs are
mosaics of the founder
due to diminishing
chances of
recombination over
short genetic distance
and a given number of
generations,
X – Crossing
- Selfing
SSD - Single-seed descent.
Nested Association Mapping (NAM) population development

Multiparent Advanced Generation Intercross (MAGIC)
 MAGIC : A Recombinant Inbred Line (RIL) population is created from multiple
founder lines, in which the genome of the founders are first mixed by several rounds
of mating, and subsequently inbred to generate a stable panel of inbred lines (RILs)
by Single Seed Descent method.
 More parental accessions increases the allelic diversity, potentially increasing the
number of QTL that segregate in the population.
 The successive rounds of recombination cause LD to decay, thereby increasing the
precision and resolution of QTL location (Mackay & Powell, 2007).

Development of MAGIC Population for Indica
Immortalized RILs

Comparison of
mapping
population and
resolution
One chance for
Recombination –
Poor resolution
Many recombination
events during
generation advancement -
Intermediate resolution
So many historical
recombination
events – Fine resolution
Soto-Cerda and
Cloutier (2012)

 Process of determining the accurate order of nucleotides along chromosomes
and genome
 It includes different method or technology that is used to determine the order
of four bases “adenine, guanine, cytosine, and thymine” in a strand of DNA
 The first genome sequenced bacteriophage ΦX174 in the year 1977
 In Haemophilus influenzae 1995
 In eukaryotic genome Saccharomyces cerevisiae 1996
DNA Sequencing
Palampur-176062

Next-generation sequencing e.g. genotyping-by-sequencing (GBS) provide
thousands of single nucleotide polymorphism (SNPs) covering the most
genomic region in plant chromosomes.
High throughput (next generation) sequencing applies to
 genome sequencing
 genome resequencing
 transcriptome profiling ( RNA – Seq)
 DNA- protein interactions(ChIP- Sequencing) and
 epigenome characterization
Next Generation Sequencing (NGS)
Palampur-176062

Publications related to Next Generation Sequencing (NGS)
Palampur-176062

Population Structure, Relatedness ((Kinship) and LD
 Larger the genetic variation, the faster the LD decay and fine resolution, a
direct consequence of the broader historical recombination
 LD declines quickly in out-crossing species facilitating fine mapping of a trait.
 LD blocks are extended in self pollinated crops as compared to the narrow LD
blocks in cross pollinated crop due to low recombination rate in autogamous
crops.
 Population structure occurs from the unequal distribution of alleles among
subpopulations of different ancestries.
 When these subgroups are sampled to construct a panel of lines for AM, the
intentional or unintentional mixing of individuals with different allele
frequencies creates false LD.

 LD extends with broad genetic base so mapping population should be composed of
unrelated diverse genotypes for fine mapping or higher resolution.
 Higher numbers of related individuals in mapping populations give false positive
LD.
 Genetic drift can create LD between closely linked loci. The effect is similar to taking
a small sample from a large population.
 Even if two loci are in linkage equilibrium, sampling only few individuals can
create LD
 Allele mixing also increase LD by introducing new alleles in populations.
Population Structure, Relatedness ((Kinship) and LD

Statistical Models
 Population structure (Q-matrix) and kinship coefficient (K-
matrix) can be estimated using the program STRUCTURE.
 GLM (General Linear Model) utilizes only one either Q or K
 Mixed linear model (MLM) can be utilized to block
population structure (Q) and kinship information (K).
 Q+K MLM model performs better than any other model
that used Q- or K-matrix alone.
 Compressed Mixed Linear Model(CMLM) clustered the
individual into fewer groups based on the kinship among
the individuals.

Yu and Buckler (2006)
Ideal sample with subtle population
structure and familial relatedness
,regression and genomic control (GC)
Family-based sample. GC and
mixed model (Q + K).
Both population structure and
familial relationships. SA, GC,
mixed model ([Q] plus kinship
matrix [K]).
With population structure,
structured association (SA)
and GC.
With severe population structure
and familial relationships, methods
unknown.
Different types of population encountered in association mapping
studies and relevant statistical solution.

Association Detection and Validation
 False Discovery Rate (FDR): The FDR is the proportion of positive results
that are actually false positives versus the whole set of positive results
obtained from a statistical test. FDR approaches may be most appropriate
when multiple traits are being compared or when the markers are not in
extensive LD
 Bonferroni Correction: It controls type I error rate (α) for simultaneous
multiple testing by reducing false positives. To perform a Bonferroni
correction, divide the critical P value (α) by the number of comparisons being
made. For example, if 10 hypotheses are being tested, the new critical P value
would be α/10. Put simply, the probability of identifying at least one
significant result due to chance increases as more hypotheses are tested.

Genomic coordinates (e.g. SNPs) are displayed along the X-axis, with the
negative logarithm of the association P-value for each single nucleotide
polymorphism (SNP) displayed on the Y-axis, meaning that each dot on the Manhattan
plot signifies a SNP. Because the strongest associations have the smallest P-values (e.g.,
10−15), their negative logarithms will be the greatest (e.g., 15). The size of the data
points in the plot and their height on the left-hand side of the data pane relate
directly to their significance: the larger the point and the higher the point on the scale,
the more significant the association with the trait.
33
Manhattan Plot

Bioinformatics for Candidate QTL Identification
Sr.
No.
Software Focus Website
1. STRUCTURE
2.3
Population structure http://pntch bsd
uchicago.edu/software.html
2. BAPS 5.0 Population structure http://web.abo.fi/jdc/mnS/mate/j
c. software/bapc.html
3. mStruct Population structure http://www.CS. cmu.edu
suyashinslriict. html
4. Haploview 4.2 Haplotype analysis and
LD
http://www.broadmU.edu/mpg/h
aploview/
5. TASSEL Stratification LD and AM http://www.maizegenetics.net
6. GenStat Stratification LD and AM http://www.vsni.co.uk/
7. JMP genomics Stratification LD and
structured AM
http://www.jmp.
com/software/genomics

Sr.
No.
Software Focus Website
8. SVS7 Stratification LD and
AM
http://goldenhelix.com
9. EINGENSTRAT PCA (as an alternative
to population structure)
and association
http:y7genepath.med.harvard.e
du/-reich/Software.htm
10. MTDFREML Mixed model http://aipl.arsusda.gov/curtvt/m
tdfreml.html
11. ASREML Mixed model http:y7www.vsni.co.uk/produc
ts/asreml
Bioinformatics for Candidate QTL Identification

Post GWAS activities and Approaches

Gene Identification
 In GWAS,oftena trait-associatedSNP isnot causal, but issimplyin LDwith the causal
SNP. Therefore, identification of causal variant among GWAS signals becomes
important.
 For identification of causal variants following approaches can be utilized:
(i) fine mapping
(ii) localization success rate approach
(iii) conditional analysis
(a) Conditional analysis (cGWAS) at a locus or in an LD block
(b) Conditional analysis (cGWAS) of whole genome
(c) Network-based conditional analysis (cGWAS)
(d) Conditional analysis using GWAS summary statistics

 Prediction of protein coding regions and prediction of the functional sites of
genes
 Two classes of methods are generally adopted: One is based on sequence
similarity searches, while the other is gene structure and signal-based searches
also known as ab initio prediction
GENE PREDICTION PROGRAMS
FIRST
GENERATION
SECOND
GENERATION
THIRD
GENERATION
FOURTH
GENERATION
GENSCAN,
AUGUSTUS
Test Code,
GRAIL
SORFIND,
Xpound
GeneID,
GeneParser,
GenLang ,
FGENEH
GENE PREDICTION

Gene Characterization
 Identification of causal markers and prioritization of associated markers should
generally be followed by identification and functional characterization of
candidate genes through bioinformatics analysis.
 But for validation and functional characterization, reverse genetics
technologies are often used, where the effect of variations/alterations in a
gene on phenotype is examined
Approaches use for gene characterization:
 Targeting induced local lesions in genomes (TILLING)
 Insertional mutagenesis, VIGS and RNAi
 Genome editing and base editing

Case Studies-I
The haplotype block characterization showed 1268 blocks of different sizes spread
along the genome, including highly conserved regions like the 1BS chromosome arm
where the 1BL/1RS wheat/rye translocation is located.
Based on GWAS we identified ninety-seven chromosome regions associated with
heading date, plant height, thousand grain weight, grain number per spike and fruiting
efficiency at harvest (FEh).
In particular FEh stands out as a promising trait to raise yield potential in Argentinean
wheats
Detecte fifteen haplotypes/markers associated with increased FEh values, eleven of
which showed significant effects in all three evaluated locations
In the case of adaptation, the Ppd-D1 gene is consolidated as the main determinant of
the life cycle of Argentinean wheat cultivars

This work reveals the genetic structure of the Argentinean hexaploid wheat germplasm
using a wide set of molecular markers anchored to the Ref Seq v1.0. Additionally GWAS
detects chromosomal regions (haplotypes) associated with important yield and
daptation components that will allow improvement of these traits through marker-
assisted selection
Case Studies-I

Case Studies-II
• Core collection captures most of the diversity of pearl millets inSenegal.
• includes 60 early-flowering Souna and 31 late-flowering Sanio morphotypes.
• Sixteen agro-morphological traits were evaluated in the panel in two seasons.
• Different Phenological and phenotypic traits used.
• Using GBS, 21,663 SNPs with more than 5% of MAFs werediscovered.
• Linkage groups (LG 3) (~ 89.7 Mb) and (LG 6) (~ 68.1 Mb) differentiated two
clusters among the early-flowering Souna.
• Using GWAS 18 genes were link to phenotypic variation.

• Core collection captures most of the diversity of pearl millets inSenegal.
• includes 60 early-flowering Souna and 31 late-flowering Sanio morphotypes.
• Sixteen agro-morphological traits were evaluated in the panel in two seasons.
• Different Phenological and phenotypic traits used.
• Using GBS, 21,663 SNPs with more than 5% of MAFs werediscovered.
• Linkage groups (LG 3) (~ 89.7 Mb) and (LG 6) (~ 68.1 Mb) differentiated
two clusters among the early-flowering Souna.
• Using GWAS 18 genes were link to phenotypic variation.

Role of GWAS in Crop Improvement

GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD

GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD

Similar to GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD (20)

Recently uploaded

Recently uploaded (20)

GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD

Editor's Notes