Seminar-II
Maruthi Prasad B P
II PhD
PAMB 1066
Dept. of GPB, UASB
1
POPULATION INCREASE!!!!
CLIMATE CHANGE!!!!
Biotic and Abiotic stresses!!!!
2
 Conventional breeding has made great success in the development of high-yielding
crop varieties
 It is important to accelerate the pace of crop improvement programmes especially
for the complex traits such as yield under stress condition
Varshney et al. 2005 3
Genetic Variations
4
Trait Improvement
Efficient Breeding
Environmental Resilience
Marker allelic variations within
a genome of a same species
SNP
GAATTC
CTTAAG
GAACTC
CTTGAG
InDels
CATCGCGAATTCCCATCG
GTAGCGCTTAAGGGTAGC
CATCG----------------CATCG
GTAGC----------------GTAGC
SSR
ACTGTCGACACACACACACGCTAGCT
TGACAGCTGTGTGTGTGTGCGATCGA
ACTGTCGACACACACACACACACACGCTAGCT
TGACAGCTGTGTGTGTGTGTGTGTGCGATCGA
ACTGTCGACACACACACACACACACACACACACGCTAGCT
TGACAGCTGTGTGTGTGTGTGTGTGTGTGTGTGCGATCGA
1. Single nucleotide polymorphisms – SNPs
2. Segmental/nucleotide insertions/deletions - InDels
3. Differences in the number of tandem repeats at a locus – SSRs
Mammadov et al., 2012
5
Depending on detection method and throughput
1. Low-throughput, hybridization-based markers : RFLPs
2. Medium-throughput, PCR-based markers: RAPD, AFLP, SSRs
3. High-throughput (HTP) sequence-based markers: SNPs
Targeting genetic variants
associated with agronomic traits
and identifying important
underlying candidate genes have
become a key area in crop genetic
research
6
Single Nucleotide Polymorphism
• A Single Nucleotide Polymorphisms (SNP), pronounced
“snips,” is a genetic variation when a single nucleotide
(i.e., A, T, C, or G) is altered and kept through heredity.
• SNP: Single DNA base variation found >1%
• Mutation: Single DNA base variation found <1%
C T T A G C T T
C T T A G T T T
SNP
C T T A G C T T
C T T A G T T T
Mutation
94%
6%
99.9%
0.1%
7
Mutations and SNPs
8
Common
Ancestor
time present
Observed genetic variations
Mutations
SNPs
Single Nucleotide Polymorphism
• A SNP is usually assumed to be a binary variable
• The probability of repeat mutation at the same SNP locus is
quite small
• The tri-allele cases are usually considered to be the effect of
genotyping errors
• The nucleotide on a SNP locus is called
• a major allele (if allele frequency > 50%)
• a minor allele (if allele frequency < 50%)
A C T T A G C T T
A C T T A G C T C C: Minor allele
94%
6%
T: Major allele
9
Single Nucleotide Polymorphism
10
►SNPs are found in
 coding and (mostly) noncoding regions
►Occur with a very high frequency
 about 1 in 1000 bases to 1 in 100 to 300 bases
►Easily automated
►SNPs close to particular gene can act as a marker for that gene
►SNPs have become the preferred markers for association studies
because of their high abundance and high-throughput SNP
genotyping technologies.
Genotypes
The use of haplotype information has been limited
because the individual genome is a diploid.
To obtain the haplotype data, we have to separate them first
 In large sequencing projects, genotypes instead of haplotypes are
collected due to cost consideration.
11
A
C
G
T
A T
SNP1 SNP2
C G
Haplotype data
SNP1 SNP2
Genotype data
A G
SNP1 SNP2
A T
C G
SNP1 SNP2
C T
Problems of Genotypes
• Genotypes only tell us the alleles at each SNP locus
• But we don’t know the connection of alleles at different SNP
loci
• There could be several possible haplotypes for the same
genotype
12
A
C
G
T
SNP1 SNP2
Genotype data
or
A T
C G
SNP1 SNP2
A G
C T
SNP1 SNP2
We don’t know which
haplotype pair is real
A G
SNP1 SNP2
C T
“Haplotype-led approaches for
increasing precision in plant
breeding”
13
Tag SNPs & Methods
to select tSNPs.
Application of Haplotype led
approaches in Plant Breeding
Haplotype construction
and Inference
Case studies .
Introduction
Conclusion
01
Haplotype
Mapping
02
03
04
05
06
Outline of Presentation
• A haplotype is a group of genes in an organism that are
inherited together from a single parent in a defined order
(Bevan et al., 2017)
• These variants tend to be inherited together, often because they are
very close together in the same chromosome region and therefore
less likely to be separated by crossing over (Snowdon et al.,
2015)
Haplotype
alleles
locus
haplotypes
String of SNPs that are
linked/co-inherit tegether
Polymorphic frozen blocks
15
Haplotypes
In terms of SNP-
“Two or more SNP alleles that tend to be inherited as a
unit” (Bernardo, 2010)
• A haplotype stands for a set of linked SNPs on the same
chromosome not easily separable by recombination
• within each block, recombination is rare due to tight
linkage
SNP1 SNP2 SNP3
-A C T T A G C T T-
-AA T T T G C T C-
-A C T T T G C T C-
Haplotype 2
Haplotype 3
C A T
A T C
C T C
Haplotype 1
SNP1 SNP2 SNP3 16
Haplotype blocks
Haplotype blocks are defined as a contiguous series of
SNPs and appearing to have very little evidence of
historical recombination among the individuals (Gabriel
et al., 2002)
Recombination Hotspots and Haplotype Blocks
17
Recombination Hotspots and
Haplotype Blocks
Recombination
hotspots
Chromosome
Haplotype
blocks
P1 P2 P3 P4
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP
loci
Haplotype patterns
: Major allele
: Minor allele
18
A Haplotype
Block Example
 The Chromosome 21 of
humans is partitioned into
4,135 haplotype blocks over
24,047 SNPs by Patil et al.
(Science, 2001).
 Blue box: major allele
 Yellow box: minor allele
19
Hapmap
The HapMap is a map of the
haplotype blocks and specific
SNPs that identify the haplotypes
The haplotype map or "HapMap"
acts as tool to find genes and
genetic variations that affect the
trait expression.
Source: The International Hapmap Project
20
Third generation sequencing: Alleviating the bottlenecks
in haplotype identification
NGS Technology
Steps in hapmap construction
21
TGS Technology
Different phasing methods for haplotype
construction/ reconstruction
1. Reference-based phasing
2. De novo genome assembly (such as diploid and
polyploid assembly)
3. Strain-resolved metagenome assembly (de novo re-
assembly, single nucleotide variant-based assembly, read
and contig binning)
22
Tools for Haplotype analysis
WhatsHap HapCut2
WhatsHap- polyphase HapTree
Falcon phase Hifiasm
SDip POLYTE
DESMAN MetaMaps
ProxiMeta
23
Haplotagging-A novel sequencing strategy for
rapid discovery of haplotypes
24
Steps in hapmap construction
1. SNPs are identified in
DNA samples from
multiple individuals
2. Adjacent SNPs that are
inherited together are
compiled into haplotypes
3. “Tag” SNPs are identified
within haplotypes that
uniquely describe those
haplotypes
Source: The International Hapmap Project
25
Haplotype blocking
1. Confidence interval test
2. Four gamete test
3. Solid spine of linkage disequilibrium
Saad et al., 2018
26
Confidence interval test
• The reasons for allowing <5% of weak LD in the haplotype block is
due to force like recurrent mutation, gene conversion, or errors of
the genome assembly or genotyping in addition to recombination
events
Saad et al., 2018 27
Four gamete test
Haplotype block partitioning method that assumes recombination
events are not allowed within each block
Four gametes = Recombination event occurred-No blocking
• Rare gamete frequency > 0.01 to count a recombination event
• Recombination events are only accepted between blocks
SNP1 SNP2
A C
G C
G T
SNP1 SNP2
A C
G C
G T
A T
Three gamete
condition
Four gametes
condition
Three gametes = No recombination- Haplotype block
28
Solid spine of linkage disequilibrium
Scenario where a SNP marker exhibits strong and consistent
associations with surrounding SNPs, indicating the presence of a stable
haplotype block
The solid spine is a line of strong LD >0.8 that moves from one allele to
next along the legs of the triangle. Which defines particular haplotype
SNP# SNP1 SNP2 SNP3 SNP4 SNP5
SNP1 - 0.97 0.99 0.93 0.96
SNP2 - 0.18 0.67 0.98
SNP3 - 0.03 0.94
SNP4 - 0.95
SNP5 -
Strong LD between the first SNP and the last SNP and with all the
intermediate SNPs is observed 29
Comparison among haplotype blocking methods
Items FGT CIT SSLD
Recombination
Event within
Block
Not Allowed ≤ 5% is allowed only between few
intermediate
SNPs
Strong LD LD is not used D´ upper limit = 0.98
D´ lower limit = 0.9
D´ > 0.8
Weak LD LD is not used D´ upper limit = 0.9
D´ lower limit = 0.7
D´ ≤ 0.8
Morphology in
the LD Chart
No recombination
event between all
SNPs in the
block
> 95% Strong LD
between all SNPs in
the block
Strong LD in the
legs of the LD
chart
30
The FGT method differs from other methods as it does not require
threshold for LD
Qian et al., 2017
Haplotype Inference
• The problem of inferring the haplotypes from a set
of genotypes is called haplotype inference.
• Most combinatorial methods consider the
maximum parsimony model to solve this
problem.
• This model assumes that the real haplotypes in natural
population is rare
• The solution of this problem is a minimum set of
haplotypes that can explain the given genotypes
31
32
Maximum Parsimony
A G
h3
C T
h4
A T
h1
C G
h2
A T
h1
A T
h1
or
G1
A
C
SNP1 SNP2
G
T
G2
A
A
SNP1 SNP2
T
T
A G
C T
A T
A T
C G
• Find a minimum set of
haplotypes to explain
the given genotypes.
Factors affecting haplotype
map construction
• SNP allele frequency distribution
• Haplotype allele numbers
• Linkage disequilibrium (LD)
Hamblin & Jannink, 2011
33
Problems of Using SNPs for
Association Studies
• The number of SNPs is still too large to be used for
association studies
• There are millions of SNPs in a plant genome
• To reduce the SNP genotyping cost, we wish to use as few SNPs as
possible for association studies
• Tag SNPs are a small subset of SNPs that is sufficient for
performing association studies without losing the power of
using all SNPs.
34
Brief glossary of terms
Haplotype
tagging
Selecting minimum subsets of SNPs that
can represent the (common) haplotypes
inferred from the original set
Tagging
or
Tag SNP
A SNP that reports partially or totally the
state of another SNP(s)
Tagged
SNP
A SNP that is not necessary to genotyped
because its state is reported by tagging
SNPs
Halldorsson et al., 2004
35
Examples of Tag SNPs
P1 P2 P3 P4
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP
loci
Haplotype patterns
 Suppose we wish to distinguish
an unknown haplotype sample
 We can genotype all SNPs to
identify the haplotype sample
An unknown haplotype sample
: Major allele
: Minor allele
36
P1 P2 P3 P4
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP
loci
Haplotype pattern
 In fact, it is not necessary to
genotype all SNPs
 SNPs S3, S4, and S5 can form a
set of tag SNPs
P1 P2 P3 P4
S3
S4
S5
Examples of Tag SNPs
37
Examples of Wrong Tag SNPs
P1 P2 P3 P4
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP
loci
Haplotype pattern
 SNPs S1, S2, and S3 can not
form a set of tag SNPs
because P1 and P4 will be
ambiguous
P1 P2 P3 P4
S1
S2
S3
38
P1 P2 P3 P4
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP
loci
Haplotype pattern
 SNPs S1 and S12 can form a set
of tag SNPs
 This set of SNPs is the
minimum solution in this
example
P1 P2 P3 P4
S1
S12
Examples of Tag SNPs
39
Steps for ‘tag SNP’ selection
(1) Determining predictive neighborhoods
(2) Minimizing the number of tagging SNPs
(3) Tagging quality assessment
Halldorsson et al., 2004
40
Haplotype Blocks and Tag SNPs
• Recent studies have shown that the chromosome can be partitioned
into haplotype blocks interspersed by recombination hotspots
• Within a haplotype block, there is little or no recombination
occurred.
• The SNPs within a haplotype block tend to be inherited
together
• Within a haplotype block, a small subset of SNPs (called tag SNPs)
is sufficient to distinguish each pair of haplotype patterns in the
block
• We only need to genotype tag SNPs instead of all SNPs within
a haplotype block
41
Problem Formulation
(1,2) (1,3) (1,4) (2,3) (2,4) (3,4)
S1 S2 S3 S4
There are pairs of patterns.
𝐾(𝐾 − 1)
2
 The relation between SNPs and haplotypes
can be formulated as a bipartite graph
 S1 can distinguish (P1, P3), (P1, P4), (P2, P3),
and (P2, P4)
 S2 can distinguish (P1, P4), (P2, P4), (P3, P4)
P1 P2 P3 P4
S3
S4
S1
S2
42
Observation
(1,2) (1,3) (1,4) (2,3) (2,4) (3,4)
S1 S3
Each pair of patterns is connected by at least one edge.
 The SNPs can form a set of tag SNPs if
each pair of patterns is connected by at
least one edge
 e.g., S1 and S3 can form a set of tag SNPs
 e.g., S1 and S2 can not be tag SNPs
P1 P2 P3 P4
S3
S4
S1
S2
S2
43
Methods to select tSNPs
Based on Principal Component Analysis (PCA) to reduce
the dimensions of complete sets of SNPs
Covariance matrix of SNPs
Principal components analysis
SNPs contribute most to eigenvectors & associated with the
largest eigenvalues are considered as more influential
Selected SNPs added to the set of tagging SNPs
44
Shannon entropy: Based on defining how well a subset of
SNPs captures the variation in the complete set
Shannon entropy helps us quantify how much genetic
diversity a particular SNP captures
Methods to select tSNPs
SNP has high entropy → It comes with different versions of
alleles → Reflecting greater diversity → tSNP is selected
SNP has low entropy → Most individuals have same version of
alleles → Less diversity → Less informative
45
Linkage Disequilibrium
• The problem of finding tag SNPs can be also
solved from the statistical point of view
• We can measure the correlation between SNPs
and identify sets of highly correlated SNPs
• For each set of correlated SNPs, only one SNP
need to be genotyped and can be used to
predict the values of other SNPs
• Linkage Disequilibrium (LD) is a measure that
estimates such correlation between two SNPs
46
47
Introduction to Linkage
Disequilibrium
A B
A b
a B
a b
SNP1 SNP2
• PAB ≠ PAPB
• PAb ≠ PAPb = PA(1-PB)
• PaB ≠ PaPB = (1-PA) PB
• Pab ≠ PaPb = (1-PA) (1-PB)
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
SNP2
48
Linkage Disequilibrium
Formulas
• Mathematical formulas for computing LD or Correlation:
• r2 or Δ2:
)
1
(
)
1
(
)
( 2
2
B
B
A
A
B
A
AB
P
P
P
P
P
P
P
r




)
1
(
)
1
(
)
(
)
(
Var
)
(
Var
)
,
(
Cov
2
2
2
B
B
A
A
B
A
AB
P
P
P
P
P
P
P
B
A
B
A
r





B
A
AB P
P
P
B
E
A
E
AB
E
B
A



 ]
[
]
[
]
[
)
,
(
Cov
)
1
(
]
[
]
[
)
(
V
2
2
2
A
A
A
A
P
P
P
P
A
E
A
E
A
ar






Linkage Disequilibrium Bins
• The statistical methods for finding tag SNPs are
based on the analysis of LD among all SNPs
• An LD bin is a set of SNPs such that SNPs within the
same bin are highly correlated with each other
• The value of a single SNP in one LD bin can predict the
values of other SNPs of the same bin
• These methods try to identify the minimum set of LD bins
49
An Example of LD Bins (1/3)
• SNP1 and SNP2 can not form an LD bin
• e.g., A in SNP1 may imply either G or A in SNP2
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
50
An Example of LD Bins (2/3)
• SNP1, SNP2, and SNP3 can form an LD bin.
• Any SNP in this bin is sufficient to predict the values of others.
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
51
An Example of LD Bins (3/3)
• There are three LD bins, and only three tag SNPs are required to be
genotyped (e.g., SNP1, SNP2, and SNP4).
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
52
Examples of Computing LD
375
0
5
2
5
3
5
1
5
4
5
3
5
4
5
3
1
1
2
2
2
12
.
)
(
)
P
(
P
)
P
(
P
)
P
P
P
(
r
B
B
A
A
B
A
AB











Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A T A A G T
2 G T C C T T
3 G A C A G T
4 G A C C T T
5 G A C A G C
1
5
1
5
4
5
1
5
4
5
4
5
4
5
4
1
1
2
2
2
13











)
(
)
P
(
P
)
P
(
P
)
P
P
P
(
r
B
B
A
A
B
A
AB
53
54
Difference between Haplotype
Blocks and LD bins
Haplotype Blocks LD bins
Based on the assumption that
SNPs in proximity region should
tend to be correlated with each
other
LD bins can group correlated
SNPs distant from each other
Represent co-inheritance of alleles
within the block
A disease is usually affected by
multiple genes instead of single
one
The probability of recombination
occurs in between is less
Reflect similar allelic frequencies
within the bin
The SNPs in a haplotype block do
not appear in another block
The SNPs in one LD bin can be
shared by other bins
Ding & Kullo, 2007
Tagging quality assessment
Prediction accuracy (PA): How well a identified
tagSNPs consistently identifies the haplotypes
Tagging consistency: Quantifies the consistency
between the sets of tSNPs derived from one population
using two different methods or from two populations
using one method
Tagging efficiency: Provides an estimate of the
savings in genotyping offered by tSNPs
55
Applications of Haplotype-led
approaches for increasing
precision in plant breeding
56
Haplo-pheno analysis
Sinha et al., 2020
DNA Sequence
GATATTCGTACGGAT
GATGTTCGTACTGAT
GATATTCGTACGGAT
GATATTCGTACGGAT
GATGTTCGTACTGAT
GATGTTCGTACTGAT
SNP
SNP
Individuals
with
contrasting
phenotypes
1
2
3
4
5
6
Genetic Map
Candidate gene
Phenotype
distribution
57
(Hypothetical example)
Haplotype Frequency Phenotypic
mean
GATTGTA 0.35 96
GATTATA 0.55 89
GTTCATA 0.10 115
Haplotype information provides better resolution!
102
Advantage of using Haplo-pheno analysis
58
Haplotype-based breeding (HBB)
Retrospective Approach:
Genomes of popular cultivars will be
resequenced to identify favorable
haplotypes.
Prospective Approach:
Identifying new and potentially valuable
haplotypes by studying larger ancestral
and wild relative populations of the
crop species 59
GWAS using haplotypes
Association mapping is a tool to resolve complex trait variation
down to the sequence level by exploiting historical and evolutionary
recombination events at the population level.
Importance:
It can directly exploit existing and extensive phenotypic data
collected during variety registration trials
Compared to biparental crosses it offers an increased genetic
resolution because it uses historical recombination events
It has become increasingly feasible after the emergence of cost-
effective and high throughput genotyping approaches 60
Statistical approaches for marker–
trait associations
3. Haplotype based model
2. High density SNP Multi-locus model
1. Single-locus mixed model
61
Haplotype vs. individual markers-
Comparative efficiency for crop breeding
62
• SNPs tiled on arrays are usually chosen for their moderate to high minor allele
frequency (MAF)
• SNP based genomic relationship matrix (GSNP) is based on SNPs with relatively high
MAF. Therefore, may trace less accurately the changes due to recent selection
compared to GHAP
• GHAP can differentiate between identical by descent (IBD) and identical by state
(IBS)
• Meuwissen et al., 2014 suggested that building the relationship matrix using
haplotypes instead of single SNPs may improve the accuracy of genomic predictions
Haplotype vs. individual markers-
Comparative efficiency for crop breeding
63
Crop
Species
Trait Population
size
Haplotype
markers
Haplotype-
trait
associations
References
Soybean 100-seed weight; plant height; seed yield 169 941 87 Contreras-Soto et
al., 2017
Wheat Heading date; plant height; 1000-grain
weight; grain number per spike; fruiting
efficiency at harvest
102 4516 97 Basile et al., 2019
Wheat Grain yield; days to heading; plant height 6461 519 36 Sehgal et al., 2020
Barley Deoxynivalenol content in kernels; heading
time; days to maturity; grain yield; plant
height; specific weight; 1000-kernel weight
277 14400 - Abed et al., 2019
Barley Yield and quality-related traits 106 2770 23 Gawenda et al.,
2019
Rice Grain shape 372 30 Ogawa et al., 2021
Rice Agronomic traits 414 15275 Hamazaki et al.,
2020
Maize Agronomic and reproductive traits 322 53403 44 Maldonado et al.,
2019
Maize Total plant height; ear height; ear height/
plant height
183 7831 40 Maldonado et al.,
2019
Maize Agronomic and reproductive traits >1000 154104 Mayer et al., 2020
Oat Heading date 4657 164741 184 Bekele et al., 2018
Rapeseed Days to flowering; seed glucosinolate content 950 15 Jan et al., 2019
Haplotype markers in genome-wide association mapping (GWAS)
analysis in different crop species
64
Limitations of Haplotype based GWAS
• Non-informative SNPs in a given haplotype block masks the
effect of adjacent informative SNPs, which in turn leads to
spurious associations, decreasing the effectiveness of the GWAS
analysis.
• These approaches have one thing in common i.e., they all use the
consecutive SNPs that possess high LD for the development of
haplotypes.
• Haplotypes generated via these approaches have been observed to
show no difference in the information provided by the
haplotype and single SNP, because the SNPs in high LD provide
redundant information
65
Functional haplotype-GWAS (FH-GWAS)
FH-GWAS combines
only those SNPs which
are true contributors
into Haplotype
Selecting significant
SNPs identified via
GWAS analysis for
individual SNP
markers
Constructing functional
haplotypes considering
main and epistatic
effects
Association tests using
functional haplotypes
Narrow down
candidate region
Takes the associated
epistatic effects of
SNPs into
consideration
66
Qian et al., 2017
Haplotype-based marker-assisted selection
67
Pan genomics- Shift from useful
individuals to useful haplotypes
68
Genomic selection is an alternative approach for complex traits
controlled by QTLs with small effects
Uses SNPs as predictors and it is biased with the use of single
reference genome.
 Pan genomic data can be
used to identify new markers
to improve prediction
accuracy
 Practical haplotype graph
approach (PHG).
Genomic prediction using
pangenome haplotype graphs
69
70
383 wheat GS panel phenotyped for Yield, Test Weight, & Protein
Genotyped using the Illumina 90K SNPAssay
Haplotype blocks of 5, 10, 15, and 20 adjacent markers were
constructed for all chromosomes
A multi-allelic haplotype prediction algorithm was implemented and
compared with single SNPs using both k-fold cross validation and
stratified sampling optimization 71
72
Results:
Haplotypes of 15 adjacent markers + training population
optimization → significantly improved the predictive ability for
yield and protein content by 14.3 and 16.8% respectively, compared
with using single SNPs and k-fold cross validation
Output of study:
These results emphasize the effectiveness of
using haplotypes in genomic selection to
increase genetic gain
73
The use of haplotype markers in genomic
selection in different crop species
Crop
Species
Trait Training
populati
on size
Haplotype
markers
GS
prediction
accuracy
References
Bluegum Traits related to
wood quality and
tree growth
646 ~ 3000 0.58 Ballesta et al.,
2019
Soybean Plant height & grain
yield per plant
235 357 >0.80 &
>0.45
Ma et al., 2018
Sorghum Agronomic and
yield-related traits
207 1974 0.57-0.73 Jensen et al.,
2020
Wheat Yield, test weight,
and protein content
383 1400 >0.40 Sallam et al.,
2019
Wheat Grain yield and
related traits
4302 1162 0.39-0.48 Sehgal et al.,
2020
Oat Heading date 635 13954 0.42-0.67 Bekele et al.,
2018
74
Objectives:
•To identify superior haplotypes
•To design Haplotype based breeding to develop
tailor made pigeon pea varieties resistant to
drought
75
Material
292 pigeon pea
reference set
Breeding lines 117
Landraces 166
wild species 9
Drought tolerance phenotyping of 137 accessions
Haplotype analysis using Haploview software (Barrett et al., 2005)
Candidate gene‐based association analysis
Haplo‐pheno analysis
Material & Method:
76
Phenotyping of the subset panel
Minimum Maximum Mean SD Median Mode Kurtosis Skewness Standard
Error
Plant weight (PW, g) 0.11 2.17 0.87 0.33 0.84 0.84 1.08 0.53 0.03
Shoot length (SL, cm) 4.75 23.50 16.03 3.63 16.33 19.00 0.83 -0.74 0.31
Root length (RL, cm) 5.00 24.33 13.05 3.39 12.67 13.00 0.94 0.66 0.29
Fresh weight (FW, g) 0.03 0.68 0.19 0.11 0.17 0.12 3.58 1.50 0.01
Turgid weight (TW, g) 0.03 1.28 0.32 0.18 0.30 0.15 4.60 1.44 0.02
Dry weight (DW, g) 0.02 0.24 0.07 0.04 0.07 0.05 3.43 1.33 0.00
Relative water content (RWC, %) 7.58 98.96 47.65 21.06 45.89 33.33 -0.69 0.24 1.78
RESULTS:
77
Haplotype diversity
Number of
Haplotypes
Breeding lines 83
Land races 132
Wild species 60
78
Trait Gene CcLG/Scaffold SNP position
(bp)
Gene (annotation) P‐value PVE (%)
Shoot length C.cajan_30211 Scaffold126966 344212 U‐box domain‐containing protein 52 0.00490142 4.48
Root length
C.cajan_30211 Scaffold126966 349139 U‐box domain‐containing protein 52 0.00129826 8.16
C.cajan_29830 Scaffold128889 297640
Universal stress protein A‐like
protein
0.00724983 5.62
Plant weight
C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 0.00033134 8.85
C.cajan_23080 CcLG05 86455 Universal stress protein 0.00080033 7.67
C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 0.00206941 6.42
Fresh weight
C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 0.00013847 11.41
C.cajan_23080 CcLG05 86455 Universal stress protein 0.00044457 9.61
C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 0.00161236 7.67
C.cajan_26230 Scaffold133234 89141 U‐box domain‐containing protein 35 0.00164604 7.64
C.cajan_46779 Scaffold117697 3348 Cation/H(+) antiporter 15 0.0054616 5.9
Dry weight
C.cajan_26230 Scaffold133234 91818 U‐box domain‐containing protein 35 4.61E‐06 17.02
C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 0.00089919 8.59
C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 0.00150988 7.81
Turgid
weight
C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 2.39E‐05 13.62
C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 6.70E‐05 12.03
C.cajan_23080 CcLG05 86455 Universal stress protein 0.00026617 9.95
C.cajan_46779 Scaffold117697 3348 Cation/H(+) antiporter 15 0.00142431 7.52
C.cajan_46779 Scaffold117697 1822 Cation/H(+) antiporter 15 0.00395323 6.09
C.cajan_30211 Scaffold126966 345767 U‐box domain‐containing protein 52 0.0053937 5.67
C.cajan_30211 Scaffold126966 345945 U‐box domain‐containing protein 52 0.0053937 5.67
C.cajan_30211 Scaffold126966 348513 U‐box domain‐containing protein 52 0.0053937 5.67
Relative
water
content
C.cajan_26230 Scaffold133234 88787 U‐box domain‐containing protein 35 0.00572797 2.78
Association of drought responsive genes with
phenotype for identification of trait‐associated genes
79
Haplo-pheno analysis of C.cajan_23080 gene
80
Trait Gene Superior haplotype
Plant weight (PW)
C.cajan_30211 H6
C.cajan_23080 H2
Fresh weight (FW)
C.cajan_30211 H6
C.cajan_23080 H2
C.cajan_26230 H11
Turgid weight (TW)
C.cajan_30211 H6
C.cajan_23080 H2
Dry weight (DW)
C.cajan_26230 H11
C.cajan_30211 H6
Relative water content (RWC) C.cajan_26230 H5
Superior haplotypes for drought-responsive genes
81
Gene
C.cajan_23080
C.cajan_30211
C.cajan_26230
Superior haplotype
H2
H5
H6
H11
Genotype Gene(s)
Superior haplotypes
Biological status Region
Geographi
c origin
(country)
PW FW TW DW RWC
ICP 10447 C.cajan_23080 H2 H2 H2 Landrace South Asia India
ICP 1156 C.cajan_23080 H2 H2 H2 Landrace South Asia India
ICP 1273
C.cajan_23080 H2 H2 H2 Landrace South Asia India
C.cajan_26230 H5 Landrace
South
America
Venezuela
ICP 9236 C.cajan_23080 H2 H2 H2 Breeding line South Asia India
ICP 10683 C.cajan_30211 H6 H6 H6 H6 Breeding line South Asia India
ICP 7896 C.cajan_30211 H6 H6 H6 H6 Landrace South Asia India
ICP 12765 C.cajan_26230 H11 H11 Landrace South Asia Philippines
ICP 14163 C.cajan_26230 H11 H11 Landrace South Asia Indonesia
ICP 12410 C.cajan_26230 H5 Landrace Unknown Unknown
ICP 13191 C.cajan_26230 H5 Landrace South Asia India
ICP 14971 C.cajan_26230 H5 Landrace South Asia Indonesia
ICP 2698 C.cajan_26230 H5 Landrace South Asia India
ICP 4167 C.cajan_26230 H5 Landrace South Asia India
ICP 6992 C.cajan_26230 H5 Landrace South Asia India
ICP 7420 C.cajan_26230 H5 Landrace South Asia India
ICP 8012 C.cajan_26230 H5 Landrace South Asia India
ICP 7314 C.cajan_26230 H5 Breeding line South Asia India
List of accessions carrying superior haplotypes for three
drought‐associated responsive genes
82
New plant type proposed for
drought tolerant Pigeonpea
83
Output of study:
Haplotype‐based breeding can used as promising breeding approach to
develop tailor‐made crop varieties
Candidate gene‐based association analysis of these 10 genes revealed
23 strong marker‐trait associations (MTAs)
Haplo‐pheno analysis identified four superior haplotypes for three
genes regulating five component drought traits
17 accessions containing these four superior haplotypes were identified
Whole genome re‐sequencing data is used to identify the superior
haplotypes for 10 drought‐responsive candidate genes
84
Objective:
•To introduce the new concept Heterotic Haplotype Capture (HHC)
85
-50 lines
Material and Method:
86
Rejuvenating a depleted breeding pool
with novel species diversity
87
Output of study:
Genome-scale data of the NAM and HHC populations enable the
identification (in any given NAM line) of haplotype blocks that are
predicted to be heterozygous in combination with a genotyped
maternal tester
Availability of immortal heterotic populations, provides a powerful
resources for genome-scale investigations into the genetic basis of
heterosis for yield and other important agronomic traits
88
Conclusion
89
Haplotype mapping and its application in Plant Breeding

Haplotype mapping and its application in Plant Breeding

  • 1.
    Seminar-II Maruthi Prasad BP II PhD PAMB 1066 Dept. of GPB, UASB 1
  • 2.
  • 3.
     Conventional breedinghas made great success in the development of high-yielding crop varieties  It is important to accelerate the pace of crop improvement programmes especially for the complex traits such as yield under stress condition Varshney et al. 2005 3
  • 4.
    Genetic Variations 4 Trait Improvement EfficientBreeding Environmental Resilience
  • 5.
    Marker allelic variationswithin a genome of a same species SNP GAATTC CTTAAG GAACTC CTTGAG InDels CATCGCGAATTCCCATCG GTAGCGCTTAAGGGTAGC CATCG----------------CATCG GTAGC----------------GTAGC SSR ACTGTCGACACACACACACGCTAGCT TGACAGCTGTGTGTGTGTGCGATCGA ACTGTCGACACACACACACACACACGCTAGCT TGACAGCTGTGTGTGTGTGTGTGTGCGATCGA ACTGTCGACACACACACACACACACACACACACGCTAGCT TGACAGCTGTGTGTGTGTGTGTGTGTGTGTGTGCGATCGA 1. Single nucleotide polymorphisms – SNPs 2. Segmental/nucleotide insertions/deletions - InDels 3. Differences in the number of tandem repeats at a locus – SSRs Mammadov et al., 2012 5
  • 6.
    Depending on detectionmethod and throughput 1. Low-throughput, hybridization-based markers : RFLPs 2. Medium-throughput, PCR-based markers: RAPD, AFLP, SSRs 3. High-throughput (HTP) sequence-based markers: SNPs Targeting genetic variants associated with agronomic traits and identifying important underlying candidate genes have become a key area in crop genetic research 6
  • 7.
    Single Nucleotide Polymorphism •A Single Nucleotide Polymorphisms (SNP), pronounced “snips,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity. • SNP: Single DNA base variation found >1% • Mutation: Single DNA base variation found <1% C T T A G C T T C T T A G T T T SNP C T T A G C T T C T T A G T T T Mutation 94% 6% 99.9% 0.1% 7
  • 8.
    Mutations and SNPs 8 Common Ancestor timepresent Observed genetic variations Mutations SNPs
  • 9.
    Single Nucleotide Polymorphism •A SNP is usually assumed to be a binary variable • The probability of repeat mutation at the same SNP locus is quite small • The tri-allele cases are usually considered to be the effect of genotyping errors • The nucleotide on a SNP locus is called • a major allele (if allele frequency > 50%) • a minor allele (if allele frequency < 50%) A C T T A G C T T A C T T A G C T C C: Minor allele 94% 6% T: Major allele 9
  • 10.
    Single Nucleotide Polymorphism 10 ►SNPsare found in  coding and (mostly) noncoding regions ►Occur with a very high frequency  about 1 in 1000 bases to 1 in 100 to 300 bases ►Easily automated ►SNPs close to particular gene can act as a marker for that gene ►SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.
  • 11.
    Genotypes The use ofhaplotype information has been limited because the individual genome is a diploid. To obtain the haplotype data, we have to separate them first  In large sequencing projects, genotypes instead of haplotypes are collected due to cost consideration. 11 A C G T A T SNP1 SNP2 C G Haplotype data SNP1 SNP2 Genotype data A G SNP1 SNP2 A T C G SNP1 SNP2 C T
  • 12.
    Problems of Genotypes •Genotypes only tell us the alleles at each SNP locus • But we don’t know the connection of alleles at different SNP loci • There could be several possible haplotypes for the same genotype 12 A C G T SNP1 SNP2 Genotype data or A T C G SNP1 SNP2 A G C T SNP1 SNP2 We don’t know which haplotype pair is real A G SNP1 SNP2 C T
  • 13.
    “Haplotype-led approaches for increasingprecision in plant breeding” 13
  • 14.
    Tag SNPs &Methods to select tSNPs. Application of Haplotype led approaches in Plant Breeding Haplotype construction and Inference Case studies . Introduction Conclusion 01 Haplotype Mapping 02 03 04 05 06 Outline of Presentation
  • 15.
    • A haplotypeis a group of genes in an organism that are inherited together from a single parent in a defined order (Bevan et al., 2017) • These variants tend to be inherited together, often because they are very close together in the same chromosome region and therefore less likely to be separated by crossing over (Snowdon et al., 2015) Haplotype alleles locus haplotypes String of SNPs that are linked/co-inherit tegether Polymorphic frozen blocks 15
  • 16.
    Haplotypes In terms ofSNP- “Two or more SNP alleles that tend to be inherited as a unit” (Bernardo, 2010) • A haplotype stands for a set of linked SNPs on the same chromosome not easily separable by recombination • within each block, recombination is rare due to tight linkage SNP1 SNP2 SNP3 -A C T T A G C T T- -AA T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP1 SNP2 SNP3 16
  • 17.
    Haplotype blocks Haplotype blocksare defined as a contiguous series of SNPs and appearing to have very little evidence of historical recombination among the individuals (Gabriel et al., 2002) Recombination Hotspots and Haplotype Blocks 17
  • 18.
    Recombination Hotspots and HaplotypeBlocks Recombination hotspots Chromosome Haplotype blocks P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 SNP loci Haplotype patterns : Major allele : Minor allele 18
  • 19.
    A Haplotype Block Example The Chromosome 21 of humans is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001).  Blue box: major allele  Yellow box: minor allele 19
  • 20.
    Hapmap The HapMap isa map of the haplotype blocks and specific SNPs that identify the haplotypes The haplotype map or "HapMap" acts as tool to find genes and genetic variations that affect the trait expression. Source: The International Hapmap Project 20
  • 21.
    Third generation sequencing:Alleviating the bottlenecks in haplotype identification NGS Technology Steps in hapmap construction 21 TGS Technology
  • 22.
    Different phasing methodsfor haplotype construction/ reconstruction 1. Reference-based phasing 2. De novo genome assembly (such as diploid and polyploid assembly) 3. Strain-resolved metagenome assembly (de novo re- assembly, single nucleotide variant-based assembly, read and contig binning) 22
  • 23.
    Tools for Haplotypeanalysis WhatsHap HapCut2 WhatsHap- polyphase HapTree Falcon phase Hifiasm SDip POLYTE DESMAN MetaMaps ProxiMeta 23
  • 24.
    Haplotagging-A novel sequencingstrategy for rapid discovery of haplotypes 24
  • 25.
    Steps in hapmapconstruction 1. SNPs are identified in DNA samples from multiple individuals 2. Adjacent SNPs that are inherited together are compiled into haplotypes 3. “Tag” SNPs are identified within haplotypes that uniquely describe those haplotypes Source: The International Hapmap Project 25
  • 26.
    Haplotype blocking 1. Confidenceinterval test 2. Four gamete test 3. Solid spine of linkage disequilibrium Saad et al., 2018 26
  • 27.
    Confidence interval test •The reasons for allowing <5% of weak LD in the haplotype block is due to force like recurrent mutation, gene conversion, or errors of the genome assembly or genotyping in addition to recombination events Saad et al., 2018 27
  • 28.
    Four gamete test Haplotypeblock partitioning method that assumes recombination events are not allowed within each block Four gametes = Recombination event occurred-No blocking • Rare gamete frequency > 0.01 to count a recombination event • Recombination events are only accepted between blocks SNP1 SNP2 A C G C G T SNP1 SNP2 A C G C G T A T Three gamete condition Four gametes condition Three gametes = No recombination- Haplotype block 28
  • 29.
    Solid spine oflinkage disequilibrium Scenario where a SNP marker exhibits strong and consistent associations with surrounding SNPs, indicating the presence of a stable haplotype block The solid spine is a line of strong LD >0.8 that moves from one allele to next along the legs of the triangle. Which defines particular haplotype SNP# SNP1 SNP2 SNP3 SNP4 SNP5 SNP1 - 0.97 0.99 0.93 0.96 SNP2 - 0.18 0.67 0.98 SNP3 - 0.03 0.94 SNP4 - 0.95 SNP5 - Strong LD between the first SNP and the last SNP and with all the intermediate SNPs is observed 29
  • 30.
    Comparison among haplotypeblocking methods Items FGT CIT SSLD Recombination Event within Block Not Allowed ≤ 5% is allowed only between few intermediate SNPs Strong LD LD is not used D´ upper limit = 0.98 D´ lower limit = 0.9 D´ > 0.8 Weak LD LD is not used D´ upper limit = 0.9 D´ lower limit = 0.7 D´ ≤ 0.8 Morphology in the LD Chart No recombination event between all SNPs in the block > 95% Strong LD between all SNPs in the block Strong LD in the legs of the LD chart 30 The FGT method differs from other methods as it does not require threshold for LD Qian et al., 2017
  • 31.
    Haplotype Inference • Theproblem of inferring the haplotypes from a set of genotypes is called haplotype inference. • Most combinatorial methods consider the maximum parsimony model to solve this problem. • This model assumes that the real haplotypes in natural population is rare • The solution of this problem is a minimum set of haplotypes that can explain the given genotypes 31
  • 32.
    32 Maximum Parsimony A G h3 CT h4 A T h1 C G h2 A T h1 A T h1 or G1 A C SNP1 SNP2 G T G2 A A SNP1 SNP2 T T A G C T A T A T C G • Find a minimum set of haplotypes to explain the given genotypes.
  • 33.
    Factors affecting haplotype mapconstruction • SNP allele frequency distribution • Haplotype allele numbers • Linkage disequilibrium (LD) Hamblin & Jannink, 2011 33
  • 34.
    Problems of UsingSNPs for Association Studies • The number of SNPs is still too large to be used for association studies • There are millions of SNPs in a plant genome • To reduce the SNP genotyping cost, we wish to use as few SNPs as possible for association studies • Tag SNPs are a small subset of SNPs that is sufficient for performing association studies without losing the power of using all SNPs. 34
  • 35.
    Brief glossary ofterms Haplotype tagging Selecting minimum subsets of SNPs that can represent the (common) haplotypes inferred from the original set Tagging or Tag SNP A SNP that reports partially or totally the state of another SNP(s) Tagged SNP A SNP that is not necessary to genotyped because its state is reported by tagging SNPs Halldorsson et al., 2004 35
  • 36.
    Examples of TagSNPs P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 SNP loci Haplotype patterns  Suppose we wish to distinguish an unknown haplotype sample  We can genotype all SNPs to identify the haplotype sample An unknown haplotype sample : Major allele : Minor allele 36
  • 37.
    P1 P2 P3P4 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 SNP loci Haplotype pattern  In fact, it is not necessary to genotype all SNPs  SNPs S3, S4, and S5 can form a set of tag SNPs P1 P2 P3 P4 S3 S4 S5 Examples of Tag SNPs 37
  • 38.
    Examples of WrongTag SNPs P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 SNP loci Haplotype pattern  SNPs S1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous P1 P2 P3 P4 S1 S2 S3 38
  • 39.
    P1 P2 P3P4 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 SNP loci Haplotype pattern  SNPs S1 and S12 can form a set of tag SNPs  This set of SNPs is the minimum solution in this example P1 P2 P3 P4 S1 S12 Examples of Tag SNPs 39
  • 40.
    Steps for ‘tagSNP’ selection (1) Determining predictive neighborhoods (2) Minimizing the number of tagging SNPs (3) Tagging quality assessment Halldorsson et al., 2004 40
  • 41.
    Haplotype Blocks andTag SNPs • Recent studies have shown that the chromosome can be partitioned into haplotype blocks interspersed by recombination hotspots • Within a haplotype block, there is little or no recombination occurred. • The SNPs within a haplotype block tend to be inherited together • Within a haplotype block, a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block • We only need to genotype tag SNPs instead of all SNPs within a haplotype block 41
  • 42.
    Problem Formulation (1,2) (1,3)(1,4) (2,3) (2,4) (3,4) S1 S2 S3 S4 There are pairs of patterns. 𝐾(𝐾 − 1) 2  The relation between SNPs and haplotypes can be formulated as a bipartite graph  S1 can distinguish (P1, P3), (P1, P4), (P2, P3), and (P2, P4)  S2 can distinguish (P1, P4), (P2, P4), (P3, P4) P1 P2 P3 P4 S3 S4 S1 S2 42
  • 43.
    Observation (1,2) (1,3) (1,4)(2,3) (2,4) (3,4) S1 S3 Each pair of patterns is connected by at least one edge.  The SNPs can form a set of tag SNPs if each pair of patterns is connected by at least one edge  e.g., S1 and S3 can form a set of tag SNPs  e.g., S1 and S2 can not be tag SNPs P1 P2 P3 P4 S3 S4 S1 S2 S2 43
  • 44.
    Methods to selecttSNPs Based on Principal Component Analysis (PCA) to reduce the dimensions of complete sets of SNPs Covariance matrix of SNPs Principal components analysis SNPs contribute most to eigenvectors & associated with the largest eigenvalues are considered as more influential Selected SNPs added to the set of tagging SNPs 44
  • 45.
    Shannon entropy: Basedon defining how well a subset of SNPs captures the variation in the complete set Shannon entropy helps us quantify how much genetic diversity a particular SNP captures Methods to select tSNPs SNP has high entropy → It comes with different versions of alleles → Reflecting greater diversity → tSNP is selected SNP has low entropy → Most individuals have same version of alleles → Less diversity → Less informative 45
  • 46.
    Linkage Disequilibrium • Theproblem of finding tag SNPs can be also solved from the statistical point of view • We can measure the correlation between SNPs and identify sets of highly correlated SNPs • For each set of correlated SNPs, only one SNP need to be genotyped and can be used to predict the values of other SNPs • Linkage Disequilibrium (LD) is a measure that estimates such correlation between two SNPs 46
  • 47.
    47 Introduction to Linkage Disequilibrium AB A b a B a b SNP1 SNP2 • PAB ≠ PAPB • PAb ≠ PAPb = PA(1-PB) • PaB ≠ PaPB = (1-PA) PB • Pab ≠ PaPb = (1-PA) (1-PB) B b Total A PAB PaB PA a PaB Pab Pa Total PB Pb 1.0 SNP1 SNP2
  • 48.
    48 Linkage Disequilibrium Formulas • Mathematicalformulas for computing LD or Correlation: • r2 or Δ2: ) 1 ( ) 1 ( ) ( 2 2 B B A A B A AB P P P P P P P r     ) 1 ( ) 1 ( ) ( ) ( Var ) ( Var ) , ( Cov 2 2 2 B B A A B A AB P P P P P P P B A B A r      B A AB P P P B E A E AB E B A     ] [ ] [ ] [ ) , ( Cov ) 1 ( ] [ ] [ ) ( V 2 2 2 A A A A P P P P A E A E A ar      
  • 49.
    Linkage Disequilibrium Bins •The statistical methods for finding tag SNPs are based on the analysis of LD among all SNPs • An LD bin is a set of SNPs such that SNPs within the same bin are highly correlated with each other • The value of a single SNP in one LD bin can predict the values of other SNPs of the same bin • These methods try to identify the minimum set of LD bins 49
  • 50.
    An Example ofLD Bins (1/3) • SNP1 and SNP2 can not form an LD bin • e.g., A in SNP1 may imply either G or A in SNP2 Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G A C G T 2 T G C C G C 3 A A A T A T 4 T G C T A C 5 T A C C G C 6 T G C T A C 7 A A A T A T 8 A A A T A T 50
  • 51.
    An Example ofLD Bins (2/3) • SNP1, SNP2, and SNP3 can form an LD bin. • Any SNP in this bin is sufficient to predict the values of others. Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G A C G T 2 T G C C G C 3 A A A T A T 4 T G C T A C 5 T A C C G C 6 T G C T A C 7 A A A T A T 8 A A A T A T 51
  • 52.
    An Example ofLD Bins (3/3) • There are three LD bins, and only three tag SNPs are required to be genotyped (e.g., SNP1, SNP2, and SNP4). Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G A C G T 2 T G C C G C 3 A A A T A T 4 T G C T A C 5 T A C C G C 6 T G C T A C 7 A A A T A T 8 A A A T A T 52
  • 53.
    Examples of ComputingLD 375 0 5 2 5 3 5 1 5 4 5 3 5 4 5 3 1 1 2 2 2 12 . ) ( ) P ( P ) P ( P ) P P P ( r B B A A B A AB            Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A T A A G T 2 G T C C T T 3 G A C A G T 4 G A C C T T 5 G A C A G C 1 5 1 5 4 5 1 5 4 5 4 5 4 5 4 1 1 2 2 2 13            ) ( ) P ( P ) P ( P ) P P P ( r B B A A B A AB 53
  • 54.
    54 Difference between Haplotype Blocksand LD bins Haplotype Blocks LD bins Based on the assumption that SNPs in proximity region should tend to be correlated with each other LD bins can group correlated SNPs distant from each other Represent co-inheritance of alleles within the block A disease is usually affected by multiple genes instead of single one The probability of recombination occurs in between is less Reflect similar allelic frequencies within the bin The SNPs in a haplotype block do not appear in another block The SNPs in one LD bin can be shared by other bins
  • 55.
    Ding & Kullo,2007 Tagging quality assessment Prediction accuracy (PA): How well a identified tagSNPs consistently identifies the haplotypes Tagging consistency: Quantifies the consistency between the sets of tSNPs derived from one population using two different methods or from two populations using one method Tagging efficiency: Provides an estimate of the savings in genotyping offered by tSNPs 55
  • 56.
    Applications of Haplotype-led approachesfor increasing precision in plant breeding 56
  • 57.
    Haplo-pheno analysis Sinha etal., 2020 DNA Sequence GATATTCGTACGGAT GATGTTCGTACTGAT GATATTCGTACGGAT GATATTCGTACGGAT GATGTTCGTACTGAT GATGTTCGTACTGAT SNP SNP Individuals with contrasting phenotypes 1 2 3 4 5 6 Genetic Map Candidate gene Phenotype distribution 57
  • 58.
    (Hypothetical example) Haplotype FrequencyPhenotypic mean GATTGTA 0.35 96 GATTATA 0.55 89 GTTCATA 0.10 115 Haplotype information provides better resolution! 102 Advantage of using Haplo-pheno analysis 58
  • 59.
    Haplotype-based breeding (HBB) RetrospectiveApproach: Genomes of popular cultivars will be resequenced to identify favorable haplotypes. Prospective Approach: Identifying new and potentially valuable haplotypes by studying larger ancestral and wild relative populations of the crop species 59
  • 60.
    GWAS using haplotypes Associationmapping is a tool to resolve complex trait variation down to the sequence level by exploiting historical and evolutionary recombination events at the population level. Importance: It can directly exploit existing and extensive phenotypic data collected during variety registration trials Compared to biparental crosses it offers an increased genetic resolution because it uses historical recombination events It has become increasingly feasible after the emergence of cost- effective and high throughput genotyping approaches 60
  • 61.
    Statistical approaches formarker– trait associations 3. Haplotype based model 2. High density SNP Multi-locus model 1. Single-locus mixed model 61
  • 62.
    Haplotype vs. individualmarkers- Comparative efficiency for crop breeding 62
  • 63.
    • SNPs tiledon arrays are usually chosen for their moderate to high minor allele frequency (MAF) • SNP based genomic relationship matrix (GSNP) is based on SNPs with relatively high MAF. Therefore, may trace less accurately the changes due to recent selection compared to GHAP • GHAP can differentiate between identical by descent (IBD) and identical by state (IBS) • Meuwissen et al., 2014 suggested that building the relationship matrix using haplotypes instead of single SNPs may improve the accuracy of genomic predictions Haplotype vs. individual markers- Comparative efficiency for crop breeding 63
  • 64.
    Crop Species Trait Population size Haplotype markers Haplotype- trait associations References Soybean 100-seedweight; plant height; seed yield 169 941 87 Contreras-Soto et al., 2017 Wheat Heading date; plant height; 1000-grain weight; grain number per spike; fruiting efficiency at harvest 102 4516 97 Basile et al., 2019 Wheat Grain yield; days to heading; plant height 6461 519 36 Sehgal et al., 2020 Barley Deoxynivalenol content in kernels; heading time; days to maturity; grain yield; plant height; specific weight; 1000-kernel weight 277 14400 - Abed et al., 2019 Barley Yield and quality-related traits 106 2770 23 Gawenda et al., 2019 Rice Grain shape 372 30 Ogawa et al., 2021 Rice Agronomic traits 414 15275 Hamazaki et al., 2020 Maize Agronomic and reproductive traits 322 53403 44 Maldonado et al., 2019 Maize Total plant height; ear height; ear height/ plant height 183 7831 40 Maldonado et al., 2019 Maize Agronomic and reproductive traits >1000 154104 Mayer et al., 2020 Oat Heading date 4657 164741 184 Bekele et al., 2018 Rapeseed Days to flowering; seed glucosinolate content 950 15 Jan et al., 2019 Haplotype markers in genome-wide association mapping (GWAS) analysis in different crop species 64
  • 65.
    Limitations of Haplotypebased GWAS • Non-informative SNPs in a given haplotype block masks the effect of adjacent informative SNPs, which in turn leads to spurious associations, decreasing the effectiveness of the GWAS analysis. • These approaches have one thing in common i.e., they all use the consecutive SNPs that possess high LD for the development of haplotypes. • Haplotypes generated via these approaches have been observed to show no difference in the information provided by the haplotype and single SNP, because the SNPs in high LD provide redundant information 65
  • 66.
    Functional haplotype-GWAS (FH-GWAS) FH-GWAScombines only those SNPs which are true contributors into Haplotype Selecting significant SNPs identified via GWAS analysis for individual SNP markers Constructing functional haplotypes considering main and epistatic effects Association tests using functional haplotypes Narrow down candidate region Takes the associated epistatic effects of SNPs into consideration 66
  • 67.
    Qian et al.,2017 Haplotype-based marker-assisted selection 67
  • 68.
    Pan genomics- Shiftfrom useful individuals to useful haplotypes 68
  • 69.
    Genomic selection isan alternative approach for complex traits controlled by QTLs with small effects Uses SNPs as predictors and it is biased with the use of single reference genome.  Pan genomic data can be used to identify new markers to improve prediction accuracy  Practical haplotype graph approach (PHG). Genomic prediction using pangenome haplotype graphs 69
  • 70.
  • 71.
    383 wheat GSpanel phenotyped for Yield, Test Weight, & Protein Genotyped using the Illumina 90K SNPAssay Haplotype blocks of 5, 10, 15, and 20 adjacent markers were constructed for all chromosomes A multi-allelic haplotype prediction algorithm was implemented and compared with single SNPs using both k-fold cross validation and stratified sampling optimization 71
  • 72.
  • 73.
    Haplotypes of 15adjacent markers + training population optimization → significantly improved the predictive ability for yield and protein content by 14.3 and 16.8% respectively, compared with using single SNPs and k-fold cross validation Output of study: These results emphasize the effectiveness of using haplotypes in genomic selection to increase genetic gain 73
  • 74.
    The use ofhaplotype markers in genomic selection in different crop species Crop Species Trait Training populati on size Haplotype markers GS prediction accuracy References Bluegum Traits related to wood quality and tree growth 646 ~ 3000 0.58 Ballesta et al., 2019 Soybean Plant height & grain yield per plant 235 357 >0.80 & >0.45 Ma et al., 2018 Sorghum Agronomic and yield-related traits 207 1974 0.57-0.73 Jensen et al., 2020 Wheat Yield, test weight, and protein content 383 1400 >0.40 Sallam et al., 2019 Wheat Grain yield and related traits 4302 1162 0.39-0.48 Sehgal et al., 2020 Oat Heading date 635 13954 0.42-0.67 Bekele et al., 2018 74
  • 75.
    Objectives: •To identify superiorhaplotypes •To design Haplotype based breeding to develop tailor made pigeon pea varieties resistant to drought 75
  • 76.
    Material 292 pigeon pea referenceset Breeding lines 117 Landraces 166 wild species 9 Drought tolerance phenotyping of 137 accessions Haplotype analysis using Haploview software (Barrett et al., 2005) Candidate gene‐based association analysis Haplo‐pheno analysis Material & Method: 76
  • 77.
    Phenotyping of thesubset panel Minimum Maximum Mean SD Median Mode Kurtosis Skewness Standard Error Plant weight (PW, g) 0.11 2.17 0.87 0.33 0.84 0.84 1.08 0.53 0.03 Shoot length (SL, cm) 4.75 23.50 16.03 3.63 16.33 19.00 0.83 -0.74 0.31 Root length (RL, cm) 5.00 24.33 13.05 3.39 12.67 13.00 0.94 0.66 0.29 Fresh weight (FW, g) 0.03 0.68 0.19 0.11 0.17 0.12 3.58 1.50 0.01 Turgid weight (TW, g) 0.03 1.28 0.32 0.18 0.30 0.15 4.60 1.44 0.02 Dry weight (DW, g) 0.02 0.24 0.07 0.04 0.07 0.05 3.43 1.33 0.00 Relative water content (RWC, %) 7.58 98.96 47.65 21.06 45.89 33.33 -0.69 0.24 1.78 RESULTS: 77
  • 78.
    Haplotype diversity Number of Haplotypes Breedinglines 83 Land races 132 Wild species 60 78
  • 79.
    Trait Gene CcLG/ScaffoldSNP position (bp) Gene (annotation) P‐value PVE (%) Shoot length C.cajan_30211 Scaffold126966 344212 U‐box domain‐containing protein 52 0.00490142 4.48 Root length C.cajan_30211 Scaffold126966 349139 U‐box domain‐containing protein 52 0.00129826 8.16 C.cajan_29830 Scaffold128889 297640 Universal stress protein A‐like protein 0.00724983 5.62 Plant weight C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 0.00033134 8.85 C.cajan_23080 CcLG05 86455 Universal stress protein 0.00080033 7.67 C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 0.00206941 6.42 Fresh weight C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 0.00013847 11.41 C.cajan_23080 CcLG05 86455 Universal stress protein 0.00044457 9.61 C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 0.00161236 7.67 C.cajan_26230 Scaffold133234 89141 U‐box domain‐containing protein 35 0.00164604 7.64 C.cajan_46779 Scaffold117697 3348 Cation/H(+) antiporter 15 0.0054616 5.9 Dry weight C.cajan_26230 Scaffold133234 91818 U‐box domain‐containing protein 35 4.61E‐06 17.02 C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 0.00089919 8.59 C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 0.00150988 7.81 Turgid weight C.cajan_30211 Scaffold126966 344102 U‐box domain‐containing protein 52 2.39E‐05 13.62 C.cajan_30211 Scaffold126966 344496 U‐box domain‐containing protein 52 6.70E‐05 12.03 C.cajan_23080 CcLG05 86455 Universal stress protein 0.00026617 9.95 C.cajan_46779 Scaffold117697 3348 Cation/H(+) antiporter 15 0.00142431 7.52 C.cajan_46779 Scaffold117697 1822 Cation/H(+) antiporter 15 0.00395323 6.09 C.cajan_30211 Scaffold126966 345767 U‐box domain‐containing protein 52 0.0053937 5.67 C.cajan_30211 Scaffold126966 345945 U‐box domain‐containing protein 52 0.0053937 5.67 C.cajan_30211 Scaffold126966 348513 U‐box domain‐containing protein 52 0.0053937 5.67 Relative water content C.cajan_26230 Scaffold133234 88787 U‐box domain‐containing protein 35 0.00572797 2.78 Association of drought responsive genes with phenotype for identification of trait‐associated genes 79
  • 80.
    Haplo-pheno analysis ofC.cajan_23080 gene 80
  • 81.
    Trait Gene Superiorhaplotype Plant weight (PW) C.cajan_30211 H6 C.cajan_23080 H2 Fresh weight (FW) C.cajan_30211 H6 C.cajan_23080 H2 C.cajan_26230 H11 Turgid weight (TW) C.cajan_30211 H6 C.cajan_23080 H2 Dry weight (DW) C.cajan_26230 H11 C.cajan_30211 H6 Relative water content (RWC) C.cajan_26230 H5 Superior haplotypes for drought-responsive genes 81 Gene C.cajan_23080 C.cajan_30211 C.cajan_26230 Superior haplotype H2 H5 H6 H11
  • 82.
    Genotype Gene(s) Superior haplotypes Biologicalstatus Region Geographi c origin (country) PW FW TW DW RWC ICP 10447 C.cajan_23080 H2 H2 H2 Landrace South Asia India ICP 1156 C.cajan_23080 H2 H2 H2 Landrace South Asia India ICP 1273 C.cajan_23080 H2 H2 H2 Landrace South Asia India C.cajan_26230 H5 Landrace South America Venezuela ICP 9236 C.cajan_23080 H2 H2 H2 Breeding line South Asia India ICP 10683 C.cajan_30211 H6 H6 H6 H6 Breeding line South Asia India ICP 7896 C.cajan_30211 H6 H6 H6 H6 Landrace South Asia India ICP 12765 C.cajan_26230 H11 H11 Landrace South Asia Philippines ICP 14163 C.cajan_26230 H11 H11 Landrace South Asia Indonesia ICP 12410 C.cajan_26230 H5 Landrace Unknown Unknown ICP 13191 C.cajan_26230 H5 Landrace South Asia India ICP 14971 C.cajan_26230 H5 Landrace South Asia Indonesia ICP 2698 C.cajan_26230 H5 Landrace South Asia India ICP 4167 C.cajan_26230 H5 Landrace South Asia India ICP 6992 C.cajan_26230 H5 Landrace South Asia India ICP 7420 C.cajan_26230 H5 Landrace South Asia India ICP 8012 C.cajan_26230 H5 Landrace South Asia India ICP 7314 C.cajan_26230 H5 Breeding line South Asia India List of accessions carrying superior haplotypes for three drought‐associated responsive genes 82
  • 83.
    New plant typeproposed for drought tolerant Pigeonpea 83
  • 84.
    Output of study: Haplotype‐basedbreeding can used as promising breeding approach to develop tailor‐made crop varieties Candidate gene‐based association analysis of these 10 genes revealed 23 strong marker‐trait associations (MTAs) Haplo‐pheno analysis identified four superior haplotypes for three genes regulating five component drought traits 17 accessions containing these four superior haplotypes were identified Whole genome re‐sequencing data is used to identify the superior haplotypes for 10 drought‐responsive candidate genes 84
  • 85.
    Objective: •To introduce thenew concept Heterotic Haplotype Capture (HHC) 85
  • 86.
  • 87.
    Rejuvenating a depletedbreeding pool with novel species diversity 87
  • 88.
    Output of study: Genome-scaledata of the NAM and HHC populations enable the identification (in any given NAM line) of haplotype blocks that are predicted to be heterozygous in combination with a genotyped maternal tester Availability of immortal heterotic populations, provides a powerful resources for genome-scale investigations into the genetic basis of heterosis for yield and other important agronomic traits 88
  • 89.