Genome to pangenome : A doorway into crops genome exploration

GENOME TO PANGENOME :
A doorway into crop’s genome
exploration
KIRAN K.M
PGS20AGR8449
Department of genetics and plant breeding
MASTERS SEMINAR 1
UNIVERSITY OF AGRICULTURAL SCINECE, DHARWAD

Our journey through…
How to capture this information? : Birth of pangenome
Introduction: How genome assisted crop improvement works and what sort
of information is missing from this approach ?
How much important this “missing information” is ?
Does the information's mined from pangenome oriented GWAS are worthy?
How to represent and analyze pangenome effectively to dugout new sort
of information?
What are application and future perspective of pan-genome oriented crop
improvement ?
Is pangenome the end of a story? : Conclusion

Entire genic/allelic variant forms within a
species
Domestication
Ecotype differentiation
Selection pressure
Birth or death of some
genes/modification via deletion,
duplication, transposition etc.

SINGLE-REFERENCE GENOME
Single reference genome oriented Comparative genome analysis
What if our reference genome is incomplete to capture whole information's ?
We need to capture entire genetic diversity of species : Doorway
into single reference-free pan-genome analysis

“Boosting-up” of crop improvement programs
Genomic data
derived from multiple
accessions and cultivars
Full extent of
sequence variations
within a species
PAN-GENOMIC
approach to figure out new genes and alleles
directly related to phenotype
“A pangenome refers to the full complement of genes of a biological
clade, such as a species, which can be partitioned into a set of core genes that are
shared by all individuals and a set of dispensable genes that are partially shared
or individual speciﬁc.”

Hervé Tettelin Duccio Medini
✔ Pangenomes were first introduced by
Tettelin et al., to describe gene diversity
in Streptococcus agalactiae.
Michele Morgante
✔ Pangenomics in plants was first proposed by
Morgante et al.,
✔ 2014- First crop plant genome in - Soyabean
(Glycine max)

Extensive structural variants (SVs)
Presence-absence variation
(PAV)
Copy number variation (CNV) Chromosomal rearrangements

Origin of SV’s
Recombination errors,
Non-allelic homologous
recombination (NAHR)
Replication errors
Microhomology-mediated break-
induced replication (MMBIR)
DNA break repair errors
eg: Non-homologous end
joining (NHEJ)
Non-reciprocal exchanges.
Homoeologous non-reciprocal
transpositions (HNRT)
Replication errors
Fork stalling and template
switching (FoSTeS)
PNV
Gabur et al., 2018
CNV
Polyploidization and/or
Whole-genome duplication
Structural variations (SV’s) : The un-tapped genetic potential

❑Resistance to biotic stress
⮚Rhg1 locus –resistance to cyst
nematode(soybean)
⮚Absence of sulfotransferase
gene in PAVs with various sizes-
resistance to striga (Sorghum)
⮚ Deletions in the Pi 21gene
results in quantitative and
durable resistance against blast
disease(Rice)
SVs affecting complex agronomic traits

❑Resistance to abiotic stress
⮚ PAVs Sub1A gene (Xu et al. 2006) encoding ERF like genes-submergence
tolerance&2 ERF genes SNORKEL 1&SNORKEL 2 (Hattori et al. 2009) - deep water
response(Rice)
⮚ Tolerance of phosphorus starvation at Pup1 locus –attributed to the presence of
a receptor like cytoplasmic kinase gene PSTOL1(rice)
❑Plant architecture
⮚ Extra copy of Rht-D1b resulting from a duplication of a >1Mb region causes
>70%reduction in plant height (wheat)
❑Yield and grain quality
⮚ 1212-bp deletion 5 kb downstream of the GW5 gene causes variation of grain
width and grain weight(Rice)
❑Flowering Time
⮚ A CNV at the HvFT1 locus was found to be associated with flowering time
variation (Barley)

Compared with the entire pan-genome, genes in the flexible genome were
significantly enriched with those involved in biological processes, such as defense
response, photosynthesis and biosynthetic processes

Dynamics of pan-genome compartments
⮚Gene birth and death processes
• Errors during recombination
• De novo genes
• Duplication followed by rapid divergence,
neo-functionalization
⮚Transposable elements
• In maize, helitron TE activity can modify
50% of the genome structure.
⮚Horizontal transfers
 Conjugation
 Transduction
 Transformation (Christine et al.,2019)

Pan-genome construction and assembly methods
de novo sequencing
and comparison
iterative mapping
and assembly
Graph-based approaches
1. Sequence and variation graphs (VGs)
2. Practical haplotype graphs (PHG)
▪ Errors in assembly and
annotation may lead to the false
calling of variation
▪ Costly, requires high-quality data
with high sequencing coverage
▪ Limiting the application to
relatively few individuals
Start with, Single reference genome as a
base for the pangenome
Whole genome
sequence data for
multiple individuals is
aligned to the
reference genome
Non-aligning sequence
reads are assembled
and added to the
reference to build a
pangenome
MULTIPLE GENOMES

Reference guided assembly
De-novo genome assembly
(Danilevicz, 2020)

(Eizenga et al., 2020)
Pan-genome graphs
• To represent the sequence content and the corresponding functional
annotation of an entire population, species, or a clade
• Here compressing redundant sequences into smaller data structures while
retaining information on genomic diversity and whole-genome
relationships

Ideally, a complete and fully annotated
pangenome graph would integrate genomic,
epigenomic, and transcriptomic datasets, thus
facilitating downstream functional and
comparative analyses

Genome region harboring CsFT locus among six cucumber accessions
44.0 kb complex
insertion
25.3 kb complex
insertion
39.3 kb canonical
insertion
Li et al 2022,
Nature communication
Variation Graph (VG)

Practical haplotype group (PHG) database and haplotype creation/ Trellis graph
representing genic and intergenic regions.
Jensen et al 2020,
The plant genome
PHGs is to determine
which haplotypes or genotypes of
parental haplotypes that have been
sequenced at high coverage are
present in progeny that have been
sequenced at low coverage

❖ Extrapolation of pangenome size leads to a
• Predicted pangenome of 63,865±31 genes (37,766±62 gene families
• Predicted core genome size of 49,740±164 genes (28,496±91 gene families
Model describing the sizes of core and pangenome.
Orthologous gene clusters
61,379 genes
All genes
35,853 gene families
49,895 genes 28,532 gene families
A. G Agnieszka et al, 2016
Nature communications
DOI: 10.1038/ncomms13390

How many genomes to capture the whole genome content?
✔The core/pangenome ratio
below 85% shows huge
adaptation.
✔In plants, core genome
represents from 40-80% of the
total pangenome.
Munir et al.,2020

Power-law regression for new genes
The numbers n of new genes are plotted
for increasing values of the number N of
genomes sequenced.
(Tettelin et al. 2008)
✔Blue curves are least-squares fit of the
exponential function, as in the original
pangenome model.
✔Red curves are least-squares fit of the
power-law function.
Open pan genome
Closed pan genome
Closed (α > 1)
Open (α < 1)

Software / Tool Description / Role URI link
PanSeq Extract the regions unique in the genome, Identify
the SNPs and construct the file for phylogeny
programme
https://lfz.corefacility.ca/panseq/
PanFunPro x Homology detection and pairwise genome analysis
in pan/core genome.
https://zenodo.org/record/7583#.YTR3
6p0zY2w
PGAP Detection of homologous genes, orthologous genes,
SNP, phylogenetic studies, pangenome plotting and
functional annotation.
http://pgap.sf.net
PanACEA Identification of genomic regions those are
phylogenetically dissimilar.
https://github.com/JCVenterInstitute/P
anACEA
PGAP-X Genome diversity and visualize genome structure
and gene content to understand the evolution.
http://pgapx.ybzhao.com/
PAN2HGENE. To identify new products, resulting in altering the α
value behavior in the pangenome without altering
the original genomic sequence.
https://sourceforgenet/projects/pan2h
gene-software
BGDMdocker. For pangenome analysis, visualization, clustering and
genome annotation
https://www.docker.com/whatisdocker
Aggarwal et al., 2022
Tools – pangenome analysis

MATERIALS AND METHODS
Pan-Genome Assembly and Annotation
Gene Presence-Absence Variations(gPAVs)
SNP Discovery and Annotation
Sorghum Diversity and Population Structure
Genome-Wide Association Analysis (GWAS) : Two different mapping
populations having the phenotypic data of 10 traits
Drought RNA-Seq Assay Analysis

Pan-Genome Assembly and Annotation
Iterative mapping and assembly approach
Start with Sorghum reference
assembly v3.0.1 and adds on whole
genome sequence data iteratively
Compared the aligned sequence with NCBI
non-redundant nucleotide databases
BLASTn and the sequences with homology to
sorghum mitochondria, chloroplast
sequences and homology with Viridiplantae
taxonomy (outside the green plant group)
Remove these homologous
sequences
Unmapped reads were assembled
Sorghum reference assembly v3.0.1.
Reads from 176 sorghum accessions with
a minimum of 10X coverage sequence
data were mapped to the sorghum
reference v3.0
(The assembled contig sequence more than 500 bp length was only considered and appended to
reference genome sequence.)
Bowtie2 v2.3.4,
IDBA_UD assembler,

REPEATMASKER-v4.0.7
Masked repetitive elements (>90 percent
coverage with greater 90 percent identity)
using sorghum as the species
AUGUSTUS v3.3.2
The sorghum expressed sequence tags
(ESTs) from GenBank were aligned
with tBLASTx and genes were prediction
with homology and ab initio method
On an average each iteration of the process added 1.9Mb
263.7 Mbp
89.2Mb removed
RNA-Seq mapping hints from the 25 accessions
used for combining-evidence based ab initio gene
prediction and the 3,589 genes supporting the
mapped expressed sequence tags (EST) sequences
were retained.
Identified 11,057 to 17,616 variable genes in the
176 genomes,
✔ Average gene sequence length- 1,567 bp
✔ Average exons per gene - 3.6

Sorghum Pan-Genome Gene PAV
(gPAV)
R “ape” package to construct an
NJ (neighbour joining) tree
Whole-genome sequence reads of all 354
sorghum accessions were mapped with
Bowtie2 v2.3.4
Genes models on contigs longer than 1 Kbp were used in this analysis.
Clustal analysis
all-by-all BLASTp followed by MCL
The gene enrichment analysis R
“topGO” package- using “Elim” method.
SNP Discovery and Annotation
whole genome sequence reads of 354
accessions were quality trimmed using
Trimmomatic
Bowtie2 v2.3.4 – Paired reads mapping
with pangenome
Picard tools- To filter out read duplication
SNP functionally annotated with
SnpEff v.4.3
Variants against the reference (pan-
genome) were called with GATK v.4.1

▪ Closed type pangenome- 35,719 Genes
• 18,898 variable genes
▪ 30 genes  uniquely present
▪ 3,183 (8.9%)  uniquely absent
35719
16821 (47%)
RESULTS
▪ Total- 2 million SNPs ; 91319 SNP’s in extra
contig assembly
▪ Variable gene length is shorter with few exons
▪ Variable genes have fewer synonymous SNPs
and similar non-synomymous SNPs compared to
core gene

A - Reference whole genome sequence reads mapping
B - Drought expression (RNASeq) sequence mapping
density
C - Gene density
D - Genes commonly present in all accessions (core genes)
E - genes absent in at least one of the accessions (variable
genes)
F - SNP density
G - Insertions and deletions (indels).
SNP density
• Extra contig : 0.52/Kbp
• Rest : 2.71/kb
• 210,805 Contigs
• Minimum contig length : 500 bp

gPAVs- based neighbour-joining tree
with Histogram bars
▪ Among the 35,719 total genes, 53%
exhibited the genic variations to
estimate the relationship among
the accessions
✔ The largest number of genes
▪ uniquely present : Macia (9
genes)
▪ uniquely absent : PI660645 (372
genes)
This indicated the evolutionary
distance from other accessions

The Ka/Ks ratio estimating the balance between
neutral mutations, purifying selection, and beneficial
mutations on a set of core and variable genes
Distribution of Infinium SNP array markers on chromosome

Principal co-ordinate analysis
✔ Three different clusters with one of them having two groups (Caudatum and Kafir )
✔ Durra and guinea sorghum races displayed identifiable clusters
✔ Caudatum and Kafir accessions exhibited the admixtures

▪ Total of 111 genes among total variable genes are
race-specific
▪ unique genes from durra associated with,
✔ Heat shock protein, LRR repeat protein, L-type
lactin-domain receptor, ABC transporter family
proteins, and Ras-related proteins.
▪ Guinea group unique genes associated with
✔ disease resistance protein, betaglucosidase
proteins, NRT1/PTE protein family, etc.,
GENE CLUSTER ANALYSIS Identified
⮚ 11,470 gene families
⮚ Un-clustered genes (6,057)
▪ 556 from the non-reference genes and the remaining 5,501 from
reference genes.
Specific and common genes across races

Genome-Wide Association Analysis (GWAS) with two different
mapping populations having the phenotypic data of 10 traits
POPULATION 1 POPULATION 2
▪ A subset of 227 accessions from
the 354 WGS set belonged to four
major races of sorghum having
representation from Africa, Asia,
and America was use
▪ The phenotype and genotype data
associated with
⮚ Plant height (PH),
⮚ Dry biomass (DBM), and
⮚ Starch (ST)
▪ The stay-green fine-mapping
population developed by crossing an
introgression line cross RSG04008-6 ×
J2614-11 was used for association
study using the pan-genome assembly
⮚ Green leaf area (GLA)
⮚ Glossy (GL) leaf
⮚ Sheath pigment (LSP)
⮚ Plant vigor (V),
⮚ Trichome low (TL),
⮚ Trichome up (TU),
⮚ Soot fly dead hearts (SFDH)
Pan-genome helps identifying novel genes
1

Significant association of SNP’s for plant biomass on chromosome 9

Significant association of SNP’s for plant height on extra-contigs
✔ From Population 1 : A total of 36 SNPs on extra contigs found associated with target
traits. Among them,
10 SNP
25 SNP
1 SNP
✔ From Population2 : Trait Green leaf area (GLA) significant association with five SNPs
on extra-contigs
Starch (ST)
Dry bio mass (DBM)
Plant height (PH)

Identification of the Drought Candidate Genes
▪ A sorghum RNASeq data generated from
▪ 79 out of 1,788 total drought responsive
genes(differentially expressed genes) were reported
from genes on assembly sequence (extra- contig).
Drought-resistant Susceptible
BTx623 (DR1) Tx7000 (DS1)
SC56 (DR2) PI482662 (DS2)]
6 hr- Treatment ,
DR 1 Data-set :14 (13 up and 1 downregulated) and
DR 2 data-set : 34 (31 up and 3 down-regulated)
genes from novel sequence were expressed

▪ Over-all, Five drought-related genes were co-mapped with the trait-
associated genes. Among this,
Two drought resistance specific genes Sobic.005G069800 and
Sobic.006G127800 were linked to Plant height and Sheath pigment (LSP) traits.
DR 1 Data-set DR 2 Data-set
Venn diagram

Functional consequences of
new transposable element
insertions
Possible effects on gene
product structure
Transposable elements (TEs) , a driver of structural variation

The TE insertions were shown to be associated with changes in methylation,
chromatin accessibility and potentially regulatory functions
Possible effects on gene
product abundance

TEs as novel regulatory elements
TEs carrying ACRs are enriched for association with higher expression of
nearby genes, indicates their role as novel regulatory elements
(a) Insertions of transposons into
genes/regions of accessible chromatin
regions (ACR’s) or regulatory elements
Might often result in reduced expression
of nearby genes or altered patterns of
expression
(b) Insertion of TE’s that contain ACR’s
Might act as mobile enhancers that
affect the expression of both the TE
promoters and nearby gene promoters
ie re-wiring of transcription of nearby
genes
(Noshay et al., 2020)

Pangenome : A tool to unveil the hidden role of
Transposable elements(TE’s) in crop evolution

Eight high-quality genomes reveal pan-genome
architecture and ecotype differentiation of Brassica napus
VOL 6 | 2020

SNP-based GWAS versus PAV-based GWAS
: case study for silique length(SL), seed weight (SW) and flowering time in Brassica napus.
Manhattan plots of SNP-GWAS and PAV-GWAS for silique length.
GWAS (-lmmm 1: Wald test) was performed with 3,971,412 SNPs or 27,216 PAVs in the BN-NAM population
containing 2,141 RILs.

Although the peak SNP on chromosome A09
fell within the previously reported region
identified by traditional quantitative trait
locus mapping and positional cloning.
• None of the associated SNPs was located
in the regulatory region or coding
sequence of the target gene
BnaA9.CYP78A9
• Encouragingly, PAV-GWAS directly
detected the 3.9-kb CACTA-like TE
inserted upstream of the
BnaA9.CYP78A9 promoter region(P450
monooxygenase), which was identified as
the causal variation for SL and SW
Phenotype data of silique length in eight B.
napus accessions.
• Experiments were repeated five times with
similar results.
Phenotype data of seed weight in eight
assembled B. napus accessions.
A 3.6-kb CACTA-like insertion as lead PAV
of BnaA09.CYP78A9 promoter region.

Pangenome Revealing secret of niche specific fitness
2
Tapidor
Quinta
Gagan
ZS11
Shengli
Zheyou7
Westar
No2127
Winter type
(WORs)
Semi-winter type
(SWORs)
Spring type
(SORs)
Eight B. napus accessions
Neighbour-joining tree of 210 B. napus accessions, eight
assembled accessions and 199 B. rapa accessions

Insertions of four transposable elements around BnaA10.FLC in different ecotypes
✔ Validated these TEs in 210 B. napus accessions (141 of which had ecotype information)
The role of FLC genes in the divergence of the three rapeseed ecotypes
SWORs
WORs SORs

✔Due to the LINE insertion in the first exon of BnaA10.FLC, the loss-of-function
mutation makes SORs require weak or no vernalization.
✔An 824-bp hAT insertion in the last exon of BnaA02 FLC was identified as
the lead PAV by PAV-GWAS in SOR (Spring)Type
✔The MITE insertion in the promoter region of BnaA10.FLC enhances the
expression of BnaA10.FLC which leads to a requirement of strong
vernalization for WORs.
✔A demand for vernalization of SWOR is somewhere between the other two
ecotypes due to the hAT insertion in the promoter region of BnaA10.FLC
CONCLUSIVE RESULTS

Indicating a strong correlation between specific TE insertions in
BnaA10.FLC and ecotype classification
Haplotypes of six SNPs and the three
TEs located within the 5.0-kb
upstream and downstream regions
and the coding sequence of
BnaA10.FLC
Pangenome uncover potentiality of Transposable
elements(TE’s) as powerful molecular markers
3

CROP Transposable elements Associated trait
Maize A Harbinger-like DNA transposon Represses the expression of the ZmCCT9
gene to promote flowering under long-day
conditions
Rice A Gypsy retrotransposon Enhance the expression of the OsFRDL4
gene and promote aluminum tolerance
Tomato Two Copia retrotransposons
independently inserted into the promoter
region of the orange Ruby gene
Enhanced expression and driving convergent
evolution of the blood orange trait
maize Ac/ fAc ( hAT family element) transposon Induce expression of pericarp color 2 gene
(p2) by capturing the enhancer sequence of
another gene

The tomato pangenome un-covers new genes and a
rare allele regulating fruit flavor
4
(Gao et al., 2019).
Pangenome un-covers rare alleles

• Solanum pimpinellifolium (SP)
• Solanum cheesmaniae ssp galapagense (SCG)
• Solanum lycopersicum L. var. cerasiforme (SLC)
• Solanum lycopersicum L. lycopersicum (SLL)
Phylogenetic and principal component analyses
(PCA) using the PAVs suggested that wild
accessions clearly separated from domesticated
accessions with only a few exceptions, and the two
domesticated groups (SLC and SLL) separated but
with clear overlaps
Violin graph
Principal component analyses (PCA)

“Who will last in the Run?”
Scatter plots Gene selection preference during tomato domestication and improvement

A rare promoter allele that modifies fruit flavor
Pan-genome analysis ~4-kb substitution in the promoter region of TomLoxC
(Solyc01g006540) Encodes a 13-lipoxygenase essential for C5 and C6 green-leaf
volatile production in tomato fruit
4,151-bp nonreference allele of the TomLoxC promoter captured in Pan-genome
Rare allele in cultivated tomatoes
that reflects strong negative selection during domestication.
• TomatoPan028690Truncated part of a fruit weight gene -Cell Size
Regulator (CSR) was detected in- All SP, 88.6% of SLC and 14.4% of SLL
heirlooms.
✔This supporting that the deletion allele arose during domestication and
has been largely fixed in cultivated tomatoes.
Human selection influenced fruit quality or additional phenotypes in
some instances by targeting regulatory sequences

S. pimpinellifolium SP
(47.4%)
Modern SLL cultivars
(7.2%)
All heterozygotes
S. cheesmaniae SLC
(8.4%)
SLL heirlooms
(1.1%),
Most likely because of recent introgressions from wild into cultivated tomatoes. consistent with its selection during modern breeding,
possibly the consequence of selecting lines with superior stress tolerance in agricultural settings
The frequency of the non-reference allele
✔ Expression levels of TomLoxC in
orange-stage fruit of accessions with
different promoter alleles
✔ Heterozygous TomLoxC promoter
genotypes have the strongest
expression in orange-stage fruit.

5 Pangenome helps trace back to domestication trajectory

Nature | Vol 588 | 10 December 2020
• Constructed chromosome-scale sequence assemblies
for 20 accessions
• Paired-end and mate-pair Illumina short reads were assembled into scaffolds
• Chromium linked-reads and chromosome conformation capture (Hi-C) data to arrange scaffolds into
chromosomal pseudomolecules using the TRITEX pipeline
• Use single-copy pan-genome for genetic analysis in
a wider diversity panel -single-copy regions extracted from each of
the 20 assemblies and clustered into a non-redundant set of sequences

Translate single-copy sequences variation into scorable markers which
are amenable to population genetic analysis and association scans
Genome-wide association scan for lemma adherence on the basis of PAV markers
Lemma adherence covered - NUDUM (NUD) gene
INFERENCE
All varieties of naked barley are thought to trace back to a single mutational
event, deleting the entire NUD sequence
How much significant a single-copy pangenome is ?

Mapping of polymorphic inversions in population
Objective : To discover inversions in a broader set of germplasm
• Hi-C-based inversion scans on Hi-C data of a diversity panel mapped to a
single reference genome
✔Among 69 barley genotypes (67 domesticated and 2 wild accessions)
revealed a total of 42 events that ranged from 4 to 141 Mb in size
(mean size of 23.9 Mb)
✔A notable finding was the prevalence of large (more than 5 Mb in size)
inversion polymorphisms in current elite germplasm
6 Mapping of polymorphic inversions in population

Identification and characterization of a large inversion on chromosome 7H
1. RGT Planet (Inversion carrier) × Hindmarsh  (R × H)
2. Morex × Barke (M × B) Mapping population
 Earliest cultivar that carried the inversion was Diamant. As one of the
donors of the semi-dwarf growth habit

This strongly suggests that mutation breeding in the 1960s has given rise to
a cryptic large inversion, which—unbeknownst to breeders—segregates in
elite varieties of barley
INFERENCE
• Map of inversion polymorphisms will provide breeders with a point of
reference to avoid or interpret correctly the crosses between carriers and non-
carriers.
 Diamant -Highly influential founder line of modern barley breeding
and traces back to a mutant induced by gamma irradiation of the
Czech cultivar Valticky.
 Gene bank/ Germplasm study : None of the Valticky samples carried
the inversion, whereas it segregated in the Diamant samples

Expanding Gene-Editing Potential in Crop Improvement
with Pangenomes
Identify non-recombinant inversions in
pangenome- High precision identification of
chromosomal re-arrangement boundaries
CRISPER Protein complex
Induce inversion- Re-inversion
Genes locked in the region is accessible to
recombine in population
Reversal of inversion through CRISPR tech. allow crossing genes in inverted regions
(Fernandez, 2022)
7

✔CRISPR-Cas can be used to study the effect of gene dosage by generating a
series of allelic mutants through knock-out/down mutation of specific variant
alleles
Eg : Pleotropic effects of mlo gene (barly) against powdery mildew
✔Potential benefits of using pangenome reference for genetic modification, as
1. The genetic diversity analysis can be helps to identify potential target
site for genome editing
2. Identify CNV that influence CRISPR-Cas mutation effectiveness
3. identify novel target alleles and map their position on pan-genome
4. Avoid off-target effect in multiplex editing by designing specific sgRNAs
Thus supporting accurate and specific guide RNA design

Crop pangenome Reference
Maize pangenome
(66 inbred lines)
✔ Identified inveresions Largest inversion spanning 75.5 Mb
in the pericentric region of chromosome 2
Schwartz C et
al., 2020,
Nature plants
cotton pangenome
(890 accessions )
Meta-GWAS and gene expression analysis –
Gene knockout with CRISPR-Cas9:
✔ Identified previously uncharacterized gene GhIDD7
subsequently shown to control fibre length
Li et al.,
2021,
Genome Biol.
Rice pangenome
(66 accessions)
“Green revolution
phenotype”
✔ Identified 129 conserved gene loci
✔ CRISPR-Cas knock-out/down study:- uncovered 31
high yield-related genes, including six previously
reported genes such the sd1 semi-dwarf gene
Huang j et
al., 2018

Role of Cis-regulatory elements (CREs) and their Pan-
genome identification for fine tuning of gene expression
❑ The CREs are noncoding DNA sequences capable
of recruiting transcription factors and affecting
gene expression
❑ The CREs can be broadly subdivided into
promoters and enhancers or silencers
(Zanini et al., 2021)
8

Genome editing of cis-regulatory elements: a hypothetical scenario of editing of Brassica
napus CLV3 homologues’ cis-regulatory elements to generate multiocular siliques and range
of variation in seed number. Brassica napus has two, mostly redundant, copies of BnCLV3,
so editing of both would be necessary
(Xu et al., 2021; Yang et al., 2018)

Importance of pan-genomics as approach to explaining
heterosis
❖ Pan-genomics can play an important
role in unraveling gene members and
families contributing to heterosis,
according to the proposed model
❖ A new gene and variant finding is
essential to explaining and utilizing
heterosis for crop improvement.
A model of heterosis proposed by Swanson-Wagner et al.,
10

Pan-genome : A resource to explore the Breeding
Potential of Under-utilised Crop Species
Guava Investigate fruit and leaf metabolites and fruit aroma volatiles of 27
guava accessions .These datasets could be used to scan a guava pangenome for
fruit related traits
Integrating rich
phenotype data
A super-pangenome of yam
bean species (P. erosus, P. ahipa
and P. tuberosus
Helps to infer the effects of SVs on phenotype,
including traits directly related to plant
performance such as
✔ Day to flowering and maturity,
✔ plant height
✔ root biomass
By developing resources for under-utilised crops, novel genes related to agro-morphological
traits can be detected and used to inform breeding programs or used for introgression
11

CURRENT STATUS AND FUTURE
ASPECTS OF PANGENOMIC STUDIES

A summary of plant pangenome studies.

Species Single reference
size
Pangenome
size
Traits studied using the
pangenome
Variant
type
Reference
Brassica
oleracea,
B. macrocarpa
(cultivated and
wild cabbage)
(Bo TO1000) 488
Mb; 59,225 gene
587 Mb;
61,379 genes
Disease resistance,
flowering time, secondary
metabolites
PAV Golicz, et
al.
2016
Cajanus cajan
(pigeon pea)
(Asha) 606 Mb;
53,612 genes
622 Mb;
55,512 genes
Self-fertilization, disease
resistance, seed weight
SNP,
PAV
Zhao et
al., 2020
Glycine soja (wild
soybean)
(GsojaD,
Shandong) 985
Mb; 57,631 genes
986.3 Mb;
59,080 gene
families
Disease resistance,
flowering time, oil content,
height and lodging, yield
CNV,
PAV,
SNP,
InDel
Li et al.,
2014
Gossypium
hirsutum
(upland cotton)
(TM-1) 2,347 Mb;
70,199 genes
3,388 Mb;
102,768 genes
Flowering time,
morphology, yield, fiber
traits
PAV,
SNP
Li et al.,
2021
Oryza sativa (Nipponbare)
384 Mb
Indica- 52976 Disease, stress resistance,
grain width and size
SNP Yao et al.,
Zea mays (maize) B73, 2,182 103,538
genes
Disease resistance,
flowering time
PAV,TE
insertion
Hufford et
al., 2021

Pan-genome Array (RPGA): an efficient genotyping solution
for pan-genome-based accelerated crop improvement in
rice
Anurag Daware , Ankit Malik , Rishi Srivastava , Durdam Das , Ranjith K.
Ellur , Ashok K. Singh , Akhilesh K. Tyagi and Swarup K. Parida
✔ “Rice Pan-genome Genotyping Array (RPGA)” is a first-ever pan-genome-
based SNP genotyping assay developed for crop plants
✔ Efficiently capture haplotype variation from the entire 3K rice pan-genome
representing diverse population (Indica, Tropical/Temperate japonica, aus
and Aromatic, etc.)
✔ RPGA assays total of 80504 SNPs including 60026 SNPs from 12 Nipponbare
chromosomes and 20478 SNPs from 12 pseudo127 chromosomes of 3K rice
pan-genome.
(2022)

‘RICE PAN-GENOME GENOTYPING ARRAY’ ANALYSIS PORTAL(RAP)
http://www.rpgaweb.com 3K Rice Reference Panel and subsequent GWAS

“Super-pangenome is the approach of developing a pangenome of the pangenomes of
different species for a given genus”.
Super-pangenome: A way forward

Khan,W.et al.,2020
Approaches for the construction of super pangenome

 Super-pangenomes support the breeding of crops better adapted
to diverse environments and more resilient to climate change by
analyzing gene frequency change during domestication/ evolution
Super-pangenome study involiving polyploid Brassica napus and its
two diploid progenitor genomes gives,
• Comparative modelling of the gene loss propensity in diploid and
polyploid Brassica sp.
 Diploids- Primarily associated with transposable elements
 Polyploid, B. napus - Associated with homoeologous
recombination.
 Identification of beneficial haplotypes that could be introgressed
through conventional breeding
(Bayer et al 2021, Plant Biotechnology Journal)

 Tomato super-pangenome identified functional polymorphisms in the
genes associated with fruit flavour(LIN5, ALMT9, AAT1, CXE1, and LoxC ).
 Cotton super-pangenome give knowledge on Genomic diversity
among five polyploids and their monophyletic origin
 Polyploidy genomes are conserved in gene content and synteny
 Diversified by sub-genomic transposon exchanges that equilibrate
genome size, Evolutionary rate, and positive selection between
homeologs within and among lineage
 The super-pangenome of banana identified
Gene differences between Musa and Ensete genera , as well as 12,310
new gene models in the species, forming distinct PAV clusters between the
Ensete and Musa accessions
(Chen et al., 2020 Nature genetics)

APPLICATIONS OF PANGENOME
1. Finding novel genes
2. Revealing niche specific fitness
3. Evolution, Domestication and Breeding History
4. Helps to identify potential target site for genome editing
5. Facilitating taxonomic identification
6. Approach to explaining heterosis
7. Elucidating host pathogen interaction
8. Strengthening proteogenomic

0
CONCLUSION
WHAT TO ADD?
&
WHERE TO ADD?

Beyond pan-genome ?
Pan-Transcriptome
Potent bioinformatic tools
Pan-Metabolome Pan- Epigenomes

 Comparative modelling of the propensity for gene loss in the three
species revealed that in the diploids, genes with propensity for loss are
primarily associated with transposable elements, while in the polyploid
B. napus, propensity for gene loss was associated with the position of
the gene on the pseudomolecule
 Studying how genes change in frequency between domesticated crops
and their wild relatives using

Rapeseed (Brassica napus) reference genomes,
Two Winter type oilseed rapes (WORs) (Darmor-bzh2 and Tapidor8)
Two semi-winter oilseed rapes (SWORs (ZS11 and NY7)
Genome-wide comparative analysis of eight well-assembled genomes and the
Darmor-bzh genome and identified the coregene clusters, dispensable gene
clusters and specific gene clusters.
Created by,
⮚ ZS11 de novo assemblies using PacBio, Hi-C and Bio-Nano data
⮚ Other seven accessions were obtained by integrating high-coverage PacBio
and Illumina data; two of them were verified by Hi-C or BioNano data.
Multiple high-quality reference genomes representing different ecotypes are
necessary for a better understanding of the genome structure and genetic basis
of morphotype
Materials and methods

GWAS of flowering time in the Nested association mapping (NAM) population.
Manhattan plots for flowering time analyzed by
SNP-GWAS in winter and spring environments,
respectively.
Manhattan plots for flowering time analysed
by PAV-GWAS in winter and spring
environments, respectively..
GWAS (-lmmm 1: Wald test) was performed with 3,971,412 SNPs or 27,216 PAVs in the BN-NAM
population containing 2,141 RILs.

Insertions of four transposable elements around
BnaA10.FLC in different ecotypes
✔ Four TEs were identified in the
promoter and coding region of
BnaA10.FLC
✔ Validated these TEs in 210 B. napus
accessions (141 of which had ecotype
information)
The role of FLC genes in the divergence of the three rapeseed
ecotypes.
RESULTS
✔All the WORs contained the MITE
insertion
✔ 85% (22/26) of the SORs contained
the LINE insertion
✔81% (80/99) of the SWORs contained
the hAT insertion
SWORs
WORs SORs

Flowering time of lines with different BnaA02.FLC alleles in spring & winter respectively.
spring
Spring Winter

An 824-bp hAT insertion in the last exon of BnaA02 FLC
was identified as the lead PAV by PAV-GWAS.
SOR (Spring)Type

✔Due to the LINE insertion in the first exon of BnaA10.FLC, the loss-of-
function mutation makes SORs require weak or no vernalization.
✔The MITE insertion in the promoter region of BnaA10.FLC enhances the
expression of BnaA10.FLC which leads to a requirement of strong
vernalization for WORs.
✔A demand for vernalization of SWOR is somewhere between the other
two ecotypes due to the hAT insertion in the promoter region of
BnaA10.FLC
✔An 824-bp hAT insertion in the last exon of BnaA02 FLC was identified
as the lead PAV by PAV-GWAS.
CONCLUSIVE RESULTS
SOR (Spring)Type

824 bp hAT insertion in the last
exon of BnaA02 FLC
BnaA02.FLC has a stronger flowering repression effect than BnaC02.FLC47

BnaA02.FLC has a stronger flowering repression effect than BnaC02.FLC
Both possess BnaA10.FLC
Tapidor
Quinta
Winter type
(WORs)
Tapidor Two copies of BnaA02. FLC
One copy of BnaC02.FLC
Quinta
This may be cause the difference in
flowering time between them
One copy of BnaA02.FLC
Shengli
Zheyou7
Tapidor
BnaC02. FLC gene is replaced by BnaA02.FLC
BnaA02.FLC was not expressed in any stage + one functional BnaC02.FLC
No2127
Gangan
Westar
Gene BnaC02. FLC is completely absent

✔ The cumulative expression levels of three FLC genes and the flowering time
characterization of eight assembled B. napus accessions associated with PAVs and
copy number, among the eight accessions
Stacked histogram showed FLCs expression in T0–T3
24 (T0) ; 54 (T1); 82 (T2); 115 (T3) DAYS AFTER SOWING
Spring type (SOR)
Semi winter type
Spring type (SOR)

• Among the unfavorable genes, seven were not full length.
• TomatoPan028690Truncated part of a fruit weight gene -Cell Size Regulator
(CSR) was detected in- All SP, 88.6% of SLC and 14.4% of SLL heirlooms,
✔This supporting that the deletion allele arose during domestication and has
been largely fixed in cultivated tomatoes.
Selection of promoter PAVs during tomato breeding.
A total of 90,929 nonreference contigs
3,741 nonreference sequences localized in putative promoter regions
980 promoter sequences under selection Checked the expression of their
downstream genes(RNA-Seq data, for orange-stage fruit stage ) in the 397
accessions
checked PAV patterns of these promoters, as well as those
in the reference genome

RESULT - Of these promoters, 240 had downstream genes with significantly
different expression
Human selection influenced fruit quality or additional phenotypes
in some instances by targeting regulatory sequences
A rare promoter allele that modifies fruit flavor
Pan-genome analysis ~4-kb substitution in the promoter region of TomLoxC
(Solyc01g006540) Encodes a 13-lipoxygenase essential for C5 and C6 green-leaf
volatile production in tomato fruit
4,151-bp nonreference allele of the TomLoxC promoter captured in Pan-genome
Rare allele in cultivated tomatoes
that reflects strong negative selection during domestication.

Involvement of TomLoxC in apocarotenoid biosynthesis confirmation,
✔QTL mapping TomLoxC as the cause of changed levels of flavor-
associated lipid- and carotenoid-derived volatiles.
✔Analysis of transgenic tomato fruit (TomLoxC expression was repressed)
revealed a previously unknown alternative apocarotenoid production
route.
❑ The tomato pan-genome harbors useful genetic variation which
was unvisible on the ‘Heinz 1706’ reference genome alone.
❑ Tomato pan-genome revealed extensive domestication- and
improvement-associated loci and genes, with an evident bias
toward those involved in defense response

On average, each of the 20 genotypes
contained 2.9 Mb of single-copy sequence not
present in any other assembly
Procedure To test the suitability of the single-copy pan-genome for
genetic analysis add if time
To test the suitability of the single-copy pan-genome-
The abundance of 160,716 single-copy clusters that overlap structural variants was
estimated by counting cluster-constituent k-mers (k = 31) in sequence reads of the
diversity panel

• Local PCA and haplotype analysis in our panel of 200 domesticated and 100 wild varieties of barley indicated a
single origin of the inverted haplotype.
• The inversion occurred only among domesticated barley of Western geographical origin, indicating that it
arose or has risen to high frequency after domestication. The inverted region contains high-confidence genes
in the Morex cultivar. The closest gene to the inversion breakpoint—at 448 kb distance from the distal
breakpoint in the non-carrier Morex—was HvCENTRORADIALIS (HvCEN)
• Although induced mutants of HvCEN flower very early, natural variation in HvCEN has previously been
implicated in environmental adaptation to northern European climates.
• All of the inversion carriers we analysed had HvCEN haplotype III, which is associated with later flowering in
spring barley varieties from northern Europe

Neighbor-joining tree of 271 diverse rice
accessions belonging to three different cultivated
and wild rice species viz. O. sativa, O. nivara and
O. rufipogon
RPGA-based SNP genotyping for efficiently decoding the natural allelic diversity and
population genetic structure in order to understand the domestication pattern in rice
genepool.

✔Indian traditional Basmati accessions were found to cluster distinctly from
aromatic rice accessions belonging to both north-eastern India and other
parts of the world.
✔ Traditional Basmati which displayed a closer genetic relationship with
japonica and aus accessions
Evolved Basmati Traditional Indian basmathi X indica variety (IND 1)
✔ Identified 2 sub-groups within indica subpopulation,
INDI corresponding to Xian/Indica-2 (XI-2) South Asia
INDII corresponding to XI-3 from and Southeast Asia,
previously reported along with two other indica subpopulation groups (XI-1A from East Asia, XI-1B of modern
varieties of diverse origins)
RESULTS

High-resolution QTL mapping conducted using the RPGA-based ultra-
high-density 535 genetic linkage map ( Sonasal × PB 1121 RILs)
The RPGA-based GWAS detected many previously known major grain size/weight
genes like GS3 and PGL1 (grain length and length-to-width ratio) and GW5 (grain
width) validates the ability of pan-genome-based GWAS to detect true associations
WDR12 gene (candidate gene regulating grain length )underlying the qLWR7 QTL,-
validated by both RPGA based GWAS and QTL mapping, is known to and thus appears to be a
promising

Genome to pangenome : A doorway into crops genome exploration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genome to pangenome : A doorway into crops genome exploration

Similar to Genome to pangenome : A doorway into crops genome exploration (20)

Recently uploaded

Recently uploaded (20)

Genome to pangenome : A doorway into crops genome exploration