A pan-genome represents the full complement of diversity within a clade, or the union of all genes or SNPs across a representative selection of genomes. One of the first pan-genomes was that of Streptococcus agalactiae, introduced in 2005. Since then, with the acceleration of whole-genome sequencing technology, pan-genomes have been generated across a wide range of multicellular eukaryotes. This presentation will outline the history of pan-genomes, the categories of pan-genomes, advances in pan-genome assessment, and the challenges of representing the diversity of a taxonomic clade in complex eukaryotes.
2. What is a pan-genome?
A pan-genome represents the full complement
of diversity within a clade, or the union of all
genes or SNPs across a representaAve selecAon
of genomes
4. The first pan-genome: S. agalactiae
• Microbial pan-genome of
Streptococcus agalac.ae
• Released in 2005
• Comprised of eight S.
agalac.ae genomes (8x
coverage; ~1.8Mb in size)
• Whole-genome alignments
were done using the
MUMmer alignment program
• Gene sequence similarity was
determined by translated
protein sequence similarity
TeTelin et al 2005 PNAS doi: 10.1073/pnas.0506758102
Reference lacking sequence similarity
to other accessions (white space)
Reference sharing sequence similarity in the same
syntenic posiAon across all accessions
Reference sharing sequence similarity in some accessions but not in the same syntenic posiAon
5. A recent pan-genome: rice
• a pan-genome dataset of the O.
sa.va–O. rufipogon species
complex (430Mb in size)
• deep sequencing (115X) of 66
divergent accessions into conAgs
• ConAgs were anchored to the
reference genome using the
MUMmer alignment program
• Predicted genes in conAgs were
aligned against reference genome
with BLASTN to determine gene
presence-absence in conAgs
Zhou et al 2018 Nature GeneAcs doi: 10.1038/s41588-018-0041-z
6. A future pan-genome: maize
Nested Association Population (NAM)
• 25 diverse inbred lines, plus
the reference line B73
• All 25 complete genomes
are currently being
sequenced, assembled, and
annotated by the NAM
Sequencing ConsorAum
• Each genome is
approximately 2 gigabases
in size
• Whole-genome alignment
to create the pan-genome
D. Hufnagel, ISU, unpublished
7. Species with existing pan-genomes
• Human pan-genome
• Dozens of microbial pan-genomes
• Many crop plant pan-genomes (though mostly
re-sequencing efforts)
– pepper
– soybean
– rice
– wheat
– maize
9. Features of a pan-genome (genes as unit)
• Core genome (C): Made up of genes present in all accessions in the pan-
genome
• “Dispensible” genome (D): The genes that are present in some, but not
all, accessions in the pan-genome
• Orphan genes (O): Lineage-specific genes; found only within a parAcular
accession
Accession 1
Accession 2
Accession 3
O C C
C
C
C
C
D
D
D
D
Pan-genome C C D D O
D
12. Pan-genome formats
• Consensus fasta file: all variaAon across accessions are collapsed
into a consensus sequence
• Used for either SNP diversity or whole-genome alignment against a
reference sequence
Ø pros: blastable/alignable
Ø cons: Hard to represent large regions of variaAon; difficult to
represent an all-against-all pan-genome
sorghum 5’ AGTAAGACCTGATGCAT 3’
setaria 5’ AGCAAGACCTGATGCAT 3’
rice 5’ AGCAAGACCTGATGCAT 3’
consensus 5’ AGCAAGACCTGATGCAT 3’
pineapple 5’ AGAATGACCTGATTCCT 3’
14. Levels of pan-genomes
• Species-specific: e.g. culAvars within a species (most common)
• Genus-specific: closely related species within a genus
• Family-specific: the species within a family (e.g. Grasses)
à Higher-order clade pan-genomes are possible, but complexity increases as
you go up in clade since number of shared genes declines:
Sorghum
Setaria
Peach
Pineapple
Rice
Dicots
Monocots
Tomato
Species # syntenic orthologs in Sorghum
Setaria 31142
Rice 28406
Pineapple 16147
Peach 5317
Tomato 1196 Asterids
Rosids
Grasses
15. Challenges of creating whole-genome pan-
genomes for complex genomes
• Plants such as maize and wheat have undergone recent
polyploidy events, have large inversions and translocaRons
relaAve to outgroups, and contain a great number of repeat
elements
• These genomes also contain a large number of duplicated
genes in cis and trans, and have simultaneously undergone a
great deal of gene deleRon relaAve to outgroups
• Therefore, creaAng pan-genomes to represent the extent of
the diversity in these species is daunAng
18. The role of biological databases
in curating pan-genomes
• Biological databases will need to address
nomenclature rules for pan-genes
• AgBioData is in a unique posiAon to direct and
standardize naming convenAons for pan-genes at
this early stage in pan-genome development
• Our understanding of the complexity of pan-
genomes can help inform us as to the best
nomenclature convenAons to implement