Advertisement
Advertisement

More Related Content

Advertisement

AgBioData: Complexity and Diversity of the Pan-Genome

  1. Margaret Woodhouse MaizeGDB Gale and Devos, 1998, PNAS
  2. What is a pan-genome? A pan-genome represents the full complement of diversity within a clade, or the union of all genes or SNPs across a representaAve selecAon of genomes
  3. •  A single reference genome cannot represent the diversity within a given species •  Advances in sequencing technology and lowered costs has made generaAng a pan- genome a feasible goal for many genome research groups Why a pan-genome?
  4. The first pan-genome: S. agalactiae •  Microbial pan-genome of Streptococcus agalac.ae •  Released in 2005 •  Comprised of eight S. agalac.ae genomes (8x coverage; ~1.8Mb in size) •  Whole-genome alignments were done using the MUMmer alignment program •  Gene sequence similarity was determined by translated protein sequence similarity TeTelin et al 2005 PNAS doi: 10.1073/pnas.0506758102 Reference lacking sequence similarity to other accessions (white space) Reference sharing sequence similarity in the same syntenic posiAon across all accessions Reference sharing sequence similarity in some accessions but not in the same syntenic posiAon
  5. A recent pan-genome: rice •  a pan-genome dataset of the O. sa.va–O. rufipogon species complex (430Mb in size) •  deep sequencing (115X) of 66 divergent accessions into conAgs •  ConAgs were anchored to the reference genome using the MUMmer alignment program •  Predicted genes in conAgs were aligned against reference genome with BLASTN to determine gene presence-absence in conAgs Zhou et al 2018 Nature GeneAcs doi: 10.1038/s41588-018-0041-z
  6. A future pan-genome: maize Nested Association Population (NAM) •  25 diverse inbred lines, plus the reference line B73 •  All 25 complete genomes are currently being sequenced, assembled, and annotated by the NAM Sequencing ConsorAum •  Each genome is approximately 2 gigabases in size •  Whole-genome alignment to create the pan-genome D. Hufnagel, ISU, unpublished
  7. Species with existing pan-genomes •  Human pan-genome •  Dozens of microbial pan-genomes •  Many crop plant pan-genomes (though mostly re-sequencing efforts) – pepper – soybean – rice – wheat – maize
  8. Usually the comparisons done to create a pan- genome include •  whole-genome alignment •  or alignment of short reads to a reference sequence. Methods of creating a pan-genome
  9. Features of a pan-genome (genes as unit) •  Core genome (C): Made up of genes present in all accessions in the pan- genome •  “Dispensible” genome (D): The genes that are present in some, but not all, accessions in the pan-genome •  Orphan genes (O): Lineage-specific genes; found only within a parAcular accession Accession 1 Accession 2 Accession 3 O C C C C C C D D D D Pan-genome C C D D O D
  10. Types of pan-genomes 1. Reference-based: All accessions are mapped to a single reference Ø  pros: fast and easy; can be represented as a fasta file (consensus sequence) Ø  cons: does not represent genes that are present in other accessions but missing in the reference Accession 1 Accession 2 Reference O C C C C C C D D D D Pan-genome C C D D D X
  11. 2. All-against-all: The pan-genome includes the total diversity of all accessions studied Ø  pros: contains the total diversity among all accessions Ø  cons: can’t be easily represented as a fasta file; very slow; eats CPU Types of pan-genomes Accession 1 Accession 2 Accession 3 O C C C C C C D D D D Pan-genome C C D D O D
  12. Pan-genome formats •  Consensus fasta file: all variaAon across accessions are collapsed into a consensus sequence •  Used for either SNP diversity or whole-genome alignment against a reference sequence Ø  pros: blastable/alignable Ø  cons: Hard to represent large regions of variaAon; difficult to represent an all-against-all pan-genome sorghum 5’ AGTAAGACCTGATGCAT 3’ setaria 5’ AGCAAGACCTGATGCAT 3’ rice 5’ AGCAAGACCTGATGCAT 3’ consensus 5’ AGCAAGACCTGATGCAT 3’ pineapple 5’ AGAATGACCTGATTCCT 3’
  13. •  Graphical format: variaAon represented as a graph with nodes and edges; shared sequence similarity is collapsed into a single node •  Can be used for all types of pan-genomes Ø  pros: can represent large regions of diversity beTer than a fasta file Ø  cons: not blastable/alignable Pan-genome formats Accession 1 Accession 2 Accession 3 CCTG ATGCGA TCTGAC ACTC ATTC GGAC ATGCGA ATGCGA TCTGAC TCTGAC GGTC GGAG ATGCGA TCTGAC ATTC ACTC GGAC GGTC GGAG CCTG Pan-genome
  14. Levels of pan-genomes •  Species-specific: e.g. culAvars within a species (most common) •  Genus-specific: closely related species within a genus •  Family-specific: the species within a family (e.g. Grasses) à Higher-order clade pan-genomes are possible, but complexity increases as you go up in clade since number of shared genes declines: Sorghum Setaria Peach Pineapple Rice Dicots Monocots Tomato Species # syntenic orthologs in Sorghum Setaria 31142 Rice 28406 Pineapple 16147 Peach 5317 Tomato 1196 Asterids Rosids Grasses
  15. Challenges of creating whole-genome pan- genomes for complex genomes •  Plants such as maize and wheat have undergone recent polyploidy events, have large inversions and translocaRons relaAve to outgroups, and contain a great number of repeat elements •  These genomes also contain a large number of duplicated genes in cis and trans, and have simultaneously undergone a great deal of gene deleRon relaAve to outgroups •  Therefore, creaAng pan-genomes to represent the extent of the diversity in these species is daunAng
  16. Example: two different maize inbred lines compared to the reference maize line B73 hTps://genomevoluAon.org/r/14hzw W22 Mo17 B73 Brown: blast hit from B73 to Mo17 Orange: blast hit from B73 to W22 Gene models DNA sequence in chromosomal coordinates (chr, start, stop) Tandem array W22 culAvar
  17. Red circles: genes with no syntenic ortholog to B73 Tandem array hTps://genomevoluAon.org/r/14hzw W22 culAvar W22 Mo17 B73 Brown: blast hit from B73 to Mo17 Orange: blast hit from B73 to W22 Gene models Example: two different maize inbred lines compared to the reference maize line B73
  18. The role of biological databases in curating pan-genomes •  Biological databases will need to address nomenclature rules for pan-genes •  AgBioData is in a unique posiAon to direct and standardize naming convenAons for pan-genes at this early stage in pan-genome development •  Our understanding of the complexity of pan- genomes can help inform us as to the best nomenclature convenAons to implement
  19. •  Storing pan-genomic data in a public database is also a concern •  What pan-genome formats would we store? How would we connect funcAonal data (expression data, mapping and marker data, SNP data) to the pan-genome in a useful manner? The role of biological databases in curating pan-genomes
  20. •  MaizeGDB will soon be hosAng these 25 genomes and is working on the most efficient way to do so Case study: the maize Nested Association Mapping (NAM) founder lines D. Hufnagel, ISU, unpublished
  21. Some approaches MaizeGDB is considering: Gene model pages: 1.  Storing all gene models with an ortholog in either B73 or at least one other NAM line* in a single gene model page; 2.  These pages will also contain mapping and funcAonal data associated with the NAM loci as well as the B73 gene model 3.  Orphan NAM gene models will have their own page Case study: the maize Nested Association Mapping (NAM) founder lines *as determined by the NAM Sequencing ConsorAum
  22. Nomenclature: •  NAM gene models will be given a pan-ID (i.e. Z0123456). •  All NAM gene models that share orthology will share a single pan-ID. Case study: the maize Nested Association Mapping (NAM) founder lines
  23. •  There are various ways of represenAng a pan- genome •  Pan-genomes can be very complex in certain species •  Biological databases will need to determine Ø How to assign pan-IDs Ø What sorts of pan-genome formats to host Ø The most efficient way to host pan-genomes Summary
Advertisement