12/10/2019 1
Presented by,
Aruna, K
III Ph.D scholar
Dept. of GPB
12/10/2019 2
• The first report of genic structural variation (SV) affecting a
phenotype dates back more than 80 years - Bridges (1936)
discovered duplication of the Bar gene associated with small
eyes in the fruit fly, Drosophila
• There is growing evidence that genome wide SV is a major
factor underlining observed phenotypic variation
12/10/2019 3
12/10/2019 4
Translocations
CNV and PAVInDels
Inversions
Diversity of structural variants
Different kinds of SV can occur independently or
simultaneously, resulting in complex genome
alterations
Origin of structural variants
Various cellular mechanisms can trigger generation of SV
during meiotic or mitotic cell division
1. Homoeologous non-reciprocal transpositions (HNRT) (Parkin
et al., 1995)
2. Non-allelic homologous recombination (NAHR) (Lupski, 1998)
3. Non-homologous end joining (NHEJ) (Moore and Haber, 1996)
4. Fork stalling and template switching (FoSTeS) (Lee et al.,
2007)
5. Microhomology-mediated break-induced replication (MMBIR)
(Hastings et al., 2009)
The most likely cause of much of the CNVs observed in
plants is NAHR
12/10/2019 5
12/10/2019 6
Changes in ploidy lead to
generation of SV
1. Polyploidization
2. Whole genome doubling
3. Paleopoliploidization
Examples:
• Autopolyploid potato (Solanum
tuberosum; 2n = 4x = 48)
• allohexaploid wheat (Triticum
aestivum; 2n = 6x = 42)
• allotetraploid oilseed
rape/canola (Brassica napus; 2n
= 4x = 38)
• paleohexaploids Brassica
oleracea (2n = 2x = 18) and
Brassica rapa (2n = 2x = 20)
(Lagercrantz et al., 1996; Tang
et al., 2012; Parkin et al., 2014)
12/10/2019 7
Visualization of large-scale SV
1. Classical cytology
2. Molecular marker technologies
3. Comparative genome hybridization (CGH)
4. Molecular cytogenetic techniques
• Fluorescence in situ hybridization (FISH)
• Genomic in situ hybridization (GISH)
12/10/2019 8
Sequencing-based SV detection
 NGS approaches have accelerated the process of assembling
sequences
 Single nucleotide difference detection methods using whole
genome sequencing data, high-coverage exome sequence data
or sequence capture data has been a major breakthrough in
deciphering complex SV (Chen et al., 2008; Schiessl et al.,
2017b)
Approaches for characterization of SVs from NGS reads
• Combination of read depths (RD)
• Paired reads (PR)
• Split reads (SR)
12/10/2019 9
 In 1987, it was proposed that bacterial strains showing
>70% DNA-DNA reassociation and sharing characteristic
phenotypic traits should be considered to be strains of
the same species (definition of bacteria)
 Today, this classical definition is being challenged by an
increasing amount of genomic information
 Thus far, the genome sequence of one or two strains for
each species has provided unprecedented information
 However, the question of how many genomes are
necessary to fully describe a bacterial species has yet to
be asked
Tettelin et al., 2005
12/10/2019 10
 To fully explore gene variability within the Streptococcus
agalactiae [group B Streptococcus (GBS)] species-the
complete genome sequencing of the type Ia strain A909 and
draft genome sequences (8×sequence coverage) of five
additional strains, representing the five major serotypes
 To address how many genomes are necessary to fully describe
a bacterial species, the genome of strains of each of the major
pathogenic serotypes was sequenced
 Comparative analysis of the six newly sequenced genomes
and the two genomes already available in the databases
suggests that a bacterial species can be described by its
‘‘pan-genome’’
Tettelin et al., 2005
12/10/2019 11
De Novo sequencing
12/10/2019 12
Whole genome alignment of GBS strains
Tettelin et al., 2005
12/10/2019 13
GBS core genome. The number of shared genes is plotted as a
function of the number n of strains sequentially added
Tettelin et al., 2005
12/10/2019 14
Tettelin et al., 200512/10/2019 15
 Diverse rice accessions have been resequenced and
phenotyped during recent years
 In these resequencing efforts, characterizations of the
genetic variants all rely on reference genome
 Information from highly polymorphic regions would
often be inevitably lost
(Zhao et al., 2018)
12/10/2019 16
 A total of 66 accessions were selected from ~1,500
diverse accessions of O. sativa and O. rufipogon
 Used for deep sequencing and whole-genome de novo
assembly, independently of the Nipponbare reference
 The 66 genome assemblies were anchored onto the
Nipponbare reference genome to discover detailed
sequence variations.
(Zhao et al., 2018)12/10/2019 17
(Zhao et al., 2018)
12/10/2019 18
Quality control
To confirm that 66 accessions
represent 1529 accessions
1.Phylogenetic tree
2.Comparing SNPs
To check accuracy of de novo
sequencing
1.BACs
2.De novo sequencing of reference
genome compared to know error
rates
3.Screening of domestication sweep
(Zhao et al., 2018)12/10/2019 19
Phylogenetic tree for the 66 genomes which is consistent with
that of 1529 rice accessions
1. Phylogenetic tree
(Zhao et al., 2018)12/10/2019 20
2. Comparing SNPs
• Resequenced a total of 1,529 accessions of O. sativa and O.
rufipogon
• Among the common SNPs (minor allele frequency > 0.01)
identified in the large population, 89.2% (1,405,349 of
1,575,718) were detected in the 66 genome assemblies as
well
• Suggesting that the core collection captured a large
proportion of common genetic variation in the O. sativa – O.
rufipogon complex
(Zhao et al., 2018)
12/10/2019 21
1. BAC-based sequence
(Zhao et al., 2018)
12/10/2019 22
2. De novo sequencing of reference genome and
comparison
 The genome assembly of Nipponbare was compared with
the reference sequence for quality control
 the sequence identity between them was > 99.96%
 Error rates in genes and intergenic regions of 0.0218%
and 0.0352%, respectively was recorded
 Plotted error rates across the 12 rice chromosomes
(Zhao et al., 2018)12/10/2019 23
(Zhao et al., 2018)
12/10/2019 24
3. Screening for domestication sweep
(Zhao et al., 2018)
12/10/2019 25
(Zhao et al., 2018)
12/10/2019 26
• Searched orthologous of the gene set against
each of the 67 rice genomes
• Generated a list of one-to-one correspondences
and their presence-or-absence information in
different accessions
• There were 26,372 genes present in ≥90% of
the collecton and 16,208 genes present in
≤90% of the collecton
• These were defined as the core genome set and
the dipensable genome set
(Zhao et al., 2018)
12/10/2019
27
The number of coding genes present in one group but absent
in all other groups (285 group specific genes among the
dispensable genome set
(Zhao et al., 2018)
12/10/2019 28
(Zhao et al., 2018)
12/10/2019 29
(Zhao et al., 2018)
12/10/2019 30
• constructed wheat pangenome a reference and
whole-genome sequencing data from 18 cultivars
• 64.3% core genome, 35.7% variable genome
• 12,150 genes are absent in the Chinese Spring
reference sequence but present in all the other
cultivars
12/10/2019 31
Montenegro et al., 2017
• 8 cultivated lines
• 1 wild type
• Compared with B. oleracea var TO1000
12/10/2019 32Golicz et al., 2016
pangenome contig placement on TO1000 chromosomes
12/10/2019 33Golicz et al., 2016
12/10/2019 34
 Pangenome composed of 81.3% core genome, 18.7%
variable genes and 2.2% unique genes
 Functional analysis of variable genes suggests enrichment of
genes
• Disease resistance
• Defense response
• Water homeostasis
• Amino acid phosphorylation
• Signal transduction
Golicz et al., 2016
• Maize exhibit highest amount of SVs
• Half of the genome is not shared between any two
lines
• High level of TE activity
• 85% of B73 genome consists of TEs
12/10/2019 35Lu et al., 2015
12/10/2019 36Lu et al., 2015
12/10/2019 37Lu et al., 2015
conclusion
12/10/2019 38

Pangenome: A future reference paradigm

  • 1.
    12/10/2019 1 Presented by, Aruna,K III Ph.D scholar Dept. of GPB
  • 2.
  • 3.
    • The firstreport of genic structural variation (SV) affecting a phenotype dates back more than 80 years - Bridges (1936) discovered duplication of the Bar gene associated with small eyes in the fruit fly, Drosophila • There is growing evidence that genome wide SV is a major factor underlining observed phenotypic variation 12/10/2019 3
  • 4.
    12/10/2019 4 Translocations CNV andPAVInDels Inversions Diversity of structural variants Different kinds of SV can occur independently or simultaneously, resulting in complex genome alterations
  • 5.
    Origin of structuralvariants Various cellular mechanisms can trigger generation of SV during meiotic or mitotic cell division 1. Homoeologous non-reciprocal transpositions (HNRT) (Parkin et al., 1995) 2. Non-allelic homologous recombination (NAHR) (Lupski, 1998) 3. Non-homologous end joining (NHEJ) (Moore and Haber, 1996) 4. Fork stalling and template switching (FoSTeS) (Lee et al., 2007) 5. Microhomology-mediated break-induced replication (MMBIR) (Hastings et al., 2009) The most likely cause of much of the CNVs observed in plants is NAHR 12/10/2019 5
  • 6.
  • 7.
    Changes in ploidylead to generation of SV 1. Polyploidization 2. Whole genome doubling 3. Paleopoliploidization Examples: • Autopolyploid potato (Solanum tuberosum; 2n = 4x = 48) • allohexaploid wheat (Triticum aestivum; 2n = 6x = 42) • allotetraploid oilseed rape/canola (Brassica napus; 2n = 4x = 38) • paleohexaploids Brassica oleracea (2n = 2x = 18) and Brassica rapa (2n = 2x = 20) (Lagercrantz et al., 1996; Tang et al., 2012; Parkin et al., 2014) 12/10/2019 7
  • 8.
    Visualization of large-scaleSV 1. Classical cytology 2. Molecular marker technologies 3. Comparative genome hybridization (CGH) 4. Molecular cytogenetic techniques • Fluorescence in situ hybridization (FISH) • Genomic in situ hybridization (GISH) 12/10/2019 8
  • 9.
    Sequencing-based SV detection NGS approaches have accelerated the process of assembling sequences  Single nucleotide difference detection methods using whole genome sequencing data, high-coverage exome sequence data or sequence capture data has been a major breakthrough in deciphering complex SV (Chen et al., 2008; Schiessl et al., 2017b) Approaches for characterization of SVs from NGS reads • Combination of read depths (RD) • Paired reads (PR) • Split reads (SR) 12/10/2019 9
  • 10.
     In 1987,it was proposed that bacterial strains showing >70% DNA-DNA reassociation and sharing characteristic phenotypic traits should be considered to be strains of the same species (definition of bacteria)  Today, this classical definition is being challenged by an increasing amount of genomic information  Thus far, the genome sequence of one or two strains for each species has provided unprecedented information  However, the question of how many genomes are necessary to fully describe a bacterial species has yet to be asked Tettelin et al., 2005 12/10/2019 10
  • 11.
     To fullyexplore gene variability within the Streptococcus agalactiae [group B Streptococcus (GBS)] species-the complete genome sequencing of the type Ia strain A909 and draft genome sequences (8×sequence coverage) of five additional strains, representing the five major serotypes  To address how many genomes are necessary to fully describe a bacterial species, the genome of strains of each of the major pathogenic serotypes was sequenced  Comparative analysis of the six newly sequenced genomes and the two genomes already available in the databases suggests that a bacterial species can be described by its ‘‘pan-genome’’ Tettelin et al., 2005 12/10/2019 11
  • 12.
  • 13.
    Whole genome alignmentof GBS strains Tettelin et al., 2005 12/10/2019 13
  • 14.
    GBS core genome.The number of shared genes is plotted as a function of the number n of strains sequentially added Tettelin et al., 2005 12/10/2019 14
  • 15.
    Tettelin et al.,200512/10/2019 15
  • 16.
     Diverse riceaccessions have been resequenced and phenotyped during recent years  In these resequencing efforts, characterizations of the genetic variants all rely on reference genome  Information from highly polymorphic regions would often be inevitably lost (Zhao et al., 2018) 12/10/2019 16
  • 17.
     A totalof 66 accessions were selected from ~1,500 diverse accessions of O. sativa and O. rufipogon  Used for deep sequencing and whole-genome de novo assembly, independently of the Nipponbare reference  The 66 genome assemblies were anchored onto the Nipponbare reference genome to discover detailed sequence variations. (Zhao et al., 2018)12/10/2019 17
  • 18.
    (Zhao et al.,2018) 12/10/2019 18
  • 19.
    Quality control To confirmthat 66 accessions represent 1529 accessions 1.Phylogenetic tree 2.Comparing SNPs To check accuracy of de novo sequencing 1.BACs 2.De novo sequencing of reference genome compared to know error rates 3.Screening of domestication sweep (Zhao et al., 2018)12/10/2019 19
  • 20.
    Phylogenetic tree forthe 66 genomes which is consistent with that of 1529 rice accessions 1. Phylogenetic tree (Zhao et al., 2018)12/10/2019 20
  • 21.
    2. Comparing SNPs •Resequenced a total of 1,529 accessions of O. sativa and O. rufipogon • Among the common SNPs (minor allele frequency > 0.01) identified in the large population, 89.2% (1,405,349 of 1,575,718) were detected in the 66 genome assemblies as well • Suggesting that the core collection captured a large proportion of common genetic variation in the O. sativa – O. rufipogon complex (Zhao et al., 2018) 12/10/2019 21
  • 22.
    1. BAC-based sequence (Zhaoet al., 2018) 12/10/2019 22
  • 23.
    2. De novosequencing of reference genome and comparison  The genome assembly of Nipponbare was compared with the reference sequence for quality control  the sequence identity between them was > 99.96%  Error rates in genes and intergenic regions of 0.0218% and 0.0352%, respectively was recorded  Plotted error rates across the 12 rice chromosomes (Zhao et al., 2018)12/10/2019 23
  • 24.
    (Zhao et al.,2018) 12/10/2019 24
  • 25.
    3. Screening fordomestication sweep (Zhao et al., 2018) 12/10/2019 25
  • 26.
    (Zhao et al.,2018) 12/10/2019 26
  • 27.
    • Searched orthologousof the gene set against each of the 67 rice genomes • Generated a list of one-to-one correspondences and their presence-or-absence information in different accessions • There were 26,372 genes present in ≥90% of the collecton and 16,208 genes present in ≤90% of the collecton • These were defined as the core genome set and the dipensable genome set (Zhao et al., 2018) 12/10/2019 27
  • 28.
    The number ofcoding genes present in one group but absent in all other groups (285 group specific genes among the dispensable genome set (Zhao et al., 2018) 12/10/2019 28
  • 29.
    (Zhao et al.,2018) 12/10/2019 29
  • 30.
    (Zhao et al.,2018) 12/10/2019 30
  • 31.
    • constructed wheatpangenome a reference and whole-genome sequencing data from 18 cultivars • 64.3% core genome, 35.7% variable genome • 12,150 genes are absent in the Chinese Spring reference sequence but present in all the other cultivars 12/10/2019 31 Montenegro et al., 2017
  • 32.
    • 8 cultivatedlines • 1 wild type • Compared with B. oleracea var TO1000 12/10/2019 32Golicz et al., 2016
  • 33.
    pangenome contig placementon TO1000 chromosomes 12/10/2019 33Golicz et al., 2016
  • 34.
    12/10/2019 34  Pangenomecomposed of 81.3% core genome, 18.7% variable genes and 2.2% unique genes  Functional analysis of variable genes suggests enrichment of genes • Disease resistance • Defense response • Water homeostasis • Amino acid phosphorylation • Signal transduction Golicz et al., 2016
  • 35.
    • Maize exhibithighest amount of SVs • Half of the genome is not shared between any two lines • High level of TE activity • 85% of B73 genome consists of TEs 12/10/2019 35Lu et al., 2015
  • 36.
  • 37.
  • 38.

Editor's Notes

  • #6 Details of each method
  • #7  (A) Simple genomic rearrangement: (i) the replication fork encounters a lesion (grey) and stalls; (ii) fork stalling leads to replication fork collapse and cleavage by an endonuclease creates a single-ended double-strand break (DSB), shown here as the shorter of the two strands; (iii) 5′ to 3′ resection exposes a region of microhomology (red) and generates a 3′ single-stranded overhang; (iv) the formation of a D-loop by the template strand is followed by the invasion of the 3′ overhang, which anneals to the microhomologous region to restart synthesis; and (v) synthesis is continued to the end of the chromosome. (B) Complex genomic rearrangement: (i–iv) As in simple MMBIR; (v) the invading strand disengages due to low processivity DNA polymerases; (vi) The 3′ overhang invades a different DNA segment and anneals to a microhomologous region. Further template switches occur until DNA polymerases with higher processivity enable continuous synthesis; (vii) eventually, synthesis restarts at the original strand and continues to the end of the chromosome; and (viii) following MMBIR with template switching, the replicated chromatid exhibits a region of complex rearrangements with junctional microhomology.
  • #10 Connect nxt slide by explaining how we cannot rely upon mere these factor of single genome to understand the genomic diversity or SV 1. To differentiate one plant or any organism from the other within the sps, to explain the extent of variation present in the genome of a sps, it is not logical to rely upon the sequence of one genome (reference genome). Let us see with an example how illogical it is…
  • #12 S.Agalactiae is a leading cause of illness or death among new born infants and an emerging cause of invasive infection in the elderly What was the speciality in sequencing?
  • #15 To estimate the number of genes present in every GBS strain (core genome), the number of shared genes found on sequential addition of each new genome sequence was extrapolated by fitting an exponential decaying function to the data(Fig.2).The results of all permutations of the order of addition for each of the eight genomes are shown.Asexpected,thenumberofsharedgenes initially decreased with addition of each new sequence. Nevertheless, extrapolation of the curve indicates that the core genome reaches a minimum of 1,806 genes (95% confidence interval 1,750–1,841) and will remain relatively constant, even as many more genomes are added
  • #16 Aswiththesharedgenes,theplotofthenumbers ofnewgeneswaswellfittedbyadecayingexponential.Theaverage number of new genes added by a novel sequence was 161 when a second genome was added, and this number decreased to 54 after five genomes; but, even the eighth genome continued to add new genes. Remarkably, the extrapolated curve reaches a nonzero asymptotic value of 33 new genes (95% confidence interval 22–42) with increasing numbers of genomes
  • #23  The draft sequence of one rice accession, Guangluai-4 (GLA4), was validated using 22 Mb of high-quality BAC-based sequences as a gold standard, (a) The physical locations of 87 GLA4 contigs (merged by 273 BACs through sanger-based sequencing) on chromosome 4 and the corresponding scaffolds in the GLA4 assembly of the rice pan genome data, colored in blue and green, respectively. Repeats in red show the RepeatMasker-annotated transposable elements and other repeats. (b) The local genomic region of a 3-Mb interval on chromosome 4. (c) Comparison between a scaffold in the GLA4 assembly and the corresponding BAC-based sequences. The grey blocks show highly consistent regions between Sanger-based BAC and the scaffold. The inconsistent bases between them were indicated. (d)-(f)
  • #26  Using the pan-genome-based variants, we performed a global analysis for the domestication selection scan (Supplementary Fig. 9). As expected, the results for the major domestication sweeps were almost the same as previous results from low-coverage resequencing of 1,529 accessions What are the important findings of this study?
  • #27 the whole-genome de novo assemblies provided the opportunity to discover genes that are absent in the Nipponbare reference genome sequence and to explore the PAV information of all coding genes among the rice accessions. We performed genome annotations for all 67 assemblies (including that for Nipponbare). With the exclusion of repetitive sequences, we predicted all the non-transposable-element (non-TE) protein-coding genes for each genome. There were a total of 10,872 genes in the 67 rice accessions that were at least partially absent in the Nipponbare reference. These ‘newly identified’ genes were mostly due to large indels among accessions (for example, a large insertion relative to the Nipponbare variety; see Fig. 4a,b
  • #28  previous studies had identified several genes that had not been observed in the Nipponbare reference, including Sub1A, SNORKEL1 and SNORKEL2, which control submergence tolerance11,12, and Pstol, which controls phosphorus-deficiency tolerance13. Sequence searching showed that all of these reported genes were among the newly identified genes found in the pangenome (Fig. 5a). Taken together, these pieces of evidence suggest that at least some of the newly identified genes are functionally important. We searched the orthologs of the gene set against each of the 67 rice genomes (see the number of shared genes between two accessions in Fig. 5b) generated a list of one-to-one correspondences and their presence-or-absence information in different accessions (Fig. 5c). According to PAV of the genes, there were 26,372 and 16,208 genes present in ≥ 60 rice accessions (90% of the collection) and present in < 60 accessions, respectively, and these were defined as the core genome set and the dispensable genome set of coding genes in rice.
  • #29 Among the dispensable genome set, there were 285 group-specific genes (Supplementary Fig. 15), whereas most of the genes were present in only a few accessions.
  • #30 Screened interPro domains of coding genes in the core genome set and those in the dispensable genome set and compared the functional classifications of the coding genes from the two sets (Supplementary Fig. 16). As expected, the genes of the dispensable genome set were enriched for abiotic and biotic response genes, especially for NBS-LRR (nucleotidebinding site–leucine-rich repeat) and NB-ARC (nucleotide-binding adaptor shared by APAF-1, R proteins and CED-4) genes, which control disease resistance in rice Can we stop at studying 66 accessins? Or will adding some more accessiond give further useful information? This is the obvious question at the end. To know this, they used extrapolation through softwares and evaluated the number of protein coding genes in the rice pan genome
  • #31 Evaluation of the number of protein coding genes the rice pan genome. Stepwise addition of rice accessions from n = 2 to n = 67 was performed to evaluate the number of coding genes in the rice genome. In each independent run from n = 2 to n = 67, clusters are defined as coding genes present in at least two rice accessions, and singletons are defined as coding genes present in only one of the rice accessions.
  • #34 SNP density, pangenome contig placement on TO1000 chromosomes and variable gene density. Each chromosome is represented by three tracks, which from the top correspond to the following: (1) SNP density (black line)—each chromosome was split into 500kb bins, the number of SNPs in bins is plotted as a function of bin position; (2) pangenome contig placement on TO1000 chromosomes (coloured rectangle)—contigs originating from each step of pangenome construction were placed along the chromosomes and colour coded according to the line; (3) variable gene density (burgundy line)—the number of variable genes in each 500kb bin is divided by the total number of genes.
  • #38 The 4.4-M mapped tags were aligned to the B73 reference genome and the physical positions were compared with their genetic positions. If a tag did not align to the reference or did not have alignment within the 10-Mb region of its genetic position, then it would be considered as a PAV tag. The B73 tags are those that have at least one perfect match to the reference genome.