This document describes the process taken to analyze genomes of Pneumocystis species to elucidate their putative mating system. The approach involved using Schizosaccharomyces pombe as a reference genome to search for orthologs in the target genomes through similarity searches, domain architecture analysis, and phylogenetic analysis. Key findings included the lack of elements of the RNA interference pathway and cell fusion/meiosis regulation in Pneumocystis, providing evidence they may have a primary homothallic mating system. The document also discusses limitations and need for reassessing some annotations.
4. Reminiscing
• Why
• Elucidation of the putative mating system for Pneumocystis.
• Available genomes
• Rationale
• A ‘Taphrina paper’-like approach was initially envisaged
• Arbitrary list of genes of interest based on published literature
• Gene location by similarity to a bait protein reference
• Tentative exon mapping by hand
• Functional assessment
• A (not so) close (, but very) well annotated reference
• Schizosaccharomyces pombe
• Assumptions
• The existence of three close genomes allowed for
• Inference of non-existence whenever absent on all
• P. carinii - rats
• P. jirovecii – humans
• P. murina – mice
• Taphrina deformans – peach tree
• Saitoella complicata - saprophite
5. Looking for the the Road Ahead
• Similarity based methods (TBLASTN)
• Specifics
• Protein sequence as bait
• S. pombe @ UniProt
• Genome sequence as substrate
• Scaffolds|contigs library
• Limitations
• Unspecific hits
• Highly divergent sequences
• No hits expected
• No statistical model available
• Gene relevance
• To know the relevant process(es) inner workings
• To detect putative chocke points
• To navigate a sea of heterogeneous names
7. Navigating Troubled Waters
• Final solution
• Phylogenetic analysis
• Labour+computation intensive
• Restricted set of sequences
• Sensible route
• Domain architecture analysis
• Faster & Sound
• Based on alignments of all the accepted cases
• Modular nature of proteins
• Underlying, but transparent phylogenetics
• Stable & Reusable
• Location by prototypical synteny
• Too short
• Too divergent
8. Cell Designer 4.3
The Reference
• Schizosaccharomyces pombe
• Proteome as gateway to genome annotation
• Swiss-Prot grade annotation
• Set of plus 5000 genes
• Availability of functional annotation
• Published references
• Referenced papers
• Papers located by textual search
• KEGG PATHWAY
• Sparsely used
9. Looking for Orthologs
InterProScan
TBLASTN
Genewise
and manual
curation
Target genome
annotation
no
yes
CDS
translation
Target domains
or domain
architecture
no yes
no
yes
S. pombe query
sequence
Target
genome
Genomic regions
Match?
InterProScan
S. pombe specific
domains or domain
architecture
Phylogenetical analysis
No homolog found Homolog found
Match?
Match?
11. Going Deeper
• Method Performance
• Reassessment of annotated genes
• e.g. ste23@Pjiro
12. Out of Trouble?
• Method Limitations
• Annotation ambiguity
• e.g. rad24 & rad25
Phylemon2::Phymlbestaictree
13. A Question of Entourage
• Small highly divergent genes
• Small MAT genes through synteny
14. Major Findings (Pneumocystis)
• Considering Schizosaccharomyces pombe
• RNA interference (RNAi) pathway
• Lacks crucial elements (e.g.)
• dcr1
• ARC Complex
• etc
• Cell Fusion & Meiosis Onset Regulation
• Missing assorted elements
• Pheromone Action
• Incomplete signal transmission component set
• Signal transduction seem to be all presente
• Cell Cycle Regulation by Environmental Factors
• Missing some of the putative crucial elements (e.g.)
• wis4, wis1
• hog1
• atf1
• rst2
15. Conclusions
• Pneumocystis species
• Probable primary homothalism
• Only two different, and incomplete MAT ‘cassette’ candidates
• In the same scaffold
• No conventional silencing RNA interference pathway
• Additional argument in favor
• Further study required
• Further clinical isolates sequencing seems to support
• Existing annotation
• Should be assessed depending on its quality grade
• Odd situations should be reappraised
Good afternoon. Thank you for atending to this talk about…
…what can be rightfully titled as the making of the paper shown here.
This study began in the wake of a previous paper that acompanied the release of the annotated genome of Taphrina deformans.
The initial challenge was to apply the same approach used before to locate and characterize the MAT, and some other sex-related genes in Pneumocystis sequenced genomes.
The existence of 3 sequenced genomes, and well annotated refernce would allow some bold assumptions to be made, namely the inference of absence for a given gene.
The methodology to be used would involve gene pinpointing by similarity to a reference protein sequence, manual fitting for the mapping of the CDS, and confirmation through its product functional annotation. The smallest MAT genes were to be located through relative position, and synteny.
A multitude of unspecific BLAST hits, a lack of clear assessment of role, and relevance for many of the genes listed as of interest led to a complete reappraisal of the approach.
To avoid misunderstandings about gene symbols the Schpo nomenclature, as presente in UniProt/Swiss-Prot was adopted.
The main cause for hit unspecificity was readily assigned to the fact that many of the genes studied shared related domains as can be seen here for the SPK1 protein.
Several good hits across the target scaffolds are the result of a domain presente in many diferente known architectures.
Ignore this common fact at your own peril, and expect to find yourself in a disturbing maze of mirrors…
To overcome this problem gene annotation was to be made independent of plain sequence similarity methods.
No doubt that phylogenetic analysis would be the approach.
Being a very demanding process in terms of computation,and expert interaction with the results, it can only be applied to restricted sets of sequences.
In this case protein domain architecture analysis is clearly the sensible route.
It is based on the underlying alignment of all the accepted elements for a given domain, and these alignments are usually broader than anything the regular phylogenetics user can envisage.
…
In this way BLAST hits are used just as a broad location tool.
The availability of a seemingly complete, and well annotated proteome opened the door to an assessment of the regulatory processes of interest.
The latter allowed for a more precise appraisal of the role each gene plays, and its relevance for whole.
Missing information was brought intothis analysis from published references, and the remaining gaps filled by some data collected from KEGG Pathway.
The final annotation protocol amounted to this diagram.
…
Please note that this approach where ortholog identification is made independent of similarity search, delivers the user from the shackles of a given evalue threshold.
This means that TBLASTN searches can be carried down to almost preposterous evalue levels in order to find the most faint of the similarity signals.
Hit validation is carried independently so it will not be affected by the significance level of the TBLASTN hit.
Such a protocol enabled us to get ortholog candidatess even when they are dispersed among different unorddered contigs, as it is the case for P.carinii.
As na exemple we may consider DNA mismatch repair protein MSH2.
As you can see the functional annotation for the product of the concatenated exons found is not only very similar to the annotation of the COOH end of MSH2 in Schpo, but it matches even the relevant PANTHER subfamily.
The pattern of exon distribution among the genomes studied denotes a typical increase in CDS fragmentation from S. complicata towards Pneumocystis.
The same protocol enabled us to review the existing annotation for the genomic regions where the hits were found.
The case shown presents the A-factor processing enzyme STE23 from S.cerevisiae (no evidente ortholog in S. pombe, maybe YAN2).
In the genomic sequence of P.jirovecii the TBLASTN hit pointed to a region already occupied by 2 shorter genes.
As these genes presented no meaningfull fuctional annotation, and Genewise was able to match the STE23 to exons very convincingly the case for annotation review was very compeling. Moreover both length, and domain architecture were the expected for STE23.
Off course danger is always around the corner, and situations emerged where even the reference architecture proved ambiguous.
An exemple of this were the DNA damage checkpoint proteins RAD24, and RAD25.
Their architecture is identical, and even when a phylogenetical analysis is carried out it is very diffiecult to assign the found homologs to each of the prototypes.
Moreover both the branching posterior probability, and the exon structure seem to point into different directions.
The larger MAT genes were easily foud by the standard protocol, but their smaller, and more divergente neighbours had to be located through relative position, and assessed by synteny.
From the several putative candidates for matMi, the largest were chosen, and they appear to present a common trait: matching the signal peptide signature at the NH3-end.
matPc site in Pneumocystis was found to be consistently occupied by a hsp104 ortholog, and no other candidate were found.
The original annotation involved a very large gene that after review was splitted in hsp104, and end4 genes.
For Pneumocystis several pathways seem to be impaired if the Spombe reference proves valid.
The most notable is the RNA interference pathway.
Other circuits important to sexual reproduction seem also affected, namely the control of the onset of meyosis.
This findings would allow for some interesting conclusions, but some effort has to be placed in finding possible circunventing pathways that could be at play.