CAMERA Annotation Pipelines
      (and related infrastructure)



            Brett Whitty
            12/20/2007
Overview

 Compute Infrastructure
 GOS/CAMERA ncRNA/ORF calling pipeline
   rRNA finding pipeline
   ORF calling
 GOS (incremental) protein clustering
 CAMERA Annotation Pipeline
   Specifications
   Implementation
Compute Infrastructure
CALIT2 Compute Grid

 48 dual-core dual-CPU 64 bit machines
    192 SGE slots
 Redhat-based ‘Rocks Clusters’ Linux
  distribution (see http://rocksclusters.org)
 ‘Rocks Rolls’
   Bio-roll (/opt/Bio)
   Used to image/install each node separately,
    including local Perl module installs (patches)
sos.camera.calit2.net

 Head node of sos cluster
    SSH into here
 Is not an SGE submit host
SOS Cluster Global Mounts
 /share/apps
    applications (and related files) are installed here,
     analysis data should not be stored here
 /home/thumper6
    a global mount point --- 18T(!!!) storage volume
     on which all analysis data/results should be
     stored
 /opt/Bio
    tools such as clustalw, EMBOSS, hmmer, ncbi
     blast are installed under here
SOS Local Mounts
                   (on each grid node)


 /state/partition1
    local storage device on each grid node available
     for local scratch space (438G)
 /tmp
    system tmp partition (7G)
pg0-0.camera.calit2.net

 SSH accessible only through head
 Is an SGE submit host
 Running apache and postgres servers
pg0-0.camera.calit2.net

 http://web1.camera.calit2.net/ergatis/


 /var/www/cgi-bin/ergatis
     /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
      https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis




 /var/www/html/ergatis
     /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
      https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis
pg0-0.camera.calit2.net
 CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has
  sudo permissions for user 'ergatis'
    The two CGI scripts in the install which run RunWorkflow and
     KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm)
     have been modified, and 'sudo -u ergatis ' has been appended
     to their normal execution strings

 IdGenerator.pm has been modified to use JCVIIdGenerator.pm

 Many of the settings in ergatis.ini have been changed from
  defaults, including disabling a number of the components
    When updating the Ergatis CGI directory from the SVN
     repository, a backup copy should be set-aside in advance
SGE/Workflow Notes
   Two SGE queues have been configured for ergatis:
        ergatis.q (192 slots)
        ergatis-fast.q (144 slots)
   ergatis.q is subordinate queue of ergatis-fast.q

   ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in
    /home/ergatis/.sge_request

   Workflow version 3.0 is installed
        /share/apps/workflow

   Workflow requires that the SGE queue's prolog and epilog scripts be set to the
    following:
        prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue
        epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue

   The queue configuration can be checked using the command
    'qconf -sq ergatis.q'
Ergatis Application Install
   The main ergatis application install directory is under /share/apps/ergatis

   The chado-v1r12b1 release is the current version installed
        direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI
        Perl wrappers were modified via sed to the correct local directory structures
        Proper install wasn't done because no working installer script was available at the
         time

   /share/apps/ergatis/chado-v1r12b1
    symlinked to /share/apps/ergatis/current

   Executables which some ergatis component use, but are not installed with
    Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin

   External tools which are not globally installed on sos are installed under
    /share/apps/ergatis/external_apps

   Ergatis global directories (global_id_repository, global_saved_templates) are
    located under /share/apps/ergatis/ergatis_global
Ergatis Data Locations
 All ergatis data should be put under /home/thumper6/ergatis

 Project repositories are located under
  /home/thumper6/ergatis/projects
  or symlink /share/apps/ergatis/projects

 CAMERA project repository is
  /home/thumper6/ergatis/projects/camera

 Databases are located under /home/thumper6/ergatis/db
  or symlink /share/apps/ergatis/db

 Global scratch space is under /home/thumper6/ergatis/scratch
  or symlink /share/apps/ergatis/scratch
ikelite.rocksclusters.org

 Less machines than sos cluster (~20 slots?)
 Initial test ergatis install was done here
  (similar directory structure to sos)
 Completely distinct from sos cluster
 Sandbox
 Shibu, Weizhong Li and others run computes
  here (e.g.: clustering pipeline)
Pipelines
GOS/CAMERA Pipelines Overview



     Metagenomic Reads


  ncRNA/ORF Finding Pipeline

                               Incremental Clustering
        ORFs/peptides                Pipeline


      Annotation Pipeline      Cluster Memberships
Challenges
 All computes in pipeline must be performed on
  multi-sequence input/output files, as the filesystem
  can not physically support 12M+ individual FASTA
  input files/output files
    other partitioning solutions could work(?) but most tools
     support multiple sequence inputs anyway

 Overall total space consumption was an issue when
  computes were running on TIGR grid, but this is not
  as much an issue (currently) on CALIT2 grid
    Solution here was to keep all inputs/outputs gzipped
     during pipeline execution, at the cost of some performance
     loss (using things like zcat –f | with NCBI BLAST, etc.)
GOS/CAMERA ncRNA and
  ORF Finding Pipeline
GOS/CAMERA ncRNA and ORF
     Finding Pipeline Overview
            Reads

        Find tRNAs           Extract tRNAs   tRNAs FASTA

     Soft-Mask tRNAs

        Find rRNAs           Extract rRNAs   rRNAs FASTA

     Soft-Mask rRNAs                          ORFs FASTA
                              Metagene
     GOS ORF calling                         Peptides FASTA

                                              ORFs FASTA
ORF stats     ORF overlaps
                                             Peptides FASTA
GOS/CAMERA
ncRNA and ORF Finding Pipeline
                    CAMERA-specific
                   Ergatis components
camera_extract_trna
CAMERA rRNA Finder Overview
 BLAST vs. a database of coded pooled rRNA
  subunit sequences
 BLAST prefilter step with loose parameters
    blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1
     -z 3000000000 -W 9
 Reads with prefilter hits are searched using strict
  parameters
    blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b
     1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T
 Collapse aligned intervals of the same rRNA type
  and extract the highest scoring alignments from
  each region
camera_filter_blast
camera_rrna_finder




Custom DB
rRNA Finder DB
  /usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa



 5S
    Sequences from Archaea, Bacteria and Eukaryota were
     obtained from the 5S Ribosomal RNA Database
    http://biobases.ibch.poznan.pl/5SData/
 16S
    Sequences for Archaea and Bactera were obtained from the
     Green Genes 16S db
    http://greengenes.lbl.gov/
 18S
    Source was Doug Rusch's 18S database prepared for the GOS
     paper
 23S
    Source was Doug Rusch's 23S database prepared for the GOS
     paper.
rRNA Finder DB

Fasta headers were coded as follows:

>#S [D] ...original.header...

where # is one of (5, 16, 18, 23) and D is one of
 (A, B, E). The camera_rrna_finder
 component expects this format.
rRNA Finder DB
 CD-HIT was run on the entire database to cluster sequences with
  high similarity to reduce the database size but maintain a range
  of diverse sequences

Command line:
/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i
   input_database.fsa -o output_database.fsa -c 0.8 -n 4

 Consistency of clustering was checked with a Perl script to
  ensure no heterogeneous clustering
  (e.g.: 18S and 16S clustering together)
 Clusters were consistent
 Database size was reduced from 65,591 sequences to 1,329
rRNA Finder
open_reading_frames
ORF Overlaps/ORF Stats
FASTA Headers
   >HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03
    /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=1088 /length=1088
   >JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722
    /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
    /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
    /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=841 /length=841"
   >JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722
    /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
    /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
    /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=841 /length=841"
   >JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1
    /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707
    /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03
    /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=902 /length=902"
   >JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0
    /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1
    /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714
    /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847
    /length=847"
The absence of called
   ORFs in this region of
   the read is due to the
     soft-masked rRNA
          sequence




  RNAmmer didn’t
   identify the 23S
sequence, though it is
capable of finding 23S
Again, RNAmmer failed to identify rRNA sequence
These ORFs have
 >150 unmasked
     bases




                    BLAST-based
                  approach does a
                  pretty good job of
                    finding correct
                      boundaries
BLAST-based rRNA
   finding appears to
 outperform RNAmmer
for 23S sequences, and
       some 16S
GOS (Incremental)
    Clustering Pipeline

http://camera.venterinstitute.org/wiki/display/V
Clustering Overview
                                    Core
                                   Cluster

                                    Core
                                                        Core
                                   Cluster
 All Public                                            Cluster
Proteins +
GOS ORFs                             Core               Core
                                    Cluster            Cluster

                                     Core
                  GOS               Cluster             v1.2

                                                       Non-Redundant 90%
                Historical Artifacts
        Longest Sequence
         Representatives
                                                    Identity CD-HIT Sequence
                           (with respect to annotation) Representatives
CAMERA Polypeptide
Annotation Pipeline
Thoughts on Specifications
 Annotation rules should not be literally codified as
  Perl code (and only Perl code)!!!
  (especially when the “decision makers” never look at the code)


 What tools do we trust?
 What cutoffs do we use?
 What evidence/data types do we consider?


 These will (in some cases should) change over time
More Thoughts

 Specifications are easier to change than
  code, so code should be written to support
  change

 But unless they’re defined first, the
  specifications will be a moving target
(My) Design Objectives

 Must be able to add/remove annotation data
  sources as the annotation SOP changes
 Must be able to easily change the ways in
  which these annotation data types are
  applied/combined to produce final annotation
 Must be able to change/expand the types of
  final annotation data we are producing
Object-Oriented Design Approach

 OOP in Perl == *, but lesser of two evils
    (don’t ask me what the other evil is, but it must be pretty evil)



 Encapsulates possible sources of change and prevents
    them from affecting downstream components
    (like HACCP)
 Polymorphism of $parser->parse($infile) producing
  annotation objects is nice
 Re-use was not really a motive here


*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
Annotation Pipeline Overview
            Annotation Tool(s)


         Annotation Source Data

                 Parser(s)
                                     We can make changes
         Annotation Data Object(s)   to the annotation rules,
                                         without having to
                                     necessarily re-run or re-
                                          parse the data
                 Annotation
                   Rules

           Final Annotation Data
Design Objectives for Parsers
A parser must:
 Produce polypeptides with associated AnnotationData objects of a defined type
 Produce AnnotationData object with attributes specified in a consistent way
        E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ ->
         ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or
         verification should be done before the AnnotationData object is created; if the data is
         invalid, the attribute should not be populated, or the object should not be created.
   Produce annotation data objects that are independent of the source annotation
    data they were parsed from
        e.g.: They have already been canonized as a type of ‘trusted annotation evidence
         type’ when they are created as AnnotationData objects. These trusted types are
         defined in the annotation SOP.

   These features create a separation between how trusted evidence is defined
    (input data), and how the evidence is used to produce annotation (annotation
    rules)
AnnotationData Objects
              AnnotationData


    AnnotationData::Polypeptide
                                        Polypeptide
type:
          [some string]
attributes:                       AnnotationData Object(s)
          common_name
          gene_symbol
          EC
          GO
          TIGR_role
          …
AnnotationRules

 AnnotationRules object implements the rules
 from the annotation SOP document

 AnnotationRules::PredictedProtein takes a
 Polypeptide object with associated
 AnnotationData objects of varying type and
 applies the annotation rules to create a final
 AnnotationData object
AnnotationRules
 Rules are encoded as an array in the following
  format:
ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2

 Where OPERATOR is one of:
   = for assign attribute (if unassigned)
   + for append attribute
   - for overwrite attribute

 Any operators can be defined as they are applied
  with a hash of handler subroutines
AnnotationRules::PredictedProtein
    my @annotation_order = (
           ## equivalog level tigrfam hits
           'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
           'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',
           'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

            'TIGRFAM::FRAG::Equivalog|=|GO',
            'TIGRFAM::FRAG::Exception|=|GO',
            'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',
            'TIGRFAM::FullLength::Domain|=|GO',
            'PandaBLASTP::Characterized|=|GO',

            'PRIAM|=|GO EC',

            ## equivalog level hits vs tigrfam frag
            'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

            ## characterized high confidence blast hit
            'PandaBLASTP::Characterized|=|common_name gene_symbol',

            ## pfam and non-equivalog tigrfams
            'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
            'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',
            …
CAMERA Annotation Pipeline




       CAMERA-specific
      Ergatis components
camera_annotation_parser
camera_annotation_rules
camera_annotation_rules
CAMERA-specific Code in SVN

 http://iwebsvn.tigr.org/listing.php?repname=ANNO
Future Development
                                     (My 2 cents)



   Pipeline development must be driven by annotation SOP development
    work
      Feedback on pipeline bugs must be vigilantly kept separate from feedback
       on annotation SOP bugs
      First discuss and update the SOP, then modify the code
   Cluster summary annotation
      Shortest path here seems to be a combination of GO Slim and EC
       assignments? GO consortium makes some scripts available for
       summarizing sets of GO assignments
      If using the current code, PolypeptideSet container class exists already.
       Cluster members can be added to a PolypeptideSet and that can be used
       as input to an AnnotationRules::FinalCluster object that is similar to the one
       for PredictedProtein, but with a different set of handler routines.
   Incremental clustering pipeline
        Good luck 

CAMERA metagenomic annotation pipeline

  • 1.
    CAMERA Annotation Pipelines (and related infrastructure) Brett Whitty 12/20/2007
  • 2.
    Overview  Compute Infrastructure GOS/CAMERA ncRNA/ORF calling pipeline  rRNA finding pipeline  ORF calling  GOS (incremental) protein clustering  CAMERA Annotation Pipeline  Specifications  Implementation
  • 3.
  • 4.
    CALIT2 Compute Grid 48 dual-core dual-CPU 64 bit machines  192 SGE slots  Redhat-based ‘Rocks Clusters’ Linux distribution (see http://rocksclusters.org)  ‘Rocks Rolls’  Bio-roll (/opt/Bio)  Used to image/install each node separately, including local Perl module installs (patches)
  • 5.
    sos.camera.calit2.net  Head nodeof sos cluster  SSH into here  Is not an SGE submit host
  • 6.
    SOS Cluster GlobalMounts  /share/apps  applications (and related files) are installed here, analysis data should not be stored here  /home/thumper6  a global mount point --- 18T(!!!) storage volume on which all analysis data/results should be stored  /opt/Bio  tools such as clustalw, EMBOSS, hmmer, ncbi blast are installed under here
  • 7.
    SOS Local Mounts (on each grid node)  /state/partition1  local storage device on each grid node available for local scratch space (438G)  /tmp  system tmp partition (7G)
  • 8.
    pg0-0.camera.calit2.net  SSH accessibleonly through head  Is an SGE submit host  Running apache and postgres servers
  • 9.
    pg0-0.camera.calit2.net  http://web1.camera.calit2.net/ergatis/  /var/www/cgi-bin/ergatis  /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis  /var/www/html/ergatis  /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis
  • 10.
    pg0-0.camera.calit2.net  CGI scriptsrun as the user 'apache' on pg0-0, but ‘apache’ has sudo permissions for user 'ergatis'  The two CGI scripts in the install which run RunWorkflow and KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm) have been modified, and 'sudo -u ergatis ' has been appended to their normal execution strings  IdGenerator.pm has been modified to use JCVIIdGenerator.pm  Many of the settings in ergatis.ini have been changed from defaults, including disabling a number of the components  When updating the Ergatis CGI directory from the SVN repository, a backup copy should be set-aside in advance
  • 11.
    SGE/Workflow Notes  Two SGE queues have been configured for ergatis:  ergatis.q (192 slots)  ergatis-fast.q (144 slots)  ergatis.q is subordinate queue of ergatis-fast.q  ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in /home/ergatis/.sge_request  Workflow version 3.0 is installed  /share/apps/workflow  Workflow requires that the SGE queue's prolog and epilog scripts be set to the following:  prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue  epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue  The queue configuration can be checked using the command 'qconf -sq ergatis.q'
  • 12.
    Ergatis Application Install  The main ergatis application install directory is under /share/apps/ergatis  The chado-v1r12b1 release is the current version installed  direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI  Perl wrappers were modified via sed to the correct local directory structures  Proper install wasn't done because no working installer script was available at the time  /share/apps/ergatis/chado-v1r12b1 symlinked to /share/apps/ergatis/current  Executables which some ergatis component use, but are not installed with Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin  External tools which are not globally installed on sos are installed under /share/apps/ergatis/external_apps  Ergatis global directories (global_id_repository, global_saved_templates) are located under /share/apps/ergatis/ergatis_global
  • 13.
    Ergatis Data Locations All ergatis data should be put under /home/thumper6/ergatis  Project repositories are located under /home/thumper6/ergatis/projects or symlink /share/apps/ergatis/projects  CAMERA project repository is /home/thumper6/ergatis/projects/camera  Databases are located under /home/thumper6/ergatis/db or symlink /share/apps/ergatis/db  Global scratch space is under /home/thumper6/ergatis/scratch or symlink /share/apps/ergatis/scratch
  • 14.
    ikelite.rocksclusters.org  Less machinesthan sos cluster (~20 slots?)  Initial test ergatis install was done here (similar directory structure to sos)  Completely distinct from sos cluster  Sandbox  Shibu, Weizhong Li and others run computes here (e.g.: clustering pipeline)
  • 15.
  • 16.
    GOS/CAMERA Pipelines Overview Metagenomic Reads ncRNA/ORF Finding Pipeline Incremental Clustering ORFs/peptides Pipeline Annotation Pipeline Cluster Memberships
  • 17.
    Challenges  All computesin pipeline must be performed on multi-sequence input/output files, as the filesystem can not physically support 12M+ individual FASTA input files/output files  other partitioning solutions could work(?) but most tools support multiple sequence inputs anyway  Overall total space consumption was an issue when computes were running on TIGR grid, but this is not as much an issue (currently) on CALIT2 grid  Solution here was to keep all inputs/outputs gzipped during pipeline execution, at the cost of some performance loss (using things like zcat –f | with NCBI BLAST, etc.)
  • 18.
    GOS/CAMERA ncRNA and ORF Finding Pipeline
  • 19.
    GOS/CAMERA ncRNA andORF Finding Pipeline Overview Reads Find tRNAs Extract tRNAs tRNAs FASTA Soft-Mask tRNAs Find rRNAs Extract rRNAs rRNAs FASTA Soft-Mask rRNAs ORFs FASTA Metagene GOS ORF calling Peptides FASTA ORFs FASTA ORF stats ORF overlaps Peptides FASTA
  • 20.
    GOS/CAMERA ncRNA and ORFFinding Pipeline CAMERA-specific Ergatis components
  • 21.
  • 22.
    CAMERA rRNA FinderOverview  BLAST vs. a database of coded pooled rRNA subunit sequences  BLAST prefilter step with loose parameters  blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1 -z 3000000000 -W 9  Reads with prefilter hits are searched using strict parameters  blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b 1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T  Collapse aligned intervals of the same rRNA type and extract the highest scoring alignments from each region
  • 23.
  • 24.
  • 25.
    rRNA Finder DB /usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa  5S  Sequences from Archaea, Bacteria and Eukaryota were obtained from the 5S Ribosomal RNA Database  http://biobases.ibch.poznan.pl/5SData/  16S  Sequences for Archaea and Bactera were obtained from the Green Genes 16S db  http://greengenes.lbl.gov/  18S  Source was Doug Rusch's 18S database prepared for the GOS paper  23S  Source was Doug Rusch's 23S database prepared for the GOS paper.
  • 26.
    rRNA Finder DB Fastaheaders were coded as follows: >#S [D] ...original.header... where # is one of (5, 16, 18, 23) and D is one of (A, B, E). The camera_rrna_finder component expects this format.
  • 27.
    rRNA Finder DB CD-HIT was run on the entire database to cluster sequences with high similarity to reduce the database size but maintain a range of diverse sequences Command line: /usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i input_database.fsa -o output_database.fsa -c 0.8 -n 4  Consistency of clustering was checked with a Perl script to ensure no heterogeneous clustering (e.g.: 18S and 16S clustering together)  Clusters were consistent  Database size was reduced from 65,591 sequences to 1,329
  • 28.
  • 29.
  • 30.
  • 31.
    FASTA Headers  >HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03 /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=1088 /length=1088  >JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"  >JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"  >JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1 /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707 /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03 /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=902 /length=902"  >JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0 /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847 /length=847"
  • 32.
    The absence ofcalled ORFs in this region of the read is due to the soft-masked rRNA sequence RNAmmer didn’t identify the 23S sequence, though it is capable of finding 23S
  • 33.
    Again, RNAmmer failedto identify rRNA sequence
  • 34.
    These ORFs have >150 unmasked bases BLAST-based approach does a pretty good job of finding correct boundaries
  • 35.
    BLAST-based rRNA finding appears to outperform RNAmmer for 23S sequences, and some 16S
  • 36.
    GOS (Incremental) Clustering Pipeline http://camera.venterinstitute.org/wiki/display/V
  • 37.
    Clustering Overview Core Cluster Core Core Cluster All Public Cluster Proteins + GOS ORFs Core Core Cluster Cluster Core GOS Cluster v1.2 Non-Redundant 90% Historical Artifacts Longest Sequence Representatives Identity CD-HIT Sequence (with respect to annotation) Representatives
  • 38.
  • 39.
    Thoughts on Specifications Annotation rules should not be literally codified as Perl code (and only Perl code)!!! (especially when the “decision makers” never look at the code)  What tools do we trust?  What cutoffs do we use?  What evidence/data types do we consider?  These will (in some cases should) change over time
  • 40.
    More Thoughts  Specificationsare easier to change than code, so code should be written to support change  But unless they’re defined first, the specifications will be a moving target
  • 41.
    (My) Design Objectives Must be able to add/remove annotation data sources as the annotation SOP changes  Must be able to easily change the ways in which these annotation data types are applied/combined to produce final annotation  Must be able to change/expand the types of final annotation data we are producing
  • 42.
    Object-Oriented Design Approach OOP in Perl == *, but lesser of two evils (don’t ask me what the other evil is, but it must be pretty evil)  Encapsulates possible sources of change and prevents them from affecting downstream components (like HACCP)  Polymorphism of $parser->parse($infile) producing annotation objects is nice  Re-use was not really a motive here *Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
  • 43.
    Annotation Pipeline Overview Annotation Tool(s) Annotation Source Data Parser(s) We can make changes Annotation Data Object(s) to the annotation rules, without having to necessarily re-run or re- parse the data Annotation Rules Final Annotation Data
  • 44.
    Design Objectives forParsers A parser must:  Produce polypeptides with associated AnnotationData objects of a defined type  Produce AnnotationData object with attributes specified in a consistent way  E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ -> ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or verification should be done before the AnnotationData object is created; if the data is invalid, the attribute should not be populated, or the object should not be created.  Produce annotation data objects that are independent of the source annotation data they were parsed from  e.g.: They have already been canonized as a type of ‘trusted annotation evidence type’ when they are created as AnnotationData objects. These trusted types are defined in the annotation SOP.  These features create a separation between how trusted evidence is defined (input data), and how the evidence is used to produce annotation (annotation rules)
  • 45.
    AnnotationData Objects AnnotationData AnnotationData::Polypeptide Polypeptide type: [some string] attributes: AnnotationData Object(s) common_name gene_symbol EC GO TIGR_role …
  • 46.
    AnnotationRules  AnnotationRules objectimplements the rules from the annotation SOP document  AnnotationRules::PredictedProtein takes a Polypeptide object with associated AnnotationData objects of varying type and applies the annotation rules to create a final AnnotationData object
  • 47.
    AnnotationRules  Rules areencoded as an array in the following format: ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2  Where OPERATOR is one of:  = for assign attribute (if unassigned)  + for append attribute  - for overwrite attribute  Any operators can be defined as they are applied with a hash of handler subroutines
  • 48.
    AnnotationRules::PredictedProtein  my @annotation_order = (  ## equivalog level tigrfam hits  'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::Equivalog|=|GO',  'TIGRFAM::FRAG::Exception|=|GO',  'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',  'TIGRFAM::FullLength::Domain|=|GO',  'PandaBLASTP::Characterized|=|GO',  'PRIAM|=|GO EC',   ## equivalog level hits vs tigrfam frag  'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',   ## characterized high confidence blast hit  'PandaBLASTP::Characterized|=|common_name gene_symbol',   ## pfam and non-equivalog tigrfams  'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',  …
  • 49.
    CAMERA Annotation Pipeline CAMERA-specific Ergatis components
  • 50.
  • 51.
  • 52.
  • 53.
    CAMERA-specific Code inSVN  http://iwebsvn.tigr.org/listing.php?repname=ANNO
  • 54.
    Future Development (My 2 cents)  Pipeline development must be driven by annotation SOP development work  Feedback on pipeline bugs must be vigilantly kept separate from feedback on annotation SOP bugs  First discuss and update the SOP, then modify the code  Cluster summary annotation  Shortest path here seems to be a combination of GO Slim and EC assignments? GO consortium makes some scripts available for summarizing sets of GO assignments  If using the current code, PolypeptideSet container class exists already. Cluster members can be added to a PolypeptideSet and that can be used as input to an AnnotationRules::FinalCluster object that is similar to the one for PredictedProtein, but with a different set of handler routines.  Incremental clustering pipeline  Good luck 