SlideShare a Scribd company logo
CAMERA Annotation Pipelines
      (and related infrastructure)

            Brett Whitty

 Compute Infrastructure
 GOS/CAMERA ncRNA/ORF calling pipeline
   rRNA finding pipeline
   ORF calling
 GOS (incremental) protein clustering
 CAMERA Annotation Pipeline
   Specifications
   Implementation
Compute Infrastructure
CALIT2 Compute Grid

 48 dual-core dual-CPU 64 bit machines
    192 SGE slots
 Redhat-based ‘Rocks Clusters’ Linux
  distribution (see
 ‘Rocks Rolls’
   Bio-roll (/opt/Bio)
   Used to image/install each node separately,
    including local Perl module installs (patches)

 Head node of sos cluster
    SSH into here
 Is not an SGE submit host
SOS Cluster Global Mounts
 /share/apps
    applications (and related files) are installed here,
     analysis data should not be stored here
 /home/thumper6
    a global mount point --- 18T(!!!) storage volume
     on which all analysis data/results should be
 /opt/Bio
    tools such as clustalw, EMBOSS, hmmer, ncbi
     blast are installed under here
SOS Local Mounts
                   (on each grid node)

 /state/partition1
    local storage device on each grid node available
     for local scratch space (438G)
 /tmp
    system tmp partition (7G)

 SSH accessible only through head
 Is an SGE submit host
 Running apache and postgres servers


 /var/www/cgi-bin/ergatis
     /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force ergatis

 /var/www/html/ergatis
     /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force ergatis
 CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has
  sudo permissions for user 'ergatis'
    The two CGI scripts in the install which run RunWorkflow and
     KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/
     have been modified, and 'sudo -u ergatis ' has been appended
     to their normal execution strings

 has been modified to use

 Many of the settings in ergatis.ini have been changed from
  defaults, including disabling a number of the components
    When updating the Ergatis CGI directory from the SVN
     repository, a backup copy should be set-aside in advance
SGE/Workflow Notes
   Two SGE queues have been configured for ergatis:
        ergatis.q (192 slots)
        ergatis-fast.q (144 slots)
   ergatis.q is subordinate queue of ergatis-fast.q

   ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in

   Workflow version 3.0 is installed
        /share/apps/workflow

   Workflow requires that the SGE queue's prolog and epilog scripts be set to the
        prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue
        epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue

   The queue configuration can be checked using the command
    'qconf -sq ergatis.q'
Ergatis Application Install
   The main ergatis application install directory is under /share/apps/ergatis

   The chado-v1r12b1 release is the current version installed
        direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI
        Perl wrappers were modified via sed to the correct local directory structures
        Proper install wasn't done because no working installer script was available at the

   /share/apps/ergatis/chado-v1r12b1
    symlinked to /share/apps/ergatis/current

   Executables which some ergatis component use, but are not installed with
    Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin

   External tools which are not globally installed on sos are installed under

   Ergatis global directories (global_id_repository, global_saved_templates) are
    located under /share/apps/ergatis/ergatis_global
Ergatis Data Locations
 All ergatis data should be put under /home/thumper6/ergatis

 Project repositories are located under
  or symlink /share/apps/ergatis/projects

 CAMERA project repository is

 Databases are located under /home/thumper6/ergatis/db
  or symlink /share/apps/ergatis/db

 Global scratch space is under /home/thumper6/ergatis/scratch
  or symlink /share/apps/ergatis/scratch

 Less machines than sos cluster (~20 slots?)
 Initial test ergatis install was done here
  (similar directory structure to sos)
 Completely distinct from sos cluster
 Sandbox
 Shibu, Weizhong Li and others run computes
  here (e.g.: clustering pipeline)
GOS/CAMERA Pipelines Overview

     Metagenomic Reads

  ncRNA/ORF Finding Pipeline

                               Incremental Clustering
        ORFs/peptides                Pipeline

      Annotation Pipeline      Cluster Memberships
 All computes in pipeline must be performed on
  multi-sequence input/output files, as the filesystem
  can not physically support 12M+ individual FASTA
  input files/output files
    other partitioning solutions could work(?) but most tools
     support multiple sequence inputs anyway

 Overall total space consumption was an issue when
  computes were running on TIGR grid, but this is not
  as much an issue (currently) on CALIT2 grid
    Solution here was to keep all inputs/outputs gzipped
     during pipeline execution, at the cost of some performance
     loss (using things like zcat –f | with NCBI BLAST, etc.)
  ORF Finding Pipeline
     Finding Pipeline Overview

        Find tRNAs           Extract tRNAs   tRNAs FASTA

     Soft-Mask tRNAs

        Find rRNAs           Extract rRNAs   rRNAs FASTA

     Soft-Mask rRNAs                          ORFs FASTA
     GOS ORF calling                         Peptides FASTA

                                              ORFs FASTA
ORF stats     ORF overlaps
                                             Peptides FASTA
ncRNA and ORF Finding Pipeline
                   Ergatis components
CAMERA rRNA Finder Overview
 BLAST vs. a database of coded pooled rRNA
  subunit sequences
 BLAST prefilter step with loose parameters
    blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1
     -z 3000000000 -W 9
 Reads with prefilter hits are searched using strict
    blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b
     1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T
 Collapse aligned intervals of the same rRNA type
  and extract the highest scoring alignments from
  each region

Custom DB
rRNA Finder DB

 5S
    Sequences from Archaea, Bacteria and Eukaryota were
     obtained from the 5S Ribosomal RNA Database
 16S
    Sequences for Archaea and Bactera were obtained from the
     Green Genes 16S db
 18S
    Source was Doug Rusch's 18S database prepared for the GOS
 23S
    Source was Doug Rusch's 23S database prepared for the GOS
rRNA Finder DB

Fasta headers were coded as follows:

>#S [D] ...original.header...

where # is one of (5, 16, 18, 23) and D is one of
 (A, B, E). The camera_rrna_finder
 component expects this format.
rRNA Finder DB
 CD-HIT was run on the entire database to cluster sequences with
  high similarity to reduce the database size but maintain a range
  of diverse sequences

Command line:
/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i
   input_database.fsa -o output_database.fsa -c 0.8 -n 4

 Consistency of clustering was checked with a Perl script to
  ensure no heterogeneous clustering
  (e.g.: 18S and 16S clustering together)
 Clusters were consistent
 Database size was reduced from 65,591 sequences to 1,329
rRNA Finder
ORF Overlaps/ORF Stats
FASTA Headers
   >HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03
    /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=1088 /length=1088
   >JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722
    /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
    /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
    /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=841 /length=841"
   >JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722
    /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
    /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
    /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=841 /length=841"
   >JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1
    /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707
    /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03
    /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=902 /length=902"
   >JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0
    /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1
    /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714
    /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847
The absence of called
   ORFs in this region of
   the read is due to the
     soft-masked rRNA

  RNAmmer didn’t
   identify the 23S
sequence, though it is
capable of finding 23S
Again, RNAmmer failed to identify rRNA sequence
These ORFs have
 >150 unmasked

                  approach does a
                  pretty good job of
                    finding correct
BLAST-based rRNA
   finding appears to
 outperform RNAmmer
for 23S sequences, and
       some 16S
GOS (Incremental)
    Clustering Pipeline
Clustering Overview

 All Public                                            Cluster
Proteins +
GOS ORFs                             Core               Core
                                    Cluster            Cluster

                  GOS               Cluster             v1.2

                                                       Non-Redundant 90%
                Historical Artifacts
        Longest Sequence
                                                    Identity CD-HIT Sequence
                           (with respect to annotation) Representatives
CAMERA Polypeptide
Annotation Pipeline
Thoughts on Specifications
 Annotation rules should not be literally codified as
  Perl code (and only Perl code)!!!
  (especially when the “decision makers” never look at the code)

 What tools do we trust?
 What cutoffs do we use?
 What evidence/data types do we consider?

 These will (in some cases should) change over time
More Thoughts

 Specifications are easier to change than
  code, so code should be written to support

 But unless they’re defined first, the
  specifications will be a moving target
(My) Design Objectives

 Must be able to add/remove annotation data
  sources as the annotation SOP changes
 Must be able to easily change the ways in
  which these annotation data types are
  applied/combined to produce final annotation
 Must be able to change/expand the types of
  final annotation data we are producing
Object-Oriented Design Approach

 OOP in Perl == *, but lesser of two evils
    (don’t ask me what the other evil is, but it must be pretty evil)

 Encapsulates possible sources of change and prevents
    them from affecting downstream components
    (like HACCP)
 Polymorphism of $parser->parse($infile) producing
  annotation objects is nice
 Re-use was not really a motive here

*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
Annotation Pipeline Overview
            Annotation Tool(s)

         Annotation Source Data

                                     We can make changes
         Annotation Data Object(s)   to the annotation rules,
                                         without having to
                                     necessarily re-run or re-
                                          parse the data

           Final Annotation Data
Design Objectives for Parsers
A parser must:
 Produce polypeptides with associated AnnotationData objects of a defined type
 Produce AnnotationData object with attributes specified in a consistent way
        E.g.: All parsers should produce EC number attributes that look like ‘’ ->
         ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or
         verification should be done before the AnnotationData object is created; if the data is
         invalid, the attribute should not be populated, or the object should not be created.
   Produce annotation data objects that are independent of the source annotation
    data they were parsed from
        e.g.: They have already been canonized as a type of ‘trusted annotation evidence
         type’ when they are created as AnnotationData objects. These trusted types are
         defined in the annotation SOP.

   These features create a separation between how trusted evidence is defined
    (input data), and how the evidence is used to produce annotation (annotation
AnnotationData Objects

          [some string]
attributes:                       AnnotationData Object(s)

 AnnotationRules object implements the rules
 from the annotation SOP document

 AnnotationRules::PredictedProtein takes a
 Polypeptide object with associated
 AnnotationData objects of varying type and
 applies the annotation rules to create a final
 AnnotationData object
 Rules are encoded as an array in the following

 Where OPERATOR is one of:
   = for assign attribute (if unassigned)
   + for append attribute
   - for overwrite attribute

 Any operators can be defined as they are applied
  with a hash of handler subroutines
    my @annotation_order = (
           ## equivalog level tigrfam hits
           'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
           'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',
           'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

            'TIGRFAM::FRAG::Equivalog|=|GO',
            'TIGRFAM::FRAG::Exception|=|GO',
            'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',
            'TIGRFAM::FullLength::Domain|=|GO',
            'PandaBLASTP::Characterized|=|GO',

            'PRIAM|=|GO EC',
            ## equivalog level hits vs tigrfam frag
            'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
            ## characterized high confidence blast hit
            'PandaBLASTP::Characterized|=|common_name gene_symbol',
            ## pfam and non-equivalog tigrfams
            'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
            'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',
            …
CAMERA Annotation Pipeline

      Ergatis components
CAMERA-specific Code in SVN

Future Development
                                     (My 2 cents)

   Pipeline development must be driven by annotation SOP development
      Feedback on pipeline bugs must be vigilantly kept separate from feedback
       on annotation SOP bugs
      First discuss and update the SOP, then modify the code
   Cluster summary annotation
      Shortest path here seems to be a combination of GO Slim and EC
       assignments? GO consortium makes some scripts available for
       summarizing sets of GO assignments
      If using the current code, PolypeptideSet container class exists already.
       Cluster members can be added to a PolypeptideSet and that can be used
       as input to an AnnotationRules::FinalCluster object that is similar to the one
       for PredictedProtein, but with a different set of handler routines.
   Incremental clustering pipeline
        Good luck 

More Related Content

What's hot

Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Hsien-Hsin Sean Lee, Ph.D.
Kazushi Yamashina
Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017
Andriy Berestovskyy
The Spectre of Meltdowns
The Spectre of MeltdownsThe Spectre of Meltdowns
The Spectre of Meltdowns
Andriy Berestovskyy
True stories on the analysis of network activity using Python
True stories on the analysis of network activity using PythonTrue stories on the analysis of network activity using Python
True stories on the analysis of network activity using Python
Understanding Tomasulo Algorithm
Understanding Tomasulo AlgorithmUnderstanding Tomasulo Algorithm
Understanding Tomasulo Algorithm
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
Anne Nicolas
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In Detail
Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)
Hackfest Communication
Javier Quílez Oliete
Kazushi Yamashina
Zianed Hou
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
Felipe Prado
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Anne Nicolas
Kazushi Yamashina
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
Simen Li
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Hsien-Hsin Sean Lee, Ph.D.
ARM 64bit has come!
ARM 64bit has come!ARM 64bit has come!
ARM 64bit has come!
Tetsuyuki Kobayashi
Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41
Michal Jurosz
Kazushi Yamashina

What's hot (20)

Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017
The Spectre of Meltdowns
The Spectre of MeltdownsThe Spectre of Meltdowns
The Spectre of Meltdowns
True stories on the analysis of network activity using Python
True stories on the analysis of network activity using PythonTrue stories on the analysis of network activity using Python
True stories on the analysis of network activity using Python
Understanding Tomasulo Algorithm
Understanding Tomasulo AlgorithmUnderstanding Tomasulo Algorithm
Understanding Tomasulo Algorithm
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In Detail
Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
ARM 64bit has come!
ARM 64bit has come!ARM 64bit has come!
ARM 64bit has come!
Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41

Similar to CAMERA metagenomic annotation pipeline

Squash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System ProfileSquash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System Profile
Steve Arnold
SANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management DatabasesSANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management Databases
Phil Hagen
Ganglia monitoring
Ganglia monitoringGanglia monitoring
Ganglia monitoring
Chen Robert
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common Command
Jeff Yang
Basic Linux kernel
Basic Linux kernelBasic Linux kernel
Basic Linux kernel
Morteza Nourelahi Alamdari
Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]
Alwin Arrasyid
Feng Yu
Bundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPMBundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPM
Alexander Shopov
Efficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsEfficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native Environments
Gergely Szabó
Rpm Introduction
Rpm IntroductionRpm Introduction
Rpm Introduction
Shrinivasan T
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
Keith Wright
Pitr Made Easy
Pitr Made EasyPitr Made Easy
Pitr Made Easy
Joshua Drake
configuring a warm standby, the easy way
configuring a warm standby, the easy wayconfiguring a warm standby, the easy way
configuring a warm standby, the easy way
Command Prompt., Inc
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
Roman Podoliaka
App container rkt
App container rktApp container rkt
App container rkt
Xiaofeng Guo
Basic Linux Internals
Basic Linux InternalsBasic Linux Internals
Basic Linux Internals
mukul bhardwaj
Snort296x centos6x 2
Snort296x centos6x 2Snort296x centos6x 2
Snort296x centos6x 2
Trinh Tuan
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
Maté Ongenaert
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System II

Similar to CAMERA metagenomic annotation pipeline (20)

Squash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System ProfileSquash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System Profile
SANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management DatabasesSANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management Databases
Ganglia monitoring
Ganglia monitoringGanglia monitoring
Ganglia monitoring
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common Command
Basic Linux kernel
Basic Linux kernelBasic Linux kernel
Basic Linux kernel
Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]
Bundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPMBundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPM
Efficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsEfficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native Environments
Rpm Introduction
Rpm IntroductionRpm Introduction
Rpm Introduction
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
Pitr Made Easy
Pitr Made EasyPitr Made Easy
Pitr Made Easy
configuring a warm standby, the easy way
configuring a warm standby, the easy wayconfiguring a warm standby, the easy way
configuring a warm standby, the easy way
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
App container rkt
App container rktApp container rkt
App container rkt
Basic Linux Internals
Basic Linux InternalsBasic Linux Internals
Basic Linux Internals
Snort296x centos6x 2
Snort296x centos6x 2Snort296x centos6x 2
Snort296x centos6x 2
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System II

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
名前 です男
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

CAMERA metagenomic annotation pipeline

  • 1. CAMERA Annotation Pipelines (and related infrastructure) Brett Whitty 12/20/2007
  • 2. Overview  Compute Infrastructure  GOS/CAMERA ncRNA/ORF calling pipeline  rRNA finding pipeline  ORF calling  GOS (incremental) protein clustering  CAMERA Annotation Pipeline  Specifications  Implementation
  • 4. CALIT2 Compute Grid  48 dual-core dual-CPU 64 bit machines  192 SGE slots  Redhat-based ‘Rocks Clusters’ Linux distribution (see  ‘Rocks Rolls’  Bio-roll (/opt/Bio)  Used to image/install each node separately, including local Perl module installs (patches)
  • 5.  Head node of sos cluster  SSH into here  Is not an SGE submit host
  • 6. SOS Cluster Global Mounts  /share/apps  applications (and related files) are installed here, analysis data should not be stored here  /home/thumper6  a global mount point --- 18T(!!!) storage volume on which all analysis data/results should be stored  /opt/Bio  tools such as clustalw, EMBOSS, hmmer, ncbi blast are installed under here
  • 7. SOS Local Mounts (on each grid node)  /state/partition1  local storage device on each grid node available for local scratch space (438G)  /tmp  system tmp partition (7G)
  • 8.  SSH accessible only through head  Is an SGE submit host  Running apache and postgres servers
  • 9.   /var/www/cgi-bin/ergatis  /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force ergatis  /var/www/html/ergatis  /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force ergatis
  • 10.  CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has sudo permissions for user 'ergatis'  The two CGI scripts in the install which run RunWorkflow and KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/ have been modified, and 'sudo -u ergatis ' has been appended to their normal execution strings  has been modified to use  Many of the settings in ergatis.ini have been changed from defaults, including disabling a number of the components  When updating the Ergatis CGI directory from the SVN repository, a backup copy should be set-aside in advance
  • 11. SGE/Workflow Notes  Two SGE queues have been configured for ergatis:  ergatis.q (192 slots)  ergatis-fast.q (144 slots)  ergatis.q is subordinate queue of ergatis-fast.q  ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in /home/ergatis/.sge_request  Workflow version 3.0 is installed  /share/apps/workflow  Workflow requires that the SGE queue's prolog and epilog scripts be set to the following:  prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue  epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue  The queue configuration can be checked using the command 'qconf -sq ergatis.q'
  • 12. Ergatis Application Install  The main ergatis application install directory is under /share/apps/ergatis  The chado-v1r12b1 release is the current version installed  direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI  Perl wrappers were modified via sed to the correct local directory structures  Proper install wasn't done because no working installer script was available at the time  /share/apps/ergatis/chado-v1r12b1 symlinked to /share/apps/ergatis/current  Executables which some ergatis component use, but are not installed with Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin  External tools which are not globally installed on sos are installed under /share/apps/ergatis/external_apps  Ergatis global directories (global_id_repository, global_saved_templates) are located under /share/apps/ergatis/ergatis_global
  • 13. Ergatis Data Locations  All ergatis data should be put under /home/thumper6/ergatis  Project repositories are located under /home/thumper6/ergatis/projects or symlink /share/apps/ergatis/projects  CAMERA project repository is /home/thumper6/ergatis/projects/camera  Databases are located under /home/thumper6/ergatis/db or symlink /share/apps/ergatis/db  Global scratch space is under /home/thumper6/ergatis/scratch or symlink /share/apps/ergatis/scratch
  • 14.  Less machines than sos cluster (~20 slots?)  Initial test ergatis install was done here (similar directory structure to sos)  Completely distinct from sos cluster  Sandbox  Shibu, Weizhong Li and others run computes here (e.g.: clustering pipeline)
  • 16. GOS/CAMERA Pipelines Overview Metagenomic Reads ncRNA/ORF Finding Pipeline Incremental Clustering ORFs/peptides Pipeline Annotation Pipeline Cluster Memberships
  • 17. Challenges  All computes in pipeline must be performed on multi-sequence input/output files, as the filesystem can not physically support 12M+ individual FASTA input files/output files  other partitioning solutions could work(?) but most tools support multiple sequence inputs anyway  Overall total space consumption was an issue when computes were running on TIGR grid, but this is not as much an issue (currently) on CALIT2 grid  Solution here was to keep all inputs/outputs gzipped during pipeline execution, at the cost of some performance loss (using things like zcat –f | with NCBI BLAST, etc.)
  • 18. GOS/CAMERA ncRNA and ORF Finding Pipeline
  • 19. GOS/CAMERA ncRNA and ORF Finding Pipeline Overview Reads Find tRNAs Extract tRNAs tRNAs FASTA Soft-Mask tRNAs Find rRNAs Extract rRNAs rRNAs FASTA Soft-Mask rRNAs ORFs FASTA Metagene GOS ORF calling Peptides FASTA ORFs FASTA ORF stats ORF overlaps Peptides FASTA
  • 20. GOS/CAMERA ncRNA and ORF Finding Pipeline CAMERA-specific Ergatis components
  • 22. CAMERA rRNA Finder Overview  BLAST vs. a database of coded pooled rRNA subunit sequences  BLAST prefilter step with loose parameters  blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1 -z 3000000000 -W 9  Reads with prefilter hits are searched using strict parameters  blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b 1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T  Collapse aligned intervals of the same rRNA type and extract the highest scoring alignments from each region
  • 25. rRNA Finder DB /usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa  5S  Sequences from Archaea, Bacteria and Eukaryota were obtained from the 5S Ribosomal RNA Database   16S  Sequences for Archaea and Bactera were obtained from the Green Genes 16S db   18S  Source was Doug Rusch's 18S database prepared for the GOS paper  23S  Source was Doug Rusch's 23S database prepared for the GOS paper.
  • 26. rRNA Finder DB Fasta headers were coded as follows: >#S [D] ...original.header... where # is one of (5, 16, 18, 23) and D is one of (A, B, E). The camera_rrna_finder component expects this format.
  • 27. rRNA Finder DB  CD-HIT was run on the entire database to cluster sequences with high similarity to reduce the database size but maintain a range of diverse sequences Command line: /usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i input_database.fsa -o output_database.fsa -c 0.8 -n 4  Consistency of clustering was checked with a Perl script to ensure no heterogeneous clustering (e.g.: 18S and 16S clustering together)  Clusters were consistent  Database size was reduced from 65,591 sequences to 1,329
  • 31. FASTA Headers  >HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03 /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=1088 /length=1088  >JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"  >JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"  >JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1 /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707 /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03 /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=902 /length=902"  >JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0 /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847 /length=847"
  • 32. The absence of called ORFs in this region of the read is due to the soft-masked rRNA sequence RNAmmer didn’t identify the 23S sequence, though it is capable of finding 23S
  • 33. Again, RNAmmer failed to identify rRNA sequence
  • 34. These ORFs have >150 unmasked bases BLAST-based approach does a pretty good job of finding correct boundaries
  • 35. BLAST-based rRNA finding appears to outperform RNAmmer for 23S sequences, and some 16S
  • 36. GOS (Incremental) Clustering Pipeline
  • 37. Clustering Overview Core Cluster Core Core Cluster All Public Cluster Proteins + GOS ORFs Core Core Cluster Cluster Core GOS Cluster v1.2 Non-Redundant 90% Historical Artifacts Longest Sequence Representatives Identity CD-HIT Sequence (with respect to annotation) Representatives
  • 39. Thoughts on Specifications  Annotation rules should not be literally codified as Perl code (and only Perl code)!!! (especially when the “decision makers” never look at the code)  What tools do we trust?  What cutoffs do we use?  What evidence/data types do we consider?  These will (in some cases should) change over time
  • 40. More Thoughts  Specifications are easier to change than code, so code should be written to support change  But unless they’re defined first, the specifications will be a moving target
  • 41. (My) Design Objectives  Must be able to add/remove annotation data sources as the annotation SOP changes  Must be able to easily change the ways in which these annotation data types are applied/combined to produce final annotation  Must be able to change/expand the types of final annotation data we are producing
  • 42. Object-Oriented Design Approach  OOP in Perl == *, but lesser of two evils (don’t ask me what the other evil is, but it must be pretty evil)  Encapsulates possible sources of change and prevents them from affecting downstream components (like HACCP)  Polymorphism of $parser->parse($infile) producing annotation objects is nice  Re-use was not really a motive here *Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
  • 43. Annotation Pipeline Overview Annotation Tool(s) Annotation Source Data Parser(s) We can make changes Annotation Data Object(s) to the annotation rules, without having to necessarily re-run or re- parse the data Annotation Rules Final Annotation Data
  • 44. Design Objectives for Parsers A parser must:  Produce polypeptides with associated AnnotationData objects of a defined type  Produce AnnotationData object with attributes specified in a consistent way  E.g.: All parsers should produce EC number attributes that look like ‘’ -> ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or verification should be done before the AnnotationData object is created; if the data is invalid, the attribute should not be populated, or the object should not be created.  Produce annotation data objects that are independent of the source annotation data they were parsed from  e.g.: They have already been canonized as a type of ‘trusted annotation evidence type’ when they are created as AnnotationData objects. These trusted types are defined in the annotation SOP.  These features create a separation between how trusted evidence is defined (input data), and how the evidence is used to produce annotation (annotation rules)
  • 45. AnnotationData Objects AnnotationData AnnotationData::Polypeptide Polypeptide type: [some string] attributes: AnnotationData Object(s) common_name gene_symbol EC GO TIGR_role …
  • 46. AnnotationRules  AnnotationRules object implements the rules from the annotation SOP document  AnnotationRules::PredictedProtein takes a Polypeptide object with associated AnnotationData objects of varying type and applies the annotation rules to create a final AnnotationData object
  • 47. AnnotationRules  Rules are encoded as an array in the following format: ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2  Where OPERATOR is one of:  = for assign attribute (if unassigned)  + for append attribute  - for overwrite attribute  Any operators can be defined as they are applied with a hash of handler subroutines
  • 48. AnnotationRules::PredictedProtein  my @annotation_order = (  ## equivalog level tigrfam hits  'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::Equivalog|=|GO',  'TIGRFAM::FRAG::Exception|=|GO',  'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',  'TIGRFAM::FullLength::Domain|=|GO',  'PandaBLASTP::Characterized|=|GO',  'PRIAM|=|GO EC',   ## equivalog level hits vs tigrfam frag  'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',   ## characterized high confidence blast hit  'PandaBLASTP::Characterized|=|common_name gene_symbol',   ## pfam and non-equivalog tigrfams  'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',  …
  • 49. CAMERA Annotation Pipeline CAMERA-specific Ergatis components
  • 53. CAMERA-specific Code in SVN 
  • 54. Future Development (My 2 cents)  Pipeline development must be driven by annotation SOP development work  Feedback on pipeline bugs must be vigilantly kept separate from feedback on annotation SOP bugs  First discuss and update the SOP, then modify the code  Cluster summary annotation  Shortest path here seems to be a combination of GO Slim and EC assignments? GO consortium makes some scripts available for summarizing sets of GO assignments  If using the current code, PolypeptideSet container class exists already. Cluster members can be added to a PolypeptideSet and that can be used as input to an AnnotationRules::FinalCluster object that is similar to the one for PredictedProtein, but with a different set of handler routines.  Incremental clustering pipeline  Good luck 