SlideShare a Scribd company logo
1 of 50
Download to read offline
BioPerl Update 2010:
Towards a Modern BioPerl
Chris Fields (UIUC)
BOSC 7-10-10
Present Day BioPerl



✤   Addressing new bioinformatics problems

✤   Collaborations in Open Bioinformatics Foundation

✤   Google Summer of Code
Towards a Modern BioPerl



✤   Lowering the barrier for new users to become involved

✤   Using Modern Perl language features

✤   Dealing with the BioPerl monolith
BioPerl 2.0?



✤   BioPerl and Modern Perl OOP (Moose)

✤   BioPerl and Perl 6
Background

✤   Started in 1996, many contributors over the years
    ✤   Jason Stajich (UCR)               ✤   Ian Korf (Wash U)

    ✤   Hilmar Lapp (NESCent)             ✤   Chris Mungall (NCBO)

    ✤   Heikki Lehväslaiho (KAUST)        ✤   Brian Osborne (BioTeam)

    ✤   Georg Fuellen (Bielefeld)         ✤   Steve Trutane (Stanford)

    ✤   Ewan Birney (Sanger, EBI)         ✤   Sendu Bala (Sanger)

    ✤   Aaron Mackey (Univ. Virginia)     ✤   Dave Messina (Sonnhammer Lab)

    ✤   Chris Dagdigian (BioTeam)         ✤   Mark Jensen (TCGA)

    ✤   Steven Brenner (UC-Berkeley)      ✤   Rob Buels (SGN)

    ✤   Lincoln Stein (OICR, CSHL)        ✤   Many, many more!
Background


✤   Open source: ‘Released under the same license as Perl itself’ i.e.
    Artistic

✤   http://bioperl.org

✤   Core developers - make releases, drive the project, set vision

✤   Regular contributors - have direct commit access
BioPerl Distributions



✤   BioPerl Core - the main distribution (aka ‘bioperl-live’ if using dev
    version)

✤   BioPerl-Run - Perl ‘wrappers’ for common bioinformatics tools

✤   BioPerl-DB - BioSQL ORM to BioPerl classes
Biological Sequences
✤   Bio::Seq - sequence record class
         #!/bin/perl -w

         use Modern::Perl;
         use Bio::Seq;

         my $seq_obj = Bio::Seq->new(-seq             =>   "aaaatgggggggggggccccgtt",
                                     -display_id      =>   "ABC12345",
                                     -desc            =>   "example 1",
                                     -alphabet        =>   "dna");

         say $seq_obj->display_id;   # ABC12345
         say $seq_obj->desc;         # example 1
         say $seq_obj->seq;          # aaaatgggggggggggccccgtt

         my $revcom = $seq_obj->revcom; # new Bio::Seq, but revcom
         say $revcom->seq;          # aacggggcccccccccccatttt
Sequence I/O
✤   Bio::SeqIO - sequence I/O stream classes (pluggable)
                 #!/usr/bin/perl -w

                 use Modern::Perl;
                 use Bio::SeqIO;

                 my ($infile, $outfile) = @ARGV;

                 my $in = Bio::SeqIO->new(-file => $infile,
                                          -format => 'genbank');
                 my $out = Bio::SeqIO->new(-file => ">$outfile",
                                          -format => 'fasta');

                 while (my $seq_obj = $in->next_seq) {
                     say $seq_obj->display_id;
                     $out->write_seq($seq_obj);
                 }
Sequence Features

✤   Bio::SeqFeature::Generic - generic SF implementation
                                                   GenBank File
use Modern::Perl;                                               source            1..2629
use Bio::SeqIO;                                                                   /organism="Enterococcus faecalis OG1RF"
                                                                                  /mol_type="genomic DNA"
my $in = Bio::SeqIO->new(-file => shift,                                          /strain="OG1RF"
                         -format => 'genbank');                                   /db_xref="taxon:474186"
                                                                gene              25..>2629
while (my $seq_obj = $in->next_seq) {                                             /gene="pyr operon"
    for my $feat_obj ($seq_obj->get_SeqFeatures) {                                /note="pyrimidine biosynthetic operon"
        say "Primary tag: ".$feat_obj->primary_tag;
        say "Location: ".$feat_obj->location->to_FTstring;               Primary tag: source
        for my $tag ($feat_obj->get_all_tags) {                          Location: 1..2629
            say " tag: $tag";                                              tag: db_xref
            for my $value ($feat_obj->get_tag_values($tag)) {                value: taxon:474186
                say "    value: $value";                                   tag: mol_type
            }                                                                value: genomic DNA
        }                                                                  tag: organism
    }                                                                        value: Enterococcus faecalis OG1RF
}                                                                          tag: strain
                                                                             value: OG1RF
Sequence Features

✤   Bio::SeqFeature::Generic - generic SF implementation
                                                   GenBank File
use Modern::Perl;                                               source            1..2629
use Bio::SeqIO;                                                                   /organism="Enterococcus faecalis OG1RF"
                                                                                  /mol_type="genomic DNA"
my $in = Bio::SeqIO->new(-file => shift,                                          /strain="OG1RF"
                         -format => 'genbank');                                   /db_xref="taxon:474186"
                                                                gene              25..>2629
while (my $seq_obj = $in->next_seq) {                                             /gene="pyr operon"
    for my $feat_obj ($seq_obj->get_SeqFeatures) {                                /note="pyrimidine biosynthetic operon"
        say "Primary tag: ".$feat_obj->primary_tag;
        say "Location: ".$feat_obj->location->to_FTstring;               Primary tag: source
        for my $tag ($feat_obj->get_all_tags) {                          Location: 1..2629
            say " tag: $tag";                                              tag: db_xref
            for my $value ($feat_obj->get_tag_values($tag)) {                value: taxon:474186
                say "    value: $value";                                   tag: mol_type
            }                                                                value: genomic DNA
        }                                                                  tag: organism
    }                                                                        value: Enterococcus faecalis OG1RF
}                                                                          tag: strain
                                                                             value: OG1RF
Sequence Features

✤   Bio::SeqFeature::Generic - generic SF implementation
                                                   GenBank File
use Modern::Perl;                                               source            1..2629
use Bio::SeqIO;                                                                   /organism="Enterococcus faecalis OG1RF"
                                                                                  /mol_type="genomic DNA"
my $in = Bio::SeqIO->new(-file => shift,                                          /strain="OG1RF"
                         -format => 'genbank');                                   /db_xref="taxon:474186"
                                                                gene              25..>2629
while (my $seq_obj = $in->next_seq) {                                             /gene="pyr operon"
    for my $feat_obj ($seq_obj->get_SeqFeatures) {                                /note="pyrimidine biosynthetic operon"
        say "Primary tag: ".$feat_obj->primary_tag;
        say "Location: ".$feat_obj->location->to_FTstring;               Primary tag: source
        for my $tag ($feat_obj->get_all_tags) {                          Location: 1..2629
            say " tag: $tag";                                              tag: db_xref
            for my $value ($feat_obj->get_tag_values($tag)) {                value: taxon:474186
                say "    value: $value";                                   tag: mol_type
            }                                                                value: genomic DNA
        }                                                                  tag: organism
    }                                                                        value: Enterococcus faecalis OG1RF
}                                                                          tag: strain
                                                                             value: OG1RF
Report Parsing
     Query= gi|1786183|gb|AAC73113.1| (AE000111) aspartokinase I,
     homoserine dehydrogenase I [Escherichia coli]
              (820 letters)

     Database: ecoli.aa
                4289 sequences; 1,358,990 total letters

     Searching..................................................done

                                                                            Score       E
     Sequences producing significant alignments:                            (bits)    Value

     gb|AAC73113.1|   (AE000111)   aspartokinase I, homoserine dehydrogen...   1567   0.0
     gb|AAC76922.1|   (AE000468)   aspartokinase II and homoserine dehydr...    332   1e-91
     gb|AAC76994.1|   (AE000475)   aspartokinase III, lysine sensitive [E...    184   3e-47
     gb|AAC73282.1|   (AE000126)   uridylate kinase [Escherichia coli]           42   3e-04

     >gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia
                coli]
               Length = 820

      Score = 1567 bits (4058), Expect = 0.0
      Identities = 806/820 (98%), Positives = 806/820 (98%)

     Query: 1   MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60
                MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA
     Sbjct: 1   MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60
Report Parsing
                                                        Query=gi|1786183|gb|AAC73113.1|
✤   Bio::SearchIO                                        Hit=gb|AAC73113.1|
#!/usr/bin/perl -w                                       Length=820
                                                         Percent_id=98.2926829268293
use Modern::Perl;
use Bio::SearchIO;
                                                        Query=gi|1786183|gb|AAC73113.1|
my $in = Bio::SearchIO->new(-format => 'blast',
                            -file   => 'ecoli.bls');
                                                         Hit=gb|AAC76922.1|
                                                         Length=821
while( my $result = $in->next_result ) {                 Percent_id=29.5980511571255
  while( my $hit = $result->next_hit ) {
    while( my $hsp = $hit->next_hsp ) {                 Query=gi|1786183|gb|AAC73113.1|
      say "Query=".$result->query_name;
                                                         Hit=gb|AAC76994.1|
      say " Hit=".$hit->name;
                                                         Length=471
      say " Length=".$hsp->length('total');
      say " Percent_id=".$hsp->percent_identity."n";    Percent_id=30.1486199575372
    }
  }                                                     Query=gi|1786183|gb|AAC73113.1|
}                                                        Hit=gb|AAC73282.1|
                                                         Length=97
                                                         Percent_id=28.8659793814433
Local/Remote Database Interfaces

✤   Bio::DB::GenBank

              #!/bin/perl -w

              use Modern::Perl;
              use Bio::DB::GenBank;

              my $db_obj = Bio::DB::GenBank->new;    # query NCBI nuc db

              my $seq_obj = $db_obj->get_Seq_by_acc('A00002');

              say $seq_obj->display_id;   # A00002
              say $seq_obj->length();     # 194




✤   Also EntrezGene, GenPept, RefSeq, UniProt, EBI, etc.
And Lots More!

✤   Bio::Align/IO            ✤   Bio::Map/IO

✤   Bio::Assembly/IO         ✤   Bio::Restriction/IO

✤   Bio::Tree/IO             ✤   Bio::Structure/IO

✤   Local flatfile databases   ✤   Bio::Factory

✤   Bio::Graphics            ✤   Bio::Tools::Run (catch-all namespace)

✤   SeqFeature databases     ✤   Bio::Factory (create objects)

✤   Bio::Pedigree/IO         ✤   Bio::Range/Location

✤   Bio::Coordinate/IO
Current Development
Next-Gen Sequence



✤   Second-generation/next-generation sequencing

    ✤   This is Lincoln Stein

    ✤   There is a reason he is smiling...
Next-Gen Sequence

✤   Bio-SamTools - support for SAM and BAM data (via SamTools)

✤   Bio-BigFile - support for BigWig/BigBed (via Jim Kent’s UCSC tools)

    ✤   Separate CPAN distributions

✤   GBrowse (Lincoln’s talk this afternoon), BioPerl

    ✤   Via SeqFeatures (high-level API for both modules)

    ✤   Via Bio::Assembly and BioPerl-Run (using the above modules)
Data Courtesy R. Khetani, M. Hudson, G. Robinson
New Tools/Wrappers

✤   BowTie            ✤   Infernal v.1.0
✤   BWA               ✤   NCBI eUtils (SOAP, CGI-based)
✤   MAQ               ✤   TopHat/CuffLinks (upcoming)
✤   BEDTools (beta)   ✤   The Cloud - bioperl-max
✤   SAMTools
                        Mark Jensen,
✤   HMMER3            Thomas Sharpton,
                       Dave Messina,
✤   BLAST+
                         Kai Blin,
✤   PAML               Dan Kortschak
Collaborations

  Published online 16 December 2009                               Nucleic Acids Research, 2010, Vol. 38, No. 6 1767–1771
                                                                                                  doi:10.1093/nar/gkp1137

  SURVEY AND SUMMARY
  The Sanger FASTQ file format for sequences
  with quality scores, and the Solexa/Illumina
  FASTQ variants
  Peter J. A. Cock1,*, Christopher J. Fields2, Naohisa Goto3, Michael L. Heuer4 and
  Peter M. Rice5
  1
   Plant Pathology, SCRI, Invergowrie, Dundee DD2 5DA, UK, 2Institute for Genomic Biology, 1206 W. Gregory
  Drive, M/C 195, University of Illinois at Urbana-Champaign, IL 61801, USA, 3Genome Information Research
  Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka 565-0871,
  Japan, 4Harbinger Partners, Inc., 855 Village Center Drive, Suite 356, St. Paul, MN 55127, USA and 5EMBL
  Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,
  Cambridge CB10 1SD, UK

  Received October 13, 2009; Revised November 13, 2009; Accepted November 17, 2009



  ABSTRACT                                                         of an explicit standard some parsers will fail to cope with
                                                                   very long ‘>’ title lines or very long sequences without
  FASTQ has emerged as a common file format for                    line wrapping. There is also no standardization for
The Google Summer of Code



✤   O|B|F was accepted this year for the first time

✤   Headed by Rob Buels (SGN), with some help from Hilmar Lapp and
    myself

✤   Six projects, covering BioPerl, BioJava, Biopython, BioRuby
The Google Summer of Code

✤   BioPerl has actually been part of the Google Summer of Code for the
    last three years (as have many other Bio*):

    ✤   NESCent - admin: H. Lapp:

        ✤   2008 - PhyloXML parsing (student: Mira Han)

        ✤   2009 - NeXML parsing (student: Chase Miller)

    ✤   O|B|F - admin: R. Buels:

        ✤   2010 - Alignment subsystem refactoring (student: Jun Yin)
GSoC - Alignment Subsystem

✤   Clean up current code

✤   Include capability of dealing with large datasets

✤   Target next-gen data, very large alignments?

    ✤   Abstract the backend (DB, memory, etc.)

    ✤   SAM/BAM may work (via Bio::DB::SAM)

    ✤   ...but what about protein sequences?
Towards a Modern BioPerl
Towards a Modern BioPerl


✤   BioPerl will be turning 15 soon

✤   What can we improve?

✤   What can we do with the current code?

✤   Maybe some that we can use in a BioPerl 2.0?

✤   Or a BioPerl 6?
What We Can Do Now



✤   Lower the barrier

✤   Use Modern Perl

✤   Deal with the monolith
Lower the Barrier

✤   We have already started on this - May 2010

✤   Migrate source code repository to git and GitHub

✤   Original BioPerl developers are added as collaborators on GitHub...

    ✤   ...but now anyone can now ‘fork’ BioPerl, make changes, submit
        ‘pull requests’, etc.

✤   Since May, have had many forks, pull requests with code reviews (so
    a decent success)
Using Modern Perl

✤   Minimal version of Perl required for BioPerl is v5.6.1

✤   Even v5.8.1 is considered quite old

✤   Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)
Using Modern Perl

✤   Minimal version of Perl required for BioPerl is v5.6.1

✤   Even v5.8.1 is considered quite old

✤   Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)
Using Modern Perl

say                                        defined-or

print "I like newlinesn";                 # work only if false && defined
                                           $foo ||= 'default';
say "I like newlines";
                                           if (!defined($foo)) {
                                               $foo = 'default'
yada yada                                  }

                                           $foo //= 'default';
sub implement_me {
    shift->throw_not_implemented
}

sub implement_me { ... }     # yada yada
Using Modern Perl

Smart Match                          given/when

if ($key ~~ %hash) { # like exists
                                     given ($foo) {
    # do something
                                         when (%lookup) { ... }
}
                                         when (/^(d+)/) { ... }
                                         when (/^[A-Za-z]+/) { ... }
if ($foo ~~ /d+/ ) { # like =~
                                         default { ... }
    # do something
                                     }
}
Dealing with the Monolith

✤   Release manager nightmares:

    ✤   Remote databases disappear (XEMBL)

    ✤   Others change service or URLs (SeqHound)

    ✤   Services become obsolete (Pise)

    ✤   Developers move on, disappear, modules bit-rot (not saying :)

✤   How do we solve this problem?
Dealing with the Monolith

                      Classes        Tests (Files)
    bioperl-live
                        874          23146 (341)
       (Core)
    bioperl-run        123*           2468 (80)

    bioperl-db          72             113 (16)

  bioperl-network        9             327 (9)

 * Had 285 more prior to Pise module removal!
Dealing with the Monolith


✤   Maybe we shouldn’t be friendly to the monolith

✤   Maybe we should ‘blow it up’

✤   (Of course, that means make the code modular)

✤   It was originally designed with that somewhat in mind (interfaces)
Dealing with the Monolith

✤   Separate distributions make it easier to submit fixes as needed

    ✤   However, separate distributions make developing a little trickier

✤   Can we create a distribution that resembles BioPerl as users know it?

✤   Is this something we should worry about?

    ✤   YES

    ✤   Don’t alienate end-users!
Towards BioPerl 2.0?



✤   Biome: BioPerl with Moose

✤   BioPerl6: self-explanatory
Biome

✤   BioPerl classes implemented in Moose

✤   GitHub: http://github.com/cjfields/biome

✤   Implemented: Ranges, Locations, simple PrimarySeq, Annotation,
    SeqFeatures, prototype SeqIO

✤   Interfaces converted to Moose Roles

✤   ‘Type’-checking used for data types
Role
package Biome::Role::Range;
                                                Attributes
use Biome::Role;
use Biome::Types qw(SequenceStrand);

requires 'to_string';                  Class
                                       package Biome::Range;
has strand    =>   (
    isa       =>   SequenceStrand,
                                       use Biome;
    is        =>   'rw',
    default   =>   0,
                                       with 'Biome::Role::Range';
    coerce    =>   1
);
                                       sub to_string {
                                           my ($self) = @_;
has start     => (
                                           return sprintf("(%s, %s) strand=%s",
    is        => 'rw',
                                                          $self->start,
    isa       => 'Int',
                                                          $self->end,
);
                                                          $self->strand);
                                       }
has end       => (
    is        => 'rw',
    isa       => 'Int'
);

sub length {
    $_[0]->end - $_[0]->start + 1;
}
BioPerl 6


✤   BioPerl6: http://github.com/cjfields/bioperl6

✤   Little has been done beyond simple implementations

✤   Code is open to anyone for experimentation

✤   Ex: Philip Mabon donated a FASTA grammar:
Grammar (FASTA)                     Actions (FASTA)
grammar Bio::Grammar::Fasta {
     token TOP {
        ^<fasta>+ $

    }
    token fasta {
        <description_line> <sequence>
    }

    token description_line    {
        ^^> <id> <.ws> <description> n
    }
    token id           {
        | <identifier>
        | <generic_id>
    }
    token identifier   {
        S+
    }
    token generic_id {
        S+
    }

    token description   {
        N+
    }
    token sequence      {
        <-[>]>+
    }
}
Grammar (FASTA)                                    Actions (FASTA)
grammar Bio::Grammar::Fasta {              class Bio::Grammar::Actions::Fasta {
     token TOP {                               method TOP($/){
        ^<fasta>+ $                                my @matches = gather for $/<fasta> -> $m {
                                                       take $m.ast;
    }                                              };
    token fasta {
        <description_line> <sequence>              make @matches;
    }                                          }
                                               method fasta($/){
    token description_line    {                    my $id =$/<description_line>.ast<id>;
        ^^> <id> <.ws> <description> n           my $desc = $/<description_line>.ast<description>;
    }                                              my $obj = Bio::PrimarySeq.new(
    token id           {                               display_id => $id,
        | <identifier>                                 description => $desc,
        | <generic_id>                                 seq         => $/<sequence>.ast);
    }                                              make $obj;
    token identifier   {                       }
        S+                                    method description_line($/){
    }                                              make $/;
    token generic_id {                         }
        S+                                    method id($/) {
    }                                              make $/;
                                               }
    token description   {                      method description($/){
        N+                                        make $/;
    }                                          }
    token sequence      {                      method sequence($/){
        <-[>]>+                                    make (~$/).subst("n", '', :g);
    }                                          }
}                                          }
Grammar (FASTA)                                    Actions (FASTA)
grammar Bio::Grammar::Fasta {              class Bio::Grammar::Actions::Fasta {
     token TOP {                               method TOP($/){
        ^<fasta>+ $                                my @matches = gather for $/<fasta> -> $m {
                                                       take $m.ast;
    }                                              };
    token fasta {
        <description_line> <sequence>              make @matches;
    }                                          }
                                               method fasta($/){
    token description_line    {                    my $id =$/<description_line>.ast<id>;
        ^^> <id> <.ws> <description> n           my $desc = $/<description_line>.ast<description>;
    }                                              my $obj = Bio::PrimarySeq.new(
    token id           {                               display_id => $id,
        | <identifier>                                 description => $desc,
        | <generic_id>                                 seq         => $/<sequence>.ast);
    }                                              make $obj;
    token identifier   {                       }
        S+                                    method description_line($/){
    }                                              make $/;
    token generic_id {                         }
        S+                                    method id($/) {
    }                                              make $/;
                                               }
    token description   {                      method description($/){
        N+                                        make $/;
    }                                          }
    token sequence      {                      method sequence($/){
        <-[>]>+                                    make (~$/).subst("n", '', :g);
    }                                          }
}                                          }
Grammar (FASTA)                                    Actions (FASTA)
grammar Bio::Grammar::Fasta {              class Bio::Grammar::Actions::Fasta {
     token TOP {                               method TOP($/){
        ^<fasta>+ $                                my @matches = gather for $/<fasta> -> $m {
                                                       take $m.ast;
    }                                              };
    token fasta {
        <description_line> <sequence>              make @matches;
    }                                          }
                                               method fasta($/){
    token description_line    {                    my $id =$/<description_line>.ast<id>;
        ^^> <id> <.ws> <description> n           my $desc = $/<description_line>.ast<description>;
    }                                              my $obj = Bio::PrimarySeq.new(
    token id           {                               display_id => $id,
        | <identifier>                                 description => $desc,
        | <generic_id>                                 seq         => $/<sequence>.ast);
    }                                              make $obj;
    token identifier   {                       }
        S+                                    method description_line($/){
    }                                              make $/;
    token generic_id {                         }
        S+                                    method id($/) {
    }                                              make $/;
                                               }
    token description   {                      method description($/){
        N+                                        make $/;
    }                                          }
    token sequence      {                      method sequence($/){
        <-[>]>+                                    make (~$/).subst("n", '', :g);
    }                                          }
}                                          }
Grammar (FASTA)                                    Actions (FASTA)
grammar Bio::Grammar::Fasta {              class Bio::Grammar::Actions::Fasta {
     token TOP {                               method TOP($/){
        ^<fasta>+ $                                my @matches = gather for $/<fasta> -> $m {
                                                       take $m.ast;
    }                                              };
    token fasta {
        <description_line> <sequence>              make @matches;
    }                                          }
                                               method fasta($/){
    token description_line    {                    my $id =$/<description_line>.ast<id>;
        ^^> <id> <.ws> <description> n           my $desc = $/<description_line>.ast<description>;
    }                                              my $obj = Bio::PrimarySeq.new(
    token id           {                               display_id => $id,
        | <identifier>                                 description => $desc,
        | <generic_id>                                 seq         => $/<sequence>.ast);
    }                                              make $obj;
    token identifier   {                       }
        S+                                    method description_line($/){
    }                                              make $/;
    token generic_id {                         }
        S+                                    method id($/) {
    }                                              make $/;
                                               }
    token description   {                      method description($/){
        N+                                        make $/;
    }                                          }
    token sequence      {                      method sequence($/){
        <-[>]>+                                    make (~$/).subst("n", '', :g);
    }                                          }
}                                          }
Grammar (FASTA)                                    Actions (FASTA)
grammar Bio::Grammar::Fasta {              class Bio::Grammar::Actions::Fasta {
     token TOP {                               method TOP($/){
        ^<fasta>+ $                                my @matches = gather for $/<fasta> -> $m {
                                                       take $m.ast;
    }                                              };
    token fasta {
        <description_line> <sequence>              make @matches;
    }                                          }
                                               method fasta($/){
    token description_line    {                    my $id =$/<description_line>.ast<id>;
        ^^> <id> <.ws> <description> n           my $desc = $/<description_line>.ast<description>;
    }                                              my $obj = Bio::PrimarySeq.new(
    token id           {                               display_id => $id,
        | <identifier>                                 description => $desc,
        | <generic_id>                                 seq         => $/<sequence>.ast);
    }                                              make $obj;
    token identifier   {                       }
        S+                                    method description_line($/){
    }                                              make $/;
    token generic_id {                         }
        S+                                    method id($/) {
    }                                              make $/;
                                               }
    token description   {                      method description($/){
        N+                                        make $/;
    }                                          }
    token sequence      {                      method sequence($/){
        <-[>]>+                                    make (~$/).subst("n", '', :g);
    }                                          }
}                                          }
Grammar (FASTA)                                    Actions (FASTA)
grammar Bio::Grammar::Fasta {              class Bio::Grammar::Actions::Fasta {
     token TOP {                               method TOP($/){
        ^<fasta>+ $                                my @matches = gather for $/<fasta> -> $m {
                                                       take $m.ast;
    }                                              };
    token fasta {
        <description_line> <sequence>              make @matches;
    }                                          }
                                               method fasta($/){
    token description_line    {                    my $id =$/<description_line>.ast<id>;
        ^^> <id> <.ws> <description> n           my $desc = $/<description_line>.ast<description>;
    }                                              my $obj = Bio::PrimarySeq.new(
    token id           {                               display_id => $id,
        | <identifier>                                 description => $desc,
        | <generic_id>                                 seq         => $/<sequence>.ast);
    }                                              make $obj;
    token identifier   {                       }
        S+                                    method description_line($/){
    }                                              make $/;
    token generic_id {                         }
        S+                                    method id($/) {
    }                                              make $/;
                                               }
    token description   {                      method description($/){
        N+                                        make $/;
    }                                          }
    token sequence      {                      method sequence($/){
        <-[>]>+                                    make (~$/).subst("n", '', :g);
    }                                          }
}                                          }
Grammar (FASTA)                                    Actions (FASTA)
grammar Bio::Grammar::Fasta {              class Bio::Grammar::Actions::Fasta {
     token TOP {                               method TOP($/){
        ^<fasta>+ $                                my @matches = gather for $/<fasta> -> $m {
                                                       take $m.ast;
    }                                              };
    token fasta {
        <description_line> <sequence>              make @matches;
    }                                          }
                                               method fasta($/){
    token description_line    {                    my $id =$/<description_line>.ast<id>;
        ^^> <id> <.ws> <description> n           my $desc = $/<description_line>.ast<description>;
    }                                              my $obj = Bio::PrimarySeq.new(
    token id           {                               display_id => $id,
        | <identifier>                                 description => $desc,
        | <generic_id>                                 seq         => $/<sequence>.ast);
    }                                              make $obj;
    token identifier   {                       }
        S+                                    method description_line($/){
    }                                              make $/;
    token generic_id {                         }
        S+                                    method id($/) {
    }                                              make $/;
                                               }
    token description   {                      method description($/){
        N+                                        make $/;
    }                                          }
    token sequence      {                      method sequence($/){
        <-[>]>+                                    make (~$/).subst("n", '', :g);
    }                                          }
}                                          }
Acknowledgements


✤   All BioPerl developers

✤   Chris Dagdigian and Mauricio Herrera Cuadra (O|B|F gurus)

✤   Cross-Collaborative work: Peter Cock (Biopython), Pjotr Prins
    (BioLib, BioRuby), Naohisa Goto (BioRuby), Michael Heuer and
    Andreas Prlic (BioJava), Peter Rice (EMBOSS)

✤   Questions? Do we even have time?

More Related Content

Viewers also liked

Chap009 business marketing channels partnerships for customer service
Chap009 business marketing channels partnerships for customer serviceChap009 business marketing channels partnerships for customer service
Chap009 business marketing channels partnerships for customer service
Hee Young Shin
 
Zorg En Welzijn Projecten In Beeld
Zorg En Welzijn   Projecten In BeeldZorg En Welzijn   Projecten In Beeld
Zorg En Welzijn Projecten In Beeld
Clairtje01
 
2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactoren2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactoren
CVO-SSH
 
Chefs catalog coupon
Chefs catalog couponChefs catalog coupon
Chefs catalog coupon
Materazzi3
 

Viewers also liked (19)

System Case Study
System Case StudySystem Case Study
System Case Study
 
Chap009 business marketing channels partnerships for customer service
Chap009 business marketing channels partnerships for customer serviceChap009 business marketing channels partnerships for customer service
Chap009 business marketing channels partnerships for customer service
 
Zorg En Welzijn Projecten In Beeld
Zorg En Welzijn   Projecten In BeeldZorg En Welzijn   Projecten In Beeld
Zorg En Welzijn Projecten In Beeld
 
Economic and Policy Impacts of Climate Change
Economic and Policy Impacts of Climate ChangeEconomic and Policy Impacts of Climate Change
Economic and Policy Impacts of Climate Change
 
IPad boot camp iste 2013 without videos
IPad boot camp iste 2013 without videosIPad boot camp iste 2013 without videos
IPad boot camp iste 2013 without videos
 
Hoe schrijf je een brief?
Hoe schrijf je een brief?Hoe schrijf je een brief?
Hoe schrijf je een brief?
 
2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactoren2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactoren
 
Explaining A Programming Model for Context-Aware Applications in Large-Scale ...
Explaining A Programming Model for Context-Aware Applications in Large-Scale ...Explaining A Programming Model for Context-Aware Applications in Large-Scale ...
Explaining A Programming Model for Context-Aware Applications in Large-Scale ...
 
Graduate Students Workshop
Graduate Students Workshop Graduate Students Workshop
Graduate Students Workshop
 
Informe anual 2010
Informe anual 2010Informe anual 2010
Informe anual 2010
 
Building and publishing e book
Building and publishing e bookBuilding and publishing e book
Building and publishing e book
 
INTEF
INTEFINTEF
INTEF
 
Aliens in Our Uplands: Managing Past Mistakes, Preventing New Recruits
Aliens in Our Uplands: Managing Past Mistakes, Preventing New RecruitsAliens in Our Uplands: Managing Past Mistakes, Preventing New Recruits
Aliens in Our Uplands: Managing Past Mistakes, Preventing New Recruits
 
Jayb
JaybJayb
Jayb
 
Small Business Profits Tune-Up
Small Business Profits Tune-UpSmall Business Profits Tune-Up
Small Business Profits Tune-Up
 
Christmasfood
ChristmasfoodChristmasfood
Christmasfood
 
Chefs catalog coupon
Chefs catalog couponChefs catalog coupon
Chefs catalog coupon
 
ctrl-EFF Pitch
ctrl-EFF Pitchctrl-EFF Pitch
ctrl-EFF Pitch
 
My 2d versatility presentation4
My 2d versatility presentation4My 2d versatility presentation4
My 2d versatility presentation4
 

Similar to Fields bosc2010 bio_perl

Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databases
Shuichi Kawashima
 
100603_TogoWS_SOAP
100603_TogoWS_SOAP100603_TogoWS_SOAP
100603_TogoWS_SOAP
ocha_kaneko
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
Jan Aerts
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
c.titus.brown
 

Similar to Fields bosc2010 bio_perl (20)

Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databases
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014
 
Bioinformatica t2-databases
Bioinformatica t2-databasesBioinformatica t2-databases
Bioinformatica t2-databases
 
第2回LinkedData勉強会@yayamamo
第2回LinkedData勉強会@yayamamo第2回LinkedData勉強会@yayamamo
第2回LinkedData勉強会@yayamamo
 
100603_TogoWS_SOAP
100603_TogoWS_SOAP100603_TogoWS_SOAP
100603_TogoWS_SOAP
 
Modware
ModwareModware
Modware
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Formats de données en biologie
Formats de données en biologieFormats de données en biologie
Formats de données en biologie
 
BioPerl Project Update
BioPerl Project UpdateBioPerl Project Update
BioPerl Project Update
 
iExpo Paris 10 juin 2010-Velterop
iExpo Paris 10 juin 2010-VelteropiExpo Paris 10 juin 2010-Velterop
iExpo Paris 10 juin 2010-Velterop
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Crispr/cas9 101
Crispr/cas9 101Crispr/cas9 101
Crispr/cas9 101
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerl
 
TYPO3 Flow 2.0 Workshop T3BOARD13
TYPO3 Flow 2.0 Workshop T3BOARD13TYPO3 Flow 2.0 Workshop T3BOARD13
TYPO3 Flow 2.0 Workshop T3BOARD13
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
2.CRISPR .pptx
2.CRISPR .pptx2.CRISPR .pptx
2.CRISPR .pptx
 

More from BOSC 2010

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
BOSC 2010
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomics
BOSC 2010
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-services
BOSC 2010
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
BOSC 2010
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 emboss
BOSC 2010
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evoker
BOSC 2010
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projector
BOSC 2010
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenis
BOSC 2010
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductor
BOSC 2010
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasf
BOSC 2010
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstats
BOSC 2010
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopython
BOSC 2010
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_ruby
BOSC 2010
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rna
BOSC 2010
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytoweb
BOSC 2010
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phylo
BOSC 2010
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptx
BOSC 2010
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadi
BOSC 2010
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
BOSC 2010
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

More from BOSC 2010 (20)

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomics
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-services
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 emboss
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evoker
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projector
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenis
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductor
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasf
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstats
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopython
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_ruby
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rna
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytoweb
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phylo
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptx
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadi
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

Fields bosc2010 bio_perl

  • 1. BioPerl Update 2010: Towards a Modern BioPerl Chris Fields (UIUC) BOSC 7-10-10
  • 2. Present Day BioPerl ✤ Addressing new bioinformatics problems ✤ Collaborations in Open Bioinformatics Foundation ✤ Google Summer of Code
  • 3. Towards a Modern BioPerl ✤ Lowering the barrier for new users to become involved ✤ Using Modern Perl language features ✤ Dealing with the BioPerl monolith
  • 4. BioPerl 2.0? ✤ BioPerl and Modern Perl OOP (Moose) ✤ BioPerl and Perl 6
  • 5. Background ✤ Started in 1996, many contributors over the years ✤ Jason Stajich (UCR) ✤ Ian Korf (Wash U) ✤ Hilmar Lapp (NESCent) ✤ Chris Mungall (NCBO) ✤ Heikki Lehväslaiho (KAUST) ✤ Brian Osborne (BioTeam) ✤ Georg Fuellen (Bielefeld) ✤ Steve Trutane (Stanford) ✤ Ewan Birney (Sanger, EBI) ✤ Sendu Bala (Sanger) ✤ Aaron Mackey (Univ. Virginia) ✤ Dave Messina (Sonnhammer Lab) ✤ Chris Dagdigian (BioTeam) ✤ Mark Jensen (TCGA) ✤ Steven Brenner (UC-Berkeley) ✤ Rob Buels (SGN) ✤ Lincoln Stein (OICR, CSHL) ✤ Many, many more!
  • 6. Background ✤ Open source: ‘Released under the same license as Perl itself’ i.e. Artistic ✤ http://bioperl.org ✤ Core developers - make releases, drive the project, set vision ✤ Regular contributors - have direct commit access
  • 7. BioPerl Distributions ✤ BioPerl Core - the main distribution (aka ‘bioperl-live’ if using dev version) ✤ BioPerl-Run - Perl ‘wrappers’ for common bioinformatics tools ✤ BioPerl-DB - BioSQL ORM to BioPerl classes
  • 8. Biological Sequences ✤ Bio::Seq - sequence record class #!/bin/perl -w use Modern::Perl; use Bio::Seq; my $seq_obj = Bio::Seq->new(-seq => "aaaatgggggggggggccccgtt", -display_id => "ABC12345", -desc => "example 1", -alphabet => "dna"); say $seq_obj->display_id; # ABC12345 say $seq_obj->desc; # example 1 say $seq_obj->seq; # aaaatgggggggggggccccgtt my $revcom = $seq_obj->revcom; # new Bio::Seq, but revcom say $revcom->seq; # aacggggcccccccccccatttt
  • 9. Sequence I/O ✤ Bio::SeqIO - sequence I/O stream classes (pluggable) #!/usr/bin/perl -w use Modern::Perl; use Bio::SeqIO; my ($infile, $outfile) = @ARGV; my $in = Bio::SeqIO->new(-file => $infile, -format => 'genbank'); my $out = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta'); while (my $seq_obj = $in->next_seq) { say $seq_obj->display_id; $out->write_seq($seq_obj); }
  • 10. Sequence Features ✤ Bio::SeqFeature::Generic - generic SF implementation GenBank File use Modern::Perl; source 1..2629 use Bio::SeqIO; /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" my $in = Bio::SeqIO->new(-file => shift, /strain="OG1RF" -format => 'genbank'); /db_xref="taxon:474186" gene 25..>2629 while (my $seq_obj = $in->next_seq) { /gene="pyr operon" for my $feat_obj ($seq_obj->get_SeqFeatures) { /note="pyrimidine biosynthetic operon" say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; Primary tag: source for my $tag ($feat_obj->get_all_tags) { Location: 1..2629 say " tag: $tag"; tag: db_xref for my $value ($feat_obj->get_tag_values($tag)) { value: taxon:474186 say " value: $value"; tag: mol_type } value: genomic DNA } tag: organism } value: Enterococcus faecalis OG1RF } tag: strain value: OG1RF
  • 11. Sequence Features ✤ Bio::SeqFeature::Generic - generic SF implementation GenBank File use Modern::Perl; source 1..2629 use Bio::SeqIO; /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" my $in = Bio::SeqIO->new(-file => shift, /strain="OG1RF" -format => 'genbank'); /db_xref="taxon:474186" gene 25..>2629 while (my $seq_obj = $in->next_seq) { /gene="pyr operon" for my $feat_obj ($seq_obj->get_SeqFeatures) { /note="pyrimidine biosynthetic operon" say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; Primary tag: source for my $tag ($feat_obj->get_all_tags) { Location: 1..2629 say " tag: $tag"; tag: db_xref for my $value ($feat_obj->get_tag_values($tag)) { value: taxon:474186 say " value: $value"; tag: mol_type } value: genomic DNA } tag: organism } value: Enterococcus faecalis OG1RF } tag: strain value: OG1RF
  • 12. Sequence Features ✤ Bio::SeqFeature::Generic - generic SF implementation GenBank File use Modern::Perl; source 1..2629 use Bio::SeqIO; /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" my $in = Bio::SeqIO->new(-file => shift, /strain="OG1RF" -format => 'genbank'); /db_xref="taxon:474186" gene 25..>2629 while (my $seq_obj = $in->next_seq) { /gene="pyr operon" for my $feat_obj ($seq_obj->get_SeqFeatures) { /note="pyrimidine biosynthetic operon" say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; Primary tag: source for my $tag ($feat_obj->get_all_tags) { Location: 1..2629 say " tag: $tag"; tag: db_xref for my $value ($feat_obj->get_tag_values($tag)) { value: taxon:474186 say " value: $value"; tag: mol_type } value: genomic DNA } tag: organism } value: Enterococcus faecalis OG1RF } tag: strain value: OG1RF
  • 13. Report Parsing Query= gi|1786183|gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia coli] (820 letters) Database: ecoli.aa 4289 sequences; 1,358,990 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogen... 1567 0.0 gb|AAC76922.1| (AE000468) aspartokinase II and homoserine dehydr... 332 1e-91 gb|AAC76994.1| (AE000475) aspartokinase III, lysine sensitive [E... 184 3e-47 gb|AAC73282.1| (AE000126) uridylate kinase [Escherichia coli] 42 3e-04 >gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia coli] Length = 820 Score = 1567 bits (4058), Expect = 0.0 Identities = 806/820 (98%), Positives = 806/820 (98%) Query: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA Sbjct: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60
  • 14. Report Parsing Query=gi|1786183|gb|AAC73113.1| ✤ Bio::SearchIO Hit=gb|AAC73113.1| #!/usr/bin/perl -w Length=820 Percent_id=98.2926829268293 use Modern::Perl; use Bio::SearchIO; Query=gi|1786183|gb|AAC73113.1| my $in = Bio::SearchIO->new(-format => 'blast', -file => 'ecoli.bls'); Hit=gb|AAC76922.1| Length=821 while( my $result = $in->next_result ) { Percent_id=29.5980511571255 while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { Query=gi|1786183|gb|AAC73113.1| say "Query=".$result->query_name; Hit=gb|AAC76994.1| say " Hit=".$hit->name; Length=471 say " Length=".$hsp->length('total'); say " Percent_id=".$hsp->percent_identity."n"; Percent_id=30.1486199575372 } } Query=gi|1786183|gb|AAC73113.1| } Hit=gb|AAC73282.1| Length=97 Percent_id=28.8659793814433
  • 15. Local/Remote Database Interfaces ✤ Bio::DB::GenBank #!/bin/perl -w use Modern::Perl; use Bio::DB::GenBank; my $db_obj = Bio::DB::GenBank->new; # query NCBI nuc db my $seq_obj = $db_obj->get_Seq_by_acc('A00002'); say $seq_obj->display_id; # A00002 say $seq_obj->length(); # 194 ✤ Also EntrezGene, GenPept, RefSeq, UniProt, EBI, etc.
  • 16. And Lots More! ✤ Bio::Align/IO ✤ Bio::Map/IO ✤ Bio::Assembly/IO ✤ Bio::Restriction/IO ✤ Bio::Tree/IO ✤ Bio::Structure/IO ✤ Local flatfile databases ✤ Bio::Factory ✤ Bio::Graphics ✤ Bio::Tools::Run (catch-all namespace) ✤ SeqFeature databases ✤ Bio::Factory (create objects) ✤ Bio::Pedigree/IO ✤ Bio::Range/Location ✤ Bio::Coordinate/IO
  • 18. Next-Gen Sequence ✤ Second-generation/next-generation sequencing ✤ This is Lincoln Stein ✤ There is a reason he is smiling...
  • 19. Next-Gen Sequence ✤ Bio-SamTools - support for SAM and BAM data (via SamTools) ✤ Bio-BigFile - support for BigWig/BigBed (via Jim Kent’s UCSC tools) ✤ Separate CPAN distributions ✤ GBrowse (Lincoln’s talk this afternoon), BioPerl ✤ Via SeqFeatures (high-level API for both modules) ✤ Via Bio::Assembly and BioPerl-Run (using the above modules)
  • 20. Data Courtesy R. Khetani, M. Hudson, G. Robinson
  • 21. New Tools/Wrappers ✤ BowTie ✤ Infernal v.1.0 ✤ BWA ✤ NCBI eUtils (SOAP, CGI-based) ✤ MAQ ✤ TopHat/CuffLinks (upcoming) ✤ BEDTools (beta) ✤ The Cloud - bioperl-max ✤ SAMTools Mark Jensen, ✤ HMMER3 Thomas Sharpton, Dave Messina, ✤ BLAST+ Kai Blin, ✤ PAML Dan Kortschak
  • 22. Collaborations Published online 16 December 2009 Nucleic Acids Research, 2010, Vol. 38, No. 6 1767–1771 doi:10.1093/nar/gkp1137 SURVEY AND SUMMARY The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Peter J. A. Cock1,*, Christopher J. Fields2, Naohisa Goto3, Michael L. Heuer4 and Peter M. Rice5 1 Plant Pathology, SCRI, Invergowrie, Dundee DD2 5DA, UK, 2Institute for Genomic Biology, 1206 W. Gregory Drive, M/C 195, University of Illinois at Urbana-Champaign, IL 61801, USA, 3Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka 565-0871, Japan, 4Harbinger Partners, Inc., 855 Village Center Drive, Suite 356, St. Paul, MN 55127, USA and 5EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received October 13, 2009; Revised November 13, 2009; Accepted November 17, 2009 ABSTRACT of an explicit standard some parsers will fail to cope with very long ‘>’ title lines or very long sequences without FASTQ has emerged as a common file format for line wrapping. There is also no standardization for
  • 23. The Google Summer of Code ✤ O|B|F was accepted this year for the first time ✤ Headed by Rob Buels (SGN), with some help from Hilmar Lapp and myself ✤ Six projects, covering BioPerl, BioJava, Biopython, BioRuby
  • 24. The Google Summer of Code ✤ BioPerl has actually been part of the Google Summer of Code for the last three years (as have many other Bio*): ✤ NESCent - admin: H. Lapp: ✤ 2008 - PhyloXML parsing (student: Mira Han) ✤ 2009 - NeXML parsing (student: Chase Miller) ✤ O|B|F - admin: R. Buels: ✤ 2010 - Alignment subsystem refactoring (student: Jun Yin)
  • 25. GSoC - Alignment Subsystem ✤ Clean up current code ✤ Include capability of dealing with large datasets ✤ Target next-gen data, very large alignments? ✤ Abstract the backend (DB, memory, etc.) ✤ SAM/BAM may work (via Bio::DB::SAM) ✤ ...but what about protein sequences?
  • 26. Towards a Modern BioPerl
  • 27. Towards a Modern BioPerl ✤ BioPerl will be turning 15 soon ✤ What can we improve? ✤ What can we do with the current code? ✤ Maybe some that we can use in a BioPerl 2.0? ✤ Or a BioPerl 6?
  • 28. What We Can Do Now ✤ Lower the barrier ✤ Use Modern Perl ✤ Deal with the monolith
  • 29. Lower the Barrier ✤ We have already started on this - May 2010 ✤ Migrate source code repository to git and GitHub ✤ Original BioPerl developers are added as collaborators on GitHub... ✤ ...but now anyone can now ‘fork’ BioPerl, make changes, submit ‘pull requests’, etc. ✤ Since May, have had many forks, pull requests with code reviews (so a decent success)
  • 30. Using Modern Perl ✤ Minimal version of Perl required for BioPerl is v5.6.1 ✤ Even v5.8.1 is considered quite old ✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)
  • 31. Using Modern Perl ✤ Minimal version of Perl required for BioPerl is v5.6.1 ✤ Even v5.8.1 is considered quite old ✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)
  • 32. Using Modern Perl say defined-or print "I like newlinesn"; # work only if false && defined $foo ||= 'default'; say "I like newlines"; if (!defined($foo)) { $foo = 'default' yada yada } $foo //= 'default'; sub implement_me { shift->throw_not_implemented } sub implement_me { ... } # yada yada
  • 33. Using Modern Perl Smart Match given/when if ($key ~~ %hash) { # like exists given ($foo) { # do something when (%lookup) { ... } } when (/^(d+)/) { ... } when (/^[A-Za-z]+/) { ... } if ($foo ~~ /d+/ ) { # like =~ default { ... } # do something } }
  • 34. Dealing with the Monolith ✤ Release manager nightmares: ✤ Remote databases disappear (XEMBL) ✤ Others change service or URLs (SeqHound) ✤ Services become obsolete (Pise) ✤ Developers move on, disappear, modules bit-rot (not saying :) ✤ How do we solve this problem?
  • 35. Dealing with the Monolith Classes Tests (Files) bioperl-live 874 23146 (341) (Core) bioperl-run 123* 2468 (80) bioperl-db 72 113 (16) bioperl-network 9 327 (9) * Had 285 more prior to Pise module removal!
  • 36. Dealing with the Monolith ✤ Maybe we shouldn’t be friendly to the monolith ✤ Maybe we should ‘blow it up’ ✤ (Of course, that means make the code modular) ✤ It was originally designed with that somewhat in mind (interfaces)
  • 37. Dealing with the Monolith ✤ Separate distributions make it easier to submit fixes as needed ✤ However, separate distributions make developing a little trickier ✤ Can we create a distribution that resembles BioPerl as users know it? ✤ Is this something we should worry about? ✤ YES ✤ Don’t alienate end-users!
  • 38. Towards BioPerl 2.0? ✤ Biome: BioPerl with Moose ✤ BioPerl6: self-explanatory
  • 39. Biome ✤ BioPerl classes implemented in Moose ✤ GitHub: http://github.com/cjfields/biome ✤ Implemented: Ranges, Locations, simple PrimarySeq, Annotation, SeqFeatures, prototype SeqIO ✤ Interfaces converted to Moose Roles ✤ ‘Type’-checking used for data types
  • 40. Role package Biome::Role::Range; Attributes use Biome::Role; use Biome::Types qw(SequenceStrand); requires 'to_string'; Class package Biome::Range; has strand => ( isa => SequenceStrand, use Biome; is => 'rw', default => 0, with 'Biome::Role::Range'; coerce => 1 ); sub to_string { my ($self) = @_; has start => ( return sprintf("(%s, %s) strand=%s", is => 'rw', $self->start, isa => 'Int', $self->end, ); $self->strand); } has end => ( is => 'rw', isa => 'Int' ); sub length { $_[0]->end - $_[0]->start + 1; }
  • 41. BioPerl 6 ✤ BioPerl6: http://github.com/cjfields/bioperl6 ✤ Little has been done beyond simple implementations ✤ Code is open to anyone for experimentation ✤ Ex: Philip Mabon donated a FASTA grammar:
  • 42. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^> <id> <.ws> <description> n } token id { | <identifier> | <generic_id> } token identifier { S+ } token generic_id { S+ } token description { N+ } token sequence { <-[>]>+ } }
  • 43. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }
  • 44. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }
  • 45. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }
  • 46. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }
  • 47. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }
  • 48. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }
  • 49. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }
  • 50. Acknowledgements ✤ All BioPerl developers ✤ Chris Dagdigian and Mauricio Herrera Cuadra (O|B|F gurus) ✤ Cross-Collaborative work: Peter Cock (Biopython), Pjotr Prins (BioLib, BioRuby), Naohisa Goto (BioRuby), Michael Heuer and Andreas Prlic (BioJava), Peter Rice (EMBOSS) ✤ Questions? Do we even have time?