BioRuby
                                 Project Update
Raoul J.P. Bonnal                                                             co-authors:
                                                                                              Toshiaki Katayama
r@bioruby.org                                                                                          Pjotr Prins
Life Science Informatics                                                                       Mitsuteru Nakao
Integrative Biology Program
Fondazione INGM
                                                                                             Christian M Zmasek
Italy                                                                                               Nahoisa Goto
                              11th Annual Bioinformatic Open Source Conference (BOSC) 2010

                                              Boston, Massachusetts, USA
Introduction


BioRuby - bioinformatics library for Ruby language
• Object oriented scripting language, functional and reflective
• has become popular by "Ruby on Rails“
• created by Matz in 1993 in Japan
BioRuby & Platforms



                    Ruby Interpreter
     Performances                       Portability
Ruby                                      JRuby
RubyEE                                        Java libraries


              gem install bio
                    Operating Systems
BioRuby & Platforms
BioLib




                             Ruby Interpreter
              Performances                       Portability
         Ruby                                      JRuby
         RubyEE                                        Java libraries


                       gem install bio
                             Operating Systems
BioRuby & Platforms                              Cytoscape




                    Ruby Interpreter
     Performances                       Portability
Ruby                                      JRuby
RubyEE                                        Java libraries


              gem install bio
                    Operating Systems
History
    2008                   2009                                            2010

           WebServices              Workflows                                        SemanticWeb
                                                                                                             Code fest
                                        1.3.0                      1.4.0
                                                    1.3.1                                                    BOSC




               ---                                   GSoC                                GSoC
               +++ git
                                                     •phyloXML                      •Ruby 1.9.2
                                                                                    •NeXML I/O, RDF triples
                                                                                    •Infer gene duplications


GitHub:                                         GSoC references:
    http://github.com/bioruby/bioruby               Ruby 1.9.2 support of BioRuby (OBF)
                                                    Develop an API for NeXML I/O, and, RDF triples for BioRuby (NESCent)
                                                    Implementation of algorithm to infer gene duplications in BioRuby (OBF)
                                                    Implementing phyloXML support in BioRuby (NESCent)
BioRuby Features

Category          Modules
Object Sequence   pathway, tree, bibliography reference
Sequence          translation, alignment, location,mapping, feature table, molecular
Manipulation      weight, design siRNA, restriction enzyme

Format            GenBank, EMBL, UniProt, KEGG, PDB, MEDLINE, REBASE, FASTQ, GFF,
                  MSF, ABIF, SCF, GCG, Lasergene, GEO SOFT, Gene Ontology

Tool              BLAST, FASTA, EMBOSS, HMMER, InterProScan,GenScan, BLAT, Sim4,
                  Spidey, MEME, ClustalW, MUSCLE, MAFFT, T-Coffee, ProbCons
Phylogeny         PHYLIP, PAML, phyloXML, NEXUS, Newick

Web Service       NCBI, EBI, DDBJ, KEGG, TogoWS, PSORT, TargetP, PTS1, SOSUI, TMHMM

ODBA              BioSQL, BioFetch, indexed flat files

Shell             Interactive environment for rapid Bioinformatics analyses
Relevant New                 Features1


Bio::SQL Interoperable storage of sequences -Raoul Bonnal-
  require ‘bio’
  #active_record (ORM)
  #your_database_adapter (MYSQL, Postgresql,JDBC)
  connection =
  Bio::SQL.establish_connection({‘development=>{‘hostname=>you_host_name,
                                              ‘database’=> ‘CoolBioSeqDB’,
                                              ‘adapter’=> ‘jdbcmysql’
                                              ‘username’=> ‘Raoul’,
                                              ‘password’=> ‘SmartPassword’},
                            ‘development’)
  #read a GenBank file and store:
  my_sotrage = Bio::SQL::Biodatabase.find(:first)
  genbank = Bio::GenBank.open(‘dbvrl1.gb’)
  genbank.each_entry do |gb|
    Bio::SQL::Sequence.new(:biosequence=>gb.to_biosequence,
                                :biodatabase=>my_sotrage)
  end

  #fetch an accession is easy
  Bio::SQL.fetch_accession(your_accession).to_biosequence.output(:embl)
Relevant New                     Features2


Bio::PhyloXML r/w by -Diana Jaunzeikare, Christian M Zmasek-
  require ‘bio’ # libxml-ruby

  #Create a parser
  phyloxml = Bio::PhyloXML::Parser.new(‘example.xml’)

  #Consume the tree
  phyloxml.each do |tree|
    puts tree.name
  end
  #Wrinting
  writer = Bio::PhyloXML::Writer.new(‘my_tree.xml’)
  write.writer(tree2)

  #Extract information
  phyloxml = Bio::PhyloXML::Parser.new(‘ncbi_taxnonomy_mollusca.xml’)
  phyloxml.each do |tree|
    tree.each_nome do |node|
      print ‘Scientific name: ‘, node.taxonomies[0].scientific_name,‘n’
    end
  end                               Han, M. V. and Zmasek, C. M. (2009). phyloXML: XML for
                                    evolutionary biology and
                                    comparative genomics. BMC Bioinformatics, 10, 356.
Relevant New                     Features3


Bio::FASTQ r/w Next Generation Sequencing FASTQ -Naohisa Goto-
  require ‘bio’
  ff_fasta = Bio::FlatFile.open(filename.fasta)
  ff_qual = Bio::FlatFile.open(filename.qual)

  while entry_fasta = ff_fasta.next_entry
    seq = entry_fasta.to_biosequence
    seq.quality_score_type = :phred
    seq.quality_scores = ff_qual.next_entry.data
    puts seq.output(:fastq,
                    :title => entry_fasta.definition)
  end

   ●   Format supported: SOLEXA, ILLUMINA




                                            Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P.
                                            M. (2010). The Sanger
                                            FASTQ file format for sequences with quality scores, and
                                            the Solexa/Illumina
                                            FASTQ variants. Nucleic Acids Res, 38(6), 1767.1771.
Relevant New               Features4



Bio::NCBI::REST example
  require ‘bio’
  ncbi = Bio::NCBI::REST::ESearch.new
  ncbi.search("nucleotide", "tardigrada")
  ncbi.count("nucleotide", "tardigrada")
  ncbi.nucleotide("tardigrada")
  ncbi.taxonomy("tardigrada")
  ncbi.pubmed("tardigrada", "reldate" => 365)
  ncbi.pubmed("mammoth mitochondrial genome")


Bio::TogoWS entry point for PDBj, NCBI, DDBJ, EBI, KEGG
  require ‘bio’
  t = Bio::TogoWS::REST.new
  puts t.entry('genbank', 'AF237819')
  puts t.search('uniprot', 'lung cancer')
BioRuby is Agile
●   OpenBio* developers are the Stakeholders
    ●    Speed up in the iteration proccess
    ●    Frequent meetings (mail, skype/voice chat, irc)
●   Test Everything (required for new features)
     –   Improve quality , maintainability and guarantee portability
     –   Ruby Unit Testing Framework , Rspec
●   GitHub
    ●    Low barries for new developers
    ●    32 forks and 100 people watching us


                                                                Agile Manifesto
Moving to Agile Programming
2500



2000



1500

                                                                 Tests
1000                                                             Tutorial's lines



500



   0
       1.0.0     1.1.0   1.2.0   1.2.1   1.3.0   1.3.1   1.4.0
Refactoring
3500


3000


2500


2000                                                           Files
                                                               Classes
1500                                                           Modules
                                                               Methods
1000


 500


   0
       1.0.0   1.1.0   1.2.0   1.2.1   1.3.0   1.3.1   1.4.0
Ongoing Work
●   Semantic Web (started @ BioHackathon 2010)
    ●   Expose data in RDF
    ●   Consuming SPARQL end points efficiently
●   Ruby 1.9.2 support of BioRuby ( GSoC & OBF)
    ●   Improved performances
●   Develop an API for NeXML I/O, and, RDF triples for BioRuby (GSoC &
    NESCent)
●   Implementation of algorithm to infer gene duplications in BioRuby
    (GSoC & OBF)
PlugIn system
●   We want a BioRuby core stable on every OS
    ●   But… we want to use experimental code ASAP
    ●   BioRuby + BioRuby Plugin + Rails we can have multiple
        applications with an unique core and specific features
        –   User or Application
●   Suggest Guidelines for plugin namespace
    ●   On GitHub you can find our plugins looking for
        bioruby-plugin-NAME
PlugIn system
The plugin system will be delivered with the next
  BioRuby release
BioGraphics – Jan Aerts-
For biologists:
bioruby --plugin install graphics

For geeks:
bioruby --plugin install git://github.com/user/repo.git




It’s very experimental
What We Need



●   Better integration with R
●   Better support for data visualization (interpretation)
●   Detailed Roadmap
Publications
BioRuby: Bioinformatics software for the Ruby programming language (submitted)
    Naohisa Goto, Pjotr Prins, Mitsuteru Nakao, Raoul Bonnal, Jan Aerts and Toshiaki Katayama

The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and
   workflows (accepted)
   Toshiaki Katayama et all.

Toshiaki Katayama, Mitsuteru Nakao and Toshihisa Takagi (2010)
    TogoWS: integrated SOAP and REST APIs for interoperable bioinformatics Web services, Nucleic Acids
    Research, 2010, Vol. 38, No. suppl_2 W706-W711, doi:10.1093/nar/gkq386 (Web Server Issue 2010)

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P. M. (2010).
   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.
   Nucleic Acids Res, 38(6), 1767.1771.



Over 24 articles use BioRuby as in their analyses, check the up to date list:
   http://bioruby.open-bio.org/wiki/Research_using_BioRuby
Acknoledgments
●   BioRuby Team
                                       Open Bioinformatics Foundation
    ●   Toshiaki Katayama*
    ●   Naoshita Goto*
    ●   Pjotr Prins*                   Database Center for Life Science
    ●   Mitsuteru Nakao*
    ●   Jan Aerts*
    ●   Christian M Zmasek*
                                       Google Summer of Code
    ●   All GSoC students


                                       NESCent
                                       National Evolutionary Synthesis Center



* co-author

Bonnal bosc2010 bio_ruby

  • 1.
    BioRuby Project Update Raoul J.P. Bonnal co-authors: Toshiaki Katayama r@bioruby.org Pjotr Prins Life Science Informatics Mitsuteru Nakao Integrative Biology Program Fondazione INGM Christian M Zmasek Italy Nahoisa Goto 11th Annual Bioinformatic Open Source Conference (BOSC) 2010 Boston, Massachusetts, USA
  • 2.
    Introduction BioRuby - bioinformaticslibrary for Ruby language • Object oriented scripting language, functional and reflective • has become popular by "Ruby on Rails“ • created by Matz in 1993 in Japan
  • 3.
    BioRuby & Platforms Ruby Interpreter Performances Portability Ruby JRuby RubyEE Java libraries gem install bio Operating Systems
  • 4.
    BioRuby & Platforms BioLib Ruby Interpreter Performances Portability Ruby JRuby RubyEE Java libraries gem install bio Operating Systems
  • 5.
    BioRuby & Platforms Cytoscape Ruby Interpreter Performances Portability Ruby JRuby RubyEE Java libraries gem install bio Operating Systems
  • 6.
    History 2008 2009 2010 WebServices Workflows SemanticWeb Code fest 1.3.0 1.4.0 1.3.1 BOSC --- GSoC GSoC +++ git •phyloXML •Ruby 1.9.2 •NeXML I/O, RDF triples •Infer gene duplications GitHub: GSoC references: http://github.com/bioruby/bioruby Ruby 1.9.2 support of BioRuby (OBF) Develop an API for NeXML I/O, and, RDF triples for BioRuby (NESCent) Implementation of algorithm to infer gene duplications in BioRuby (OBF) Implementing phyloXML support in BioRuby (NESCent)
  • 7.
    BioRuby Features Category Modules Object Sequence pathway, tree, bibliography reference Sequence translation, alignment, location,mapping, feature table, molecular Manipulation weight, design siRNA, restriction enzyme Format GenBank, EMBL, UniProt, KEGG, PDB, MEDLINE, REBASE, FASTQ, GFF, MSF, ABIF, SCF, GCG, Lasergene, GEO SOFT, Gene Ontology Tool BLAST, FASTA, EMBOSS, HMMER, InterProScan,GenScan, BLAT, Sim4, Spidey, MEME, ClustalW, MUSCLE, MAFFT, T-Coffee, ProbCons Phylogeny PHYLIP, PAML, phyloXML, NEXUS, Newick Web Service NCBI, EBI, DDBJ, KEGG, TogoWS, PSORT, TargetP, PTS1, SOSUI, TMHMM ODBA BioSQL, BioFetch, indexed flat files Shell Interactive environment for rapid Bioinformatics analyses
  • 8.
    Relevant New Features1 Bio::SQL Interoperable storage of sequences -Raoul Bonnal- require ‘bio’ #active_record (ORM) #your_database_adapter (MYSQL, Postgresql,JDBC) connection = Bio::SQL.establish_connection({‘development=>{‘hostname=>you_host_name, ‘database’=> ‘CoolBioSeqDB’, ‘adapter’=> ‘jdbcmysql’ ‘username’=> ‘Raoul’, ‘password’=> ‘SmartPassword’}, ‘development’) #read a GenBank file and store: my_sotrage = Bio::SQL::Biodatabase.find(:first) genbank = Bio::GenBank.open(‘dbvrl1.gb’) genbank.each_entry do |gb| Bio::SQL::Sequence.new(:biosequence=>gb.to_biosequence, :biodatabase=>my_sotrage) end #fetch an accession is easy Bio::SQL.fetch_accession(your_accession).to_biosequence.output(:embl)
  • 9.
    Relevant New Features2 Bio::PhyloXML r/w by -Diana Jaunzeikare, Christian M Zmasek- require ‘bio’ # libxml-ruby #Create a parser phyloxml = Bio::PhyloXML::Parser.new(‘example.xml’) #Consume the tree phyloxml.each do |tree| puts tree.name end #Wrinting writer = Bio::PhyloXML::Writer.new(‘my_tree.xml’) write.writer(tree2) #Extract information phyloxml = Bio::PhyloXML::Parser.new(‘ncbi_taxnonomy_mollusca.xml’) phyloxml.each do |tree| tree.each_nome do |node| print ‘Scientific name: ‘, node.taxonomies[0].scientific_name,‘n’ end end Han, M. V. and Zmasek, C. M. (2009). phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics, 10, 356.
  • 10.
    Relevant New Features3 Bio::FASTQ r/w Next Generation Sequencing FASTQ -Naohisa Goto- require ‘bio’ ff_fasta = Bio::FlatFile.open(filename.fasta) ff_qual = Bio::FlatFile.open(filename.qual) while entry_fasta = ff_fasta.next_entry seq = entry_fasta.to_biosequence seq.quality_score_type = :phred seq.quality_scores = ff_qual.next_entry.data puts seq.output(:fastq, :title => entry_fasta.definition) end ● Format supported: SOLEXA, ILLUMINA Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P. M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res, 38(6), 1767.1771.
  • 11.
    Relevant New Features4 Bio::NCBI::REST example require ‘bio’ ncbi = Bio::NCBI::REST::ESearch.new ncbi.search("nucleotide", "tardigrada") ncbi.count("nucleotide", "tardigrada") ncbi.nucleotide("tardigrada") ncbi.taxonomy("tardigrada") ncbi.pubmed("tardigrada", "reldate" => 365) ncbi.pubmed("mammoth mitochondrial genome") Bio::TogoWS entry point for PDBj, NCBI, DDBJ, EBI, KEGG require ‘bio’ t = Bio::TogoWS::REST.new puts t.entry('genbank', 'AF237819') puts t.search('uniprot', 'lung cancer')
  • 12.
    BioRuby is Agile ● OpenBio* developers are the Stakeholders ● Speed up in the iteration proccess ● Frequent meetings (mail, skype/voice chat, irc) ● Test Everything (required for new features) – Improve quality , maintainability and guarantee portability – Ruby Unit Testing Framework , Rspec ● GitHub ● Low barries for new developers ● 32 forks and 100 people watching us Agile Manifesto
  • 13.
    Moving to AgileProgramming 2500 2000 1500 Tests 1000 Tutorial's lines 500 0 1.0.0 1.1.0 1.2.0 1.2.1 1.3.0 1.3.1 1.4.0
  • 14.
    Refactoring 3500 3000 2500 2000 Files Classes 1500 Modules Methods 1000 500 0 1.0.0 1.1.0 1.2.0 1.2.1 1.3.0 1.3.1 1.4.0
  • 15.
    Ongoing Work ● Semantic Web (started @ BioHackathon 2010) ● Expose data in RDF ● Consuming SPARQL end points efficiently ● Ruby 1.9.2 support of BioRuby ( GSoC & OBF) ● Improved performances ● Develop an API for NeXML I/O, and, RDF triples for BioRuby (GSoC & NESCent) ● Implementation of algorithm to infer gene duplications in BioRuby (GSoC & OBF)
  • 16.
    PlugIn system ● We want a BioRuby core stable on every OS ● But… we want to use experimental code ASAP ● BioRuby + BioRuby Plugin + Rails we can have multiple applications with an unique core and specific features – User or Application ● Suggest Guidelines for plugin namespace ● On GitHub you can find our plugins looking for bioruby-plugin-NAME
  • 17.
    PlugIn system The pluginsystem will be delivered with the next BioRuby release BioGraphics – Jan Aerts- For biologists: bioruby --plugin install graphics For geeks: bioruby --plugin install git://github.com/user/repo.git It’s very experimental
  • 18.
    What We Need ● Better integration with R ● Better support for data visualization (interpretation) ● Detailed Roadmap
  • 19.
    Publications BioRuby: Bioinformatics softwarefor the Ruby programming language (submitted) Naohisa Goto, Pjotr Prins, Mitsuteru Nakao, Raoul Bonnal, Jan Aerts and Toshiaki Katayama The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows (accepted) Toshiaki Katayama et all. Toshiaki Katayama, Mitsuteru Nakao and Toshihisa Takagi (2010) TogoWS: integrated SOAP and REST APIs for interoperable bioinformatics Web services, Nucleic Acids Research, 2010, Vol. 38, No. suppl_2 W706-W711, doi:10.1093/nar/gkq386 (Web Server Issue 2010) Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P. M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res, 38(6), 1767.1771. Over 24 articles use BioRuby as in their analyses, check the up to date list: http://bioruby.open-bio.org/wiki/Research_using_BioRuby
  • 20.
    Acknoledgments ● BioRuby Team Open Bioinformatics Foundation ● Toshiaki Katayama* ● Naoshita Goto* ● Pjotr Prins* Database Center for Life Science ● Mitsuteru Nakao* ● Jan Aerts* ● Christian M Zmasek* Google Summer of Code ● All GSoC students NESCent National Evolutionary Synthesis Center * co-author