EMBOSS European Molecular Biology Open Software Suite Open-Bio Project Update 2010 Peter Rice pmr@ebi.ac.uk
A quick introduction <ul><li>Open source package for sequence analysis </li></ul><ul><ul><li>ANSI C source code </li></ul>...
Who do we serve? <ul><li>Expert software developers </li></ul><ul><ul><li>Bioinformaticians </li></ul></ul><ul><ul><li>Com...
EMBOSS command line interface <ul><li>EMBOSS applications run from the command line </li></ul><ul><li>This is not the only...
EMBOSS Update <ul><li>Release 6.3.0 as usual on 15th July 2010 </li></ul><ul><li>New support for NGS sequence formats </li...
NGS sequence formats <ul><li>SAM format: tab-delimited short read data </li></ul><ul><li>BAM format: binary compressed SAM...
NGS sequence formats <ul><li>FASTQ joint effort with Bio* projects </li></ul><ul><li>Definition of 3 conflicting FASTQ for...
Other sequence formats <ul><li>>AB036666 AB036666 Wolbachia sp. wKue genes </li></ul><ul><li>cattactatttcagtcgagacatattagg...
New data sources for EMBOSS <ul><li>BioMart access </li></ul><ul><ul><li>As a sequence database, define sequence, identifi...
Data servers <ul><li>Defining individual sequence databases is tedious </li></ul><ul><li>Many database definitions are sim...
New data sources for EMBOSS (2) <ul><li>Non-sequence data </li></ul><ul><ul><li>Cross-referenced resources from EMBL/UniPr...
Ontologies <ul><li>Support for OBO format ontologies: </li></ul><ul><ul><li>Gene Ontology </li></ul></ul><ul><ul><li>Seque...
Ontologies: EDAM <ul><li>EMBRACE Datatypes And Methods </li></ul><ul><ul><li>OBO format (so far) </li></ul></ul><ul><li>Al...
EDAM in ACD <ul><li>application: antigenic [ </li></ul><ul><li>documentation: &quot;Finds antigenic sites in proteins&quot...
Ontologies: EDAM (2) <ul><li>SoapLab web services annotated with EDAM </li></ul><ul><ul><li>EDAM terms parsed from ACD fil...
Ontologies: NCBI Taxonomy <ul><li>Parsers for “.dmp” files </li></ul><ul><li>Will add dbx indexing applications </li></ul>...
EMBOSS Interfaces and wrappers <ul><li>Two releases in the past year </li></ul><ul><li>Possibly three releases next year <...
The Emboss Team BOSC 2010: EMBOSS 11.07.10 Peter Rice Alan Bleasby Jon Ison Mahmut Uludag
Acknowledgements <ul><li>EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Ro...
Upcoming SlideShare
Loading in...5
×

Rice bosc2010 emboss

835

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
835
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Rice bosc2010 emboss

  1. 1. EMBOSS European Molecular Biology Open Software Suite Open-Bio Project Update 2010 Peter Rice pmr@ebi.ac.uk
  2. 2. A quick introduction <ul><li>Open source package for sequence analysis </li></ul><ul><ul><li>ANSI C source code </li></ul></ul><ul><ul><li>GPL licensed applications, LGPL libraries </li></ul></ul><ul><ul><li>200+ applications </li></ul></ul><ul><ul><li>100+ third party applications in 15 associated packages </li></ul></ul><ul><ul><ul><li>MIRA, MEME, HMMER, PHYLIP, etc. </li></ul></ul></ul><ul><ul><li>Project started 1996 at Sanger and HGMP </li></ul></ul><ul><ul><li>Now based at EBI </li></ul></ul><ul><ul><li>Release 1.0.0 15th July 2000 </li></ul></ul><ul><ul><li>Release 6.3.0 15th July 2010 </li></ul></ul><ul><ul><li>Funded by UK-BBSRC and EMBL-EBI </li></ul></ul><ul><ul><li>Originally funded by the Wellcome Trust </li></ul></ul><ul><ul><li>Additional funds from UK-MRC </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  3. 3. Who do we serve? <ul><li>Expert software developers </li></ul><ul><ul><li>Bioinformaticians </li></ul></ul><ul><ul><li>Computer scientists </li></ul></ul><ul><li>Expert users </li></ul><ul><ul><li>Biology research community </li></ul></ul><ul><ul><li>Industry </li></ul></ul><ul><li>Scientific users </li></ul><ul><ul><li>Biology research community </li></ul></ul><ul><ul><li>Industry </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  4. 4. EMBOSS command line interface <ul><li>EMBOSS applications run from the command line </li></ul><ul><li>This is not the only interface </li></ul><ul><ul><li>There are over 100 interfaces and packaged systems available </li></ul></ul><ul><ul><ul><li>Web: wEMBOSS </li></ul></ul></ul><ul><ul><ul><li>GUI: Jemboss </li></ul></ul></ul><ul><ul><ul><li>Web Services: SoapLab </li></ul></ul></ul><ul><ul><ul><li>Workflows: Galaxy, Taverna </li></ul></ul></ul><ul><ul><ul><li>Windows: mEMBOSS </li></ul></ul></ul><ul><li>All applications have a command definition file (.acd) </li></ul><ul><ul><li>Defines all inputs, outputs, and other options </li></ul></ul><ul><ul><li>Read at startup </li></ul></ul><ul><ul><li>Contains all command line options with descriptions </li></ul></ul><ul><ul><li>Template for any other interface </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  5. 5. EMBOSS Update <ul><li>Release 6.3.0 as usual on 15th July 2010 </li></ul><ul><li>New support for NGS sequence formats </li></ul><ul><li>Adaptor detection added to supermatcher </li></ul><ul><li>Metadata and ontologies </li></ul><ul><li>Full set of public data resources </li></ul><ul><li>Three open source books: users, developers, admin </li></ul><ul><ul><li>Cambridge University Press </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  6. 6. NGS sequence formats <ul><li>SAM format: tab-delimited short read data </li></ul><ul><li>BAM format: binary compressed SAM format </li></ul><ul><ul><li>More work needed on remote access to mapped reads </li></ul></ul><ul><li>FASTQ short reads and quality scores </li></ul><ul><ul><li>OpenBio project collaboration on format standards </li></ul></ul><ul><ul><li>Improved error detection (for all formats) </li></ul></ul><ul><ul><li>Improved performance for input and output </li></ul></ul><ul><ul><li>Indexing in dbxflat </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  7. 7. NGS sequence formats <ul><li>FASTQ joint effort with Bio* projects </li></ul><ul><li>Definition of 3 conflicting FASTQ formats </li></ul><ul><li>Agreement on standard parsing procedures </li></ul><ul><li>@EAS54_6_R1_2_1_413_324 </li></ul><ul><li>CCCTTCTTGTCTTCAGCGTTTCTCC </li></ul><ul><li>+ EAS54_6_R1_2_1_413_324 </li></ul><ul><li>;;3;;;;;;;;;;;;7;;;;;;;88 </li></ul><ul><li>@EAS54_6_R1_2_1_443_348 </li></ul><ul><li>GTTGCTTCTGGCGTGGGTGGGGGGG </li></ul><ul><li>+EAS54_6_R1_2_1_443_348 </li></ul><ul><li>;;;;;;;;;;;9;7;;.7;393 333 </li></ul>BOSC 2010: EMBOSS 11.07.10
  8. 8. Other sequence formats <ul><li>>AB036666 AB036666 Wolbachia sp. wKue genes </li></ul><ul><li>cattactatttcagtcgagacatattaggtcaatcaattttaatcaacaagattggtcaa </li></ul><ul><li>gatcaaagtaacattaaaaaatatatatactcatatggtgagtaccctctgaactggcct </li></ul><ul><li>cagggaacagaatacactttatctaacagccctgttacaacattaatatttgttcaaggt </li></ul><ul><li>aatgaaggacaagaaaaaacagcattcatttttcatatacgagagtccaatacaaaggaa </li></ul><ul><li>ttctatgctgataaaaaaattccagtgctaaacatacctaaaataggaaaagtaggaaat </li></ul><ul><li>gccgtagaaattaaaatgagtctaaaaaaatatgaaacagggttatcttttgaagacctt </li></ul><ul><li>tttgaaatagaacagataagtaaatatgaatcaagtggtaatgatcaacaatttacagat </li></ul><ul><li>ggcaagtttattgagatacctaattctgatgaattaaaggcaaaatttgatcaagcaatc </li></ul><ul><li>acttctcaacatgcttccgacggtgaggtttcattgcaagcctataaagtgttgcttact </li></ul><ul><li>gaagtagcagatacgatttaccctatcaaagatttgattactaatgaagcaagattacaa </li></ul><ul><li>gctgttcttaatggtttgcttagtagctatagtgatttaaagctacaggagacttctgcg </li></ul><ul><li>aagactgtaattatacctgaatttcaagtaggagcaggtggtcgtgtagatatggtaatt </li></ul><ul><li>Caaggtattggtccttcgtctcagggtactaaagaatacac tcctatagcgctggaattt </li></ul>
  9. 9. New data sources for EMBOSS <ul><li>BioMart access </li></ul><ul><ul><li>As a sequence database, define sequence, identifier, etc. </li></ul></ul><ul><ul><li>Need to define a very large number of databases </li></ul></ul><ul><li>Ensembl access </li></ul><ul><ul><li>Code from Michael Schuster </li></ul></ul><ul><ul><li>Ensembl SQL access code in library (access method soon) </li></ul></ul><ul><ul><li>Same issues as BioMart </li></ul></ul><ul><li>DAS 1.6 client access planned </li></ul><ul><li>GMOD access planned </li></ul><ul><li>BioSQL access planned </li></ul>BOSC 2010: EMBOSS 11.07.10
  10. 10. Data servers <ul><li>Defining individual sequence databases is tedious </li></ul><ul><li>Many database definitions are similar </li></ul><ul><li>Simplify (and extend) with server definitions: </li></ul><ul><ul><li>SRS </li></ul></ul><ul><ul><li>MRS </li></ul></ul><ul><ul><li>BioMart </li></ul></ul><ul><ul><li>Ensembl </li></ul></ul><ul><ul><li>DAS 1.6 </li></ul></ul><ul><li>Define server </li></ul><ul><li>USA to give server:dbname:queryfield-value </li></ul><ul><li>Database name and query field known to user </li></ul><ul><ul><li>Or reported by a query to the server in an extended showdb </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  11. 11. New data sources for EMBOSS (2) <ul><li>Non-sequence data </li></ul><ul><ul><li>Cross-referenced resources from EMBL/UniProt/etc. </li></ul></ul><ul><ul><li>Useful to return as: </li></ul></ul><ul><ul><ul><li>Identifiers </li></ul></ul></ul><ul><ul><ul><li>Text for entries </li></ul></ul></ul><ul><ul><ul><li>HTML with markup </li></ul></ul></ul><ul><ul><ul><li>URLs for browsing </li></ul></ul></ul><ul><li>Dbxref.dat </li></ul><ul><ul><li>List of all known data resources </li></ul></ul><ul><ul><li>Standard names </li></ul></ul><ul><ul><li>Standard queries for sequence, text, HTML, etc </li></ul></ul><ul><ul><li>Query by identifier and other fields </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  12. 12. Ontologies <ul><li>Support for OBO format ontologies: </li></ul><ul><ul><li>Gene Ontology </li></ul></ul><ul><ul><li>Sequence Ontology (used internally for features) </li></ul></ul><ul><ul><li>BioSapiens Ontology (used internally for features) </li></ul></ul><ul><li>Parsing and format validation </li></ul><ul><li>Indexing with new dbx applications </li></ul><ul><li>Indexing cross-references in EMBL/UniProt/etc. </li></ul><ul><li>Navigation up, down, siblings, etc. </li></ul><ul><li>Remote and local access </li></ul>BOSC 2010: EMBOSS 11.07.10
  13. 13. Ontologies: EDAM <ul><li>EMBRACE Datatypes And Methods </li></ul><ul><ul><li>OBO format (so far) </li></ul></ul><ul><li>All ACD files have relations attributes </li></ul><ul><ul><li>“ topic” for application (Immunological analysis) </li></ul></ul><ul><ul><li>“ operation” for application (Epitope mapping) </li></ul></ul><ul><ul><li>“ data” for inputs and outputs </li></ul></ul><ul><ul><ul><li>Pure protein sequence </li></ul></ul></ul><ul><ul><ul><ul><li>Sequence record </li></ul></ul></ul></ul><ul><ul><ul><ul><li>1 or more </li></ul></ul></ul></ul><ul><ul><ul><li>Sequence length </li></ul></ul></ul><ul><ul><ul><li>“ Peptide immunogenicity report” </li></ul></ul></ul><ul><li>Validation by acdvalid application </li></ul>BOSC 2010: EMBOSS 11.07.10
  14. 14. EDAM in ACD <ul><li>application: antigenic [ </li></ul><ul><li>documentation: &quot;Finds antigenic sites in proteins&quot; </li></ul><ul><li>groups: &quot;Protein:Motifs&quot; </li></ul><ul><li>relations: &quot; /edam/topic/0000201 Immunological analysis&quot; </li></ul><ul><li>relations: &quot; /edam/operation/0000416 Epitope mapping“ </li></ul><ul><li>] </li></ul><ul><li>seqall: sequence [ </li></ul><ul><li>parameter: &quot;Y&quot; </li></ul><ul><li>type: &quot;proteinstandard&quot; </li></ul><ul><li>relations: &quot; /edam/data/0001219 Pure protein sequence&quot; </li></ul><ul><li>relations: &quot; /edam/data/0000849 Sequence record&quot; </li></ul><ul><li>relations: &quot; /edam/data/0002178 1 or more“ </li></ul><ul><li>] </li></ul><ul><li>integer: minlen [ </li></ul><ul><li>standard: &quot;Y“ minimum: &quot;1” maximum: &quot;50” default: &quot;6&quot; </li></ul><ul><li>information: &quot;Minimum length of antigenic region&quot; </li></ul><ul><li>relations: &quot; /edam/data/0001249 Sequence length“ </li></ul><ul><li>] </li></ul><ul><li>report: outfile [ </li></ul><ul><li>parameter: &quot;Y&quot; </li></ul><ul><li>rformat: &quot;motif&quot; </li></ul><ul><li>multiple: &quot;Y&quot; </li></ul><ul><li>taglist: &quot;int:pos=Max_score_pos&quot; </li></ul><ul><li>relations: &quot; /edam/data/0001534 Peptide immunogenicity report&quot; </li></ul><ul><li>] </li></ul>BOSC 2010: EMBOSS 11.07.10
  15. 15. Ontologies: EDAM (2) <ul><li>SoapLab web services annotated with EDAM </li></ul><ul><ul><li>EDAM terms parsed from ACD files </li></ul></ul><ul><ul><li>Web services have WSDL files </li></ul></ul><ul><ul><li>SAWSDL annotation with EDAM terms </li></ul></ul><ul><ul><li>Annotation can be used by BioCatalogue </li></ul></ul><ul><ul><ul><li>www.biocatalogue.org </li></ul></ul></ul><ul><ul><li>Also can be used by EMBRACE registry </li></ul></ul><ul><ul><ul><li>www.embraceregistry.net </li></ul></ul></ul>BOSC 2010: EMBOSS 11.07.10
  16. 16. Ontologies: NCBI Taxonomy <ul><li>Parsers for “.dmp” files </li></ul><ul><li>Will add dbx indexing applications </li></ul><ul><li>Local and remote access </li></ul><ul><li>Navigation up, down, siblings (the usual suspects) </li></ul><ul><li>Automatic cross references from sequence data </li></ul><ul><ul><li>EMBL source line </li></ul></ul><ul><ul><li>UniProt OX lines </li></ul></ul><ul><ul><li>BioMart mart name (organism name) </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  17. 17. EMBOSS Interfaces and wrappers <ul><li>Two releases in the past year </li></ul><ul><li>Possibly three releases next year </li></ul><ul><li>Too many for other projects to keep up </li></ul><ul><ul><li>So we are obliged to help, starting with: </li></ul></ul><ul><ul><ul><li>SoapLab2 </li></ul></ul></ul><ul><ul><ul><li>Jemboss </li></ul></ul></ul><ul><ul><ul><li>Galaxy </li></ul></ul></ul><ul><ul><ul><li>Pipeline Pilot </li></ul></ul></ul><ul><ul><ul><ul><li>BioPerl </li></ul></ul></ul></ul><ul><ul><ul><li>wEMBOSS and Explorer </li></ul></ul></ul><ul><ul><ul><li>G-language? </li></ul></ul></ul><ul><ul><ul><li>… . And anyone else who asks! </li></ul></ul></ul>BOSC 2010: EMBOSS 11.07.10
  18. 18. The Emboss Team BOSC 2010: EMBOSS 11.07.10 Peter Rice Alan Bleasby Jon Ison Mahmut Uludag
  19. 19. Acknowledgements <ul><li>EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam, Michael Schuster, Syed Haider </li></ul><ul><li>RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan, Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop </li></ul><ul><li>LION: Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold </li></ul><ul><li>Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley </li></ul><ul><li>National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium, Russia, China, Canada, Australia, Argentina </li></ul><ul><li>Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer, Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, Kristoffer Rapacki, Matus Kalas </li></ul><ul><li>IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge University Press </li></ul><ul><li>Open-Bio Foundation, Sourceforge, ... And the British Antarctic Survey </li></ul><ul><li>http://emboss.sourceforge.net </li></ul><ul><li>http://emboss.open-bio.org/wiki </li></ul>BOSC 2010: EMBOSS 11.07.10
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×