Your SlideShare is downloading. ×
0
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Rice bosc2010 emboss
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Rice bosc2010 emboss

815

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
815
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. EMBOSS European Molecular Biology Open Software Suite Open-Bio Project Update 2010 Peter Rice pmr@ebi.ac.uk
  • 2. A quick introduction <ul><li>Open source package for sequence analysis </li></ul><ul><ul><li>ANSI C source code </li></ul></ul><ul><ul><li>GPL licensed applications, LGPL libraries </li></ul></ul><ul><ul><li>200+ applications </li></ul></ul><ul><ul><li>100+ third party applications in 15 associated packages </li></ul></ul><ul><ul><ul><li>MIRA, MEME, HMMER, PHYLIP, etc. </li></ul></ul></ul><ul><ul><li>Project started 1996 at Sanger and HGMP </li></ul></ul><ul><ul><li>Now based at EBI </li></ul></ul><ul><ul><li>Release 1.0.0 15th July 2000 </li></ul></ul><ul><ul><li>Release 6.3.0 15th July 2010 </li></ul></ul><ul><ul><li>Funded by UK-BBSRC and EMBL-EBI </li></ul></ul><ul><ul><li>Originally funded by the Wellcome Trust </li></ul></ul><ul><ul><li>Additional funds from UK-MRC </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 3. Who do we serve? <ul><li>Expert software developers </li></ul><ul><ul><li>Bioinformaticians </li></ul></ul><ul><ul><li>Computer scientists </li></ul></ul><ul><li>Expert users </li></ul><ul><ul><li>Biology research community </li></ul></ul><ul><ul><li>Industry </li></ul></ul><ul><li>Scientific users </li></ul><ul><ul><li>Biology research community </li></ul></ul><ul><ul><li>Industry </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 4. EMBOSS command line interface <ul><li>EMBOSS applications run from the command line </li></ul><ul><li>This is not the only interface </li></ul><ul><ul><li>There are over 100 interfaces and packaged systems available </li></ul></ul><ul><ul><ul><li>Web: wEMBOSS </li></ul></ul></ul><ul><ul><ul><li>GUI: Jemboss </li></ul></ul></ul><ul><ul><ul><li>Web Services: SoapLab </li></ul></ul></ul><ul><ul><ul><li>Workflows: Galaxy, Taverna </li></ul></ul></ul><ul><ul><ul><li>Windows: mEMBOSS </li></ul></ul></ul><ul><li>All applications have a command definition file (.acd) </li></ul><ul><ul><li>Defines all inputs, outputs, and other options </li></ul></ul><ul><ul><li>Read at startup </li></ul></ul><ul><ul><li>Contains all command line options with descriptions </li></ul></ul><ul><ul><li>Template for any other interface </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 5. EMBOSS Update <ul><li>Release 6.3.0 as usual on 15th July 2010 </li></ul><ul><li>New support for NGS sequence formats </li></ul><ul><li>Adaptor detection added to supermatcher </li></ul><ul><li>Metadata and ontologies </li></ul><ul><li>Full set of public data resources </li></ul><ul><li>Three open source books: users, developers, admin </li></ul><ul><ul><li>Cambridge University Press </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 6. NGS sequence formats <ul><li>SAM format: tab-delimited short read data </li></ul><ul><li>BAM format: binary compressed SAM format </li></ul><ul><ul><li>More work needed on remote access to mapped reads </li></ul></ul><ul><li>FASTQ short reads and quality scores </li></ul><ul><ul><li>OpenBio project collaboration on format standards </li></ul></ul><ul><ul><li>Improved error detection (for all formats) </li></ul></ul><ul><ul><li>Improved performance for input and output </li></ul></ul><ul><ul><li>Indexing in dbxflat </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 7. NGS sequence formats <ul><li>FASTQ joint effort with Bio* projects </li></ul><ul><li>Definition of 3 conflicting FASTQ formats </li></ul><ul><li>Agreement on standard parsing procedures </li></ul><ul><li>@EAS54_6_R1_2_1_413_324 </li></ul><ul><li>CCCTTCTTGTCTTCAGCGTTTCTCC </li></ul><ul><li>+ EAS54_6_R1_2_1_413_324 </li></ul><ul><li>;;3;;;;;;;;;;;;7;;;;;;;88 </li></ul><ul><li>@EAS54_6_R1_2_1_443_348 </li></ul><ul><li>GTTGCTTCTGGCGTGGGTGGGGGGG </li></ul><ul><li>+EAS54_6_R1_2_1_443_348 </li></ul><ul><li>;;;;;;;;;;;9;7;;.7;393 333 </li></ul>BOSC 2010: EMBOSS 11.07.10
  • 8. Other sequence formats <ul><li>>AB036666 AB036666 Wolbachia sp. wKue genes </li></ul><ul><li>cattactatttcagtcgagacatattaggtcaatcaattttaatcaacaagattggtcaa </li></ul><ul><li>gatcaaagtaacattaaaaaatatatatactcatatggtgagtaccctctgaactggcct </li></ul><ul><li>cagggaacagaatacactttatctaacagccctgttacaacattaatatttgttcaaggt </li></ul><ul><li>aatgaaggacaagaaaaaacagcattcatttttcatatacgagagtccaatacaaaggaa </li></ul><ul><li>ttctatgctgataaaaaaattccagtgctaaacatacctaaaataggaaaagtaggaaat </li></ul><ul><li>gccgtagaaattaaaatgagtctaaaaaaatatgaaacagggttatcttttgaagacctt </li></ul><ul><li>tttgaaatagaacagataagtaaatatgaatcaagtggtaatgatcaacaatttacagat </li></ul><ul><li>ggcaagtttattgagatacctaattctgatgaattaaaggcaaaatttgatcaagcaatc </li></ul><ul><li>acttctcaacatgcttccgacggtgaggtttcattgcaagcctataaagtgttgcttact </li></ul><ul><li>gaagtagcagatacgatttaccctatcaaagatttgattactaatgaagcaagattacaa </li></ul><ul><li>gctgttcttaatggtttgcttagtagctatagtgatttaaagctacaggagacttctgcg </li></ul><ul><li>aagactgtaattatacctgaatttcaagtaggagcaggtggtcgtgtagatatggtaatt </li></ul><ul><li>Caaggtattggtccttcgtctcagggtactaaagaatacac tcctatagcgctggaattt </li></ul>
  • 9. New data sources for EMBOSS <ul><li>BioMart access </li></ul><ul><ul><li>As a sequence database, define sequence, identifier, etc. </li></ul></ul><ul><ul><li>Need to define a very large number of databases </li></ul></ul><ul><li>Ensembl access </li></ul><ul><ul><li>Code from Michael Schuster </li></ul></ul><ul><ul><li>Ensembl SQL access code in library (access method soon) </li></ul></ul><ul><ul><li>Same issues as BioMart </li></ul></ul><ul><li>DAS 1.6 client access planned </li></ul><ul><li>GMOD access planned </li></ul><ul><li>BioSQL access planned </li></ul>BOSC 2010: EMBOSS 11.07.10
  • 10. Data servers <ul><li>Defining individual sequence databases is tedious </li></ul><ul><li>Many database definitions are similar </li></ul><ul><li>Simplify (and extend) with server definitions: </li></ul><ul><ul><li>SRS </li></ul></ul><ul><ul><li>MRS </li></ul></ul><ul><ul><li>BioMart </li></ul></ul><ul><ul><li>Ensembl </li></ul></ul><ul><ul><li>DAS 1.6 </li></ul></ul><ul><li>Define server </li></ul><ul><li>USA to give server:dbname:queryfield-value </li></ul><ul><li>Database name and query field known to user </li></ul><ul><ul><li>Or reported by a query to the server in an extended showdb </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 11. New data sources for EMBOSS (2) <ul><li>Non-sequence data </li></ul><ul><ul><li>Cross-referenced resources from EMBL/UniProt/etc. </li></ul></ul><ul><ul><li>Useful to return as: </li></ul></ul><ul><ul><ul><li>Identifiers </li></ul></ul></ul><ul><ul><ul><li>Text for entries </li></ul></ul></ul><ul><ul><ul><li>HTML with markup </li></ul></ul></ul><ul><ul><ul><li>URLs for browsing </li></ul></ul></ul><ul><li>Dbxref.dat </li></ul><ul><ul><li>List of all known data resources </li></ul></ul><ul><ul><li>Standard names </li></ul></ul><ul><ul><li>Standard queries for sequence, text, HTML, etc </li></ul></ul><ul><ul><li>Query by identifier and other fields </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 12. Ontologies <ul><li>Support for OBO format ontologies: </li></ul><ul><ul><li>Gene Ontology </li></ul></ul><ul><ul><li>Sequence Ontology (used internally for features) </li></ul></ul><ul><ul><li>BioSapiens Ontology (used internally for features) </li></ul></ul><ul><li>Parsing and format validation </li></ul><ul><li>Indexing with new dbx applications </li></ul><ul><li>Indexing cross-references in EMBL/UniProt/etc. </li></ul><ul><li>Navigation up, down, siblings, etc. </li></ul><ul><li>Remote and local access </li></ul>BOSC 2010: EMBOSS 11.07.10
  • 13. Ontologies: EDAM <ul><li>EMBRACE Datatypes And Methods </li></ul><ul><ul><li>OBO format (so far) </li></ul></ul><ul><li>All ACD files have relations attributes </li></ul><ul><ul><li>“ topic” for application (Immunological analysis) </li></ul></ul><ul><ul><li>“ operation” for application (Epitope mapping) </li></ul></ul><ul><ul><li>“ data” for inputs and outputs </li></ul></ul><ul><ul><ul><li>Pure protein sequence </li></ul></ul></ul><ul><ul><ul><ul><li>Sequence record </li></ul></ul></ul></ul><ul><ul><ul><ul><li>1 or more </li></ul></ul></ul></ul><ul><ul><ul><li>Sequence length </li></ul></ul></ul><ul><ul><ul><li>“ Peptide immunogenicity report” </li></ul></ul></ul><ul><li>Validation by acdvalid application </li></ul>BOSC 2010: EMBOSS 11.07.10
  • 14. EDAM in ACD <ul><li>application: antigenic [ </li></ul><ul><li>documentation: &quot;Finds antigenic sites in proteins&quot; </li></ul><ul><li>groups: &quot;Protein:Motifs&quot; </li></ul><ul><li>relations: &quot; /edam/topic/0000201 Immunological analysis&quot; </li></ul><ul><li>relations: &quot; /edam/operation/0000416 Epitope mapping“ </li></ul><ul><li>] </li></ul><ul><li>seqall: sequence [ </li></ul><ul><li>parameter: &quot;Y&quot; </li></ul><ul><li>type: &quot;proteinstandard&quot; </li></ul><ul><li>relations: &quot; /edam/data/0001219 Pure protein sequence&quot; </li></ul><ul><li>relations: &quot; /edam/data/0000849 Sequence record&quot; </li></ul><ul><li>relations: &quot; /edam/data/0002178 1 or more“ </li></ul><ul><li>] </li></ul><ul><li>integer: minlen [ </li></ul><ul><li>standard: &quot;Y“ minimum: &quot;1” maximum: &quot;50” default: &quot;6&quot; </li></ul><ul><li>information: &quot;Minimum length of antigenic region&quot; </li></ul><ul><li>relations: &quot; /edam/data/0001249 Sequence length“ </li></ul><ul><li>] </li></ul><ul><li>report: outfile [ </li></ul><ul><li>parameter: &quot;Y&quot; </li></ul><ul><li>rformat: &quot;motif&quot; </li></ul><ul><li>multiple: &quot;Y&quot; </li></ul><ul><li>taglist: &quot;int:pos=Max_score_pos&quot; </li></ul><ul><li>relations: &quot; /edam/data/0001534 Peptide immunogenicity report&quot; </li></ul><ul><li>] </li></ul>BOSC 2010: EMBOSS 11.07.10
  • 15. Ontologies: EDAM (2) <ul><li>SoapLab web services annotated with EDAM </li></ul><ul><ul><li>EDAM terms parsed from ACD files </li></ul></ul><ul><ul><li>Web services have WSDL files </li></ul></ul><ul><ul><li>SAWSDL annotation with EDAM terms </li></ul></ul><ul><ul><li>Annotation can be used by BioCatalogue </li></ul></ul><ul><ul><ul><li>www.biocatalogue.org </li></ul></ul></ul><ul><ul><li>Also can be used by EMBRACE registry </li></ul></ul><ul><ul><ul><li>www.embraceregistry.net </li></ul></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 16. Ontologies: NCBI Taxonomy <ul><li>Parsers for “.dmp” files </li></ul><ul><li>Will add dbx indexing applications </li></ul><ul><li>Local and remote access </li></ul><ul><li>Navigation up, down, siblings (the usual suspects) </li></ul><ul><li>Automatic cross references from sequence data </li></ul><ul><ul><li>EMBL source line </li></ul></ul><ul><ul><li>UniProt OX lines </li></ul></ul><ul><ul><li>BioMart mart name (organism name) </li></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 17. EMBOSS Interfaces and wrappers <ul><li>Two releases in the past year </li></ul><ul><li>Possibly three releases next year </li></ul><ul><li>Too many for other projects to keep up </li></ul><ul><ul><li>So we are obliged to help, starting with: </li></ul></ul><ul><ul><ul><li>SoapLab2 </li></ul></ul></ul><ul><ul><ul><li>Jemboss </li></ul></ul></ul><ul><ul><ul><li>Galaxy </li></ul></ul></ul><ul><ul><ul><li>Pipeline Pilot </li></ul></ul></ul><ul><ul><ul><ul><li>BioPerl </li></ul></ul></ul></ul><ul><ul><ul><li>wEMBOSS and Explorer </li></ul></ul></ul><ul><ul><ul><li>G-language? </li></ul></ul></ul><ul><ul><ul><li>… . And anyone else who asks! </li></ul></ul></ul>BOSC 2010: EMBOSS 11.07.10
  • 18. The Emboss Team BOSC 2010: EMBOSS 11.07.10 Peter Rice Alan Bleasby Jon Ison Mahmut Uludag
  • 19. Acknowledgements <ul><li>EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam, Michael Schuster, Syed Haider </li></ul><ul><li>RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan, Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop </li></ul><ul><li>LION: Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold </li></ul><ul><li>Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley </li></ul><ul><li>National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium, Russia, China, Canada, Australia, Argentina </li></ul><ul><li>Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer, Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, Kristoffer Rapacki, Matus Kalas </li></ul><ul><li>IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge University Press </li></ul><ul><li>Open-Bio Foundation, Sourceforge, ... And the British Antarctic Survey </li></ul><ul><li>http://emboss.sourceforge.net </li></ul><ul><li>http://emboss.open-bio.org/wiki </li></ul>BOSC 2010: EMBOSS 11.07.10

×