Rice bosc2010 emboss
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,102
On Slideshare
1,102
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. EMBOSS European Molecular Biology Open Software Suite Open-Bio Project Update 2010 Peter Rice pmr@ebi.ac.uk
  • 2. A quick introduction
    • Open source package for sequence analysis
      • ANSI C source code
      • GPL licensed applications, LGPL libraries
      • 200+ applications
      • 100+ third party applications in 15 associated packages
        • MIRA, MEME, HMMER, PHYLIP, etc.
      • Project started 1996 at Sanger and HGMP
      • Now based at EBI
      • Release 1.0.0 15th July 2000
      • Release 6.3.0 15th July 2010
      • Funded by UK-BBSRC and EMBL-EBI
      • Originally funded by the Wellcome Trust
      • Additional funds from UK-MRC
    BOSC 2010: EMBOSS 11.07.10
  • 3. Who do we serve?
    • Expert software developers
      • Bioinformaticians
      • Computer scientists
    • Expert users
      • Biology research community
      • Industry
    • Scientific users
      • Biology research community
      • Industry
    BOSC 2010: EMBOSS 11.07.10
  • 4. EMBOSS command line interface
    • EMBOSS applications run from the command line
    • This is not the only interface
      • There are over 100 interfaces and packaged systems available
        • Web: wEMBOSS
        • GUI: Jemboss
        • Web Services: SoapLab
        • Workflows: Galaxy, Taverna
        • Windows: mEMBOSS
    • All applications have a command definition file (.acd)
      • Defines all inputs, outputs, and other options
      • Read at startup
      • Contains all command line options with descriptions
      • Template for any other interface
    BOSC 2010: EMBOSS 11.07.10
  • 5. EMBOSS Update
    • Release 6.3.0 as usual on 15th July 2010
    • New support for NGS sequence formats
    • Adaptor detection added to supermatcher
    • Metadata and ontologies
    • Full set of public data resources
    • Three open source books: users, developers, admin
      • Cambridge University Press
    BOSC 2010: EMBOSS 11.07.10
  • 6. NGS sequence formats
    • SAM format: tab-delimited short read data
    • BAM format: binary compressed SAM format
      • More work needed on remote access to mapped reads
    • FASTQ short reads and quality scores
      • OpenBio project collaboration on format standards
      • Improved error detection (for all formats)
      • Improved performance for input and output
      • Indexing in dbxflat
    BOSC 2010: EMBOSS 11.07.10
  • 7. NGS sequence formats
    • FASTQ joint effort with Bio* projects
    • Definition of 3 conflicting FASTQ formats
    • Agreement on standard parsing procedures
    • @EAS54_6_R1_2_1_413_324
    • CCCTTCTTGTCTTCAGCGTTTCTCC
    • + EAS54_6_R1_2_1_413_324
    • ;;3;;;;;;;;;;;;7;;;;;;;88
    • @EAS54_6_R1_2_1_443_348
    • GTTGCTTCTGGCGTGGGTGGGGGGG
    • +EAS54_6_R1_2_1_443_348
    • ;;;;;;;;;;;9;7;;.7;393 333
    BOSC 2010: EMBOSS 11.07.10
  • 8. Other sequence formats
    • >AB036666 AB036666 Wolbachia sp. wKue genes
    • cattactatttcagtcgagacatattaggtcaatcaattttaatcaacaagattggtcaa
    • gatcaaagtaacattaaaaaatatatatactcatatggtgagtaccctctgaactggcct
    • cagggaacagaatacactttatctaacagccctgttacaacattaatatttgttcaaggt
    • aatgaaggacaagaaaaaacagcattcatttttcatatacgagagtccaatacaaaggaa
    • ttctatgctgataaaaaaattccagtgctaaacatacctaaaataggaaaagtaggaaat
    • gccgtagaaattaaaatgagtctaaaaaaatatgaaacagggttatcttttgaagacctt
    • tttgaaatagaacagataagtaaatatgaatcaagtggtaatgatcaacaatttacagat
    • ggcaagtttattgagatacctaattctgatgaattaaaggcaaaatttgatcaagcaatc
    • acttctcaacatgcttccgacggtgaggtttcattgcaagcctataaagtgttgcttact
    • gaagtagcagatacgatttaccctatcaaagatttgattactaatgaagcaagattacaa
    • gctgttcttaatggtttgcttagtagctatagtgatttaaagctacaggagacttctgcg
    • aagactgtaattatacctgaatttcaagtaggagcaggtggtcgtgtagatatggtaatt
    • Caaggtattggtccttcgtctcagggtactaaagaatacac tcctatagcgctggaattt
  • 9. New data sources for EMBOSS
    • BioMart access
      • As a sequence database, define sequence, identifier, etc.
      • Need to define a very large number of databases
    • Ensembl access
      • Code from Michael Schuster
      • Ensembl SQL access code in library (access method soon)
      • Same issues as BioMart
    • DAS 1.6 client access planned
    • GMOD access planned
    • BioSQL access planned
    BOSC 2010: EMBOSS 11.07.10
  • 10. Data servers
    • Defining individual sequence databases is tedious
    • Many database definitions are similar
    • Simplify (and extend) with server definitions:
      • SRS
      • MRS
      • BioMart
      • Ensembl
      • DAS 1.6
    • Define server
    • USA to give server:dbname:queryfield-value
    • Database name and query field known to user
      • Or reported by a query to the server in an extended showdb
    BOSC 2010: EMBOSS 11.07.10
  • 11. New data sources for EMBOSS (2)
    • Non-sequence data
      • Cross-referenced resources from EMBL/UniProt/etc.
      • Useful to return as:
        • Identifiers
        • Text for entries
        • HTML with markup
        • URLs for browsing
    • Dbxref.dat
      • List of all known data resources
      • Standard names
      • Standard queries for sequence, text, HTML, etc
      • Query by identifier and other fields
    BOSC 2010: EMBOSS 11.07.10
  • 12. Ontologies
    • Support for OBO format ontologies:
      • Gene Ontology
      • Sequence Ontology (used internally for features)
      • BioSapiens Ontology (used internally for features)
    • Parsing and format validation
    • Indexing with new dbx applications
    • Indexing cross-references in EMBL/UniProt/etc.
    • Navigation up, down, siblings, etc.
    • Remote and local access
    BOSC 2010: EMBOSS 11.07.10
  • 13. Ontologies: EDAM
    • EMBRACE Datatypes And Methods
      • OBO format (so far)
    • All ACD files have relations attributes
      • “ topic” for application (Immunological analysis)
      • “ operation” for application (Epitope mapping)
      • “ data” for inputs and outputs
        • Pure protein sequence
          • Sequence record
          • 1 or more
        • Sequence length
        • “ Peptide immunogenicity report”
    • Validation by acdvalid application
    BOSC 2010: EMBOSS 11.07.10
  • 14. EDAM in ACD
    • application: antigenic [
    • documentation: "Finds antigenic sites in proteins"
    • groups: "Protein:Motifs"
    • relations: " /edam/topic/0000201 Immunological analysis"
    • relations: " /edam/operation/0000416 Epitope mapping“
    • ]
    • seqall: sequence [
    • parameter: "Y"
    • type: "proteinstandard"
    • relations: " /edam/data/0001219 Pure protein sequence"
    • relations: " /edam/data/0000849 Sequence record"
    • relations: " /edam/data/0002178 1 or more“
    • ]
    • integer: minlen [
    • standard: "Y“ minimum: "1” maximum: "50” default: "6"
    • information: "Minimum length of antigenic region"
    • relations: " /edam/data/0001249 Sequence length“
    • ]
    • report: outfile [
    • parameter: "Y"
    • rformat: "motif"
    • multiple: "Y"
    • taglist: "int:pos=Max_score_pos"
    • relations: " /edam/data/0001534 Peptide immunogenicity report"
    • ]
    BOSC 2010: EMBOSS 11.07.10
  • 15. Ontologies: EDAM (2)
    • SoapLab web services annotated with EDAM
      • EDAM terms parsed from ACD files
      • Web services have WSDL files
      • SAWSDL annotation with EDAM terms
      • Annotation can be used by BioCatalogue
        • www.biocatalogue.org
      • Also can be used by EMBRACE registry
        • www.embraceregistry.net
    BOSC 2010: EMBOSS 11.07.10
  • 16. Ontologies: NCBI Taxonomy
    • Parsers for “.dmp” files
    • Will add dbx indexing applications
    • Local and remote access
    • Navigation up, down, siblings (the usual suspects)
    • Automatic cross references from sequence data
      • EMBL source line
      • UniProt OX lines
      • BioMart mart name (organism name)
    BOSC 2010: EMBOSS 11.07.10
  • 17. EMBOSS Interfaces and wrappers
    • Two releases in the past year
    • Possibly three releases next year
    • Too many for other projects to keep up
      • So we are obliged to help, starting with:
        • SoapLab2
        • Jemboss
        • Galaxy
        • Pipeline Pilot
          • BioPerl
        • wEMBOSS and Explorer
        • G-language?
        • … . And anyone else who asks!
    BOSC 2010: EMBOSS 11.07.10
  • 18. The Emboss Team BOSC 2010: EMBOSS 11.07.10 Peter Rice Alan Bleasby Jon Ison Mahmut Uludag
  • 19. Acknowledgements
    • EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam, Michael Schuster, Syed Haider
    • RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan, Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop
    • LION: Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold
    • Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley
    • National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium, Russia, China, Canada, Australia, Argentina
    • Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer, Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, Kristoffer Rapacki, Matus Kalas
    • IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge University Press
    • Open-Bio Foundation, Sourceforge, ... And the British Antarctic Survey
    • http://emboss.sourceforge.net
    • http://emboss.open-bio.org/wiki
    BOSC 2010: EMBOSS 11.07.10