Your SlideShare is downloading. ×
Rice bosc2010 emboss
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Rice bosc2010 emboss

801

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
801
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. EMBOSS European Molecular Biology Open Software Suite Open-Bio Project Update 2010 Peter Rice pmr@ebi.ac.uk
  • 2. A quick introduction
    • Open source package for sequence analysis
      • ANSI C source code
      • GPL licensed applications, LGPL libraries
      • 200+ applications
      • 100+ third party applications in 15 associated packages
        • MIRA, MEME, HMMER, PHYLIP, etc.
      • Project started 1996 at Sanger and HGMP
      • Now based at EBI
      • Release 1.0.0 15th July 2000
      • Release 6.3.0 15th July 2010
      • Funded by UK-BBSRC and EMBL-EBI
      • Originally funded by the Wellcome Trust
      • Additional funds from UK-MRC
    BOSC 2010: EMBOSS 11.07.10
  • 3. Who do we serve?
    • Expert software developers
      • Bioinformaticians
      • Computer scientists
    • Expert users
      • Biology research community
      • Industry
    • Scientific users
      • Biology research community
      • Industry
    BOSC 2010: EMBOSS 11.07.10
  • 4. EMBOSS command line interface
    • EMBOSS applications run from the command line
    • This is not the only interface
      • There are over 100 interfaces and packaged systems available
        • Web: wEMBOSS
        • GUI: Jemboss
        • Web Services: SoapLab
        • Workflows: Galaxy, Taverna
        • Windows: mEMBOSS
    • All applications have a command definition file (.acd)
      • Defines all inputs, outputs, and other options
      • Read at startup
      • Contains all command line options with descriptions
      • Template for any other interface
    BOSC 2010: EMBOSS 11.07.10
  • 5. EMBOSS Update
    • Release 6.3.0 as usual on 15th July 2010
    • New support for NGS sequence formats
    • Adaptor detection added to supermatcher
    • Metadata and ontologies
    • Full set of public data resources
    • Three open source books: users, developers, admin
      • Cambridge University Press
    BOSC 2010: EMBOSS 11.07.10
  • 6. NGS sequence formats
    • SAM format: tab-delimited short read data
    • BAM format: binary compressed SAM format
      • More work needed on remote access to mapped reads
    • FASTQ short reads and quality scores
      • OpenBio project collaboration on format standards
      • Improved error detection (for all formats)
      • Improved performance for input and output
      • Indexing in dbxflat
    BOSC 2010: EMBOSS 11.07.10
  • 7. NGS sequence formats
    • FASTQ joint effort with Bio* projects
    • Definition of 3 conflicting FASTQ formats
    • Agreement on standard parsing procedures
    • @EAS54_6_R1_2_1_413_324
    • CCCTTCTTGTCTTCAGCGTTTCTCC
    • + EAS54_6_R1_2_1_413_324
    • ;;3;;;;;;;;;;;;7;;;;;;;88
    • @EAS54_6_R1_2_1_443_348
    • GTTGCTTCTGGCGTGGGTGGGGGGG
    • +EAS54_6_R1_2_1_443_348
    • ;;;;;;;;;;;9;7;;.7;393 333
    BOSC 2010: EMBOSS 11.07.10
  • 8. Other sequence formats
    • >AB036666 AB036666 Wolbachia sp. wKue genes
    • cattactatttcagtcgagacatattaggtcaatcaattttaatcaacaagattggtcaa
    • gatcaaagtaacattaaaaaatatatatactcatatggtgagtaccctctgaactggcct
    • cagggaacagaatacactttatctaacagccctgttacaacattaatatttgttcaaggt
    • aatgaaggacaagaaaaaacagcattcatttttcatatacgagagtccaatacaaaggaa
    • ttctatgctgataaaaaaattccagtgctaaacatacctaaaataggaaaagtaggaaat
    • gccgtagaaattaaaatgagtctaaaaaaatatgaaacagggttatcttttgaagacctt
    • tttgaaatagaacagataagtaaatatgaatcaagtggtaatgatcaacaatttacagat
    • ggcaagtttattgagatacctaattctgatgaattaaaggcaaaatttgatcaagcaatc
    • acttctcaacatgcttccgacggtgaggtttcattgcaagcctataaagtgttgcttact
    • gaagtagcagatacgatttaccctatcaaagatttgattactaatgaagcaagattacaa
    • gctgttcttaatggtttgcttagtagctatagtgatttaaagctacaggagacttctgcg
    • aagactgtaattatacctgaatttcaagtaggagcaggtggtcgtgtagatatggtaatt
    • Caaggtattggtccttcgtctcagggtactaaagaatacac tcctatagcgctggaattt
  • 9. New data sources for EMBOSS
    • BioMart access
      • As a sequence database, define sequence, identifier, etc.
      • Need to define a very large number of databases
    • Ensembl access
      • Code from Michael Schuster
      • Ensembl SQL access code in library (access method soon)
      • Same issues as BioMart
    • DAS 1.6 client access planned
    • GMOD access planned
    • BioSQL access planned
    BOSC 2010: EMBOSS 11.07.10
  • 10. Data servers
    • Defining individual sequence databases is tedious
    • Many database definitions are similar
    • Simplify (and extend) with server definitions:
      • SRS
      • MRS
      • BioMart
      • Ensembl
      • DAS 1.6
    • Define server
    • USA to give server:dbname:queryfield-value
    • Database name and query field known to user
      • Or reported by a query to the server in an extended showdb
    BOSC 2010: EMBOSS 11.07.10
  • 11. New data sources for EMBOSS (2)
    • Non-sequence data
      • Cross-referenced resources from EMBL/UniProt/etc.
      • Useful to return as:
        • Identifiers
        • Text for entries
        • HTML with markup
        • URLs for browsing
    • Dbxref.dat
      • List of all known data resources
      • Standard names
      • Standard queries for sequence, text, HTML, etc
      • Query by identifier and other fields
    BOSC 2010: EMBOSS 11.07.10
  • 12. Ontologies
    • Support for OBO format ontologies:
      • Gene Ontology
      • Sequence Ontology (used internally for features)
      • BioSapiens Ontology (used internally for features)
    • Parsing and format validation
    • Indexing with new dbx applications
    • Indexing cross-references in EMBL/UniProt/etc.
    • Navigation up, down, siblings, etc.
    • Remote and local access
    BOSC 2010: EMBOSS 11.07.10
  • 13. Ontologies: EDAM
    • EMBRACE Datatypes And Methods
      • OBO format (so far)
    • All ACD files have relations attributes
      • “ topic” for application (Immunological analysis)
      • “ operation” for application (Epitope mapping)
      • “ data” for inputs and outputs
        • Pure protein sequence
          • Sequence record
          • 1 or more
        • Sequence length
        • “ Peptide immunogenicity report”
    • Validation by acdvalid application
    BOSC 2010: EMBOSS 11.07.10
  • 14. EDAM in ACD
    • application: antigenic [
    • documentation: "Finds antigenic sites in proteins"
    • groups: "Protein:Motifs"
    • relations: " /edam/topic/0000201 Immunological analysis"
    • relations: " /edam/operation/0000416 Epitope mapping“
    • ]
    • seqall: sequence [
    • parameter: "Y"
    • type: "proteinstandard"
    • relations: " /edam/data/0001219 Pure protein sequence"
    • relations: " /edam/data/0000849 Sequence record"
    • relations: " /edam/data/0002178 1 or more“
    • ]
    • integer: minlen [
    • standard: "Y“ minimum: "1” maximum: "50” default: "6"
    • information: "Minimum length of antigenic region"
    • relations: " /edam/data/0001249 Sequence length“
    • ]
    • report: outfile [
    • parameter: "Y"
    • rformat: "motif"
    • multiple: "Y"
    • taglist: "int:pos=Max_score_pos"
    • relations: " /edam/data/0001534 Peptide immunogenicity report"
    • ]
    BOSC 2010: EMBOSS 11.07.10
  • 15. Ontologies: EDAM (2)
    • SoapLab web services annotated with EDAM
      • EDAM terms parsed from ACD files
      • Web services have WSDL files
      • SAWSDL annotation with EDAM terms
      • Annotation can be used by BioCatalogue
        • www.biocatalogue.org
      • Also can be used by EMBRACE registry
        • www.embraceregistry.net
    BOSC 2010: EMBOSS 11.07.10
  • 16. Ontologies: NCBI Taxonomy
    • Parsers for “.dmp” files
    • Will add dbx indexing applications
    • Local and remote access
    • Navigation up, down, siblings (the usual suspects)
    • Automatic cross references from sequence data
      • EMBL source line
      • UniProt OX lines
      • BioMart mart name (organism name)
    BOSC 2010: EMBOSS 11.07.10
  • 17. EMBOSS Interfaces and wrappers
    • Two releases in the past year
    • Possibly three releases next year
    • Too many for other projects to keep up
      • So we are obliged to help, starting with:
        • SoapLab2
        • Jemboss
        • Galaxy
        • Pipeline Pilot
          • BioPerl
        • wEMBOSS and Explorer
        • G-language?
        • … . And anyone else who asks!
    BOSC 2010: EMBOSS 11.07.10
  • 18. The Emboss Team BOSC 2010: EMBOSS 11.07.10 Peter Rice Alan Bleasby Jon Ison Mahmut Uludag
  • 19. Acknowledgements
    • EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam, Michael Schuster, Syed Haider
    • RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan, Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop
    • LION: Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold
    • Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley
    • National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium, Russia, China, Canada, Australia, Argentina
    • Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer, Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, Kristoffer Rapacki, Matus Kalas
    • IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge University Press
    • Open-Bio Foundation, Sourceforge, ... And the British Antarctic Survey
    • http://emboss.sourceforge.net
    • http://emboss.open-bio.org/wiki
    BOSC 2010: EMBOSS 11.07.10

×