Adding Meaning To Your Data
Upcoming SlideShare
Loading in...5
×
 

Adding Meaning To Your Data

on

  • 7,174 views

Biosapiens talk 2007-12-04

Biosapiens talk 2007-12-04

Statistics

Views

Total Views
7,174
Views on SlideShare
7,171
Embed Views
3

Actions

Likes
3
Downloads
81
Comments
0

2 Embeds 3

http://mndoci.tumblr.com 2
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Adding Meaning To Your Data Adding Meaning To Your Data Presentation Transcript

  • Making your data more meaningful Using BioMOBY and my Grid Taverna services as examples Duncan Hull University of Manchester BioSapiens Network of Excellence 2007-12-04
  • Outline
    • Semantic Web redux
      • Semantic web stack,
      • There are lots of TLAs (three letter acronyms) in this talk, unavoidable
      • XML, URI, Namespaces, Unicode,
      • RDF, RDFS OWL
    • In this talk when I say “data” I usually mean Web Services…
      • Example: InterProScan
      • BioMOBY http://www.biomoby.org
      • myGrid Taverna http://www.mygrid.org.uk
      • myExperiment http://www.myexperiment.org
    • Conclusions
      • What did we learn?
  • The semantic web “knows what you mean” thanks to added meaning
    • According to Tim Berners-Lee, Ora Lassila and Jim Hendler http://scholar.google.com/scholar?q=semantic+web
    • … and Mark Butler from HP labs http://www.flickr.com/photos/dullhunk/303503677/
    • Vague, audacious, “visionary”, controversial and/or doomed (depending on who you ask)
    • in practice this means…
  • Semantic Web “stack”
    • A suite of technology and standards* for adding meaning to data
    • Taken from “Semantic Web Architecture: Stack or Two Towers?” by Ian Horrocks et al see http://dx.doi.org/10.1007/11552222_4 and http://www.flickr.com/photos/dullhunk/415645490/
  • … But first, InterProScan…
    • InterProScan: Protein domains identifier @ EBI http://view.ncbi.nlm.nih.gov/pubmed/15980438
    • http://www.ebi.ac.uk/Tools/webservices/rest/submit?tool=iprscan&sequence=uniprot:slpi_human&seqtype=P&email=homer@simpsons.com
    • That horrendously long URI submits a job to InterProScan with 4 parameters
      • ?tool=iprscan “use the InterProScan tool”
      • &sequence=uniprot:slpi_human “…with the sequence secretory leukocyte proteinase inhibitor (SLPI) in UniProt format”
      • &seqtype=P “sequence type is protein (e.g. not DNA)”
      • &email:homer@simpsons.com “email results to Homer Simpson”
      • Returns a job identifier e.g. iprscan-20071203-18053660
      • http://www. ebi .ac. uk/cgi-bin/iprscan/iprscan ? tool=iprscan &jobid= iprscan-20071203-18053660
  • Back to the Stack
  • URI, Unicode, XML and Namespaces
    • Bottom of semantic web stack:
    • Namespaces http://www.w3.org/TR/xml-names/
    • eXtensible Markup Language (XML) http://www.w3.org/TR/xml
    • Uniform Resource Identifiers (URI) http://www.ietf.org/rfc/rfc3986
    • Unicode http://www.unicode.org/charts
  • URI: Uniform Resource IDENTIFIER
    • URIs include Uniform Resource Locators (URLs) most people are familiar with for locating things, usually just called them “links”…
      • E.g. http://www.biosapiens.info locator for the biosapiens website
      • E.g. http://view.ncbi.nlm.nih.gov/pubmed/16015280 locates a biosapiens publication
      • Not persistent, sometimes unstable and break e.g. “404 not found”
      • Not guaranteed to be unique
    • URIs include Uniform Resource Names (URNs) for naming things that are less familiar like ISBN, Digital Object Identifiers (DOI) and Life Science Identifiers (LSID) etc
      • E.g. urn:doi:10.1038/sj.ejhg.5201470 names a publication using DOI
      • E.g. urn:isbn:0387484361 names a book using ISBN
      • E.g. urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank:bx247883 names a biological sequence using LSID
      • Unlike URLs, URNs are UNIQUE and PERSISTENT
    • URIs can be names , identifiers or locators (sometimes all three)
    • See URI Generic syntax http://www.ietf.org/rfc/rfc3986 and URN syntax http://www.ietf.org/rfc/rfc2141 from the Internet Engineering Task Force (IETF)
  • Unicode: Boring but important
    • Unicode provides a unique number for every character
      • no matter what the platform (Windows, Unix, iPhone, toaster etc)
      • no matter what the program (protein database, email client etc)
      • no matter what the language (English, Chinese, Swahili… you name it)
    • E.g. U+0041 is the number for “LATIN CAPITAL A”
    • E.g. U+0F03 is the number for “TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA”
    • E.g. U+221E Is the number for “INFINITY”
    • http://www. unicode . org/standard/WhatIsUnicode .html http://www. tbray .org/ongoing/When/200x/2003/04/06/Unicode
  • XML: eXtensible Markup Language, boring but incredibly useful
    • Data marked using tags as “trees”
    • <operation name=&quot;InterProScan&quot; method=&quot;get&quot;>
    • <request>
    • <parameter name=&quot;sequence&quot; type=&quot;xsd:string&quot; required=&quot;true&quot;/>
    • </request>
    • <response>
    • <representation mediaType=&quot;text/xml&quot; element=&quot;yn:ResultSet&quot;>
    • <parameter name=&quot;totalResults&quot;
    • type=&quot;xsd:nonNegativeInteger&quot;
    • </response>
    • </operation>
  • Namespaces
    • “ XML namespaces provide a simple method for qualifying element and attribute names used in XML by associating them with namespaces identified by URI references.”
    • <?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?>
    • xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;
    • xmlns:xsd= http ://www.w3.org/2001/XMLSchema …
    • What this means is that
    • <xsi:fred> and <xsd:fred>
    Are different because they belong in different namespaces, “ xsi:fred ” is shorthand for http://www.w3.org/2001/XMLSchema-instance:fred Allows us to have lots of different things called “fred”
  • Describing Web Services
    • This (xml, uri, namespaces + xml) gives us enough to describe Web Services
    • There are two styles of services on the Web: “RESTful” and “RESTless”
      • “ RESTful” (usually no SOAP): described with (Web Application Description Language) WADL https://wadl.dev.java.net/
      • “ RESTless” (uses SOAP and WSDL): described with Web Services Description Language (WSDL) http://www.w3.org/TR/wsdl
      • Most services you’ll come across in bioinformatics are the latter… but that might change
  • WSDL, WuzzDuLL, MiserabuL…
    • WSDL http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl
    • Describes Inputs and Outputs, tells us how to interact with service
    • Registries of Web Services, like myGrid and BioMOBY use these WSDLs to build their index
    • But, its really difficult to find services based on information in their WSDL
      • Poor metadata e.g. input “ string ” name “ in0 ”, “ in1 ”, “ out1 ” etc
      • Often auto-generated by tools, not humans
      • No constraints on what can be said
      • Machine readable but not very human readable
  • RDF and OWL: Adding metadata and semantics
    • Web Ontology Language (OWL) http://www.w3.org/TR/owl-features/
    • Resource Description Framework (RDF) http://www.w3.org/TR/rdf-primer/ (M&S stands for model and syntax)
    • RDF Schema (RDFS) http://www.w3.org/TR/rdf-schema/
  • RDF and RDF schema
    • RDF is just triples of (subject, verb , object ) we can say things about services like
      • InterproScan isA service
      • InterProScan isA protein_domains_identifier
      • InterProScan hasInput protein_sequence
      • InterProScan hasOutput InterProScan_report
    • The idea is simple …
    • … but unfortunately the specifications are syntax are horrible to read and write
      • But see http://notabug.com/2002/rdfprimer/ by Aaron Swartz
      • RDF Schema gives us “templates” for RDF
  • BioMOBY.org
    • A registry of annotated Web Services:
    • BioMOBY has three ontologies
      • Namespace e.g. genbank
      • Object e.g. protein_sequence (inputs and outputs)
      • Services (tasks e.g. alignment )
      • And an API too, which lets users add terms to ontology when they register services
      • Everything in BioMOBY is annotated (unlike myGrid and myExperiment)
      • Ontologies and Services are available from:
        • http://biomoby.org/cgi-bin/serviceList
        • http://biomoby.org/RESOURCES/MOBY-S/Namespaces
        • http://biomoby.org/RESOURCES/MOBY-S/Objects
        • http://biomoby.org/RESOURCES/MOBY-S/Services
  • myGrid Taverna
    • myGrid has a registry of services
      • Many aren’t annotated
      • … but arbitrary services can be added, not just BioMOBY
      • Lovingly curated by Franck Tanoh and Katy Wolstencroft
      • Using a single ontology http://www.mygrid.org.uk/ontology
    • Accessible from Taverna workflow engine
    • myGrid makes a bit more use of OWL but not much
  • Web Ontology Language (OWL)
    • RDF and RDFS provide limited capabilities for reasoning
      • All men are mortal
      • Socrates is a man
      • --------------------------------------
      • Therefore Socrates is mortal
    • Do this using deductive reasoners like FaCT++, Pellet, KAON2 etc
    • Ulrike Sattlers list of reasoners
      • http://www.cs.man.ac.uk/~sattler/reasoners.html
    • http://flickr.com/photos/dullhunk/337473755/ socrates picture
  • What can a reasoner do?
    • Subsumption check knowledge is correct, e.g. all protein_sequences are biological_sequences
    • Equivalence check knowledge is minimally redundant e.g. SLPI and WAP4 are synonyms for “Secretory leukocyte protease inhibitor”
    • Consistency check that knowledge is meaningful, no contradictions are made SLP1 can’t be both a DNA_sequence and a protein_sequence because these are disjoint classes
    • Instantiation check if an individual is an instance of a class is myProtein and instance of SLPI ?
    • Used Protégé, you have used a reasoner
  • Semantic Web Services in a nutshell User driven Bit of an afterthought Lots of user generated metadata Metadata yes Kind of Yes Yes Yes! Community participation possibly? A bit No no no! Reasoning / semantics Maybe? no yes API? Taverna 2 / myExperiment myGrid BioMOBY
  • www.myExperiment.org
    • Getting large quantities or high-quality metadata about services is time-consuming and expensive…
    • Many new web applications rely on users to provide metadata for them
      • E.g. flickr, myspace, facebook, delicious etc
    • People annotate services by uploading collections of services, workflows
    • Can “tag” them
  • Conclusions
    • We really need standard metadata to describe and find services
    • Standards are boring but important
    • You’re unlikely to win a Nobel prize for creating or using one…
      • But science can’t work without them
      • Especially “data-driven” rather than “hypothesis-driven” Science
    • We’ve looked at semantic web standards for describing Web Services, using InterProScan as an example
      • And myGrid, BioMOBY and myExperiment too
      • But didn’t talk about DAS / BioDAS
      • Thanks for listening
  • Acknowledgements and References
    • Thanks to everyone I robbed stuff off :
      • Carole Goble, Homer Simpson, David De Roure, Tim Bray, Mark Butler, Stian Soiland, Katy Wolstencroft, Franck Tanoh, Rod Page, Mark Wilkinson, myGrid team, myExperiment team, Ian Horrocks, Ulrike Sattler, Tim Berners-Lee, Ora Lassila, Jim Hendler, Steve Pettifer, Douglas Kell, IETF, W3C etc
    • These slides are also available at http://www.slideshare.net/dullhunk/slideshows
    • See Also:
      • This talk mostly about semantics rather than web services: see also “Web of Science - REST or SOAP?” at http://www.slideshare.net/dullhunk/web-of-science-rest-or-soap/
      • BioMOBY http://view.ncbi.nlm.nih.gov/pubmed/12511062
      • myGrid Ontology http://view.ncbi.nlm.nih.gov/pubmed/18048194
      • Taverna workflow http://view.ncbi.nlm.nih.gov/pubmed/16845108
      • myExperiment: social networking for workflow-using e-scientists (Goble and DeRoure) http://portal.acm.org/citation.cfm?doid=1273360.1273361
    • Questions?