Making your data more meaningful Using BioMOBY and  my Grid Taverna services as examples Duncan Hull University of Manchester BioSapiens Network of Excellence 2007-12-04
Outline Semantic Web redux Semantic web stack,  There are lots of TLAs (three letter acronyms) in this talk, unavoidable XML, URI, Namespaces, Unicode,  RDF, RDFS OWL In this talk when I say “data” I usually mean Web Services… Example: InterProScan BioMOBY  http://www.biomoby.org myGrid Taverna  http://www.mygrid.org.uk myExperiment  http://www.myexperiment.org Conclusions What did we learn?
The semantic web  “knows what  you mean” thanks to added meaning According to Tim Berners-Lee,  Ora Lassila and Jim Hendler http://scholar.google.com/scholar?q=semantic+web … and Mark Butler from HP labs http://www.flickr.com/photos/dullhunk/303503677/ Vague, audacious, “visionary”, controversial and/or doomed  (depending on who you ask) in practice this means…
Semantic Web “stack” A suite of technology and standards* for adding meaning to data Taken from “Semantic Web Architecture: Stack or Two Towers?” by Ian Horrocks   et al  see  http://dx.doi.org/10.1007/11552222_4  and  http://www.flickr.com/photos/dullhunk/415645490/
… But first, InterProScan… InterProScan: Protein domains identifier @ EBI   http://view.ncbi.nlm.nih.gov/pubmed/15980438 http://www.ebi.ac.uk/Tools/webservices/rest/submit?tool=iprscan&sequence=uniprot:slpi_human&seqtype=P&email=homer@simpsons.com   That horrendously long URI submits a job to InterProScan with 4 parameters ?tool=iprscan  “use the InterProScan tool” &sequence=uniprot:slpi_human  “…with the sequence secretory leukocyte proteinase inhibitor (SLPI) in UniProt format” &seqtype=P  “sequence type is protein (e.g. not DNA)”   &email:homer@simpsons.com  “email results to Homer Simpson”   Returns a job identifier e.g.  iprscan-20071203-18053660   http://www. ebi .ac. uk/cgi-bin/iprscan/iprscan ? tool=iprscan &jobid= iprscan-20071203-18053660
Back to the Stack
URI, Unicode, XML and Namespaces  Bottom of semantic web stack: Namespaces  http://www.w3.org/TR/xml-names/ eXtensible Markup Language (XML)  http://www.w3.org/TR/xml   Uniform Resource Identifiers (URI)  http://www.ietf.org/rfc/rfc3986 Unicode  http://www.unicode.org/charts
URI: Uniform Resource IDENTIFIER URIs include  Uniform Resource Locators (URLs)  most people are familiar with for  locating  things, usually just called them “links”… E.g.  http://www.biosapiens.info  locator for the biosapiens website E.g.  http://view.ncbi.nlm.nih.gov/pubmed/16015280   locates a biosapiens publication Not persistent, sometimes unstable and break e.g. “404 not found” Not guaranteed to be unique URIs include  Uniform Resource Names (URNs)  for  naming  things that are less familiar like ISBN, Digital Object Identifiers (DOI) and Life Science Identifiers (LSID) etc E.g.  urn:doi:10.1038/sj.ejhg.5201470  names a publication using DOI E.g.  urn:isbn:0387484361  names a book using ISBN E.g.  urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank:bx247883  names a biological sequence using LSID Unlike URLs,  URNs are  UNIQUE  and  PERSISTENT URIs  can be  names ,  identifiers  or  locators  (sometimes all three) See URI Generic syntax  http://www.ietf.org/rfc/rfc3986   and URN syntax  http://www.ietf.org/rfc/rfc2141  from the Internet Engineering Task Force (IETF)
Unicode: Boring but important Unicode provides a unique number for  every  character no matter what the platform (Windows, Unix, iPhone, toaster etc) no matter what the program (protein database, email client etc) no matter what the language (English, Chinese, Swahili… you name it) E.g.  U+0041  is the number for “LATIN CAPITAL A” E.g.  U+0F03  is the number for  “TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA” E.g.  U+221E  Is the number for “INFINITY” http://www. unicode . org/standard/WhatIsUnicode .html   http://www. tbray .org/ongoing/When/200x/2003/04/06/Unicode
XML: eXtensible Markup Language,  boring but incredibly useful Data marked using tags as “trees” <operation name=&quot;InterProScan&quot; method=&quot;get&quot;> <request> <parameter name=&quot;sequence&quot; type=&quot;xsd:string&quot; required=&quot;true&quot;/> </request> <response> <representation mediaType=&quot;text/xml&quot; element=&quot;yn:ResultSet&quot;> <parameter name=&quot;totalResults&quot; type=&quot;xsd:nonNegativeInteger&quot; </response> </operation>
Namespaces “ XML namespaces provide a simple method for qualifying element and attribute names used in XML by associating them with namespaces identified by URI references.” <?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?> xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot; xmlns:xsd= http ://www.w3.org/2001/XMLSchema  … What this means is that  <xsi:fred>  and  <xsd:fred> Are different because they belong in different namespaces,  “ xsi:fred ” is shorthand for  http://www.w3.org/2001/XMLSchema-instance:fred Allows us to have lots of different things called “fred”
Describing Web Services This (xml, uri, namespaces + xml) gives us enough to describe Web Services There are two styles of services on the Web: “RESTful” and “RESTless” “ RESTful” (usually no SOAP): described with (Web Application Description Language) WADL  https://wadl.dev.java.net/ “ RESTless” (uses SOAP and WSDL): described with Web Services Description Language (WSDL)  http://www.w3.org/TR/wsdl   Most services you’ll come across in bioinformatics are the latter… but that might change
WSDL, WuzzDuLL, MiserabuL… WSDL  http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl Describes Inputs and Outputs, tells us how to interact with service Registries of Web Services, like myGrid and BioMOBY use these WSDLs to build their index But, its  really  difficult to find services based on information in their WSDL Poor metadata e.g. input “ string ” name “ in0 ”, “ in1 ”, “ out1 ” etc Often auto-generated by tools, not humans No constraints on what can be said Machine readable but not very human readable
RDF and OWL: Adding metadata and semantics Web Ontology Language (OWL)  http://www.w3.org/TR/owl-features/   Resource Description Framework (RDF)  http://www.w3.org/TR/rdf-primer/   (M&S stands for model and syntax) RDF Schema (RDFS)  http://www.w3.org/TR/rdf-schema/
RDF and RDF schema RDF is just triples of (subject,  verb ,  object ) we can say things about services like InterproScan  isA   service InterProScan  isA  protein_domains_identifier InterProScan  hasInput   protein_sequence InterProScan  hasOutput   InterProScan_report  The idea is simple … … but unfortunately the specifications are syntax are  horrible  to read and write But see  http://notabug.com/2002/rdfprimer/  by Aaron Swartz RDF Schema gives us “templates” for RDF
BioMOBY.org A registry of annotated Web Services: BioMOBY has three ontologies Namespace  e.g.  genbank Object  e.g.  protein_sequence  (inputs and outputs) Services  (tasks e.g.  alignment ) And an API too, which lets users add terms to ontology when they register services Everything  in BioMOBY is annotated (unlike myGrid and myExperiment) Ontologies and Services are available from: http://biomoby.org/cgi-bin/serviceList http://biomoby.org/RESOURCES/MOBY-S/Namespaces http://biomoby.org/RESOURCES/MOBY-S/Objects http://biomoby.org/RESOURCES/MOBY-S/Services
myGrid Taverna myGrid has a registry of services Many aren’t annotated … but arbitrary services can be added, not just BioMOBY Lovingly curated by Franck Tanoh and Katy Wolstencroft Using a single ontology  http://www.mygrid.org.uk/ontology   Accessible from Taverna workflow engine myGrid makes a bit more use of OWL but not much
Web Ontology Language (OWL) RDF and RDFS provide limited capabilities for reasoning All men are mortal Socrates is a man -------------------------------------- Therefore Socrates is mortal Do this using  deductive reasoners  like  FaCT++, Pellet, KAON2 etc Ulrike Sattlers list of reasoners http://www.cs.man.ac.uk/~sattler/reasoners.html   http://flickr.com/photos/dullhunk/337473755/  socrates picture
What can a reasoner do? Subsumption  check knowledge is correct, e.g. all  protein_sequences  are  biological_sequences Equivalence  check knowledge is minimally redundant e.g.  SLPI  and  WAP4  are synonyms for  “Secretory leukocyte protease inhibitor” Consistency  check that knowledge is meaningful, no contradictions are made SLP1 can’t be both a  DNA_sequence  and a  protein_sequence  because these are disjoint classes Instantiation  check if an individual is   an instance of a class is  myProtein  and instance of  SLPI ? Used Protégé, you have used a reasoner
Semantic Web Services in a nutshell User driven Bit of an afterthought Lots of user generated metadata Metadata yes Kind of Yes Yes Yes! Community participation possibly? A bit No no no! Reasoning / semantics Maybe? no yes API? Taverna 2 / myExperiment myGrid BioMOBY
www.myExperiment.org Getting large quantities or high-quality metadata about services is time-consuming and expensive… Many new web applications rely on users to provide metadata for them E.g. flickr, myspace, facebook, delicious etc People annotate services by uploading collections of services, workflows Can “tag” them
Conclusions We really need standard metadata to describe and find services Standards are boring but important You’re unlikely to win a Nobel prize for creating or using one… But science can’t work without them Especially “data-driven” rather than “hypothesis-driven” Science We’ve looked at semantic web standards for describing Web Services, using InterProScan as an example And myGrid, BioMOBY and myExperiment too But didn’t talk about DAS / BioDAS Thanks for listening
Acknowledgements and References Thanks to everyone I robbed stuff off : Carole Goble, Homer Simpson, David De Roure, Tim Bray, Mark Butler, Stian Soiland, Katy Wolstencroft, Franck Tanoh, Rod Page, Mark Wilkinson, myGrid team, myExperiment team, Ian Horrocks, Ulrike Sattler, Tim Berners-Lee, Ora Lassila, Jim Hendler, Steve Pettifer, Douglas Kell, IETF, W3C etc These slides are also available at  http://www.slideshare.net/dullhunk/slideshows See Also: This talk mostly about semantics rather than web services: see also “Web of Science - REST or SOAP?” at  http://www.slideshare.net/dullhunk/web-of-science-rest-or-soap/ BioMOBY  http://view.ncbi.nlm.nih.gov/pubmed/12511062 myGrid Ontology  http://view.ncbi.nlm.nih.gov/pubmed/18048194 Taverna workflow  http://view.ncbi.nlm.nih.gov/pubmed/16845108 myExperiment: social networking for workflow-using e-scientists (Goble and DeRoure)  http://portal.acm.org/citation.cfm?doid=1273360.1273361 Questions?

Adding Meaning To Your Data

  • 1.
    Making your datamore meaningful Using BioMOBY and my Grid Taverna services as examples Duncan Hull University of Manchester BioSapiens Network of Excellence 2007-12-04
  • 2.
    Outline Semantic Webredux Semantic web stack, There are lots of TLAs (three letter acronyms) in this talk, unavoidable XML, URI, Namespaces, Unicode, RDF, RDFS OWL In this talk when I say “data” I usually mean Web Services… Example: InterProScan BioMOBY http://www.biomoby.org myGrid Taverna http://www.mygrid.org.uk myExperiment http://www.myexperiment.org Conclusions What did we learn?
  • 3.
    The semantic web “knows what you mean” thanks to added meaning According to Tim Berners-Lee, Ora Lassila and Jim Hendler http://scholar.google.com/scholar?q=semantic+web … and Mark Butler from HP labs http://www.flickr.com/photos/dullhunk/303503677/ Vague, audacious, “visionary”, controversial and/or doomed (depending on who you ask) in practice this means…
  • 4.
    Semantic Web “stack”A suite of technology and standards* for adding meaning to data Taken from “Semantic Web Architecture: Stack or Two Towers?” by Ian Horrocks et al see http://dx.doi.org/10.1007/11552222_4 and http://www.flickr.com/photos/dullhunk/415645490/
  • 5.
    … But first,InterProScan… InterProScan: Protein domains identifier @ EBI http://view.ncbi.nlm.nih.gov/pubmed/15980438 http://www.ebi.ac.uk/Tools/webservices/rest/submit?tool=iprscan&sequence=uniprot:slpi_human&seqtype=P&email=homer@simpsons.com That horrendously long URI submits a job to InterProScan with 4 parameters ?tool=iprscan “use the InterProScan tool” &sequence=uniprot:slpi_human “…with the sequence secretory leukocyte proteinase inhibitor (SLPI) in UniProt format” &seqtype=P “sequence type is protein (e.g. not DNA)” &email:homer@simpsons.com “email results to Homer Simpson” Returns a job identifier e.g. iprscan-20071203-18053660 http://www. ebi .ac. uk/cgi-bin/iprscan/iprscan ? tool=iprscan &jobid= iprscan-20071203-18053660
  • 6.
  • 7.
    URI, Unicode, XMLand Namespaces Bottom of semantic web stack: Namespaces http://www.w3.org/TR/xml-names/ eXtensible Markup Language (XML) http://www.w3.org/TR/xml Uniform Resource Identifiers (URI) http://www.ietf.org/rfc/rfc3986 Unicode http://www.unicode.org/charts
  • 8.
    URI: Uniform ResourceIDENTIFIER URIs include Uniform Resource Locators (URLs) most people are familiar with for locating things, usually just called them “links”… E.g. http://www.biosapiens.info locator for the biosapiens website E.g. http://view.ncbi.nlm.nih.gov/pubmed/16015280 locates a biosapiens publication Not persistent, sometimes unstable and break e.g. “404 not found” Not guaranteed to be unique URIs include Uniform Resource Names (URNs) for naming things that are less familiar like ISBN, Digital Object Identifiers (DOI) and Life Science Identifiers (LSID) etc E.g. urn:doi:10.1038/sj.ejhg.5201470 names a publication using DOI E.g. urn:isbn:0387484361 names a book using ISBN E.g. urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank:bx247883 names a biological sequence using LSID Unlike URLs, URNs are UNIQUE and PERSISTENT URIs can be names , identifiers or locators (sometimes all three) See URI Generic syntax http://www.ietf.org/rfc/rfc3986 and URN syntax http://www.ietf.org/rfc/rfc2141 from the Internet Engineering Task Force (IETF)
  • 9.
    Unicode: Boring butimportant Unicode provides a unique number for every character no matter what the platform (Windows, Unix, iPhone, toaster etc) no matter what the program (protein database, email client etc) no matter what the language (English, Chinese, Swahili… you name it) E.g. U+0041 is the number for “LATIN CAPITAL A” E.g. U+0F03 is the number for “TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA” E.g. U+221E Is the number for “INFINITY” http://www. unicode . org/standard/WhatIsUnicode .html http://www. tbray .org/ongoing/When/200x/2003/04/06/Unicode
  • 10.
    XML: eXtensible MarkupLanguage, boring but incredibly useful Data marked using tags as “trees” <operation name=&quot;InterProScan&quot; method=&quot;get&quot;> <request> <parameter name=&quot;sequence&quot; type=&quot;xsd:string&quot; required=&quot;true&quot;/> </request> <response> <representation mediaType=&quot;text/xml&quot; element=&quot;yn:ResultSet&quot;> <parameter name=&quot;totalResults&quot; type=&quot;xsd:nonNegativeInteger&quot; </response> </operation>
  • 11.
    Namespaces “ XMLnamespaces provide a simple method for qualifying element and attribute names used in XML by associating them with namespaces identified by URI references.” <?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?> xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot; xmlns:xsd= http ://www.w3.org/2001/XMLSchema … What this means is that <xsi:fred> and <xsd:fred> Are different because they belong in different namespaces, “ xsi:fred ” is shorthand for http://www.w3.org/2001/XMLSchema-instance:fred Allows us to have lots of different things called “fred”
  • 12.
    Describing Web ServicesThis (xml, uri, namespaces + xml) gives us enough to describe Web Services There are two styles of services on the Web: “RESTful” and “RESTless” “ RESTful” (usually no SOAP): described with (Web Application Description Language) WADL https://wadl.dev.java.net/ “ RESTless” (uses SOAP and WSDL): described with Web Services Description Language (WSDL) http://www.w3.org/TR/wsdl Most services you’ll come across in bioinformatics are the latter… but that might change
  • 13.
    WSDL, WuzzDuLL, MiserabuL…WSDL http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl Describes Inputs and Outputs, tells us how to interact with service Registries of Web Services, like myGrid and BioMOBY use these WSDLs to build their index But, its really difficult to find services based on information in their WSDL Poor metadata e.g. input “ string ” name “ in0 ”, “ in1 ”, “ out1 ” etc Often auto-generated by tools, not humans No constraints on what can be said Machine readable but not very human readable
  • 14.
    RDF and OWL:Adding metadata and semantics Web Ontology Language (OWL) http://www.w3.org/TR/owl-features/ Resource Description Framework (RDF) http://www.w3.org/TR/rdf-primer/ (M&S stands for model and syntax) RDF Schema (RDFS) http://www.w3.org/TR/rdf-schema/
  • 15.
    RDF and RDFschema RDF is just triples of (subject, verb , object ) we can say things about services like InterproScan isA service InterProScan isA protein_domains_identifier InterProScan hasInput protein_sequence InterProScan hasOutput InterProScan_report The idea is simple … … but unfortunately the specifications are syntax are horrible to read and write But see http://notabug.com/2002/rdfprimer/ by Aaron Swartz RDF Schema gives us “templates” for RDF
  • 16.
    BioMOBY.org A registryof annotated Web Services: BioMOBY has three ontologies Namespace e.g. genbank Object e.g. protein_sequence (inputs and outputs) Services (tasks e.g. alignment ) And an API too, which lets users add terms to ontology when they register services Everything in BioMOBY is annotated (unlike myGrid and myExperiment) Ontologies and Services are available from: http://biomoby.org/cgi-bin/serviceList http://biomoby.org/RESOURCES/MOBY-S/Namespaces http://biomoby.org/RESOURCES/MOBY-S/Objects http://biomoby.org/RESOURCES/MOBY-S/Services
  • 17.
    myGrid Taverna myGridhas a registry of services Many aren’t annotated … but arbitrary services can be added, not just BioMOBY Lovingly curated by Franck Tanoh and Katy Wolstencroft Using a single ontology http://www.mygrid.org.uk/ontology Accessible from Taverna workflow engine myGrid makes a bit more use of OWL but not much
  • 18.
    Web Ontology Language(OWL) RDF and RDFS provide limited capabilities for reasoning All men are mortal Socrates is a man -------------------------------------- Therefore Socrates is mortal Do this using deductive reasoners like FaCT++, Pellet, KAON2 etc Ulrike Sattlers list of reasoners http://www.cs.man.ac.uk/~sattler/reasoners.html http://flickr.com/photos/dullhunk/337473755/ socrates picture
  • 19.
    What can areasoner do? Subsumption check knowledge is correct, e.g. all protein_sequences are biological_sequences Equivalence check knowledge is minimally redundant e.g. SLPI and WAP4 are synonyms for “Secretory leukocyte protease inhibitor” Consistency check that knowledge is meaningful, no contradictions are made SLP1 can’t be both a DNA_sequence and a protein_sequence because these are disjoint classes Instantiation check if an individual is an instance of a class is myProtein and instance of SLPI ? Used Protégé, you have used a reasoner
  • 20.
    Semantic Web Servicesin a nutshell User driven Bit of an afterthought Lots of user generated metadata Metadata yes Kind of Yes Yes Yes! Community participation possibly? A bit No no no! Reasoning / semantics Maybe? no yes API? Taverna 2 / myExperiment myGrid BioMOBY
  • 21.
    www.myExperiment.org Getting largequantities or high-quality metadata about services is time-consuming and expensive… Many new web applications rely on users to provide metadata for them E.g. flickr, myspace, facebook, delicious etc People annotate services by uploading collections of services, workflows Can “tag” them
  • 22.
    Conclusions We reallyneed standard metadata to describe and find services Standards are boring but important You’re unlikely to win a Nobel prize for creating or using one… But science can’t work without them Especially “data-driven” rather than “hypothesis-driven” Science We’ve looked at semantic web standards for describing Web Services, using InterProScan as an example And myGrid, BioMOBY and myExperiment too But didn’t talk about DAS / BioDAS Thanks for listening
  • 23.
    Acknowledgements and ReferencesThanks to everyone I robbed stuff off : Carole Goble, Homer Simpson, David De Roure, Tim Bray, Mark Butler, Stian Soiland, Katy Wolstencroft, Franck Tanoh, Rod Page, Mark Wilkinson, myGrid team, myExperiment team, Ian Horrocks, Ulrike Sattler, Tim Berners-Lee, Ora Lassila, Jim Hendler, Steve Pettifer, Douglas Kell, IETF, W3C etc These slides are also available at http://www.slideshare.net/dullhunk/slideshows See Also: This talk mostly about semantics rather than web services: see also “Web of Science - REST or SOAP?” at http://www.slideshare.net/dullhunk/web-of-science-rest-or-soap/ BioMOBY http://view.ncbi.nlm.nih.gov/pubmed/12511062 myGrid Ontology http://view.ncbi.nlm.nih.gov/pubmed/18048194 Taverna workflow http://view.ncbi.nlm.nih.gov/pubmed/16845108 myExperiment: social networking for workflow-using e-scientists (Goble and DeRoure) http://portal.acm.org/citation.cfm?doid=1273360.1273361 Questions?