Adding Meaning To Your Data

Making your data more meaningful Using BioMOBY and my Grid Taverna services as examples Duncan Hull University of Manchester BioSapiens Network of Excellence 2007-12-04

Outline Semantic Web redux Semantic web stack, There are lots of TLAs (three letter acronyms) in this talk, unavoidable XML, URI, Namespaces, Unicode, RDF, RDFS OWL In this talk when I say “data” I usually mean Web Services… Example: InterProScan BioMOBY http://www.biomoby.org myGrid Taverna http://www.mygrid.org.uk myExperiment http://www.myexperiment.org Conclusions What did we learn?

The semantic web “knows what you mean” thanks to added meaning According to Tim Berners-Lee, Ora Lassila and Jim Hendler http://scholar.google.com/scholar?q=semantic+web … and Mark Butler from HP labs http://www.flickr.com/photos/dullhunk/303503677/ Vague, audacious, “visionary”, controversial and/or doomed (depending on who you ask) in practice this means…

Semantic Web “stack” A suite of technology and standards* for adding meaning to data Taken from “Semantic Web Architecture: Stack or Two Towers?” by Ian Horrocks et al see http://dx.doi.org/10.1007/11552222_4 and http://www.flickr.com/photos/dullhunk/415645490/

… But first, InterProScan… InterProScan: Protein domains identifier @ EBI http://view.ncbi.nlm.nih.gov/pubmed/15980438 http://www.ebi.ac.uk/Tools/webservices/rest/submit?tool=iprscan&sequence=uniprot:slpi_human&seqtype=P&email=homer@simpsons.com That horrendously long URI submits a job to InterProScan with 4 parameters ?tool=iprscan “use the InterProScan tool” &sequence=uniprot:slpi_human “…with the sequence secretory leukocyte proteinase inhibitor (SLPI) in UniProt format” &seqtype=P “sequence type is protein (e.g. not DNA)” &email:homer@simpsons.com “email results to Homer Simpson” Returns a job identifier e.g. iprscan-20071203-18053660 http://www. ebi .ac. uk/cgi-bin/iprscan/iprscan ? tool=iprscan &jobid= iprscan-20071203-18053660

URI, Unicode, XML and Namespaces Bottom of semantic web stack: Namespaces http://www.w3.org/TR/xml-names/ eXtensible Markup Language (XML) http://www.w3.org/TR/xml Uniform Resource Identifiers (URI) http://www.ietf.org/rfc/rfc3986 Unicode http://www.unicode.org/charts

URI: Uniform Resource IDENTIFIER URIs include Uniform Resource Locators (URLs) most people are familiar with for locating things, usually just called them “links”… E.g. http://www.biosapiens.info locator for the biosapiens website E.g. http://view.ncbi.nlm.nih.gov/pubmed/16015280 locates a biosapiens publication Not persistent, sometimes unstable and break e.g. “404 not found” Not guaranteed to be unique URIs include Uniform Resource Names (URNs) for naming things that are less familiar like ISBN, Digital Object Identifiers (DOI) and Life Science Identifiers (LSID) etc E.g. urn:doi:10.1038/sj.ejhg.5201470 names a publication using DOI E.g. urn:isbn:0387484361 names a book using ISBN E.g. urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank:bx247883 names a biological sequence using LSID Unlike URLs, URNs are UNIQUE and PERSISTENT URIs can be names , identifiers or locators (sometimes all three) See URI Generic syntax http://www.ietf.org/rfc/rfc3986 and URN syntax http://www.ietf.org/rfc/rfc2141 from the Internet Engineering Task Force (IETF)

Unicode: Boring but important Unicode provides a unique number for every character no matter what the platform (Windows, Unix, iPhone, toaster etc) no matter what the program (protein database, email client etc) no matter what the language (English, Chinese, Swahili… you name it) E.g. U+0041 is the number for “LATIN CAPITAL A” E.g. U+0F03 is the number for “TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA” E.g. U+221E Is the number for “INFINITY” http://www. unicode . org/standard/WhatIsUnicode .html http://www. tbray .org/ongoing/When/200x/2003/04/06/Unicode

XML: eXtensible Markup Language, boring but incredibly useful Data marked using tags as “trees” <operation name="InterProScan" method="get"> <request> <parameter name="sequence" type="xsd:string" required="true"/> </request> <response> <representation mediaType="text/xml" element="yn:ResultSet"> <parameter name="totalResults" type="xsd:nonNegativeInteger" </response> </operation>

Namespaces “ XML namespaces provide a simple method for qualifying element and attribute names used in XML by associating them with namespaces identified by URI references.” <?xml version="1.0" standalone="yes"?> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd= http ://www.w3.org/2001/XMLSchema … What this means is that <xsi:fred> and <xsd:fred> Are different because they belong in different namespaces, “ xsi:fred ” is shorthand for http://www.w3.org/2001/XMLSchema-instance:fred Allows us to have lots of different things called “fred”

Describing Web Services This (xml, uri, namespaces + xml) gives us enough to describe Web Services There are two styles of services on the Web: “RESTful” and “RESTless” “ RESTful” (usually no SOAP): described with (Web Application Description Language) WADL https://wadl.dev.java.net/ “ RESTless” (uses SOAP and WSDL): described with Web Services Description Language (WSDL) http://www.w3.org/TR/wsdl Most services you’ll come across in bioinformatics are the latter… but that might change

WSDL, WuzzDuLL, MiserabuL… WSDL http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl Describes Inputs and Outputs, tells us how to interact with service Registries of Web Services, like myGrid and BioMOBY use these WSDLs to build their index But, its really difficult to find services based on information in their WSDL Poor metadata e.g. input “ string ” name “ in0 ”, “ in1 ”, “ out1 ” etc Often auto-generated by tools, not humans No constraints on what can be said Machine readable but not very human readable

RDF and OWL: Adding metadata and semantics Web Ontology Language (OWL) http://www.w3.org/TR/owl-features/ Resource Description Framework (RDF) http://www.w3.org/TR/rdf-primer/ (M&S stands for model and syntax) RDF Schema (RDFS) http://www.w3.org/TR/rdf-schema/

RDF and RDF schema RDF is just triples of (subject, verb , object ) we can say things about services like InterproScan isA service InterProScan isA protein_domains_identifier InterProScan hasInput protein_sequence InterProScan hasOutput InterProScan_report The idea is simple … … but unfortunately the specifications are syntax are horrible to read and write But see http://notabug.com/2002/rdfprimer/ by Aaron Swartz RDF Schema gives us “templates” for RDF

BioMOBY.org A registry of annotated Web Services: BioMOBY has three ontologies Namespace e.g. genbank Object e.g. protein_sequence (inputs and outputs) Services (tasks e.g. alignment ) And an API too, which lets users add terms to ontology when they register services Everything in BioMOBY is annotated (unlike myGrid and myExperiment) Ontologies and Services are available from: http://biomoby.org/cgi-bin/serviceList http://biomoby.org/RESOURCES/MOBY-S/Namespaces http://biomoby.org/RESOURCES/MOBY-S/Objects http://biomoby.org/RESOURCES/MOBY-S/Services

myGrid Taverna myGrid has a registry of services Many aren’t annotated … but arbitrary services can be added, not just BioMOBY Lovingly curated by Franck Tanoh and Katy Wolstencroft Using a single ontology http://www.mygrid.org.uk/ontology Accessible from Taverna workflow engine myGrid makes a bit more use of OWL but not much

Web Ontology Language (OWL) RDF and RDFS provide limited capabilities for reasoning All men are mortal Socrates is a man -------------------------------------- Therefore Socrates is mortal Do this using deductive reasoners like FaCT++, Pellet, KAON2 etc Ulrike Sattlers list of reasoners http://www.cs.man.ac.uk/~sattler/reasoners.html http://flickr.com/photos/dullhunk/337473755/ socrates picture

What can a reasoner do? Subsumption check knowledge is correct, e.g. all protein_sequences are biological_sequences Equivalence check knowledge is minimally redundant e.g. SLPI and WAP4 are synonyms for “Secretory leukocyte protease inhibitor” Consistency check that knowledge is meaningful, no contradictions are made SLP1 can’t be both a DNA_sequence and a protein_sequence because these are disjoint classes Instantiation check if an individual is an instance of a class is myProtein and instance of SLPI ? Used Protégé, you have used a reasoner

Semantic Web Services in a nutshell User driven Bit of an afterthought Lots of user generated metadata Metadata yes Kind of Yes Yes Yes! Community participation possibly? A bit No no no! Reasoning / semantics Maybe? no yes API? Taverna 2 / myExperiment myGrid BioMOBY

www.myExperiment.org Getting large quantities or high-quality metadata about services is time-consuming and expensive… Many new web applications rely on users to provide metadata for them E.g. flickr, myspace, facebook, delicious etc People annotate services by uploading collections of services, workflows Can “tag” them

Conclusions We really need standard metadata to describe and find services Standards are boring but important You’re unlikely to win a Nobel prize for creating or using one… But science can’t work without them Especially “data-driven” rather than “hypothesis-driven” Science We’ve looked at semantic web standards for describing Web Services, using InterProScan as an example And myGrid, BioMOBY and myExperiment too But didn’t talk about DAS / BioDAS Thanks for listening

Acknowledgements and References Thanks to everyone I robbed stuff off : Carole Goble, Homer Simpson, David De Roure, Tim Bray, Mark Butler, Stian Soiland, Katy Wolstencroft, Franck Tanoh, Rod Page, Mark Wilkinson, myGrid team, myExperiment team, Ian Horrocks, Ulrike Sattler, Tim Berners-Lee, Ora Lassila, Jim Hendler, Steve Pettifer, Douglas Kell, IETF, W3C etc These slides are also available at http://www.slideshare.net/dullhunk/slideshows See Also: This talk mostly about semantics rather than web services: see also “Web of Science - REST or SOAP?” at http://www.slideshare.net/dullhunk/web-of-science-rest-or-soap/ BioMOBY http://view.ncbi.nlm.nih.gov/pubmed/12511062 myGrid Ontology http://view.ncbi.nlm.nih.gov/pubmed/18048194 Taverna workflow http://view.ncbi.nlm.nih.gov/pubmed/16845108 myExperiment: social networking for workflow-using e-scientists (Goble and DeRoure) http://portal.acm.org/citation.cfm?doid=1273360.1273361 Questions?

Adding Meaning To Your Data

More Related Content

What's hot

Viewers also liked

Similar to Adding Meaning To Your Data

More from Duncan Hull

Recently uploaded

Adding Meaning To Your Data