Producing, publishing and consuming linked data - CSHALS 2013
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Producing, publishing and consuming linked data - CSHALS 2013

on

  • 1,205 views

Bio2RDF presentation at CSHALS 2013 in Boston.

Bio2RDF presentation at CSHALS 2013 in Boston.

Statistics

Views

Total Views
1,205
Views on SlideShare
1,205
Embed Views
0

Actions

Likes
1
Downloads
23
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Producing, publishing and consuming linked data - CSHALS 2013 Presentation Transcript

  • 1. Producing, Publishing and Consuming Linked Data Three Lessons from the Bio2RDF Project François Belleau Centre de recherche du CHUQ, Laval University Québec, Canada @bio2rdf
  • 2. • Looking backward to 2004 • Lessons : 1) How to produce RDF 2) How to publish Linked Data 3) How to consume SPARQL endpoints • Looking forward for the next decade
  • 3. The story of two images or Bio2RDF fairy tale 2004 vision 2011 reality
  • 4. Rdfizer inspiration
  • 5. Data Integration problem in bioinformatics
  • 6. Where Bio2RDF got its name
  • 7. Mashup ! FungalWeb from Christopher Baker YeastHub from Kei-Hoi Cheung
  • 8. ISMB 2005 Birds of a Feather
  • 9. W3C conference in 2007 46 millions documents in SESAME
  • 10. DILS conference in 2008 63 millions triples in Virtuoso
  • 11. ISMB conference in 2008 65 millions triples in Virtuoso
  • 12. March 2009 Linked Data cloud is published Bio2RDF 2,3 billions triples represents 54% of the global graph
  • 13. W3C-HCLS F2F Meeting in 2009 41 in Virtuoso endpoints
  • 14. CSHALS conference in 2013 1 billions triples in 19 Virtuoso endpoints with Bio2RDF release 2 and still adding…
  • 15. Bio2RDF is not alone anymore !
  • 16. How to produce RDF • Bio2RDF project transform existing public database into RDF; • Data format transformation to RDF triples is simple to do; • Transformation need to be done from many kind of format (CSV, XML, JSON, HTML, relational database) to RDF.
  • 17. Methods • 2006 Converting XML and HTML document from the web using JSP JSTL library • 2007-2010 Perl scripts, JSP web pages • 2012 – Release 2.0 rdfiser are written in PHP • 2013 – Use Talend ETL job
  • 18. ETL definition from Wikipedia In computing, Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that involves:  Extracting data from outside sources  Transforming it to fit operational which can include quality levels)  Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse) http://en.wikipedia.org/wiki/Extract,_transform,_load
  • 19. Why not use ETL software to rdfize existing data ?
  • 20. Talend Open Studio for Data Integration an open source free ETL software build with Eclipse http://www.talend.com/
  • 21. HGNC 2 Bio2RDF example EXTRACT from the web TRANSFORM to RDF LOAD into triplestore
  • 22. HGNC 2 Bio2RDF : EXTRACT
  • 23. HGNC 2 Bio2RDF : TRANSFORM
  • 24. HGNC 2 Bio2RDF : LOAD
  • 25. This rdfizer is available on myExperiment http://www.myexperiment.org/workflows/3420.html
  • 26. Lesson #1 • Use existing ETL tool, like Talend, to do fast and efficient transformation to RDF n-triples format. • Talend could be extended with new Semantic web components to ease RDF transformation and simplify SPARQL query submission.
  • 27. How to publish Linked Data • Design your URI pattern; • Publish SPARQL endpoint on the Internet; • Offer a search engine and a browser; • Register it to official registry like CKAN; • Advertise it in SPARQL endpoint list; • Describe your triples with an ontology or the way Bio2RDF does; • Publish SPARQL query example; • Index your data in semantic search service like Sindice;
  • 28. Design your URI pattern • Bio2RDF use Banff manifesto URIs • http://sourceforge.net/apps/mediawiki/bio2rdf/index.p hp?title=Banff_Manifesto • Example : http://bio2rdf.org/geneid:15275 • Apply the four linked data rules • http://www.w3.org/DesignIssues/LinkedData.html • Be polite with other URIs • http://hackathon3.dbcls.jp/wiki/URI • Example : http://purl.uniprot.org/uniprot/P05067
  • 29. Publish SPARQL endpoint on the Internet • Choose a triplestore technology • http://en.wikipedia.org/wiki/Triplestore
  • 30. Offer a search engine and a browser
  • 31. Register it to official registry like CKAN
  • 32. Advertise it in SPARQL endpoint list http://www.freebase.com/view/base/politeuri/sparql_endpoint http://beta.bio2rdf.org/
  • 33. Describe your triples
  • 34. Publish SPARQL query example http://sourceforge.net/apps/mediawiki/bio2rdf/index.php?title=Essential_SPARQL_queries
  • 35. Index your data in semantic search service
  • 36. Lesson #2 • To be present in the Linked Data cloud, just publish your data through a SPARQL endpoint. • Register it to public resources, describe its content and suggest SPARQL queries. • We use OpenLink Virtuoso free edition since 2007. Without this first class triplestore software there would not be a Bio2RDF service.
  • 37. How to consume SPARQL endpoints Two principles : 1. To answer a specific question first build a mashup using public or private SPARQL endpoints. 2. Then, ask your questions to the mashup.
  • 38. How to build a semantic mashup • 2005 - Import RDF file in Protégé. • 2006 - Use ELMO RDF crawler to import RDF data into SESAME triplestore. • 2007 - We implement a import function in SESAME based on derefencable URIs. • 2008 - Use Virtuoso sponge option and Perl scripts. • 2009 - Use Taverna workflow engine to fetch triples from SPARQL endpoint. • 2012 Use a Talend workflow consuming SPARQL endpoint.
  • 39. Who is influential at CSHALS ? http://cshals.mashup.bio2rdf.org/relfinder/ http://cshals.mashup.bio2rdf.org/sparql
  • 40. Talend workflow to create the needed semantic mashup • Do a full text search for each author (~80) who talked at CSHALS since 2007 and get its publication; • For each publication get its XML description (~1000) and rdfize it; • For each publication get its citation list; • For each publication citing a previous one get its description (~10 000).
  • 41. Global workflow in 3 steps Full text search Describe publication Describe citing publication
  • 42. Full text search using ncbi/esearch
  • 43. Describe publication, pubmed rdfizer for ncbi/efetch and ncbi/elink service
  • 44. Describe citing publication using ncbi/elinks
  • 45. Then query the mashup • What is CSHALS conference about ? • Who are the most influential researchers in the community ? • Which articles in semantics as been mostly cited ?
  • 46. What is CSHALS conference about ? select ?label2 as ?mesh count(*) as ?count where { ?s <http://bio2rdf.org/pubmed_vocabulary#xFoundIn> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh . ?xMesh rdfs:label "Semantics" . ?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh2 . ?xMesh2 rdfs:label ?label2 . } order by desc(2)
  • 47. Who are the most influential researchers in the community ? select ?l3 as ?author count(distinct ?pubmed ) as ?citation where { ?s a <http://bio2rdf.org/pubmed_vocabulary#searchResults> . ?s rdfs:label ?l . ?s <http://bio2rdf.org/pubmed_vocabulary#xFoundIn> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_vocabulary:xCitedIn> ?xCitedIn . ?pubmed rdfs:label ?l2 . ?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh . ?xMesh rdfs:label "Semantics" . ?pubmed <http://bio2rdf.org/pubmed_vocabulary#xPerson> ?xPerson . ?xPerson rdfs:label ?l3 . } order by desc(2)
  • 48. Which articles in semantics has been most cited ? select ?l2 as ?title count(?xCitedIn) as ?count where { ?s a <http://bio2rdf.org/pubmed_vocabulary#searchResults> . ?s rdfs:label ?l . ?s <http://bio2rdf.org/pubmed_vocabulary#xFoundIn> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_vocabulary:xCitedIn> ?xCitedIn . ?pubmed rdfs:label ?l2 . ?pubmed <http://bio2rdf.org/pubmed_vocabulary#xMesh> ?xMesh . ?xMesh rdfs:label "Semantics" . } order by desc(2)
  • 49. What is the relation between François Belleau and Michel Dumontier ?
  • 50. Using RelFinder http://www.visualdataweb.org/relfinder.php http://cshals.mashup.bio2rdf.org/relfinder
  • 51. Using Sentient Knowledge Explorer http://www.io-informatics.com/
  • 52. Gruff for AllegroGraph http://www.franz.com/agraph/gruff/
  • 53. Lesson #3 • To answer a specific question build a mashup from SPARQL endpoints and query it. • To build your semantic mashup, use a workflow which can be created with an ETL like Talend. • Explore the mashup with semantic software like Virtuoso faceted browser, RelFinder, Gruff or Sentient.
  • 54. Projects • Add new data source to Bio2RDF collection of SPARQL endpoints; • Develop Talend ETL Semantic web extension to ease rdfizing and SPARQL endpoint consumption needed to build mashup; • Create a mobile application to browse Bio2RDF or other SPARQL data sources.
  • 55. Looking forward foir the next decade • More data provider will expose their data as SPARQL endpoints, but Bio2RDF is still needed. • Now that Data has been converted to RDF (a dirty job) we need to ask useful question to the Linked Data cloud (a hard one). SPARQL query will not be sufficient and reasoner will be essential. • Semantic software for browsing, visualisation, edition will be created and SPARQL federated query engine will become available. This will be the next game changer. • Intuitive mobile applications will give access to Semantic web data in a user friendly manner. • Data Integration experience will be successful for scientist user, if our enthusiast community get organize, so governance for Linked Data in Life Science is a major issue.
  • 56. LSSEC - Life Science SPARQL Endpoint Club https://groups.google.com/d/forum/life-science-sparql-endpoint-club A private club for SPARQL endpoint publisher to gather and discuss their concerns about Linked Data, Ontology and promotion of the Semantic Web in the Life Science community. To become a member you need to publish RDF or host a SPARQL endpoint of interest for the Life Science community.
  • 57. Acknowledgements • Bio2RDF is a community project available at http://bio2rdf.org • The community can be joined at https://groups.google.com/forum/?fromgroups#!forum/bio2rdf • This work was done under the supervision of Dr Arnaud Droit, assistant professor and director of the Centre de Biologie Computationnelle du CRCHUQ at Laval University, where Bio2RDF is hosted. • Michel Dumontier, from the Dumontier Lab at Carleton University, is also hosting Bio2RDF server and his team created new release 2. • Thanks to all the people member of the Bio2RDF community, and especially Marc-Alexandre Nolin and Peter Ansell, initial developers. • This work was supported by Ministère du Développement Economique, Innovation Exportation (MDEIE).