Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RDF project
Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RDF project Background François Belleau (firstname.lastname@example.org)With the proliferation of new online databases, data integration continues to be one of the major unsolved problems forbioinformatics. In spite of initiatives like BioPAX , Biomart , the EBI, KEGG and NCBI integrated web resources, the web Lesson # 1of bioinformatics databases is still a web of independent data silos. Rdfise data using ETL software like Talend.Since 2005, the aim of the Bio2RDF project has been to make popular public datasets available in RDF format; the datadescription format of the growing Semantic Web. Initially data from OMIM, KEGG, Entrez Gene, along with numerous otherresources, were converted to the RDF semantic format. Currently 38 SPARQL endpoints are made available from theBio2RDF server . This is the workflow producingBio2RDF project has been the primary source of bioinformatics data in the Linked Data cloud in 2009. Today many triples from Genbank HTML web page aboutorganisations have started to publish their datasets or knowledge bases using the RDF/SPARQL standard. GO, Uniprot and external database references.Reactome were early converts to publishing in RDF. Most recently PDBJ, KEGG, NCBO have started to publish their owndata in the new semantic way. From the data integration perspective projects like BioLOD  from the Riken Institute andLinked Life Data  from Ontotext have pushed the Semantic Web model close to production quality service. The linked Datacloud of bioinformatics is now rapidly growing . The technology incubation phase is over.One question data provider should ask themselves now is : How costly is it to produce and publish data in RDFaccording to this new paradigm ? And, from the bioinformatician data consumer point of view : How useful can semantic These are the instructions creating triplesweb technologies be to build the data mashups needed to support a specific knowledge discovery tasks and the from the data flow.needs of domain experts ?These are the questions we answer here by proposing methods for producing, publishing and consuming RDF data, and bysharing the lessons we have learnt while building Bio2RDF. Producing RDFRDF is all about triples, building triples, storing triples and querying triples. A triple is defined by the subject-predicate-objectmodel. If you have used key-value table before, you already know what triples are. A collection of triples define a graph sogeneric that all data can be represented using it. Every kind of data can be converted in triples from all known formats: HTML,XML, relational database, columns table or key-value representation. Converting data to RDF is so important to build theSemantic Web that it is expressed by a new verbs : triplify or rdfize ! Building the Bio2RDF rdfizers we had to deal with all Expose data as RDF using dereferencable URIthose kind of data formats and sources. according to design rule #1 and #2Lesson #1 Transforming data in RDF is an ETL (Extract Transform Load) task and there are now free and professional frameworks available for this purpose.Talend  software is a first class ETL framework, based on Eclipse, generating native java code from a graphicalrepresentation of the data transformation workflow. Using this professional quality software to rdfize data is much more Make a SPARQL endpoint publicproductive than writting Java, Perl, PHP scrits as we use to do in the past. so query can be submited.To build the namespace SPARQL endpoint at Bio2RDF , a RDF mashup composed of GO, Uniprot, LSRN, GenBank, Here is a query used to discovererMIRIAM and Bio2RDF namespace description, we generated RDF from XML, HTML, key/value format file, tabular file and the schema of an unknown triplestore.also an RDF dump. Using Talend ETL framework has made the programming job and quality testing far more efficient. Publish on the Linked Data webThe inventor of HTML, Tim Berner Lee, has also define the rules by which the Semantic Web should be designed : 1) Use URIs as names for things 2) Use HTTP URIs so that people can look up those names. 3) When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4) Include links to other URIs. so that they can discover more things.Building Bio2RDF, we have been early adopters of those rules. The DBpedia project, a version of Wikipedia available in RDFformat and through one of the first major public SPARQL endpoints, is at the heart of the Linked Data cloud, it is built usingVirtuoso triplestore , a first class software, that is free and open-source. Lesson ##2 Lesson 2 To publish semantic data use To publish semanticVirtuoso a triplestore like data use a triplestore like VirtuosoLesson #2 To publish semantic web data chose a good triplestore and made a SPARQL endpoint available publicly on the Internet. Discover concepts using type ahead searchBio2RDF project has also depended on Virtuoso, and benefits from all the innovation in each new version. Virtuoso not only Full text search query results with rankingoffers SPARQL endpoint to submit queries based on the W3C standards, full text search and facet browsing-based user based on the number of connections in the graphinterface are available so the RDF graph can be browsed, queried, searched and explored with type ahead completionservice. All this from one software product directly out of the box.Sesame , 4Store , Mulgara  and other new projects emerging each year make publishing data over the web a newaffordable reality. Consuming triplesWhy should we start using Semantic Web data and technologies ? Because building a database from public resources on theweb is more efficient than the traditional way of creating datawarehouse. The Giant Global Graph (GGG) of the entireSemantic Web is the new datastore you can build your semantic mashup from with the tools of your choice.To answer a high level scientific question from data already available in RDF, you need first to build a specific triplestore thatyou will eventually be able to query to, and hopefully, will obtain the expected results. Building a specific database just toanswer a specific question, this is what semantic data mashup are about. Lesson # 3Lesson #3 Semantic datasources available from SPARQL endpoint can be consumed in all kind Consume semantic data as you like, of ways to create mashup. using HTTP GET, SOAP services or new tool designed to explore RDF data.For example the following ways of consuming RDF include; (i) SPARQL queries over REST, (ii) dereferenced RDF graph by Using soapUI popular tool URI over HTTP, (iii) SOAP services returning RDF or even better still (iv) the new web services model proposed by SADI you can consume Bio2RDFs SOAPframework . Programming in Java, PHP, Ruby or PERL, using RDF/XML, Turtle or JSON/RDF format is also possible and services returning triples in ntriple format.the needed software get better each year. Its is a wild new world of open technologies you will benefit from and to learn anduse.The Bio2RDF project first offered an RDF graph that could be dereferenced by a URI in the formhttp://bio2rdf.org/omim:602080. Any HTTP GET request will return the RDF version of a document from one of the databasewe expose as RDF in the format of your choice. Next, you can submit queries directly to one of our public SPARQL endpointslike http://namespace.bio2rdf.org/sparql. Programming a script or designing a workflow with software like Taverna or Talend,you can build your data mashup from the growing semantic web data sources in days, not weeks.To explore the possibilities offered by a triplestore, discover the Bio2RDF SPARQL endpoint about bioinformatics database athttp://namespace.bio2rdf.org/fct, submit SPARQL queries to its endpoint at http://namespace.bio2rdf.org/sparqlAnd, if you are a SOAP services user, consume its web services described herehttp://namespace.bio2rdf.org/bio2rdf/SOAP/services.wsdl. DiscussionCombining data from different sources is the main problem of data integration in bioinformatics. The Semantic Webcommunity have addressed this problem for years, now the emergent Semantic Web technologies are mature and ready tobe used in production scale systems. The Bio2RDF community think that solving data integration problem in bioinformaticscan be solve by applying existing Semantic Web practices. The bioinformatics community could significantly benefit from whatis being developed now, in fact our community has done a lot to show that Semantic Web model has a great potential in Using the RelFinder tool solving Life Science problems. By sharing our own Bio2RDF experience and these simple lessons we have learned, we hope it is possible to query RDF graphically and visualise the triplestores graph.that may be you should give it a try in your next data integration project. Acknowledgements References ● Bio2RDF is a community project available at http://bio2rdf.org ● The community can be joined at https://groups.google.com/forum/?fromgroups#! 1) http://www.biopax.org/ 9) http://www.w3.org/DesignIssues/LinkedData.html forum/bio2rdf 2) http://www.biomart.org/ 10) http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/ ● This work was done under the supervision of Dr Arnaud Droit, assistant professor and 3) http://www.bio2rdf.org/ 11) http://www.openrdf.org/ director of the Centre de Biologie Computationnelle du CRCHUQ at Laval 4) http://biolod.org/ 12) http://4store.org/ University, where a mirror of Bio2RDF is hosted. 5) http://linkedlifedata.com/ 13) http://www.mulgara.org/ ● Michel Dumontier, from the Dumontier Lab at Carleton University, is also hosting 6) http://richard.cyganiak.de/2007/10/lod/ 14) http://sadiframework.org Bio2RDF server and actually leads the project 7) http://talend.com/ 15) http://www.visualdataweb.org/relfinder.php ● Thanks to all the people member of the Bio2RDF community, and especially Marc- 8) http://namespace.bio2rdf.org/sparql 16) http://www.soapui.org/ Alexandre Nolin and Peter Ansell, initial developers.