Best practices for generating Bio2RDF linked data
Upcoming SlideShare
Loading in...5
×
 

Best practices for generating Bio2RDF linked data

on

  • 560 views

Slides summarizing best practices for generating Bio2RDF Linked Data

Slides summarizing best practices for generating Bio2RDF Linked Data

Statistics

Views

Total Views
560
Views on SlideShare
560
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Best practices for generating Bio2RDF linked data Best practices for generating Bio2RDF linked data Presentation Transcript

  • Best practices for generating linked data Tutorial @ ICBO 2013
  • Tutorial Roadmap
  • Bio2RDF Best Practices 1. Assign a URI for all things 2. Assign labels and identifiers 3. Declare and assign types 4. Provide dataset provenance
  • 1. Assign URIs for all things ● The base Bio2RDF URI pattern: http://bio2rdf.org/namespace:identifier ● Data provider record identifiers are maintained from source ● Linked Data = no blank nodes!
  • 1. Assign URIs for all things ● Data provider records are maintained from source ○ e.g. DrugBank’s resource IRI for Leucovorin http://bio2rdf.org/drugbank:DB00650
  • 1. Assign URIs for all things ● Vocabulary namespaces are used for dataset specific types and predicates http://bio2rdf.org/drugbank_vocabulary:Drug ● Resource namespaces are used to assign an identifier when one isn't a provided by the source - unique identifier with UUID, hash, counter, concatenated strings, etc http://bio2rdf.org/drugbank_resource:DB00440_DB00650
  • 1. Assign URIs for all things ● All valid namespaces are listed in the Bio2RDF Life Sciences Registry ○ ensures that URIs are consistent across all Bio2RDF datasets ○ registry is publicly available at http://tinyurl. com/dataregistry
  • 2. Assign labels and identifiers ● Use rdfs:label to assign a language-specified label for all resources ○ can be a source provided title, a script generated phrase, or a phrase provided in a third party dataset ○ Pattern: rdfs:label "label [ns:id]"@lang ● Use Dublin Core predicates for source- provided label and identifiers ○ Pattern: dc:title "label"@lang (assign language tag only when one is provided) ○ Pattern: dc:identifier "ns:id"^^xsd:string
  • 2. Assign labels and identifiers ● Use Bio2RDF predicates to assign Bio2RDF namespace and Bio2RDF identifiers: ○ Pattern: bio2rdf_vocabulary:namespace "ns"^^xsd: string ○ Pattern: bio2rdf_vocabulary:identifier "id"^^xsd: string
  • 2. Assign labels and identifiers Example: DrugBank entry for Nitrazepam drugbank:DB0159 rdfs:label "Nitrazepam [drugbank:DB0159]"@en ; dc:title “Nitrazepam”@en ; dc:identifier “drugbank:DB0159”^^xsd:string ; bio2rdf_vocabulary:namespace “drugbank”^^xsd:string ; bio2rdf_vocabulary:identifier “DB0159”^^xsd:string .
  • 3. Declare and assign types ● All resources should be typed as being resources of the dataset ○ Pattern: rdf:type namespace_vocabulary:Resource ● Instances of a dataset vocabulary type should also be typed as owl: NamedIndividual ○ Pattern: rdf:type namespace_vocabulary:Type ○ Pattern: rdf:type owl:NamedIndividual ● Classes should be typed as owl:Class ○ Pattern: rdf:type owl:Class ○ If superclass has been described using namespace_vocabulary pattern, then link class using rdfs:subClassOf
  • 3. Declare and assign types ● Object properties and datatype properties should also be typed ○ Pattern: rdf:type owl:ObjectProperty ○ Pattern: rdf:type owl:DatatypeProperty ● Examples: drugbank:DB0159 rdf:type drugbank_vocabulary:Resource ; rdf:type owl:Class ; rdfs:subClassOf drugbank_vocabulary:Drug . drugbank_vocabulary:ddi-interactor-in rdf:type owl:ObjectProperty .
  • 4. Provide dataset provenance data item Bio2RDF dataset Features -Entity-dataset link -Creator -Publisher -Date created -License & rights -Source -Availability - SPARQL endpoint - Data dump Vocabularies VoID Dublin Core W3C Provenance Bio2RDF vocabulary Source dataset prov:wasDerivedFrom void:inDataset
  • 4. Provide dataset provenance ● link every resource to the versioned/dated Bio2RDF dataset in which it is described ○ Pattern: void:inDataset <http://bio2rdf.org/dataset: namespace-dd-mm-yyyy.rdf> ○ Example: drugbank:DB0159 void:inDataset <http://bio2rdf. org/dataset:drugbank-03-07-2013> .
  • A crash course in PHP
  • PHP : Hypertext Preprocessor ● A general-purpose open source scripting language ○ homepage : http://php.net ● PHP scripts can be executed from the command line or embedded in HTML documents ● Syntactically similar to C/C++/Java but it is not strongly typed
  • A hello world PHP script ● All PHP scripts are surrounded by the <?php and ?> tags
  • Declaring and instantiating classes
  • Using the Bio2RDF PHP API to create an RDFizer ● Basic structure of a Bio2RDFizer script: ○ Initialize script parameters - input file(s), default dataset namespace, etc. ○ Define a Run() function that handles downloading and iterating over input files, as well as function calls to parse and convert input data to RDF ○ Define function(s) to convert input data to RDF using Bio2RDF API helper functions
  • Using the Bio2RDF PHP API to create an RDFizer ● Bio2RDF PHP API defines helper functions that implement Bio2RDF best practices: ○ getNamespace() ○ getVoc() ○ getRes() ○ triplify($subject, $predicate, $object) //object is an rdf resource ○ triplifyString($subject, $predicate, "string")// object is a literal ○ describeIndividual($uri, $label, $type, $title, $description, $language) ○ describeClass( ... ) ○ describeProperty ( ... )
  • Example: The Comparative Toxicogenomics Database CTD Bio2RDFizer script is available on GitHub
  • Using and contributing to the Bio2RDF project on GitHub
  • Using and contributing to the Bio2RDF project on GitHub 1. Fork the bio2rdf-scripts and php-lib repositories on Github https://help.github.com/articles/fork-a-repo 2. Write some code! 3. Commit code to your fork 4. Make a pull request to the bio2rdf-scripts repo