This document discusses the Semantic Web and Linked Open Data. It explains how the Semantic Web helps integrate data by using shared vocabularies and URIs to normalize meanings between data sources. As more datasets adopt Semantic Web principles by exposing structured data through URIs and RDF formats, individual datasets become less isolated and are interconnected to form a large knowledge base. The document provides examples of querying and exploring Linked Open Data through SPARQL and the LOD Cloud. It also offers recommendations for publishing and working with Linked Open Data.
Hierarchy of management that covers different levels of management
The Semantic Web and Linked Open Data: An Introduction
1. The Semantic Web and Linked
Open Data
Pete DeVries
TaxonConcept.org
http://www.taxonconcept.org/
Department of Entomology
University of Wisconsin - Madison
2. What is the Semantic Web and how
does it Work?
Lets Look at the Traditional Way
Taxon Table
Location Table
This data structure is really only interpretable within the context of this specific database
3. Data Islands
The result are database islands that contain a lot of redundant data which is independently curated.
Each effort benefits little from the other efforts.
4. Data Sets often Overlap
Text
What they don’t have is a common set of field names or ID’s
5. Each Data set has is own “Vocabulary”
Different Fields
Different Names for the Same Fields
Same Names for Different Fields
Different ways of Interpreting those Fields
These nuances in meaning are often only understood by the
designers of each individual data set.
Consider how differently people interpret the meaning of
what seem to be the same terms
6. Where the Semantic Web Helps
Tim Berners-Lee’s 4 Rules
1. Use URIs* as names for things
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information.
4. Include links to other URIs. so that they can discover more things.
*URI = Uniform Resource Identifier
http://www.w3.org/DesignIssues/LinkedData.html
7. Use URIs as Names for Things?
Instead of “Door County” use
http://sws.geonames.org/5250768/
8. For Humans this URI Dereferences to a
Human Interpretable Web Page
Text
Text
9. For Machines this Dereferences a
Machine Interpretable File
As N-Triples
10. Why Would Anyone Think this Made Sense?
Now, each of these different databases are using an ID with a shared meaning.
A meaning that can be determined by dereferencing the URI.
All the data sets that use this vocabulary are now connectable.
All the data sets that are linked to this URI are now also linked to each other.
11. Life Sciences Example
Example: Two databases with county records
One uses “La Crosse County,” the other lists “La Crosse” for La
Crosse County, Wisconsin
You want to link and merge those records so that it is clear that you
mean a particular species was observed in a particular county
12. Normalize the Meaning between Data Sources
Use this shared vocabulary to integrate these two data sources
Use that shared vocabulary to find and link to other relevant data
13. As More Data Sets Adopt these Principles
The individual datasets are no longer islands, but are one interconnected knowledge base
14. Other Benefits
Reduced duplication of effort and a better separation of concerns
It would be more efficient for me to simply link to a bibliographic
reference URI on a site that specializes in that then to create my own
bibliographic database.
Similarly, it would be more efficient for the bibliographic database to link
to a URI in a nomenclatural database than curates that aspect separately.
When represented as URI’s in a Semantic Web database or “Triple Store”,
information can be encoded more efficiently ~32 bytes per statement
Enabling usable knowledge bases that scale to billions of “facts”
16. What is Linked Open Data?
1. data representation using open standards
2. use of hyperlinks to make it work on the global web
17. Wikipedia Images linked to my Species Concepts
TaxonConcept <=> Dbpedia <=> WikiCommons Images
Virtuoso OpenSource and Microsoft Pivot
(some images are too large to display)
18. How do I Mark up my Data?
Your data set can continue to exist in its current relational
database form, but you need to expose it to the semantic web in a
different form
The goal is to make structured data accessible and discoverable via
hyperlinks.
It also includes the use of hyperlinks to denote properties/
predicates that have well defined semantics.
These semantics are what ontologies and vocabularies deliver with
more fidelity that what's available in a typical RDMS.
Thus, the Semantic Web isn't a destination - it the effect of
publishing data in line with a set of principles as outlined in TimBL's
meme.
19. Knowledge as Triples
Statements are represented in a triple structure
Subject ➜ Predicate ➜ Object
• An English text version of a triple might look like
• Ochlerotatus triseriatus expected in La Crosse County, WI
20. Machine Processable Version
Ochlerotatus triseriatus is expected in La Crosse County, WI
Now represented as the following triple*
http://lod.taxonconcept.org/ses/iuCXz#Species
http://lod.taxonconcept.org/ontology/txn.owl#isExpectedIn
http://sws.geonames.org/5258961/
*Not Meant for Human Consumption
22. The Same Triple in Different Formats
RDF/XML (.rdf)
N3 (.n3)
Turtle (.ttl)
You might find one of these forms easier to create.
There are various tools that will allow you to convert between one form and another.
If you need RDF/XML, but can create N3; author in N3 then convert those files to RDF/XML.
23. How do I tell the Semantic Web
about my Data?
PingtheSemanticWeb
http://pingthesemanticweb.com/
Semantic Sitemaps
http://sw.deri.org/2007/07/sitemapextension/
25. Semantic SiteMaps
http://site.example.com/sitemap.xml
http://site.example.com/sitemap.xml.gz
Refer to the sitemap.xml file in your sites robots.txt file
26. How can I Find other Potentially Useful
Data Sets?
CKAN Comprehensive Knowledge Archive Network
http://ckan.net/
27. Ask the LOD Cloud
Enter in term or name like “Quercus alba”, to see what entities contain that term or name
29. How can I set up my own Knowledge Base?
Virtuoso Open-Source Edition
http://virtuoso.openlinksw.com/
30. How can I Query a Knowledge Base?
SPARQL
http://en.wikipedia.org/wiki/SPARQL
http://www.w3.org/TR/rdf-sparql-query/
Query using the Web Interface
Query using your own script or web application
Example
“Describe those occurrences of the species concept Boloria selene”
33. More Elaborate SPARQL Query
Query for those mammals that are “expected in” Wisconsin.
* use optional keyword for those attributes that may not exist
* the query includes those attributes that should be returned
The result set will be feed through Microsoft Pivot for Browsing
39. What does the Future hold for the
Semantic Web and Linked Open Data
Improvements in the quantity and quality of LOD data sets.
Improved Alignment of Vocabularies
Improvements in SPARQL and Quadstores
Human and Machine Interpretable Views Merged in RDFa
Better Visualization and Analysis Tools
40. Other Resources
Linked Open Data http://linkeddata.org/
W3C.org http://esw.w3.org/Main_Page
public-lod email list http://lists.w3.org/Archives/Public/public-lod/
TaxonConcept.org http://www.taxonconcept.org/
TaxonConcept.org Examples http://bit.ly/bundles/pjdlinkeddata/
SlideShare Talks
Evolution Towards Web 3.0: The Semantic Web
http://www.slideshare.net/LeeFeigenbaum/evolution-towards-web-30-the-
semantic-web
41. Recommendations
Try using and experimenting with existing vocabularies before creating
your own.
Although these technologies allow you to run queries that you might not
have anticipated, thinking about use cases etc. will provide some guidance
on the best way to markup your data.
Start with simple models and representations and add complexity as you
gain experience.
You may not want or be able to expose all your data to the LOD Cloud,
but exposing the metadata in commonly used vocabularies will make your
data more “findable”
Some vocabularies* are still under development and discussion, but in
many cases you can modify your SQL to RDF export to accommodate
changes.
* For instance, it is not clear to me what is the “best” vocabulary for
representing publications.
42. Acknowledgments
Kingsley Idehen
http://www.openlinksw.com/blog/~kidehen/
David “Paddy” Patterson mbl.edu
Anne Thessen mbl.edu
Dmitry Mozzherin mbl.edu
Han Wang rpi.edu
Patrick Leary eol.org
Editor's Notes
Today I am going to give you a brief overview of the semantic web and how it can be useful for life sciences data.\n
Here is a traditional table in a spreadsheet. It is a list of the various species and includes ID field that is used to connect it to another table for locations.\nIn this we have a representation for taxa and a representation for location that are often specific to this and only this database.\n\nOther similar databases that might be useful will have different names for the fields, and different names within the fields for what is often the same entity. \n\nWhat you have is a data island that knows nothing else about potentially related data and shares nothing about itself to other data sets.\n
The result of this structure are large islands of data which are difficult to integrate. \n\nEach of these gains little value from other data sets and are of little value to other datasets.\n
Different sets often overlap\n
\n
Use URIs as names for things\nUse HTTP URIs so that people can look up those names.\nWhen someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)\nInclude links to other URIs. so that they can discover more things.\n\n
\n
\n
\n
\n
Lets look at an example of how these unique id&#x2019;s can be used to reduce ambiguity and allow easier integration of disparate data sets.\nHere we have two databases of collection records, one for Wisconsin Insects the other for Wisconsin Plants.\nOne of the databases use &#x201C;county&#x201D; for the county field, others use &#x201C;cnty&#x201D;. In addition, one database lists &#x201C;La Crosse County&#x201D; while another lists &#x201C;La Crosse&#x201D;. \nWhat you want to do is link and merge those records so that it is clear that you mean that a particular species was observed in a particular county.\n
If both these data sources use the geonames vocabulary, then it is easy to integrate the data for both insects and plants.\nIt is also possible to look for other uses of this identifier to find related data about this particular county.\n\n
\n
\n
Linked Data is data that is linked together following the principles laid out by Tim Berners-Lee.\nLinked Open Data is Linked Data that is open and accessible.\nThere are ways to query this knowledge base, but you can also create your own subset for your own knowledge base.\n
Linked Data is data that is linked together following the principles laid out by Tim Berners-Lee.\nLinked Open Data is Linked Data that is open and accessible.\nThere are ways to query this knowledge base, but you can also create your own subset for your own knowledge base.\n
Since all these data sets are connected you can do some interesting things.\nMy data set is linked to Wikipedia through Dbpedia I can easily pull in all the images for my species that are in Wikipedia.\nThe RDF icons are for images that are too large to be displayed.\n
Your data set can continue to exist in its current relational database form, but you need to expose it to the semantic web in a different form.\n\nThe goal is to make structured data accessible and discoverable via hyperlinks. \n\nIt also includes the use of hyperlinks to denote properties/predicates that have well defined semantics. \n\nThese semantics are what ontologies and vocabularies deliver with more fidelity that what's available in a typical RDMS. \n\nThus, the Semantic Web isn't a destination - it the effect of publishing data in line with a set of principles as outlined in TimBL's meme.\n
The semantic web represents statements as triples.\nTriples consist of a subject predicate and object.\nAn english language version of a triple might look something like this. \n&#x201C;Ochlerotatus triseriatus occurrence in La Crosse County, WI\n
I can now use these unique identifiers to make machine processable statements about these entities.\nThe statement &#x201C;Ochlerotatus triseriatus is expected in La Crosse County, WI&#x201D;\nCan now be represented as the following triple\n&#xA0;<http://lod.taxonconcept.org/ses/iuCXz#Species>\n&#xA0;<http://lod.taxonconcept.org/ontology/txn.owl#isExpectedIn>\n&#xA0;<http://sws.geonames.org/5258961/> .\n&#xA0;It is important to recognize that these statements are part of the database, but they are not there for humans to process, they are there so that it is clear to this system and others what we actually mean.\n
\n
Here are the different ways of representing that original triple.\n
\n
\n
\n
CKAN serves as a registry of data sets. It does not represent all linked data sets but it is the information source that is used to generate the LOD Cloud Diagram\n
\n
\n
\n
Triple or Quadstores have their own SQL-like query language called SPARQL\n
Virtuoso has a human accessible iSPARQL interface\n
\n
\n
Faceted view allows you to display the returned results in different ways.\n
&#x201C;The LOV dataset contains the description of RDFS vocabularies or OWL ontologies defined for and used by datasets in the Linked Data Cloud. Whenever available each vocabulary includes references to the datasets using it, in particular those listed in CKAN.&#x201D;\n\nIt documents what vocabularies different data sets are using.\n\n
Here are some examples of the early EoL LOD data. We have a taxon, the Giant Panda which links to several things most notably various types of data objects.\n\n
Here is one such data object. It is an Image Object that links to the original and various other representations of the image. \n\n
This shows a scientific name string, which is linked to three different data sets that include taxa with that same scientific name. If the name is a homonym it is possible to see that there are different kinds of concepts with that same name.\n\n
Improvements in the quantity and quality of LOD data sets.\nImproved Alignment of Vocabularies\nImprovements in SPARQL and Quadstores\nHuman and Machine Interpretable Views Merged in RDFa\nBetter Visualization and Analysis Tools\n