Providing Linked Data


Published on

This presentation covers the whole spectrum of Linked Data production and exposure. After a grounding in the Linked Data principles and best practices, with special emphasis on the VoID vocabulary, we cover R2RML, operating on relational databases, Open Refine, operating on spreadsheets, and GATECloud, operating on natural language. Finally we describe the means to increase interlinkage between datasets, especially the use of tools like Silk.

Published in: Technology, Education

Providing Linked Data

  1. 1. Providing Linked DataPresented by:Barry Norton
  2. 2. Motivation: Music!2VisualizationModuleMetadataStreaming providersPhysical WrapperDownloadsDataacquisitionR2R Transf.LD WrapperMusical ContentApplicationAnalysis &Mining ModuleLDDatasetAccessLD WrapperRDF/XMLIntegratedDatasetInterlinking CleansingVocabularyMappingSPARQLEndpointPublishingRDFaOther content
  3. 3. LINKED DATA LIFECYCLEEUCLID - Querying Linked Data 3
  4. 4. Linked Data Principles1. Use URIs as names for things.2. Use HTTP URIs so that users can look upthose names.3. When someone looks up a URI, provideuseful information, using the standards(RDF*, SPARQL).4. Include links to other URIs, so that userscan discover more things.EUCLID - Providing Linked Data 4CH 1
  5. 5. Linked DataLifecycleLinked Data LifecycleEUCLID - Providing Linked Data 5Source: Sören Auer. “The Semantic Data Web” (slides)Source: José M. Alvarez. “My Linked Data Lifecycle”Source: Michael Hausenblas. “Linked Data lifeyclcle”
  6. 6. CoreTasks for ProvidingLinked DataEUCLID - Providing Linked Data 6Based on the proposed LD lifecycles and the LDprinciples, we can identify 3 main tasks for providing LD:① Creating: includes data extraction, creation of HTTPURIs, and vocabulary selection. (LD principles 1 & 2)② Interlinking: involves the creation of (RDF) links toexternal data sets. (LD principle 4)③ Publishing: consists of creating the metadata andmaking the data set accessible. (LD principle 3)
  7. 7. Agenda1. Creating Linked Data2. Interlinking Linked Data3. Publishing Linked Data4. Linked Data publishing checklist7EUCLID - Providing Linked Data
  8. 8. CREATING LINKED DATAEUCLID - Querying Linked Data 8
  9. 9. • The data of interest may be stored in a wide range orformats:• Several tools support the process of mining datafrom different repositories, for example:Extracting the Data9EUCLID - Providing Linked DataSpreadsheetsor tabular dataDatabases TextR2RML
  10. 10. Using the RDF Data ModelEUCLID - Providing Linked Data 10• The RDF data model is used to represent theextracted information• The nodes represent the concepts/entities within thedata. A node corresponds to a URI, a blank node or aliteral (only in predicates)• The relationships between the concepts/entities aremodeled as arcsSubject ObjectPredicate
  11. 11. NamingThings: URIs• All the things or distinct entities within the data mustbe named• According to the Linked Data principles, the standardmechanism to name entities is the URI• Designing Cool URIs:– Leave out information about the data regarding to: author,technologies, status, access mechanisms, …– Simplicity: short, mnemonic URIs– Stability: maintain the URIs as long as possible– Manageability: issue the URIs in a way that you can manage11EUCLID - Providing Linked DataSource:
  12. 12. SelectingVocabularies• Vocabularies model the concepts and therelationship between them in a knowledge domain• Terms from well-known vocabularies should bereused wherever possible• New terms should be define only if you can not findrequired terms in existing vocabularies• A large number of vocabularies in RDF are openlyavailable, e.g., Linked Open Vocabularies (LOV)12EUCLID - Providing Linked Data
  13. 13. SelectingVocabularies (2)EUCLID - Providing Linked Data 13Linked OpenVocabularies322 vocabulariesclassified by domainSource:
  14. 14. SelectingVocabularies (3)EUCLID - Providing Linked Data 14Linked OpenVocabularies: Analyzing MusicOntologySource:
  15. 15. SelectingVocabularies (4)EUCLID - Providing Linked Data 15Other lists of well-known vocabularies are maintainedby:• W3C SWEO Linking Open Data community project• Library Linked Data Incubator Group: Vocabularies inthe library domain
  16. 16. INTERLINKING LINKED DATAEUCLID - Providing Linked Data 16
  17. 17. Interlinking Data Sets• It’s one of the Linked Data principles!• Involves the creation of RDF links between twodifferent RDF data sets:– Links at instance level (rdfs:seeAlso, owl:sameAs)– Links at schema level (RDFS subclass/subproperty, OWLequivalent class/property, SKOS mapping properties)• Appropriate links are detected via link discoveryEUCLID - Providing Linked Data 174. Include links to other URIs, so that users can discovermore things.
  18. 18. Interlinking Data Sets (2)Challenges for link discovery• Linked Data sets are heterogeneous in terms ofvocabularies, formats and data representation• Large range of knowledge domains• Scalability: LD is composed of a large number of datasets and RDF triples, hence it is not possible tocompare every possible entity pairEUCLID - Providing Linked Data 18Source: Robert Isele. “LOD2 Webinar Series:Silk”
  19. 19. Interlinking Data Sets (3)Challenges for link discovery• It corresponds to the entity resolution problem:deciding whether two entities correspond to same object inthe real world• Name ambiguities: typos, misspellings, differentlanguages, homonyms• Structural ambiguities: same concepts/entities withdifferent structures. Requires the application of ontologyand schema matching techniquesEUCLID - Providing Linked Data 19
  20. 20. Interlinking Data Sets (4)EUCLID - Providing Linked Data 20RDF data setscan be interlinked:Manually• Involves the manual exploration ofLD data sets and their RDFresources to identify linking targets• May not be feasible when thenumber of entities within the dataset is very largeAutomatically• Using tools that perform linkdiscovery based on linkage rules, forexample: Silk, Limes and xCurator
  21. 21. owl:sameAs & rdfs:seeAlso• owl:sameAs• Creates links between individuals• States that two URIs refer to the same individuals• rdfs:seeAlso• States that a resource may provide additional informationabout the subject resource• Links in MusicBrainz:– owl:seeAlso is used for music artists– rdfs:seeAlso is used for albumsEUCLID - Providing Linked Data 21
  22. 22. SKOS• Simple Knowledge Organization System–• Data model for knowledge organization systems(thesauri, classification scheme, taxonomies)• SKOS data is expressed as RDF triples• Allows the creation of RDF links between differentdata sets with the usage of mapping propertiesEUCLID - Providing Linked Data 22
  23. 23. SKOS: Mapping PropertiesThese properties are used to link SKOS concepts(particularly instances) in different schemes:• skos:closeMatch: links two concepts that aresufficiently similar (sometimes can be used interchangeably)• skos:exactMatch: indicates that the two conceptscan be used interchangeably.• Axiom: It is a transitive property• skos:relatedMatch: states an associative mappinglink between two conceptsEUCLID - Providing Linked Data 23
  24. 24. Example of SKOS exact matchSKOS: Mapping Properties (2)EUCLID - Providing Linked Data 24mo:MusicArtist skos:exactMatch dbpedia-ont:MusicalArtist.@prefix skos: <>@prefix mo: <>@prefix dbpedia-ont: <>@prefix schema: <>mo:MusicGroup skos:exactMatch skos:exactMatch dbpedia-ont:Band.
  25. 25. Example of SKOS close matchSKOS: Mapping Properties (3)EUCLID - Providing Linked Data 25mo:SignalGroup skos:closeMatch schema:MusicAlbum.@prefix skos: <>@prefix mo: <>@prefix dbpedia-ont: <>@prefix schema: <>mo:SignalGroup skos:closeMatch dbpedia-ont:Album.
  26. 26. Integrity conditions• Guarantee consistency and avoid contradictions in therelationships between SKOS conceptsSKOS: Mapping Properties (4)EUCLID - Providing Linked Data 26skos:MappingRelationskos:closeMatchskos:exactMatchskos:relatedMatchSymmetric& TransitiveDisjointwithPartial Mapping Relation diagram with integrity conditionsSymmetric
  27. 27. PUBLISHING LINKED DATAEUCLID - Providing Linked Data 27
  28. 28. Publishing Linked DataOnce the RDF data set has been created andinterlinked, the publishing process involves thefollowing tasks:1. Metadata creation for describing the data set2. Making the data set accessible3. Exposing the data set in Linked Data repositories4. Validating the data setEUCLID - Providing Linked Data 28
  29. 29. • Consists of providing (machine-readable) metadataof RDF data sets which can be processed by engines• This information allows for:– Efficient and effective search of data sets– Selection of appropriate data sets (for consumption orinterlinking)– Get general statistics of the data setsEUCLID - Providing Linked Data 29Describing RDF Data Sets
  30. 30. Describing RDF Data Sets (2)• The common language for describing RDF data sets isVoID (Vocabulary of Interlinked Data sets)• Defines an RDF data set with the predicatevoid:Dataset• Covers 4 types of metadata:EUCLID - Providing Linked Data 30• General metadata• Structural metadata• Descriptions of linksets• Access metadata
  31. 31. VoID: General Metadata• General metadata is used by users to identifyappropriate data sets.• Specifies information about description of the dataset, contact person/organization, the license of thedata set, data subject and some technical features.• VoID (re)uses predicates from the Dublin CoreMetadata1 and FOAF2 vocabularies.EUCLID - Providing Linked Data 311
  32. 32. VoID: General Metadata (2)Predicate Range Descriptiondcterms:title Literal Name of the data set.dcterms:description Literal Description of the data set.dcterms:source RDF resource Source from which the data set was derived.dcterms:creator RDF resource Primarily responsible of creating the data set.dcterms:date xsd:date Time associated with an event in the life-cycle of the resource.dcterms:created xsd:date Date of creation of the data set.dcterms:issued xsd:date Date of publication of the data set.dcterms:modified xsd:date Date on which the data set was changed.foaf:homepage Literal Name of the data set.dcterms:publisher RDF resource Entity responsible for making the data set available.dcterms:contributor RDF resource Entity responsible for making contributions to the data set.EUCLID - Providing Linked Data 32Source: InformationContains information about the creation of the data set
  33. 33. VoID: General Metadata (3)Other Information• License of the data set: specifies the usage conditions ofthe data. The license can be pointed with the propertydcterms:license• Category of the data set: to specify the topics or domainscovered by the data set, the property dcterms:subject canbe used• Technical features: the property void:feature can beused to express technical properties of the data (e.g. RDFserialization formats)EUCLID - Providing Linked Data 33
  34. 34. VoID: Structural MetadataEUCLID - Providing Linked Data 34• Provides high-level information about the internalstructure of the data set• This metadata is useful when exploring or queryingthe data set• Includes information about resources, vocabulariesused in the data set, statistics and examples ofresources in the data set
  35. 35. VoID: Structural Metadata (2)EUCLID - Providing Linked Data 35Information about resources• Example resources: allow users to get an impression of thekind of resources included in the data set. Examples can beshown with the property void:exampleResource• Pattern for resource URIs: the void:uriSpace propertycan be used to state that all the entity URIs in a data set startwith a given string:MusicBrainz a void:Dataset;void:exampleResource<> .:MusicBrainz a void:Dataset;void:uriSpace "" .
  36. 36. VoID: Structural Metadata (3)EUCLID - Providing Linked Data 36Vocabularies used in the data set• The void:vocabulary property identifies the vocabulary orontology that is used in a data set• Typically, only the most relevant vocabularies are listed• This property can only be used for entire vocabularies. Itcannot be used to express that a subset of the vocabularyoccurs in the data set.:MusicBrainz a void:Dataset;void:vocabulary <> .
  37. 37. VoID: Structural Metadata (4)EUCLID - Providing Linked Data 37Source: about a data setExpress numeric statistics about a data set:Predicate Range Descriptionvoid:triples Number Total number of triples contained in the data set.void:entities NumberTotal number of entities that are described in the data set.An entity must have a URI, and match the void:uriRegexPatternvoid:classes Number Total number of distinct classes in the data set.void:properties Number Total number of distinct properties in the data set.void:distinctSubjects Number Total number of distinct subjects in the data set.void:distinctObjects Number Total number of distinct objects in the data set.void:documents Number Total number of documents, in case that the data set ispublished as a set of individual documents.
  38. 38. VoID: Structural Metadata (5)EUCLID - Providing Linked Data 38Partitioned data sets• The void:subset property provides description of parts of adata set• Data sets can be partitioned based on classes or properties:• void:classPartition contains only instances of a particular class• void:propertyPartition contains only triples with a particular predicate:MusicBrainz a void:Dataset;void:subset :MusicBrainzArtists .:MusicBrainz a void:Dataset;void:classPartition [ void:class mo:Release .] ;void:propertyParition [ void:property mo:member .] .
  39. 39. VoID: Describing LinksetsEUCLID - Providing Linked Data 39• Linkset: collection of RDF links between twoRDF data sets:DS1 :DS2:LS1 :LS2Image based on void:<>@PREFIX owl:<>:DS1 a void:Dataset .:DS2 a void:Dataset .:DS1 void:subset :LS1 .:LS1 a void:Linkset;void:linkPredicateowl:sameAs;void:target :DS1, :DS2 .
  40. 40. VoID: Describing Linksets (2)EUCLID - Providing Linked Data 40Example@PREFIX void:<>@PREFIX skos:<>:MusicBrainz a void:Dataset .:DBpedia a void:Dataset .:MusicBrainz void:classPartition :MBArtists .:MBArtists void:class mo:MusicArtist .:MBArtists a void:Linkset;void:linkPredicateskos:exactMatch;void:target :MusicBrainz, :DBpedia .
  41. 41. The access metadata describes the methods ofaccessing the actual RDF data set* This assumes that the default graph of the SPARQL endpoint contains the data set.VoID cannot express that a data set is contained a specific named graph. This can bespecified with SPARQL 1.1. Service DescriptionVoID: Access MetadataEUCLID - Providing Linked Data 41Method Predicate DescriptionURI look up endpoint void:uriLookupEndpointSpecifies the URI of a service for accessing thedata set (different from the SPARQL protocol)Root resource void:rootResourceURI of the top concepts (only for data setsstructured as trees)SPARQL endpoint void:sparqlEndpoint Provides access to the data set via the SPARQLprotocol.*RDF data dumps void:dataDump Specifies the location of the dump file. If the dataset is split into multiple files, then several values ofthis property are provided.CH 5
  42. 42. Providing Access to theData SetThe data set can be accessed via differentmechanisms:EUCLID - Providing Linked Data 42RDFaRDFdumpSPARQLendpointDereferencingHTTP URIs
  43. 43. Dereferencing HTTP URIs• Allows for easily exploring certain resourcescontained in the data set• What to return for a URI?• Immediate description: triples where the URI is the subject.• Backlinks: triples where the URI is the object.• Related descriptions: information of interest in typical usage scenarios.• Metadata: information as author and licensing information.• Syntax: RDF descriptions as RDF/XML and human-readable formats.• Applications (e.g. LD browsers) render the retrievedinformation so it can be perceived by a user.EUCLID - Providing Linked Data 43Source: How to Publish Linked Data on The Web - Chris Bizer, Richard Cyganiak, Tom Heath.CH 1
  44. 44. Dereferencing HTTP URIs (2)Example: DereferencingEUCLID - Providing Linked Data 44
  45. 45. RDFa• RDFa = “RDF in attributes”• Extension to HTML5 for embedding RDF withinHTML pages:– The HTML is processed by the browser, the (human)consumer don’t see the RDF data– The RDF triples within the page are consumed by APIs toextract the (semi-)structured data• It is considered as the bridge between the Web ofData and the Web of Documents• It is a complete serialization of RDFEUCLID - Providing Linked Data45
  46. 46. RDFa: AttributesAttribute role Attribute DescriptionSyntaxprefix List of prefix-name IRIs pairsvocab IRI that specifies the vocabulary where the concept is definedSubject about Specifies the subject of the relationshipPredicateproperty Express the relationship between the subject and the valuerel Defines a relation between the subject and a URLrev Express reverse relationships between two resourcesResourcehref Specifies an object URI for the rel and rev attributesresource Same as href (used when href is not present)src Specifies the subject of a relationshipLiteraldatatype Express the datatype of the object of the property attributecontent Supply machine-readable content for a literalxml:lang, lang Specifies the language of the literalMacro typeof Indicate the RDF type(s) to associate with a subjectinlist An object is added to the list of a predicate.EUCLID - Providing Linked Data46
  47. 47. RDFa: ExampleExtracting RDF from HTMLEUCLID - Providing Linked Data47<div class="artistheader"about=""typeof="">…</div><>HTML (+RDFa):RDF:
  48. 48. RDFa: ExampleExtracting RDF from HTMLEUCLID - Providing Linked Data48<div class="artistheader"about=""typeof="">…</div><><>HTML (+RDFa):RDF:
  49. 49. RDFa: ExampleExtracting RDF from HTMLEUCLID - Providing Linked Data49<div class="artistheader"about=""typeof="">…</div><><><>.HTML (+RDFa):RDF:
  50. 50. RDFa: Example (2)Extracting RDF from MusicBrainz.orgEUCLID - Providing Linked Data50
  51. 51. RDFa: Example (2)Extracting RDF from MusicBrainz.orgEUCLID - Providing Linked Data51Source:
  52. 52. RDFa: Example (2)Extracting RDF from MusicBrainz.orgEUCLID - Providing Linked Data52 the EUCLID screencast:
  53. 53. RDF Dump• An RDF dump refers to a file which contains (part of)a data set specified in an RDF format (RDF/XML, N-Triples,N-Quads)• The data set can be split into several RDF dumps• A list of available data sets available as RDF dumpscan be found at:– - Providing Linked Data 53
  54. 54. SPARQL Endpoint• The SPARQL endpoint refers to the URI of thelistener of the SPARQL protocol service, whichhandles requests for SPARQL protocol operations• The user submits SPARQL queries to the SPARQLendpoint in order to retrieve only a desired subset ofthe RDF data set• List of available SPARQL endpoints:•• - Providing Linked Data 54CH 2
  55. 55. Using Linked Data Catalogs• Data catalogs, markets or repositories are platformsdedicated to provide access to a wide range of datasets from different domains• Allow data consumers to easily find and use the data• Usually the catalogs offer relevant metadata aboutthe creation of the data setEUCLID - Providing Linked Data 55
  56. 56. Using Linked Data Catalogs (2)How to publish an RDF data set into a catalog?EUCLID - Providing Linked Data 56Create your own datacatalogRecommended for bigorganizations/institutionsaiming at providing a largenumber of data setsUse a data managementsystem, for example:Upload your data setinto an existing catalogAllows data consumers toeasily find new data setsCommon LD catalogs are:-- The Linking Open Data Cloud
  57. 57. Validating Data SetsThere are different ways to validate the published RDFdata set:EUCLID - Providing Linked Data 57GeneralvalidatorsParsing &Syntax• Vapour - Performs two types of tests: without contentnegotiation and requesting RDF/XML content• URI Debugger - Retreieves the HTTP responses of accessing a URI• RDF Triple-Checker – Dereferences namespaces associated withthe resources used in the document• W3C RDF/XML Validation Service – Evaluates the syntax ofRDF/XML documents and displays the RDF triples in it• W3C Markup Validation Service – Checks syntactic correctnessfor web documents with RDFa markup• RDF:ALERTS – Validates syntax, undefined resources, datatypeand other types of errors
  58. 58. Validating Data Sets (2)Example:Validating URIs withVapourEUCLID - Providing Linked Data 58Source:
  59. 59. Validating Data Sets (3)Example:Validating URIs withVapourEUCLID - Providing Linked Data 59Source:
  60. 60. Validating Data Sets (4)Example:Validating URIs withVapourEUCLID - Providing Linked Data 60Source: URIs withVapour
  61. 61. PROVIDING LINKED DATA:CHECKLISTEUCLID - Providing Linked Data 61
  62. 62. Providing Linked Data:Checklist (1)Creating Linked Datao All the relevant entities/concepts wereeffectively extracted from the raw data ?o Are all the created URIs dereferenceable?o Are you reusing terms from widely acceptedvocabularies?EUCLID - Providing Linked Data 62
  63. 63. Providing Linked Data:Checklist (2)Interlinking Linked Datao Is the data set linked to other RDF data sets?o Are the created vocabulary terms linked toother vocabularies?EUCLID - Providing Linked Data 63
  64. 64. Providing Linked Data:Checklist (3)Publishing Linked Datao Do you provide data set metadata?o Do you provide information about licensing?o Do you provide additional access methods?o Is the data set available in LD catalogs?o Did the data set pass the validation tests?EUCLID - Providing Linked Data 64
  65. 65. SummaryEUCLID - Providing Linked Data 65• The Linked Data lifecycle:• 3 core tasks: creating, interlinking and publishing• Creation of Linked Data:• Extracting relevant data, using URIs to name entities and selectingvocabularies and expressing the data using the RDF data model• Interlinking Linked Data:• Challenges of link discovery, using Silk to create links between twodata sets and using SKOS links• Publishing Linked Data:• Creation of data set metadata; publishing the data set via RDFdumps, SPARQL endpoints or RDFa; using RDFa and toenrich search results, and uploading the data set to a LD catalogIn this chapter we studied:
  66. 66. TheWeb & Linked Data• Linked Data catalogs• Applications
  67. 67. CKAN• CKAN is an open source platform for developing dataset catalogs• Implement useful tools for data publishers tosupport:• Data harvesting• Creation of metadata• Access mechanisms to the data set• Updating the data set• Monitoring the access to the data setEUCLID - Providing Linked Data 67
  68. 68. CKAN (2)EUCLID - Providing Linked Data 68Source:
  69. 69. CKAN (3)EUCLID - Providing Linked Data 69Source:
  70. 70. • The Data Hub is a community-run data catalog whichcontains more than 5,000 data sets1• “(…) is an openly editable open data catalogue, in thestyle of Wikipedia”.2• It is implemented on top of the CKAN platform• Allows the creation of groups:– The Linking Open Data Cloud group exclusively containsLinked Data setsEUCLID - Providing Linked Data 701 According to the information presented in the portal on March 20132 Source: Data Hub
  71. 71. EUCLID - Providing Linked Data 71Source: Data Hub (2)
  72. 72. The Data Hub (3)EUCLID - Providing Linked Data 72Source:
  73. 73. The Linking Open Data CloudEUCLID - Providing Linked Data 73September 2011Source: Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch
  74. 74. The Linking Open Data CloudHow to publish an RDF data set in this cloud?1. The data set must follow the Linked Data principles2. The data set must contain at least 1,000 RDF triples3. The data set must contain at least 50 RDF links to adata set that is already in the diagram4. Access to the data set must be providedOnce these criteria are met, the data publisher must addthe data set to the Data Hub catalog, and contact theadministrators of the Linking Open Data Cloud groupEUCLID - Providing Linked Data 74Source:
  75. 75. Linked Data & Search EnginesEUCLID - Providing Linked Data 75• Search engines collect information about webresources in order to produce richer search resultsby improving the display of the results• This is only possible if the search engines are able tounderstand the content within the web pages• The HTML pages must be annotated with machine-readable content to describe their content:Mark up format Vocabulary
  76. 76. RDFa for marking up dataEUCLID - Providing Linked Data 76• RDFa is used to provide (semi-)structured LinkedData embedded in web content• Examples:– Some search engines use RDFa, e.g., Google,Yahoo! and Bing– Facebook’s Open Graph is based on RDFa
  77. 77. Google Rich SnippetsEUCLID - Providing Linked Data 77• Embedding semantics via RDFa (ormicroformats/microdata) enhances search results:
  78. 78. Google Rich Snippets (2)EUCLID - Providing Linked Data 78
  79. 79. Schema.orgEUCLID - Providing Linked Data 79• Collection of schemas/vocabularies to markup theHTML pages• It is recognized by Bing, Google, Yahoo! and Yandex• Covers a wide range of knowledge domains• It also offers an extension mechanism in case thepublisher is interested in adding new concepts to thevocabularies
  80. 80. (2)EUCLID - Providing Linked Data 80The vocabularies cover the following topics:Source:“The world is too rich,complex and interesting fora single schema to describefully on its own. we aim to find abalance, by providing a coreschema that covers lots ofsituations, alongsideextension mechanisms forextra detail.”(Dan Brickley,
  81. 81. EUCLID - Providing Linked Data 81Integrates(/aligns) existing vocabularies whereappropriate, e.g. rNewsSource: (3)
  82. 82. Google Knowledge GraphEUCLID - Providing Linked Data 82• The user is able to findanswer to their querieswithout browsing pages• Provides detailedinformation
  83. 83. Google Knowledge Graph (2)EUCLID - Providing Linked Data 83• Google Search resultsinclude structured datafrom Freebase• Might disambiguatesearch terms
  84. 84. FreebaseEUCLID - Providing Linked Data 84• Knowledge base ofstructured data• Data is stored as agraph• Describes data fromdifferent domains
  85. 85. Bing SnapshotEUCLID - Providing Linked Data 85• Provides structured data related to the search term• Includes a significant number of entities from moredomains• Connects data from LinkedIn• Is is powered by the graph engine Trinity.RDF
  86. 86. Bing Snapshot (2)EUCLID - Providing Linked Data 86
  87. 87. Open Graph ProtocolEUCLID - Providing Linked Data 87• It was originally created by Facebook• Allows describing web content as graph objects,establishing connections between people andobjects• The descriptions are embedded in the web page asRDFa data• Supports description of several domains: basicmetadata, music, video, articles, books, websites anduser profilesSource:
  88. 88. Open Graph Protocol (2)EUCLID - Providing Linked Data 88Source: is using Open Graph protocol?Source: PublishersIMDbMicrosoftNHLPosterousRotten TomatoesTIME
  89. 89. Open Graph Protocol (3)EUCLID - Providing Linked Data 89• Facebook expands vocabulary of relationships beyond“friendship” and “like” more actions!Source:
  90. 90. Open Graph Protocol &FacebookEUCLID - Providing Linked Data 90List of domains and actionsSource:• Listen• Create a playlist• Watch• Rate• Wants to watch• Rate• Read• Quote• Wants to read• Achieve• High score• Bike• Run• Walk• Like• Recommend• FollowGeneralMusicMovies&TVGamesFitnessBookHow can we exploit these links and relationships?
  91. 91. Facebook Graph SearchEUCLID - Providing Linked Data 91• focuse on people and their interests, exploitinghow everything is related to each other• Queries are specified using natural language• Takes advantage of context and suggest possiblequeries• Allows for building more complex (expressive)queries that are not possible with normal search:– For example, “music liked by me and friends who live inmy city”
  92. 92. Facebook Graph Search (2)EUCLID - Providing Linked Data 92Context (information from profile):Graph search suggestions:
  93. 93. Facebook Graph Search (3)EUCLID - Providing Linked Data 93Results
  94. 94. Facebook Graph Search (4)EUCLID - Providing Linked Data 94Observations• Allows for conjunctive queries (applying filter over intermediateresults = “apply operator”)• Disjunctive queries are not supported:– For example: “My friends who like OR ReadWrite”• Post search is not supported– It is not possible to search in post content submitted to the timeline• User privacy settings affect the results
  95. 95. Tools for providing Linked Data• Extracting data from spreadsheets: OpenRefine• Extracting data from RDBMS: R2RML• Extracting data from text: Zemanta, OpenCalais, GATE• Interlinking data sets: Silk
  97. 97. Integrate Chart Data• Task: Integrate latest chartinformation into your RDFdatabase.• Data may be available in non-RDF formats:– Plain text– CSV, TSV, separator-basedfiles– HTML tables– Spreadsheets(OpenDocument, Excel, …)– XML– JSON– …97LDDatasetAccessIntegratedData SetInterlinking CleansingVocabularyMappingSPARQLEndpointPublishingCSV/TSVHTML Spreadsheets JSONDataacquisitionEUCLID - Providing Linked Data
  98. 98. Example DataThe Beatles, 250 millionElvis Presley, 203.3 millionMichael Jackson, 157.4 millionMadonna, 160.1 millionLed Zeppelin, 135.5 millionQueen, 90.5 million98 oforiginPeriodactiveRelease-yearof firstchartedrecordTotal certified units(from available markets)[Notes]The BeatlesUnitedKingdom1960–1970[4] 1962[4] Total available certified units:250 million[show]Elvis PresleyUnitedStates1954–1977[28] 1954[28] Total available certified units:203.3 million[show]MichaelJackson[Note 2]UnitedStates1964–2009[32] 1971[32] Total available certified units:157.4 million[show]MadonnaUnitedStates1979–present[44] 1982[44] Total available certified units:160.1 million[show]Led ZeppelinUnitedKingdom1968–1980[50] 1969[50] Total available certified units:135.5 million[show]QueenUnitedKingdom1971–present[53] 1973[53] Total available certified units:90.5 million[show]{"artist": {"class": "artist","name": "The Beatles"},"rank": 1,"value": 250 million},…CSVJSONHTML tablesEUCLID - Providing Linked Data
  99. 99. OpenRefine• transforms and cleans messyinput data sets.• is an open-source successor ofGoogle Refine.• allows for entity reconciliationagainst SPARQL endpoints orRDF data.• is extended with plugins thatenhance its functionality, e.g.for RDF support.99EUCLID - Providing Linked DataQuick Facts
  100. 100. Use of OpenRefine1001. Messy input data isimported, transformedinto a table represen-tation and cleaned.3. Define the structure ofthe RDF output.4. The data is exportedinto some RDF syntax.2. Entity reconciliation isapplied to allow forinterlinking withexisting data sets.The Beatles, 250 millionElvis Presley, 203.3 millionMichael Jackson, 157.4 millionMadonna, 160.1 millionLed Zeppelin, 135.5 millionQueen, 90.5 millionCSVmusicbrainz:b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d :totalSales "25000000000"^^xsd:int .musicbrainz:01809552-4f87-45b0-afff-2c6f0730a3be :totalSales "2.033E10"^^xsd:int .musicbrainz:f27ec8db-af05-4f36-916e-3d57f91ecf5e :totalSales "1.574E10"^^xsd:int .musicbrainz:79239441-bfd5-4981-a70c-55c3f15c1287 :totalSales "1.601E10"^^xsd:int .musicbrainz:678d88b2-87b0-403b-b63d-5da7465aecc3 :totalSales "1.355E10"^^xsd:int .musicbrainz:0383dadf-2a4e-4d10-a46a-e9e041da8eb3 :totalSales "9.05E9"^^xsd:int .RDFEUCLID - Providing Linked Data
  101. 101. Typical steps:• Group and explore dataitems• Deleting columns or rowsbased on filter condition• Split columns into severalcolumns based oncondition• Modify messy data itemswith GREL, a powerfulexpression language• Replay steps from aprevious Refine project101EUCLID - Providing Linked DataDataTransformation
  102. 102. How to Generate RDF?• Additional problem: data needs to be interlinkedwith existing MusicBrainz data• This is the point where plugins come into play:– RDF Refine: developed by DERI– An extension of OpenRefine to support RDF102?RDFEUCLID - Providing Linked Data
  103. 103. Core Capabilities• Interlinking of data by entity reconciliation– Against SPARQL endpoints, RDF dumps– Discovery of relevant RDF data sets• RDF export with the help of RDF skeletons– Define the vocabulary and graph structure of theRDF serialization– In Turtle, RDF/XML103EUCLID - Providing Linked Data
  104. 104. Typical steps:• Define a reconciliation service• Select specific types to reconcile against• Start reconciling a column against theservice104EUCLID - Providing Linked DataEntity Reconciliation
  105. 105. Define RDF Skeletons• An RDF skeleton defines the structure of theRDF triples that are exportedEUCLID - Providing Linked Data 105
  106. 106. RDF Skeletons16.06.2013 106106EUCLID - Providing Linked Data
  107. 107. EXTRACTING DATA FROMRDBMS WITH R2RMLEUCLID - Providing Linked Data 107
  108. 108. W3C RDB2RDF• Task: Integrate data fromrelational DBMS with LinkedData• Approach: map fromrelational schema to semanticvocabulary with R2RML• Publishing: two alternatives –– Translate SPARQL into SQLon the fly– Batch transform data intoRDF, index and provideSPARQL access in atriplestore108LDDatasetAccessIntegratedData inTriplestoreInterlinking CleansingVocabularyMappingSPARQLEndpointPublishingDataacquisitionEUCLID - Providing Linked DataR2RMLEngineRelationalDBMS
  109. 109. W3C RDB2RDF• The W3C made, last year, two recommendations formapping between relational databases and RDF:– Direct mapping directly exposes data as RDF• Not allowance for vocabulary mapping• No allowance for interlinking (unless URIs used in relational data)• Not appropriate for this topic– R2RML, the RDB to RDF mapping language• Allows vocabulary mapping (subject, predicate andobject maps with class options)• Allows interlinking – URIs can be constructed• Means to provide MusicBrainz RDF/SPARQL itselfEUCLID - Providing Linked Data 109
  110. 110. MusicBrainz Next Gen SchemaEUCLID - Providing Linked Data 110• ArtistAs pre-NGS, butfurther attributes• Artist CreditAllows joint credit• Release GroupCf. ‘album’versus:• Release• Medium• Track• Track List• Work• RecordingSource:
  111. 111. Music Ontology• OWL ontology with following core concepts(classes) and relationships (properties):EUCLID - Providing Linked Data 111Source:
  112. 112. R2RML Class Mapping• Mapping tables to classes is ‘easy’:lb:Artist a rr:TriplesMap ;rr:logicalTable [rr:tableName "artist"] ;rr:subjectMap[rr:class mo:MusicArtist ;rr:template"{gid}#_"] ;rr:predicateObjectMap[rr:predicate mo:musicbrainz_guid ;rr:objectMap [rr:column "gid" ;rr:datatype xsd:string]] .EUCLID - Providing Linked Data 112
  113. 113. R2RML Property Mapping• Mapping columns to properties can be easy:lb:artist_name a rr:TriplesMap ;rr:logicalTable [rr:sqlQuery"""SELECT artist.gid, artist_name.nameFROM artistINNER JOIN artist_name ON"""] ;rr:subjectMap lb:sm_artist ;rr:predicateObjectMap[rr:predicate foaf:name ;rr:objectMap [rr:column "name"]] .EUCLID - Providing Linked Data 113
  114. 114. NGS Advanced RelationsEUCLID - Providing Linked Data 114• Major entities (Artist, Release Group, Track, etc.) plusURL are paired(l_artist_artist)• Each pairingof instancesrefers to a Link• Links have types(cf. RDF properties)and attributesSource:
  115. 115. R2RML Advanced Mapping• Mapping advanced relationships (SQL joins):lb:artist_member a rr:TriplesMap ;rr:logicalTable [rr:sqlQuery"""SELECT a1.gid, a2.gid AS bandFROM artist a1INNER JOIN l_artist_artist ON =l_artist_artist.entity0INNER JOIN link ON = link.idINNER JOIN link_type ON link_type = link_type.idINNER JOIN artist a2 on l_artist_artist.entity1 = a2.idWHERE link_type.gid=5be4c609-9afa-4ea0-910b-12ffb71e3821AND link.ended=FALSE"""] ;rr:subjectMap lb:sm_artist ;rr:predicateObjectMap[rr:predicate mo:member_of ;rr:objectMap [rr:template"{band}#_" ;rr:termType rr:IRI]] .EUCLID - Providing Linked Data 115
  116. 116. EXTRACTING DATA FROMTEXTEUCLID - Providing Linked Data 116
  117. 117. OpenCalais• Not easily customised/extended• Domain-specific coverage variesEUCLID - Providing Linked Data 117Source:
  118. 118. DBpedia Spotlight• Not easily customised/extended• Is currently only available for EnglishEUCLID - Providing Linked Data 118Source:
  119. 119. ZemantaEUCLID - Providing Linked Data 119Source:• Common problem with general purpose, open-domainsemanticannotationtools• Best resultsrequirebespokecustomisation
  120. 120. • General Architecture for Text Engineering• Free open-source (LGPL)framework and development environment• Started 1996, large developer community• Used worldwide by many organisations tobuild bespoke solutions; e.g. Press Associationand the National Archive• Information Extraction in many languagesGATEEUCLID - Providing Linked Data 120
  121. 121. • Increases recall over DBpedia by deriving newlexicalisations for URIs from link anchor texts,disambiguation pages, and redirect pagesGATE Example - LODIEEUCLID - Providing Linked Data 121
  122. 122. Precision and Recall• Generic services typically very low recall• Combination is one solution• Other solution is custom extraction122PER LOC ORG TOTALDB Spotlight 0.97 / 0.40 0.82 / 0.46 0.86 / 0.31 0.85 / 0.39Zemanta 0.96 / 0.84 0.89 / 0.62 0.82 / 0.57 0.90 / 0.68LODIE 0.81 / 0.82 0.73 / 0.76 0.56 / 0.59 0.71 / 0.74Zemanta ∩ LODIE 1.00 / 0.74 0.95 / 0.45 0.97 / 0.42 0.97 / 0.54Zemanta U LODIE 0.94 / 0.93 0.77 / 0.76 0.72 / 0.71 0.82 / 0.81EUCLID - Providing Linked Data
  123. 123. Custom GATE Gazetteer• Retrieve MusicBrainzentity/label/classwith SPARQL query123EUCLID - Providing Linked Data
  124. 124. GATECloud• Custom (e.g. based around custom gazetteer)GATE pipelines can be executed on the cloud:124EUCLID - Providing Linked Data
  125. 125. INTERLINKING DATA SETS WITHSILKEUCLID - Providing Linked Data 125
  126. 126. Interlinking with Silk• Task: Create links betweenthe data set and externalLinked Data sources.• Approach: Creation ofspecified links by queryingthe target data sets• Alternatives:– Manual creation oflinkage rules by the user– Automatic learninglinkage rules bysubmitting predefinedSPARQL queries126LDDatasetAccessIntegratedData SetInterlinking CleansingVocabularyMappingSPARQLEndpointPublishingCSV/TSVHTML Spreadsheets JSONDataacquisitionEUCLID - Providing Linked Data
  127. 127. Link Discovery with Silk• Open source tool for discovering RDF links betweendata items within different Linked Data sources• It is based on the Silk Link Specification Language(Silk-LSL) for expressing linkage rules• It accesses the target RDF data sets via SPARQLendpoints to generate RDF linksEUCLID - Providing Linked Data 127Source: Robert Isele. “LOD2 Webinar Series:Silk”
  128. 128. SilkVariants• Silk Single Machine• Generates RDF links on a single machine• Data sets can reside either locally or in remote machines• Provides multithreading and caching• Silk MapReduce• Uses a cluster composed of multiple machines• Based on Hadoop and designed to scale to big data sets• Silk Server• Used within applications that consume Linked Data from the Webwhile keeping track of known entities• Provides an HTTP API for matching entities from an incomingstreamEUCLID - Providing Linked Data 128Source:
  129. 129. Source: Silk workflow is partially based on “LOD2 Webinar Series: Silk -(Simplified)Linking Workflow” by Rober Isele.SilkWorkflowEUCLID - Providing Linked Data 129Select LDdata sets• Identifysuitable datasets in LDcatalogs*• Select the twodata sets tolinkSpecify LDdata sets• Specify theaccess methodto the data set(RDF dump,SPARQLendpoint)*• Specify theentity types tobe linkedWrite linkagerule• Specifies howto comparethe resources• Use Silk-LSL• The rules canalso be learntGenerateRDF links• Output linkscan be storedin a file or atriple store• Can discoverSKOS linksSilk framework* See section “Publishing Linked Data”
  130. 130. Linkage Rule Components• Linkage rules define the conditions to create the linksbetween the data sets. These rules are composed of:EUCLID - Providing Linked Data 130Source: Paths• Describe the elements to becompared• Example: ?a/rdfs:labelTransformations• Apply transformations to theresult set of an RDF path• Examples: LowerCase,Concatenate, Replace, …Comparators• Compute the similarity of twoinputs• Examples: String similaritymetrics, Date similarity, …Aggregations• Compute an aggregated valuefrom multiple comparators• Examples: Min, Max, Avg, variousmeans, Euclidian distance …1 23 4
  131. 131. SilkWorkbench• Web application built on top of Silk, which allows thecreation of projects to manage the creation of linksbetween RDF data sets• The data sets can be stored locally or accessedremotely by specifying the SPARQL endpoint• The user is able to create customized linking tasks:– The tool offers a graphical editor to create linkage rules bycombining the linkage rules components via drag & dropelements– Includes support for (automatic) learning linkage rulesEUCLID - Providing Linked Data 131
  132. 132. Project configurationSilkWorkbench (2)EUCLID - Providing Linked Data 13212341. Project: name and components (datasources, linking tasks and output tasks)2. Data sources: specification of the datasets to be interlinked3. Linking task: specification of the linkagerules and type of links to be created4. Output task: mechanism to store theresults from the lnking process2
  133. 133. Editing a linking taskSilkWorkbench (3)EUCLID - Providing Linked Data 13314231. Linkage rule components2. Graphical editor: the items from (1) are dragged &dropped in this area, and connected to compose thelinkage rules3. Generate links: based on the defined linkage rules in (2),the data sets are accessed to discover possible links4. Learn: automatic learning of linkage rules
  134. 134. Adding a linkage ruleSilkWorkbench (4)EUCLID - Providing Linked Data 134The previous linkage rule states:1. Retrieve the foaf:name values from MusicBrainz andthe rdfs:label from DBpedia2. Apply lower case transformation to the output of (1)3. Compare the output from (2) using the metric“Levenshtein distance”. If this distance is greater than0.90, then create a link.1 23
  135. 135. Generate LinksSilkWorkbench (5)EUCLID - Providing Linked Data 135
  136. 136. Learn RulesSilkWorkbench (6)EUCLID - Providing Linked Data 136
  137. 137. For exercises, quiz and further material visit our website:EUCLID - Providing Linked Data 137@euclid_project euclidproject euclidprojecthttp://www.euclid-project.euOther channels:eBook Course
  138. 138. Acknowledgements• Alexander Mikroyannidis• Alice Carpentier• Andreas Harth• Andreas Wagner• Andriy Nikolov• Barry Norton• Daniel M. Herzig• Elena Simperl• Günter Ladwig• Inga Shamkhalov• Jacek Kopecky• John Domingue
• Juan Sequeda• Kalina Bontcheva• Maria Maleshkova• Maria-Esther Vidal• Maribel Acosta• Michael Meier• Ning Li• Paul Mulholland• Peter Haase• Richard Power• Steffen Stadtmüller138
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.