A metadata focused crawler for Linked Data

A metadata focused
crawler for Linked Data
Raphael do Vale A. Gomes1, Marco A. Casanova1,
Giseli Rabello Lopes1 and Luiz André P. Paes Leme2
1 2

Outline
 Introduction
Background
Use case
A metadata focused crawler
 Tests and results
Conclusions and future work
Acknowledgments
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
2

Introduction
 Linked Data principles
 Use URIs as names for things
 Use HTTP URIs so that people can look up those names
When someone looks up a URI, provide useful information,
using the standards (RDF*, SPARQL)
 Include links to other URIs, so that they can discover more
things
Source: http://www.w3.org/DesignIssues/LinkedData.html
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
3

Introduction
How can we recommend linked data sources to a
beginner user?
 Data sources may not use popular ontologies
 There might be more than one ontology for the same
domain
 The user may not know all (if any) of the ontologies
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
4

Introduction
Our solution:
 Create a recommender system that receives a small set of
generic URI resources and returns a complete report of
related resources (URIs, Datasets and Ontologies)
 Why generic? Because our user is a beginner person exploring the
Linked Data! He doesn’t have to know about specific datasets or
ontologies, he only need to know how to get started.
 The recommender system would benefit from a Linked
Data crawler, based on metadata
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
5

Introduction
Metadata focused crawler
 INPUT:
 User should summarize the desired domain with a small set of related
terms (URI Resources)
 OUTPUT:
 The tool returns a list of vocabulary terms, as well as provenance
data indicating how the output was generated
With the output results, the user should evaluate the most
relevant vocabularies for triplification or linkage process
 This step could be manual or use another tool (e.g.: recommender
system)
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
6

Background
 Important properties
 rdfs:subClassOf, owl:sameAs, rdfs:seeAlso and
rdf:type
SPARQL Queries
 Similar to SQL
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
7

Use case
Scenario
 User wants to publish a relational database as Linked
Data, storing music data
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
8

Use case
 Input
 The user defines an initial set T of terms to describe the
application domain
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
9
dbpedia:Music,
from DBpedia
Metadata
Focused Crawler

Use case
Process
 The crawler focuses on finding new terms
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
10
 Subclasses of the class, or
related terms (owl:sameAs
or rdfs:seeAlso)
 Also counts the number
of instances of the
class found in each
dataset
Metadata
Focused Crawler

Use case
 Output - The crawler will return:
1. List of the terms found, indicating their provenance
2. For each term found, an estimation of the number of instances in
Metadata
each tripleset probed
Focused Crawler
wordnet:synset-music-noun-
1 -> owl:sameAs
-> opencyc:Music ->
rdfs:subClassOf ->
opencyc:LoveSong ->
instance -> 500
instances.
...
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
11

A metadata focused crawler
Our solution:
 Executes several SPARQL Queries over all the LOD Cloud
(Linked Open Data Cloud)
 For each dataset, applies several queries trying to
discover relationships between datasets and the crawling
resource
 A breath first algorithm is used to discover more data in cycles
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
12

Crawling terms
 Elected terms to crawl
 Initial crawling terms
 The initial set of terms selected by the user
Crawling properties
 The list of properties that will be used to crawl
Crawling frontier
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
13

Crawling queries
 Each crawling query is applied to each dataset found
 Each crawling property is crawled using one query
 For each crawling term, all such queries are applied to all
datasets
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
14

Crawling queries
 SPARQL Endpoint or RDF dump – inverted query
SELECT distinct ?item
WHERE { ?item p <t> }
 Instance count
 Similar to other queries, but only the result size is saved
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
15

Crawling stages
 Challenge: based on generic terms, how can we
discover more data?
 Answer: using strong relationships (sameAs,
subclassOf, seeAlso and instanceOf)
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
16
Schema.org
DBpedia WordNet
Music Ontology
BBC Music
More specific

Crawling stages
 Each new resource found is saved for the next level of
crawling
 Crawling frontier
 All terms elected to be processed in the next cycle
 Circular references are prevented
 Parameters to prevent large processing times
 Number of stages
 Maximum numbers of terms probed
 Maximum numbers of terms probed, for each term in the crawling
frontier
 Maximum numbers of terms probed in each tripleset, for each term in
the crawling frontier
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
17

Crawling stages
 Example
wordnet:synset-music-noun-1 -> owl:sameAs ->
OpenCyc:Music -> rdfs:subClassOf ->
OpenCyc:LoveSong -> instance -> 500 instances.
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
18

Tests and results
Domain:
 Music
Term Instance Subclass SameAs SeeAlso
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
19
mo:MusicArtist 103,541 2 -- --
mo:MusicalWork 16,833 1 -- --
dbpedia:MusicalWork 145,656 5 from dbpedia
and 21,413 from
yago
2 12
dbpedia:Song 10,987 1 1 14 (half in
Japanese)
dbpedia:Album 100,090 3 plus over 17,222
from yago
3 and other
languages
--
dbpedia:MusicalArtist 49,973 2 plus 2,178 from
yago
2 1
dbpedia:Single 44,623 3,414 -- 9

Tests and results
Music domain
Tool Precision Recall
Metadata Focused Crawler 95% 91%
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
20

Lessons learned
Parameter setting
 May grow exponentially
Choosing initial crawling terms
 Music ontology is not interlinked with more popular data
sources
 Linked Data principles not followed
Multiple ontologies describing the domain of
interest
 The larger the number of data sources in the domain, the
more useful the results will be
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
21

Conclusions and future work
 Improvements
 Discovering relationships between resources of two
triplesets described by a third one
 Crawling with SPARQL queries
 Identifying resources in different languages
 Performing simple deductions
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
22

Conclusions and future work
 Improving input
 Summarization techniques for automatic input generation
 Accepting natural language keywords and converting
them to URI resources
 Improving system performance
 Caching
 Better queries to provide results with less requests per
endpoint
Web interface
Open source
Recommender system
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
23

Acknowledgments
 This work was partly supported by:
grants 160326/2012-5, 303332/2013-1
and 57128/2009-9
grants E-26/170028/2008 and E-
26/103.070/2011
ICEIS 2014 - April, 27-30, 2014,
Lisbon, Portugal
24

A metadata focused
crawler for the Linked
Data
Raphael do Vale A. Gomes1, Marco A. Casanova1,
Giseli Rabello Lopes1 and Luiz André P. Paes Leme2
Contact: rgomes@inf.puc-rio.br
1 2

A metadata focused crawler for Linked Data

Recommended

Recommended

More Related Content

Similar to A metadata focused crawler for Linked Data

Similar to A metadata focused crawler for Linked Data (20)

Recently uploaded

Recently uploaded (20)

A metadata focused crawler for Linked Data