SlideShare a Scribd company logo
1 of 25
A metadata focused 
crawler for Linked Data 
Raphael do Vale A. Gomes1, Marco A. Casanova1, 
Giseli Rabello Lopes1 and Luiz André P. Paes Leme2 
1 2
Outline 
 Introduction 
Background 
Use case 
A metadata focused crawler 
 Tests and results 
Conclusions and future work 
Acknowledgments 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
2
Introduction 
 Linked Data principles 
 Use URIs as names for things 
 Use HTTP URIs so that people can look up those names 
When someone looks up a URI, provide useful information, 
using the standards (RDF*, SPARQL) 
 Include links to other URIs, so that they can discover more 
things 
Source: http://www.w3.org/DesignIssues/LinkedData.html 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
3
Introduction 
How can we recommend linked data sources to a 
beginner user? 
 Data sources may not use popular ontologies 
 There might be more than one ontology for the same 
domain 
 The user may not know all (if any) of the ontologies 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
4
Introduction 
Our solution: 
 Create a recommender system that receives a small set of 
generic URI resources and returns a complete report of 
related resources (URIs, Datasets and Ontologies) 
 Why generic? Because our user is a beginner person exploring the 
Linked Data! He doesn’t have to know about specific datasets or 
ontologies, he only need to know how to get started. 
 The recommender system would benefit from a Linked 
Data crawler, based on metadata 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
5
Introduction 
Metadata focused crawler 
 INPUT: 
 User should summarize the desired domain with a small set of related 
terms (URI Resources) 
 OUTPUT: 
 The tool returns a list of vocabulary terms, as well as provenance 
data indicating how the output was generated 
With the output results, the user should evaluate the most 
relevant vocabularies for triplification or linkage process 
 This step could be manual or use another tool (e.g.: recommender 
system) 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
6
Background 
 Important properties 
 rdfs:subClassOf, owl:sameAs, rdfs:seeAlso and 
rdf:type 
SPARQL Queries 
 Similar to SQL 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
7
Use case 
Scenario 
 User wants to publish a relational database as Linked 
Data, storing music data 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
8
Use case 
 Input 
 The user defines an initial set T of terms to describe the 
application domain 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
9 
dbpedia:Music, 
from DBpedia 
Metadata 
Focused Crawler
Use case 
Process 
 The crawler focuses on finding new terms 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
10 
 Subclasses of the class, or 
related terms (owl:sameAs 
or rdfs:seeAlso) 
 Also counts the number 
of instances of the 
class found in each 
dataset 
Metadata 
Focused Crawler
Use case 
 Output - The crawler will return: 
1. List of the terms found, indicating their provenance 
2. For each term found, an estimation of the number of instances in 
Metadata 
each tripleset probed 
Focused Crawler 
wordnet:synset-music-noun- 
1 -> owl:sameAs 
-> opencyc:Music -> 
rdfs:subClassOf -> 
opencyc:LoveSong -> 
instance -> 500 
instances. 
... 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
11
A metadata focused crawler 
Our solution: 
 Executes several SPARQL Queries over all the LOD Cloud 
(Linked Open Data Cloud) 
 For each dataset, applies several queries trying to 
discover relationships between datasets and the crawling 
resource 
 A breath first algorithm is used to discover more data in cycles 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
12
A metadata focused crawler 
Crawling terms 
 Elected terms to crawl 
 Initial crawling terms 
 The initial set of terms selected by the user 
Crawling properties 
 The list of properties that will be used to crawl 
Crawling frontier 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
13
A metadata focused crawler 
Crawling queries 
 Each crawling query is applied to each dataset found 
 Each crawling property is crawled using one query 
 For each crawling term, all such queries are applied to all 
datasets 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
14
A metadata focused crawler 
Crawling queries 
 SPARQL Endpoint or RDF dump – inverted query 
SELECT distinct ?item 
WHERE { ?item p <t> } 
 Instance count 
 Similar to other queries, but only the result size is saved 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
15
A metadata focused crawler 
Crawling stages 
 Challenge: based on generic terms, how can we 
discover more data? 
 Answer: using strong relationships (sameAs, 
subclassOf, seeAlso and instanceOf) 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
16 
Schema.org 
DBpedia WordNet 
Music Ontology 
BBC Music 
More specific
A metadata focused crawler 
Crawling stages 
 Each new resource found is saved for the next level of 
crawling 
 Crawling frontier 
 All terms elected to be processed in the next cycle 
 Circular references are prevented 
 Parameters to prevent large processing times 
 Number of stages 
 Maximum numbers of terms probed 
 Maximum numbers of terms probed, for each term in the crawling 
frontier 
 Maximum numbers of terms probed in each tripleset, for each term in 
the crawling frontier 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
17
A metadata focused crawler 
Crawling stages 
 Example 
wordnet:synset-music-noun-1 -> owl:sameAs -> 
OpenCyc:Music -> rdfs:subClassOf -> 
OpenCyc:LoveSong -> instance -> 500 instances. 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
18
Tests and results 
Domain: 
 Music 
Term Instance Subclass SameAs SeeAlso 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
19 
mo:MusicArtist 103,541 2 -- -- 
mo:MusicalWork 16,833 1 -- -- 
dbpedia:MusicalWork 145,656 5 from dbpedia 
and 21,413 from 
yago 
2 12 
dbpedia:Song 10,987 1 1 14 (half in 
Japanese) 
dbpedia:Album 100,090 3 plus over 17,222 
from yago 
3 and other 
languages 
-- 
dbpedia:MusicalArtist 49,973 2 plus 2,178 from 
yago 
2 1 
dbpedia:Single 44,623 3,414 -- 9
Tests and results 
Music domain 
Tool Precision Recall 
Metadata Focused Crawler 95% 91% 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
20
Lessons learned 
Parameter setting 
 May grow exponentially 
Choosing initial crawling terms 
 Music ontology is not interlinked with more popular data 
sources 
 Linked Data principles not followed 
Multiple ontologies describing the domain of 
interest 
 The larger the number of data sources in the domain, the 
more useful the results will be 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
21
Conclusions and future work 
 Improvements 
 Discovering relationships between resources of two 
triplesets described by a third one 
 Crawling with SPARQL queries 
 Identifying resources in different languages 
 Performing simple deductions 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
22
Conclusions and future work 
 Improving input 
 Summarization techniques for automatic input generation 
 Accepting natural language keywords and converting 
them to URI resources 
 Improving system performance 
 Caching 
 Better queries to provide results with less requests per 
endpoint 
Web interface 
Open source 
Recommender system 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
23
Acknowledgments 
 This work was partly supported by: 
grants 160326/2012-5, 303332/2013-1 
and 57128/2009-9 
grants E-26/170028/2008 and E- 
26/103.070/2011 
ICEIS 2014 - April, 27-30, 2014, 
Lisbon, Portugal 
24
A metadata focused 
crawler for the Linked 
Data 
Raphael do Vale A. Gomes1, Marco A. Casanova1, 
Giseli Rabello Lopes1 and Luiz André P. Paes Leme2 
Contact: rgomes@inf.puc-rio.br 
1 2

More Related Content

Similar to A metadata focused crawler for Linked Data

Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Riccardo Albertoni
 
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community CallOpenAIRE
 
NaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdfNaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdfAndré Valdestilhas
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
OpenAIRE: Science. Set Free, Iryna Kuchma, EIFL
OpenAIRE: Science. Set Free, Iryna Kuchma, EIFLOpenAIRE: Science. Set Free, Iryna Kuchma, EIFL
OpenAIRE: Science. Set Free, Iryna Kuchma, EIFLPlatforma Otwartej Nauki
 
Information Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesInformation Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesGhislain Atemezing
 
Learning Resource Exchange for Schools: Current Infrastructure & Issues
Learning Resource Exchange for Schools: Current Infrastructure & IssuesLearning Resource Exchange for Schools: Current Infrastructure & Issues
Learning Resource Exchange for Schools: Current Infrastructure & IssuesDavid Massart
 
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platformWebinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platformOpenAIRE
 
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE GuidelinesOpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE GuidelinesPedro Príncipe
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...LIBER Europe
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network
 
Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data Asuncion Gomez-Perez
 
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Baden Hughes
 
Discovering OERs through RSS and APIs
Discovering OERs through RSS and APIsDiscovering OERs through RSS and APIs
Discovering OERs through RSS and APIsazami
 
Webinar on OpenAIRE compatibility for repositories: proprietary platforms
Webinar on OpenAIRE compatibility for repositories: proprietary platformsWebinar on OpenAIRE compatibility for repositories: proprietary platforms
Webinar on OpenAIRE compatibility for repositories: proprietary platformsOpenAIRE
 

Similar to A metadata focused crawler for Linked Data (20)

Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
 
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
 
NaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdfNaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdf
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
OpenAIRE: Science. Set Free, Iryna Kuchma, EIFL
OpenAIRE: Science. Set Free, Iryna Kuchma, EIFLOpenAIRE: Science. Set Free, Iryna Kuchma, EIFL
OpenAIRE: Science. Set Free, Iryna Kuchma, EIFL
 
CORE APIv3
CORE APIv3CORE APIv3
CORE APIv3
 
Information Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesInformation Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open Vocabularies
 
Learning Resource Exchange for Schools: Current Infrastructure & Issues
Learning Resource Exchange for Schools: Current Infrastructure & IssuesLearning Resource Exchange for Schools: Current Infrastructure & Issues
Learning Resource Exchange for Schools: Current Infrastructure & Issues
 
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platformWebinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
 
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE GuidelinesOpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
 
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data
 
QALD-7 Question Answering over Linked Data Challenge
QALD-7 Question Answering over Linked Data ChallengeQALD-7 Question Answering over Linked Data Challenge
QALD-7 Question Answering over Linked Data Challenge
 
Qald 7 at ESWC2017
Qald 7 at ESWC2017Qald 7 at ESWC2017
Qald 7 at ESWC2017
 
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
 
Discovering OERs through RSS and APIs
Discovering OERs through RSS and APIsDiscovering OERs through RSS and APIs
Discovering OERs through RSS and APIs
 
Webinar on OpenAIRE compatibility for repositories: proprietary platforms
Webinar on OpenAIRE compatibility for repositories: proprietary platformsWebinar on OpenAIRE compatibility for repositories: proprietary platforms
Webinar on OpenAIRE compatibility for repositories: proprietary platforms
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

A metadata focused crawler for Linked Data

  • 1. A metadata focused crawler for Linked Data Raphael do Vale A. Gomes1, Marco A. Casanova1, Giseli Rabello Lopes1 and Luiz André P. Paes Leme2 1 2
  • 2. Outline  Introduction Background Use case A metadata focused crawler  Tests and results Conclusions and future work Acknowledgments ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 2
  • 3. Introduction  Linked Data principles  Use URIs as names for things  Use HTTP URIs so that people can look up those names When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)  Include links to other URIs, so that they can discover more things Source: http://www.w3.org/DesignIssues/LinkedData.html ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 3
  • 4. Introduction How can we recommend linked data sources to a beginner user?  Data sources may not use popular ontologies  There might be more than one ontology for the same domain  The user may not know all (if any) of the ontologies ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 4
  • 5. Introduction Our solution:  Create a recommender system that receives a small set of generic URI resources and returns a complete report of related resources (URIs, Datasets and Ontologies)  Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have to know about specific datasets or ontologies, he only need to know how to get started.  The recommender system would benefit from a Linked Data crawler, based on metadata ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 5
  • 6. Introduction Metadata focused crawler  INPUT:  User should summarize the desired domain with a small set of related terms (URI Resources)  OUTPUT:  The tool returns a list of vocabulary terms, as well as provenance data indicating how the output was generated With the output results, the user should evaluate the most relevant vocabularies for triplification or linkage process  This step could be manual or use another tool (e.g.: recommender system) ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 6
  • 7. Background  Important properties  rdfs:subClassOf, owl:sameAs, rdfs:seeAlso and rdf:type SPARQL Queries  Similar to SQL ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 7
  • 8. Use case Scenario  User wants to publish a relational database as Linked Data, storing music data ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 8
  • 9. Use case  Input  The user defines an initial set T of terms to describe the application domain ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 9 dbpedia:Music, from DBpedia Metadata Focused Crawler
  • 10. Use case Process  The crawler focuses on finding new terms ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 10  Subclasses of the class, or related terms (owl:sameAs or rdfs:seeAlso)  Also counts the number of instances of the class found in each dataset Metadata Focused Crawler
  • 11. Use case  Output - The crawler will return: 1. List of the terms found, indicating their provenance 2. For each term found, an estimation of the number of instances in Metadata each tripleset probed Focused Crawler wordnet:synset-music-noun- 1 -> owl:sameAs -> opencyc:Music -> rdfs:subClassOf -> opencyc:LoveSong -> instance -> 500 instances. ... ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 11
  • 12. A metadata focused crawler Our solution:  Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data Cloud)  For each dataset, applies several queries trying to discover relationships between datasets and the crawling resource  A breath first algorithm is used to discover more data in cycles ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 12
  • 13. A metadata focused crawler Crawling terms  Elected terms to crawl  Initial crawling terms  The initial set of terms selected by the user Crawling properties  The list of properties that will be used to crawl Crawling frontier ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 13
  • 14. A metadata focused crawler Crawling queries  Each crawling query is applied to each dataset found  Each crawling property is crawled using one query  For each crawling term, all such queries are applied to all datasets ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 14
  • 15. A metadata focused crawler Crawling queries  SPARQL Endpoint or RDF dump – inverted query SELECT distinct ?item WHERE { ?item p <t> }  Instance count  Similar to other queries, but only the result size is saved ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 15
  • 16. A metadata focused crawler Crawling stages  Challenge: based on generic terms, how can we discover more data?  Answer: using strong relationships (sameAs, subclassOf, seeAlso and instanceOf) ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 16 Schema.org DBpedia WordNet Music Ontology BBC Music More specific
  • 17. A metadata focused crawler Crawling stages  Each new resource found is saved for the next level of crawling  Crawling frontier  All terms elected to be processed in the next cycle  Circular references are prevented  Parameters to prevent large processing times  Number of stages  Maximum numbers of terms probed  Maximum numbers of terms probed, for each term in the crawling frontier  Maximum numbers of terms probed in each tripleset, for each term in the crawling frontier ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 17
  • 18. A metadata focused crawler Crawling stages  Example wordnet:synset-music-noun-1 -> owl:sameAs -> OpenCyc:Music -> rdfs:subClassOf -> OpenCyc:LoveSong -> instance -> 500 instances. ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 18
  • 19. Tests and results Domain:  Music Term Instance Subclass SameAs SeeAlso ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 19 mo:MusicArtist 103,541 2 -- -- mo:MusicalWork 16,833 1 -- -- dbpedia:MusicalWork 145,656 5 from dbpedia and 21,413 from yago 2 12 dbpedia:Song 10,987 1 1 14 (half in Japanese) dbpedia:Album 100,090 3 plus over 17,222 from yago 3 and other languages -- dbpedia:MusicalArtist 49,973 2 plus 2,178 from yago 2 1 dbpedia:Single 44,623 3,414 -- 9
  • 20. Tests and results Music domain Tool Precision Recall Metadata Focused Crawler 95% 91% ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 20
  • 21. Lessons learned Parameter setting  May grow exponentially Choosing initial crawling terms  Music ontology is not interlinked with more popular data sources  Linked Data principles not followed Multiple ontologies describing the domain of interest  The larger the number of data sources in the domain, the more useful the results will be ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 21
  • 22. Conclusions and future work  Improvements  Discovering relationships between resources of two triplesets described by a third one  Crawling with SPARQL queries  Identifying resources in different languages  Performing simple deductions ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 22
  • 23. Conclusions and future work  Improving input  Summarization techniques for automatic input generation  Accepting natural language keywords and converting them to URI resources  Improving system performance  Caching  Better queries to provide results with less requests per endpoint Web interface Open source Recommender system ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 23
  • 24. Acknowledgments  This work was partly supported by: grants 160326/2012-5, 303332/2013-1 and 57128/2009-9 grants E-26/170028/2008 and E- 26/103.070/2011 ICEIS 2014 - April, 27-30, 2014, Lisbon, Portugal 24
  • 25. A metadata focused crawler for the Linked Data Raphael do Vale A. Gomes1, Marco A. Casanova1, Giseli Rabello Lopes1 and Luiz André P. Paes Leme2 Contact: rgomes@inf.puc-rio.br 1 2