Merging controlled vocabularies
through semantic alignment
based on linked data
Authors: Konstantinos Kyprianos, Ioannis P...
Presentation outline









Introduction
Proposed approach
Proof of concept
Deployment of the proposed
approach
...
Introduction (1/2)


Controlled vocabularies are predefined lists of words for
knowledge organization and the description...
Introduction (2/2)


Our approach:

Methodology to bring together semantically similar
yet different vocabularies through...
Proposed approach

• S is the set of terms in Source dataset
• T is the set of terms in Target dataset
• L is the set of t...
Proof of concept (1/2)


University of Piraeus digital library (Dione)
◦ Theses and dissertations
◦ 3,323 bilingual subje...
Proof of concept (2/2)
1. let the source dataset S be D (i.e. Dione)
2. let the target dataset T be N (i.e. NYT)
3. let th...
Deployment of the proposed approach




◦
◦
◦

Google Refine

Tool to manipulate tabular data
Reconciliation of data wit...
Deployment results (1/2)
Linguistically
matched terms
between



◦
◦

Dione and DBpedia
Dione and Wordnet

through lexica...
Deployment results (2/2)
D = 3,323 terms

D
D2’

D1’

1119

DB

DB’’

455

DB’

W’’

W’

W

986

5,700
72

86
77

45

N

N...
Comparative evaluation (1/4)


The proposed methodology is compared against
an algorithm (introduced in a previous work*)...
Comparative evaluation (2/4)
List A

List B

207
280

List A. Previous work: only lexically matched pairs between Dione an...
Comparative evaluation (3/4)
List B

List A

27

180

100

List A ∧ List B = 180 terms

13
Comparative evaluation (4/4)
Matched
pairs

List A

List B

D1-NYT1

 (lexical)

 (lexical)

…

 (lexical)

 (lexical)...
Conclusions


A methodology was presented that is capable of finding
equivalent terms between semantically similar contro...
Future work


Future work is targeted towards the reconciliation of
Dione’s subject headings with linked data services su...
Thank you for your attention!
Questions?

17
Upcoming SlideShare
Loading in …5
×

Merging controlled vocabularies through semantic alignment based on linked data

381 views
292 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
381
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Merging controlled vocabularies through semantic alignment based on linked data

  1. 1. Merging controlled vocabularies through semantic alignment based on linked data Authors: Konstantinos Kyprianos, Ioannis Papadakis IONIAN UNIVERSITY DEPARTMENT OF ARCHIVES, LIBRARY SCIENCE AND MUSEOLOGY Ioannou Theotoki 72, 49100, Corfu 1
  2. 2. Presentation outline         Introduction Proposed approach Proof of concept Deployment of the proposed approach Deployment results Comparative evaluation Conclusions Future work 2
  3. 3. Introduction (1/2)  Controlled vocabularies are predefined lists of words for knowledge organization and the description of libraries’ collections  Creation of semantically similar yet syntactically and linguistically heterogeneous controlled vocabularies with overlapping parts  Matching tools and techniques: Lexical similarity  Matching tools and techniques: Semantic alignment ◦ Compares terms according to the order of their characters ◦ Edit – distance, prefix / suffix variations, n-grams etc. ◦ Based on semantic techniques to identify similar terms between two structured vocabularies 3
  4. 4. Introduction (2/2)  Our approach: Methodology to bring together semantically similar yet different vocabularies through the semantic alignment of the underlying terms with the employment of LOD technologies ◦ Semantic alignment is achieved through external linguistic datasets ◦ There is no requirement of any kind of structure (schema or ontology) to the compared datasets 4
  5. 5. Proposed approach • S is the set of terms in Source dataset • T is the set of terms in Target dataset • L is the set of terms in the Linguistic dataset • L’ is the set of terms that are found to be linguistically associated with some terms of the Source dataset • L’’ is the set of terms in L that are found to be semantically associated with some terms of the L’ • T’ contains the terms in T that are linguistically associated with some terms of L’ and L’’ 5
  6. 6. Proof of concept (1/2)  University of Piraeus digital library (Dione) ◦ Theses and dissertations ◦ 3,323 bilingual subject headings ◦ DSpace installation  New York Times – NYT ◦ Approximately 10.000 subject headings ◦ Journal articles  DBpedia ◦ Extracts structured information from Wikipedia ◦ 3,5 million entities  WordNet ◦ Lexical database ◦ Consists of synsets (~117,659 distinct concepts containing terms interlinked through conceptual-semantic relations) 6
  7. 7. Proof of concept (2/2) 1. let the source dataset S be D (i.e. Dione) 2. let the target dataset T be N (i.e. NYT) 3. let the linguistic datasetA L be DB (i.e. DBpedia) and 4. let the linguistic datasetB L be W (i.e.WordNet) 5. D1’ corresponds to S’, assuming that the linguistic dataset L is DB. In a similar manner, D2’ corresponds to S’, assuming that the linguistic dataset L is W. 6. DB’ and DB’’ correspond to L’ and L’’ respectively, assuming that the linguistic dataset L is DB. In a similar manner, W’ and W’’ correspond to L’ and L’’ respectively, assuming that the linguistic dataset L is W. 7. N1’ corresponds to T’, assuming that the linguistic dataset L is DB. In a similar manner, N2’ corresponds to T’ assuming that the linguistic dataset L is W. 7
  8. 8. Deployment of the proposed approach   ◦ ◦ ◦ Google Refine Tool to manipulate tabular data Reconciliation of data with existent knowledge bases RDF extension Process 1. 2. 3. 4. 5. 6. Subject headings from Dione are imported to Google Refine DBpedia and WordNet endpoints are registered in Google Refine as SPARQL reconciliation services The subject headings of Dione are linguistically matched (i.e. lexical similarity) against DBpedia’s and WordNet’s reconciliation services creating the corresponding subsets The terms in the subsets of step 3 are extended with semantically equivalent terms (i.e. semantic alignment) deriving from the rest of DBpedia and WordNet Subject headings from NYT are imported to Google Refine The subject headings of NYT are linguistically matched (i.e. lexical similarity) against the terms belonging to the subsets that are described in steps 3 and 4 8
  9. 9. Deployment results (1/2) Linguistically matched terms between  ◦ ◦ Dione and DBpedia Dione and Wordnet through lexical similarity techniques Dione DBpedia WordNet One-word Subject Headings 331 (29%) 297 (65%) Two-words Subject Headings 658 (59%) 128 (28%) Subject Headings with 3+ words 130 (12%) 30 (7%) Subject Headings with Subdivisions 0 0 Sum (1,574) 1,119 455 9
  10. 10. Deployment results (2/2) D = 3,323 terms D D2’ D1’ 1119 DB DB’’ 455 DB’ W’’ W’ W 986 5,700 72 86 77 45 N N = 10,000 terms N1’ = 163 N2’ = 117 10
  11. 11. Comparative evaluation (1/4)  The proposed methodology is compared against an algorithm (introduced in a previous work*) addressed to Dione and NYT based only on lexical similarity techniques  Dione and NYT are not described by schemas. Thus, any attempt to merge their underlying terms cannot be based on traditional ontologyalignment techniques *Papadakis, I., Kyprianos, K.: Merging Controlled Vocabularies for More Efficient Subject-based Search. International Journal of Knowledge Management. 7(3), 76-90, July-September (2011) 11
  12. 12. Comparative evaluation (2/4) List A List B 207 280 List A. Previous work: only lexically matched pairs between Dione and NYT List B. Proposed work: lexically AND semantically matched pairs between Dione and NYT 12
  13. 13. Comparative evaluation (3/4) List B List A 27 180 100 List A ∧ List B = 180 terms 13
  14. 14. Comparative evaluation (4/4) Matched pairs List A List B D1-NYT1  (lexical)  (lexical) …  (lexical)  (lexical) D158-NYT158  (lexical)  (lexical) …  (lexical)  (semantic) D180-NYT180  (lexical)  (semantic) …   (semantic) D280-NYT280   (semantic) …  (lexical)  D307-NYT307  (lexical)  TOTAL: No. of pairs 158 180 22 100 27 307 14
  15. 15. Conclusions  A methodology was presented that is capable of finding equivalent terms between semantically similar controlled vocabularies  Lexical similarities discovery and semantic alignment through external LOD datasets  Google Refine renders the deployment of the proposed methodology as a straightforward process that can be applied to other cases aiming in discovering equivalent terms in different yet semantically similar datasets  The deployment of the proposed methodology is facilitated through the employment of linked data technologies 15
  16. 16. Future work  Future work is targeted towards the reconciliation of Dione’s subject headings with linked data services such as French National Library (RAMEAU), German National Library (GND), Biblioteca National de Espana (BNE) and LIBRIS. 16
  17. 17. Thank you for your attention! Questions? 17

×