Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Linking Historical Sources to Established
Knowledge Bases in Order to Inform Entity
Linkers in Cultural Heritage
Annalina ...
Background www.adaptcentre.ie
Overarching research looks at Entity Linking in Irish Cultural
Heritage collections.
Many po...
Problem www.adaptcentre.ie
Entities in Cultural Heritage collections often have poor
representation in popular Knowledge B...
Solution? www.adaptcentre.ie
Find a better knowledge base
1. GeoNames
2. Pleiades
3. Getty
4. Library of Congress
Solution? www.adaptcentre.ie
Find a better knowledge base
1. GeoNames
2. Pleiades
3. Getty
4. Library of Congress
We could...
Solution? www.adaptcentre.ie
Find a better knowledge base
1. GeoNames
2. Pleiades
3. Getty
4. Library of Congress
We could...
Question www.adaptcentre.ie
Primary Sources www.adaptcentre.ie
• Statute Staple
• Petty Maps
• Books of Survey and Distribution
• Lodges Peerage
Primary Sources www.adaptcentre.ie
• Statute Staple
• Petty Maps
• Books of Survey and Distribution
• Lodges Peerage
Excel...
Secondary Sources www.adaptcentre.ie
• Oxford Dictionary of National Biography (ODNB)
• Dictionary of Irish Biography (DIB)
Secondary Sources www.adaptcentre.ie
• Oxford Dictionary of National Biography (ODNB)
• Dictionary of Irish Biography (DIB...
Goal www.adaptcentre.ie
Identify corresponding DBpedia entities for DIB and ODNB
bibliographies:
• Enables us to link our ...
Method: Overview www.adaptcentre.ie
Given a bibliography which describes and entity:
1. Retrieve DBpedia candidates which ...
Method: Features www.adaptcentre.ie
Method: Features www.adaptcentre.ie
Method: Features www.adaptcentre.ie
Method: Indexing and Retrieving Candidates www.adaptcentre.ie
• Indexed all DBpedia people in Solr instance.
• Indexed con...
Method: Comparing Names www.adaptcentre.ie
Compared title of biography to name of DBpedia entity using
Monge-Elkan distanc...
Method: Comparing Bibliographic Content www.adaptcentre.ie
Can be vocabulary differences between biographies and Wikipedia....
Method: Scoring www.adaptcentre.ie
Final score is a linear combination of the two previous scores:
Ψ(b, p) = α Φ(btitle, p...
Evaluation www.adaptcentre.ie
• Sampled 200 bibliographies from ODNB and DIB:
400 samples in total.
• Created manually ann...
Results: Threshold DIB www.adaptcentre.ie
0.45 0.5 0.55 0.6 0.65
0.4
0.5
0.6
0.7
0.8
Threshold
Accuracy
Results: Threshold ODNB www.adaptcentre.ie
0.45 0.5 0.55 0.6 0.65
0.4
0.5
0.6
Threshold
Accuracy
Results www.adaptcentre.ie
• Wide disparity in performance on two collections:
81.5% accurate on DIB.
67.5% accurate on OD...
Summary www.adaptcentre.ie
• Presented a method for linking bibliographies to DBpedia.
• Resulting output can be used as k...
Thank You
Questions
References I www.adaptcentre.ie
Eneko Agirre et al. “Matching Cultural Heritage items to
Wikipedia”. In: LREC. 2012, pp. 1...
References II www.adaptcentre.ie
Matt Kusner et al. “From word embeddings to document
distances”. In: International Confer...
Upcoming SlideShare
Loading in …5
×

Exploring Linked Data For The Automatic Enrichment of Historical Archives

21 views

Published on

Workshop paper
Gary Munnelly, Harshvardhan J. Pandit, Séamus Lawless.
Third International Workshop on Semantic Web for Cultural Heritage (SW4CH). Co-located with 15th European Semantic Web Conference (ESWC). Crete, Heraklion, Greece. 2018

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Exploring Linked Data For The Automatic Enrichment of Historical Archives

  1. 1. Linking Historical Sources to Established Knowledge Bases in Order to Inform Entity Linkers in Cultural Heritage Annalina Caputo Gary Munnelly S´eamus Lawless The ADAPT Centre is funded under the SFI Research Centres Programme(Grant 13/RC/2106)and is co-funded under the European Regional Development Fund.
  2. 2. Background www.adaptcentre.ie Overarching research looks at Entity Linking in Irish Cultural Heritage collections. Many possible advantages to being able to perform reliable Entity Linking on digital cultural heritage collections: • Helps to organise content. • Integrates collections with existing resources. • Enables us to build bridges between disparate collections based on mutual entities. • Facilitates the construction of better search and/or personalization tools. ...
  3. 3. Problem www.adaptcentre.ie Entities in Cultural Heritage collections often have poor representation in popular Knowledge Bases Study Content Coverage1 Agirre et al. Europeana Scran and Cgrid collections 22% Munnelly et al. 17th Century English Legal 23% An investigation by Max de Wilde reported recall scores between 0.213 and 0.517 for data curated by Historische Kranten. 1 Studies used Wikipedia derived source for Knowledge Base
  4. 4. Solution? www.adaptcentre.ie Find a better knowledge base 1. GeoNames 2. Pleiades 3. Getty 4. Library of Congress
  5. 5. Solution? www.adaptcentre.ie Find a better knowledge base 1. GeoNames 2. Pleiades 3. Getty 4. Library of Congress We couldn’t find a source with adequate coverage for the collections we handle.
  6. 6. Solution? www.adaptcentre.ie Find a better knowledge base 1. GeoNames 2. Pleiades 3. Getty 4. Library of Congress We couldn’t find a source with adequate coverage for the collections we handle. Can we build our own knowledge base?
  7. 7. Question www.adaptcentre.ie
  8. 8. Primary Sources www.adaptcentre.ie • Statute Staple • Petty Maps • Books of Survey and Distribution • Lodges Peerage
  9. 9. Primary Sources www.adaptcentre.ie • Statute Staple • Petty Maps • Books of Survey and Distribution • Lodges Peerage Excellent sources of information, but very isolating. Essentially reduces our efforts to little more than record resolution. Guaranteed to be noisy and open to dispute.
  10. 10. Secondary Sources www.adaptcentre.ie • Oxford Dictionary of National Biography (ODNB) • Dictionary of Irish Biography (DIB)
  11. 11. Secondary Sources www.adaptcentre.ie • Oxford Dictionary of National Biography (ODNB) • Dictionary of Irish Biography (DIB) Not bad! Coverage is not as good as the primary sources, but entities are more concrete. Still somewhat isolating. Can we connect these resources to a more established knowledge base?
  12. 12. Goal www.adaptcentre.ie Identify corresponding DBpedia entities for DIB and ODNB bibliographies: • Enables us to link our collections with more accepted linked data resources. • Implicitly limits the Entity Linker to entities that are relevant to our geographic region. • Enables linking across multiple knowledge bases – shown to be beneficial by Brando et al.
  13. 13. Method: Overview www.adaptcentre.ie Given a bibliography which describes and entity: 1. Retrieve DBpedia candidates which may discuss same entity. 2. Compare bibliography to all candidates. 3. Choose most similar candidate, or disregard all if no suitable equivalent is found. 4. Link bibliography to corresponding DBpedia entry.
  14. 14. Method: Features www.adaptcentre.ie
  15. 15. Method: Features www.adaptcentre.ie
  16. 16. Method: Features www.adaptcentre.ie
  17. 17. Method: Indexing and Retrieving Candidates www.adaptcentre.ie • Indexed all DBpedia people in Solr instance. • Indexed content included. Name of DBpedia entity. Text of corresponding Wikipedia article. Anchor text associated with entity. • For candidate retrieval, title of biography was excuted as query against search index. Results boosted on matches in title and anchor text. • Method only looks at top 10 returned results.
  18. 18. Method: Comparing Names www.adaptcentre.ie Compared title of biography to name of DBpedia entity using Monge-Elkan distance: • Tokenize both strings and add to bipartite graph. • Find optimal mapping between sets of tokens based on string similarity: Similarity computed with Jaro-Winkler distance. • Compute harmonic mean of similarities based on optimal mapping.
  19. 19. Method: Comparing Bibliographic Content www.adaptcentre.ie Can be vocabulary differences between biographies and Wikipedia. Use Word Mover Similarity (WMS) to compare content of articles: • Trained a Word2Vec model on contents of Wikipedia. • Convert content of biography and content of all candidate entity biographies to vector representation using Word2Vec Model. • Compute similarity between articles by computing how far apart they are in space.
  20. 20. Method: Scoring www.adaptcentre.ie Final score is a linear combination of the two previous scores: Ψ(b, p) = α Φ(btitle, pname) + β Ω(bcontent, particle) Where α and β are tuning parameters. We apply a hard threshold to this score. If the similarity of the highest ranked candidate is less than this score, then the bibliography does not have a corresponding DBpedia entry.
  21. 21. Evaluation www.adaptcentre.ie • Sampled 200 bibliographies from ODNB and DIB: 400 samples in total. • Created manually annotated gold standard from DBpedia: 72 DIB bibliographies do not have DBpedia referent. 64 ODNB bibliographies do not have DBpedia referent. • Presented approach is effectively an Entity Linking method: Evaluated with BAT Framework. • Results given in terms of percentage correct vs. incorrect annotations on the gold standard.
  22. 22. Results: Threshold DIB www.adaptcentre.ie 0.45 0.5 0.55 0.6 0.65 0.4 0.5 0.6 0.7 0.8 Threshold Accuracy
  23. 23. Results: Threshold ODNB www.adaptcentre.ie 0.45 0.5 0.55 0.6 0.65 0.4 0.5 0.6 Threshold Accuracy
  24. 24. Results www.adaptcentre.ie • Wide disparity in performance on two collections: 81.5% accurate on DIB. 67.5% accurate on ODNB. • Inaccuracies due to Solr not finding referent as candidate: 45.9% of errors on DIB. 43.1% of errors on ODNB. • Problems with ODNB likely due to pictorial resources: very little text content for WMS to compare.
  25. 25. Summary www.adaptcentre.ie • Presented a method for linking bibliographies to DBpedia. • Resulting output can be used as knowledge base for Entity Linking. • Links to DBpedia enable us to associate our historical collections with established knowledge bases where possible. • Want to investigate the possible benefits of linking across multiple knowledge bases, given that no single resource satisfies our requirements.
  26. 26. Thank You Questions
  27. 27. References I www.adaptcentre.ie Eneko Agirre et al. “Matching Cultural Heritage items to Wikipedia”. In: LREC. 2012, pp. 1729–1735. Carmen Brando, Francesca Frontini, and Jean-Gabriel Ganascia. “REDEN: Named Entity Linking in Digital Literary Editions Using Linked Data Sets”. en. In: Complex Systems Informatics and Modeling Quarterly 7 (July 2016), pp. 60–80. Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. “A framework for benchmarking entity-annotation systems”. In: Proceedings of the 22nd international conference on World Wide Web. ACM. 2013, pp. 249–260. Sergio Jimenez et al. “Generalized mongue-elkan method for approximate text string comparison”. In: International Conference on Intelligent Text Processing and Computational Linguistics. Springer. 2009, pp. 559–570.
  28. 28. References II www.adaptcentre.ie Matt Kusner et al. “From word embeddings to document distances”. In: International Conference on Machine Learning. 2015, pp. 957–966. Alvaro Monge and Charles Elkan. “The field matching problem: Algorithms and applications”. In: In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. 1996, pp. 267–270. Gary Munnelly and S´eamus Lawless. “Investigating Entity Linking in Early English Legal Documents”. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL). In Press.

×