Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Session 1.2 improving access to digital content by semantic enrichment
1. Improving access to digital collections by
semantic enrichment
Theo van Veen and Juliette Lonij, Semantics 2017, 12-09-2017
2. Overview
• Motivation and approach
• Entity linking
• Presentation
• Semantic search
• User feedback
• Wikidata as thesaurus
• Conclusions and next steps
T. v. Veen and J. Lonij, Semantics 2017
4. Motivation
• Improving discovery and usability requires intelligent
connection of content to the outside world.
• Content contains “knowledge” requiring intelligent
preprocessing to be found.
• Knowledge should be offered to the user more or less
unsolicited
• Our software should have read, analyzed and
enriched our content completely prior to the user!!
T. v. Veen and J. Lonij, Semantics 2017
5. Enrichment: purpose and approach
• making content better findable and usable,
especially newspaper articles
• by enriching text and names in the text with links to
related information
• which is in most cases linked data (links to Wikidata,
Polygoon news reels, images)
• and which enables advanced queries and presentation
of context information
T. v. Veen and J. Lonij, Semantics 2017
6. How?
• “Things” in text have to be uniquely identified
• When identifiers link to resource descriptions it is
possible to present context information about
“things”
• Relevant context information can be indexed as part of
a “thing” and so it can be searched for
• Using properties in external resource descriptions enable
semantic search
1. Identification
2. Context
3. Indexing
4. Semantic search
T. v. Veen and J. Lonij, Semantics 2017
7. Enrichment types
• Newspaper articles and radio bulletins linked to Polygoon newsreels
• Named entities linked to DBpedia (and Wikidata, VIAF etc.)
• Place-street combinations in newspaper articles linked to latitude
and longitude
• Newspaper articles linked to images from the Memory of the
Netherlands
Named
entities
Geodata Links Extracted
features
User
annotation
Image enrichment
DBpedia Street, place,
latt., long.
Web pages Classification Tags Face recognition
Wikidata Place,
latt., long.
Video Sentiment Stories (oral
history)
Emotion detection
VIAF Images Relevance Object detection
Geonames Sound Interestingness Structure detection
Etc.
Now available
T. v. Veen and J. Lonij, Semantics 2017
9. How do we deal with names in text?
• Recognize names (named entity recognition)
• Identify names by searching them in DBpedia and link the names
to the DBpedia descriptions
• Those names are ambiguous: does Einstein link to Albert Einstein
or Alfred Einstein? We need disambiguation algorithms.
• The accuracy of the links will be improved by machine learning
techniques; conventional “if then else” software isn’t fit for this job
• We need user feedback to correct false links or add missing
links and this will be used for additional training
T. v. Veen and J. Lonij, Semantics 2017
13. Continuous improvement
of enrichment algorithm
article number / time
80
1 108 mlj
• All DBpedia titles searched in news articles
• Named Entities searched in DBpedia
• Speedup by using HPC cloud SURFsara
• Using context and machine learning
Quality/confidence(%)
70
T. v. Veen and J. Lonij, Semantics 2017
90
At the end cycle to first article and
overwrite earlier enrichments with
newest algorithm
14. algorithm accuracy link recall link precision link F-measure
Rule based .76 .76 .65 .70
Machine learning (SVM) .84 .76 .83 .79
Neural network .84 .73 .87 .79
Extra features
e.g. word embedding
.85 .81 .82 .82
Extra Wikidata data,
more training data
.87 .81 .86 .84
Entity embedding .88 .86 .85 .85
From conventional entity linking to deep learning and
beyond
T. v. Veen and J. Lonij, Semantics 2017
15. Development cycle
Justification: Our aim is obtaining a higher quality than existing entity linking software (e.g. DBpedia Spotlight)
Trust/quality
Stored
LNE’s
Running
algorithm
Algorithm in
development
Enriched by
users
Target trust level
T. v. Veen and J. Lonij, Semantics 2017
train
replace
improve Example of comparison of stored LNE’s, result of current algorithm, result
of algorithm under development and existing software.
22. Sematic search:
index resource identifiers
Newspaper
index
Text + Viaf id +
Wikidata id etc.
Enrichment
database
Indexing
Get text for
article X Get enrichments
for article X
search articles with
wikidata id’s
Wikidata Semantic search (SPARQL)
providing wikidata id’s
T. v. Veen and J. Lonij, Semantics 2017
23. Articles mentioning
members of
parliament not born in
the Netherlands
SELECT ?p WHERE {
?p wdt:P39 wd:Q18887908 .
?p wdt:P19 ?place .
?place wdt:P17 ?country .
FILTER NOT EXISTS {
?place wdt:P17 wd:Q55 .
} }
24. For the same query in
the catalogue the
Wikidata identifier is
converted to the local
thesaurus identifier
25. • Semantic query between [ ], in this
case expand to all Roman Emperors
• Select “newspaper+” collection
• Select a result
• Click on a linked named entity for more
information
• Click on “More info” for properties of
this entity
• Click on a property for searching more
articles about resources with that
property
• And see the result: all articles
mentioning persons that have been
married to Elizabeth Taylor
Navigation example Using square brackets the
software tries a few
Wikidata SPARQL queries
and replaces this string
by the Wikidata results.
26. • Semantic query between [ ], in this case
expand to all Roman Emperors
• Select “newspaper+” collection
• Select a result
• Click on a linked named entity for more
information
• Click on “More info” for properties of
this entity
• Click on a property for searching more
articles about resources with that
property
• And see the result: all articles
mentioning persons that have been
married to Elizabeth Taylor
Navigation example
27. • Semantic query between [ ], in this case
expand to all Roman Emperors
• Select “newspaper+” collection
• Select a result
• Click on a linked named entity for more
information
• Click on “More info” for properties of
this entity
• Click on a property for searching more
articles about resources with that
property
• And see the result: all articles
mentioning persons that have been
married to Elizabeth Taylor
Navigation example
28. • Semantic query between [ ], in this
case expand to all Roman Emperors
• Select “newspaper+” collection
• Select a result
• Click on a linked named entity for
more information
• Click on “More info” for properties of
this entity
• Click on a property for searching more
articles about resources with that
property
• And see the result: all articles
mentioning persons that have been
married to Elizabeth Taylor
Navigation example
29. • Semantic query between [ ], in this case
expand to all Roman Emperors
• Select “newspaper+” collection
• Select a result
• Click on a linked named entity for more
information
• Click on “More info” for properties of
this entity
• Click on a property for searching more
articles about resources with that
property
• And see the result: all articles
mentioning persons that have been
married to Elizabeth Taylor
Navigation example
30. spouse=Elizabeth Taylor
• Semantic query between [ ], in this case
expand to all Roman Emperors
• Select “newspaper+” collection
• Select a result
• Click on a linked named entity for more
information
• Click on “More info” for properties of this
entity
• Click on a property for searching more
articles about resources with that
property
• And see the result: all articles
mentioning persons that have been
married to Elizabeth Taylor
Navigation example
38. Wikidata as central hub?
W
W
“Everything links to
everything”
“Everything links to
Wikidata”
39. Current situation: many to many links
(many identifiers for single resource)
Proposed: everything links to Wikidata
(same identifier for single resource)
Wikidata as universal thesaurus for libraries
T. v. Veen and J. Lonij, Semantics 2017
41. Conclusions and next steps
• Entity linking combining machine learning and domain knowledge is
promising and we still have ideas for improvements
• We have shown the added value of linking named entities to
Wikidata/DBpedia: it improves findability and usability of content as
demonstrated with the research portal
• Our aim is to increase the confidence of links so users can trust them
“enough” for using them for searching and research
• User feedback provides additional training data and needs to be
deployed on a larger scale
T. v. Veen and J. Lonij, Semantics 2017