Challenging knowledge extraction to support the curation of documentary evidence in the humanities

Challenging knowledge extraction to support 
the curation of documentary evidence in the humanities
Sciknow 2019
19 November 2019
Marina del Rey, California, United States
Enrico Daga and Enrico Motta
The Open University
Feedback: @enridaga | www.enridaga.net
Paper: http://oro.open.ac.uk/67961/

Background
• The identification and cataloguing of documentary evidence from textual
corpora is an important part of empirical research in the humanities.
• Semantic databases of documentary evidence: a recent trend
• The Listening Experience Database Project (LED) (over 10.000 unique
experiences) - https://led.kmi.open.ac.uk/
• READ-IT: Reading Europe Advanced Data Investigation Tool - https://readit-
project.eu/
• Two-step workflow:
1. Identification + Retrieval (*)
2. Cataloguing and curation (*) “Capturing Themed Evidence, a Hybrid Approach”
Details tomorrow afternoon …

Knowledge Extraction (KE)
• Bet: metadata curation could be supported with KE methods
• KE: automatic or semi-automatic derivation of formal symbolic knowledge from
unstructured or semi-structured sources
• Approaches in the literature vary in task / scope:
• (Named) Entity Recognition and Classification (Person, Work, Time, Place, …)
• Entity Linking (DBpedia, Gazetteers)
• Relation Extraction (listener of, in place)
• Event extraction (Performance)
• Machine reading

Example #1
"I then went to Amsterdam to conduct Oedipus at the
Concertgebouw, which was celebrating its fortieth
anniversary by a series of sumptuous musical
productions. The fine Concertgebouw orchestra, always
at the same high level, the magnificent male choruses
from the Royal Apollo Society, soloists of the first rank -
among them Mme Hélène Sadoven as Jocasta, Louis van
Tulder as Oedipus, and Paul Huf, an excellent reader -
and the way in which my work was received by the public,
have left a particularly precious memory that I recall with
much enjoyment."
listener: Igor Strawinsky
time: in the beginning of 1928
place: Amsterdam
opera: Oedipus Rex
/by: Igor Strawinsky
performer: Concertgebouw orch.
environment: Public
Igor Stravinksy
An Autobiography (1936), p. 139.
https://led.kmi.open.ac.uk/entity/lexp/1435674909834

Example #2
"Music is certainly a pleasure that may be reckoned
intellectual, and we shall never again have it in the
perfection it is this year, because Mr. Handel will not
compose any more! Oratorios begin next week, to
my great joy, for they are the highest entertainment
to me."
listener: Mrs Delany
time: March, 1737
place: London
opera: Operas and Oratorios
/by: G. F. Handel
environment: Public
From: Mary Granville, and Augusta Hall (ed.),
Autobiography and Correspondence of Mary Granville, Mrs
Delany: with interesting Reminiscences of King George the
Third and Queen Charlotte, volume 1 (London, 1861), p.
594.
https://led.kmi.open.ac.uk/entity/lexp/1444424772006

Analysis
• Scope: 7.3% of the LED with sources available (archive.org) and
including DBpedia entities as place or agent, 690 excerpts from 26
books.
• Focus on Entity Recognition: Listener & Place
1. Find the position of the evidence text back in the original source
2. Check where the DBpedia entity (listener or place) is mentioned
• Analysis is clearly limited and results are preliminary, although …

Analysis
• Q1 - in the excerpt? The place is mentioned in the
excerpt in 25.9% cases. The listener only in 13.4%.
• Q2 - near the excerpt? Only 10% of the times the
place mention is less than 5 paragraphs from the
excerpt. The agent, in 4% of the cases.
• Q3 - in the source? 83.2% of the times the place is
mentioned at least once in the source. In 11.4% the
place hasn’t been found.
• Q4 - in the meta? 64.8% of the listeners are also
the authors of the text - 5874 cases in LED.
Distance of entity (in n of paragraphs)

Challenges
• Implicit information, based on inference requiring expertise (e.g. Mr Handel is
G.F Handel, Oedipus is “Oedipus Rex”)
• The role of contextual knowledge is fundamental (1) in identifying the agent
from the metadata of the source; (2) common sense inference (“in the beginning
of 1928”)
• Entities can exist in distributed, heterogeneous resources (encyclopaedic KBs,
domain-specific taxonomies, gazetteers, …)
• Machine reading generates an ontology formalising the discourse in the text,
reducing the task to one of ontology alignment (not a simplification!)
• Cultural studies typically coin novel concepts (ListeningExperience) with original
schemas. Portability of the methods is even more at risk!

Thank you
Questions?

Challenging knowledge extraction to support the curation of documentary evidence in the humanities

Recommended

Recommended

More Related Content

Similar to Challenging knowledge extraction to support the curation of documentary evidence in the humanities

Similar to Challenging knowledge extraction to support the curation of documentary evidence in the humanities (20)

More from Enrico Daga

More from Enrico Daga (17)

Recently uploaded

Recently uploaded (20)