Identifying and curating documentary evidence from textual corpora is an essential part of empirical research in the humanities.
Initially, we discuss "themed" evidence - traces of a fact or situation relevant to a theme of interest and focus on the problem of identifying them in texts. To that end, we combine statistical NLP, background knowledge, and Semantic Web technologies in a hybrid approach. We illustrate the method's effectiveness in a case study of a database of evidence of experiences of listening to music. We also evidence its generality by testing it on a different use case in the digital humanities.
Finally, we ponder the applicability of knowledge extraction techniques to automatically populate a database of documentary evidence and discuss the challenges from the point of view of scientific knowledge acquisition.
Capturing the semantics of documentary evidence for humanities research
1. Capturing the semantics of documentary
evidence for humanities research
DBpedia Day, NLP & DBpedia
09 / 09 / 2021,
Semantics 2021
Amsterdam (& online)
Enrico Daga
The Open University
@enridaga | www.enridaga.net
2. Motivation
The identification and cataloguing of documentary evidence
from textual corpora is an important part of empirical research in the
humanities (e.g. historiographic methodology).
Semantic databases of documentary evidence: a recent trend
• The Listening Experience Database Project (LED) (over 10.000 unique
experiences) - https://led.kmi.open.ac.uk/ (2 UK AHRC 2012-2019)
• READ-IT: Reading Europe Advanced Data Investigation Tool - https://
readit-project.eu/ (2018-2020)
• Polifonia: Knowledge Graph of Musical Cultural Heritage, with pilots
focusing on scholars in the musical heritage domain - http://polifonia-
project.eu (2021-2023)
Two problems:
• Identification -> find evidence in texts
• Cataloguing -> curate a database of evidence
3. Identification
The task of identifying pieces of evidence in books is a manual work, which
may include relying on free text search tools (e.g. PDF viewers)
Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is
prone to errors, and (d) the methodology is (often) not documented
4. "Capturing themed evidence, a hybrid approach."
Enrico Daga and Enrico Motta
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
• Focus on Identification
• We coin the expression themed evidence, to refer to (direct or indirect)
traces of a fact or situation relevant to a theme of interest and study the
problem of identifying them in texts.
• The task of identifying themed evidence is at the intersection between
topical text classification (finding texts relevant to a certain theme) and
event retrieval (find events mentioned in texts).
• Not all topical texts are themed evidence and the nature of the event itself
is often assumed, implicit, and left to the reader
Paper: http://oro.open.ac.uk/67961/
5. Finding Listening Experiences (theme: music)
• RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual supper
followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to
the piano.
• MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel,
where one is always sure of edification from the sermon if not from the psalms.
• MASONB-88, negative: Flags and pendants were suspended from the
windows, [. . . ] the colors of the German States were waving harmoniously
together, and the banners of the Fine Arts, with appropriate inscriptions,
particularly those of music, poetry and painting, were especially honored, and
floated triumphant amidst the standards of electorates, dukedoms, and
kingdoms.
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
6. Entity boost. To promote terms mapped to entities
PoS Filter: demote terms other then verbs and
nouns, to privilege factual statements
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
1) Statistical Relatedness Analysis
2) Themed entity detection
3) Hybridisation
7. RECMUS-619, positive: Introduced to the
Anacreontic Society, consisting of
amateurs who perform admirably the best
orchestral works. The usual supper
followed. After propitiating me with a trio
from 'Cosi Fan Tutte', they drew me to the
piano.
http://dbpedia.org/resource/Anacreontic_Society
http://dbpedia.org/resource/Orchestra
http://dbpedia.org/resource/Trio_(music)
http://dbpedia.org/resource/Così_fan_tutte
http://dbpedia.org/resource/Piano
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
http://led.kmi.open.ac.uk/discovery/findler
8. MASONB-31, positive: In the
evening we went to Rev. Baptist
Noel's chapel, where one is
always sure of edification from the
sermon if not from the psalms.
http://dbpedia.org/resource/
Evening_Prayer_(Anglican)
http://dbpedia.org/resource/Psalms
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
MASONB-88, negative: Flags and
pendants were suspended from the
windows, [...] the colours of the
German States were waving
harmoniously together, and the
banners of the Fine Arts, with
appropriate inscriptions, particularly
those of music, poetry and painting,
were especially honored, and ︎oated
triumphant amidst the standards of
electorates, dukedoms, and
kingdoms.
http://dbpedia.org/resource/Music
9. Evaluation
The results are very good: 87% F-Measure & Accuracy
Baseline methods:
• Fo: Random Forest Classifier high precision, low recall, accuracy slightly
above random (on training/test, it performed 80% accuracy:: robust GS!!!)
• ST: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF
Variants on our method:
• Em: Statistical relatedness component only (Embeddings)
• En: Themed entity detection component (Entity) slightly above random:
gold standard is pessimistic / robust!!!
• Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered)
• Hy-F: No filter, only entity boost (Hybrid - Unfiltered) Without applying
noise correction (POS filter), precision is generally lower; shows the impact
of entity detection on recall
• Hy: best of both worlds. Substantial agreement with annotators (Cohen’s
K)
Our method on an alternative case study:
• Hy/R: Our Hybrid approach on the Reading Experience Database (to
test portability). Core concept: book[n] and core entity: dbc:Literature .
The approach is applicable to other domains with small configuration
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
11. “Challenging knowledge extraction to support the curation
of documentary evidence in the humanities. “
Enrico Daga and Enrico Motta
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). @K-CAP 2019
• Bet: metadata curation could be supported by Knowledge Extraction (KE)
• “Slot filling”
• Approaches in the literature vary in task / scope:
• (Named) Entity Recognition and Classification
• Entity Linking: encyclopedic (DBpedia, WikiData), domain specific (Gazetteers)
• Relation Extraction (e.g. listener of, in place)
• Event extraction (e.g. Performance)
• Semantic Role Labelling, Machine reading, …
• Assumption: the information is IN the text. Is that a valid assumption?
Paper: http://oro.open.ac.uk/67961/
12. Example #1
"I then went to Amsterdam to conduct Oedipus at the
Concertgebouw, which was celebrating its fortieth
anniversary by a series of sumptuous musical
productions. The fine Concertgebouw orchestra,
always at the same high level, the magnificent male
choruses from the Royal Apollo Society, soloists of
the first rank - among them Mme Hélène Sadoven as
Jocasta, Louis van Tulder as Oedipus, and Paul Huf,
an excellent reader - and the way in which my work
was received by the public, have left a particularly
precious memory that I recall with much enjoyment."
listener: Igor Strawinsky
time: in the beginning of 1928
place: Amsterdam
opera: Oedipus Rex
/by: Igor Strawinsky
performer: Concertgebouw orch.
environment: Public
Igor Stravinksy
An Autobiography (1936), p. 139.
https://led.kmi.open.ac.uk/entity/lexp/1435674909834
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
13. Example #2
"Music is certainly a pleasure that may be
reckoned intellectual, and we shall never again
have it in the perfection it is this year, because
Mr. Handel will not compose any more!
Oratorios begin next week, to my great joy, for
they are the highest entertainment to me."
listener: Mrs Delany
time: March, 1737
place: London
opera: Operas and Oratorios
/by: G. F. Handel
environment: Public
From: Mary Granville, and Augusta Hall (ed.),
Autobiography and Correspondence of Mary
Granville, Mrs Delany: with interesting
Reminiscences of King George the Third and Queen
Charlotte, volume 1 (London, 1861), p. 594.
https://led.kmi.open.ac.uk/entity/lexp/1444424772006
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
14. Experiments
• Focus on Entity Recognition: Listener & Place
• Scope: 7.3% of the LED with sources available (archive.org) and including
DBpedia entities as place or agent, 690 excerpts from 26 books.
1. Find the position of the evidence text back in the original source
2. Check where the DBpedia entity (listener or place) is mentioned
• Details of the experiments are in the paper
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
15. Analysis
• Q1 - in the excerpt? The place is mentioned in the excerpt in
25.9% cases. The listener only in 13.4%.
• Q2 - near the excerpt? Only 10% of the times the place mention
is less than 5 paragraphs from the excerpt. The agent, in 4% of
the cases.
• Q3 - in the source? 83.2% of the times the place is mentioned at
least once in the source. In 11.4% the place hasn’t been found.
• Q4 - in the meta? 64.8% of the listeners are also the authors of
the text - 5874 cases in LED.
Distance of entity (in n of paragraphs)
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
16. Polifonia | 2020
16
• Implicit information, based on inference
requiring expertise (e.g. Mr Handel is G.F
Handel, Oedipus is “Oedipus Rex”)
• The role of contextual knowledge is key to
• (1) identify the entities (e.g. metadata);
• (2) common sense reasoning (“the next
year”, "in the beginning of 1928")
• Entities can exist in distributed, heterogeneous
resources (encyclopaedic KBs, domain-specific
taxonomies, gazetteers, …)
• Machine reading generates an ontology
formalising the discourse in the text, reducing the
task to one of ontology alignment (not a
simplification!)
• AI / Knowledge Extraction research is often
focused on common sense & encyclopaedic
knowledge
• Documentary evidence is heavily domain-
specific
• Problem: humanities scholars coin novel
concepts, e.g. LED, READ-IT
• Sitting Experience in Portraiture History (OU
Arts History PhD)
• Polifonia / CHILD pilot: music of/for children
• Polifonia / MEETUPS pilot: encounters and
exchange of ideas
Lessons learnt
This research has partly received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 870811
The communication reflects only the author’s view and the Research Executive Agency is not responsible for any use that may be made of the information it contains