SlideShare a Scribd company logo
Capturing the semantics of documentary
evidence for humanities research
DBpedia Day, NLP & DBpedia
09 / 09 / 2021,
Semantics 2021
Amsterdam (& online)
Enrico Daga
The Open University
@enridaga | www.enridaga.net
Motivation
The identification and cataloguing of documentary evidence
from textual corpora is an important part of empirical research in the
humanities (e.g. historiographic methodology).
Semantic databases of documentary evidence: a recent trend
• The Listening Experience Database Project (LED) (over 10.000 unique
experiences) - https://led.kmi.open.ac.uk/ (2 UK AHRC 2012-2019)
• READ-IT: Reading Europe Advanced Data Investigation Tool - https://
readit-project.eu/ (2018-2020)
• Polifonia: Knowledge Graph of Musical Cultural Heritage, with pilots
focusing on scholars in the musical heritage domain - http://polifonia-
project.eu (2021-2023)
Two problems:
• Identification -> find evidence in texts
• Cataloguing -> curate a database of evidence
Identification
The task of identifying pieces of evidence in books is a manual work, which
may include relying on free text search tools (e.g. PDF viewers)
Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is
prone to errors, and (d) the methodology is (often) not documented
"Capturing themed evidence, a hybrid approach."
Enrico Daga and Enrico Motta
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
• Focus on Identification
• We coin the expression themed evidence, to refer to (direct or indirect)
traces of a fact or situation relevant to a theme of interest and study the
problem of identifying them in texts.
• The task of identifying themed evidence is at the intersection between
topical text classification (finding texts relevant to a certain theme) and
event retrieval (find events mentioned in texts).
• Not all topical texts are themed evidence and the nature of the event itself
is often assumed, implicit, and left to the reader
Paper: http://oro.open.ac.uk/67961/
Finding Listening Experiences (theme: music)
• RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual supper
followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to
the piano.
• MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel,
where one is always sure of edification from the sermon if not from the psalms.
• MASONB-88, negative: Flags and pendants were suspended from the
windows, [. . . ] the colors of the German States were waving harmoniously
together, and the banners of the Fine Arts, with appropriate inscriptions,
particularly those of music, poetry and painting, were especially honored, and
floated triumphant amidst the standards of electorates, dukedoms, and
kingdoms.
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
Entity boost. To promote terms mapped to entities
PoS Filter: demote terms other then verbs and
nouns, to privilege factual statements
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
1) Statistical Relatedness Analysis
2) Themed entity detection
3) Hybridisation
RECMUS-619, positive: Introduced to the
Anacreontic Society, consisting of
amateurs who perform admirably the best
orchestral works. The usual supper
followed. After propitiating me with a trio
from 'Cosi Fan Tutte', they drew me to the
piano.
http://dbpedia.org/resource/Anacreontic_Society
http://dbpedia.org/resource/Orchestra
http://dbpedia.org/resource/Trio_(music)
http://dbpedia.org/resource/Così_fan_tutte
http://dbpedia.org/resource/Piano
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
http://led.kmi.open.ac.uk/discovery/findler
MASONB-31, positive: In the
evening we went to Rev. Baptist
Noel's chapel, where one is
always sure of edification from the
sermon if not from the psalms.
http://dbpedia.org/resource/
Evening_Prayer_(Anglican)
http://dbpedia.org/resource/Psalms
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
MASONB-88, negative: Flags and
pendants were suspended from the
windows, [...] the colours of the
German States were waving
harmoniously together, and the
banners of the Fine Arts, with
appropriate inscriptions, particularly
those of music, poetry and painting,
were especially honored, and ︎oated
triumphant amidst the standards of
electorates, dukedoms, and
kingdoms.
http://dbpedia.org/resource/Music
Evaluation
The results are very good: 87% F-Measure & Accuracy
Baseline methods:
• Fo: Random Forest Classifier high precision, low recall, accuracy slightly
above random (on training/test, it performed 80% accuracy:: robust GS!!!)
• ST: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF
Variants on our method:
• Em: Statistical relatedness component only (Embeddings)
• En: Themed entity detection component (Entity) slightly above random:
gold standard is pessimistic / robust!!!
• Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered)
• Hy-F: No filter, only entity boost (Hybrid - Unfiltered) Without applying
noise correction (POS filter), precision is generally lower; shows the impact
of entity detection on recall
• Hy: best of both worlds. Substantial agreement with annotators (Cohen’s
K)
Our method on an alternative case study:
• Hy/R: Our Hybrid approach on the Reading Experience Database (to
test portability). Core concept: book[n] and core entity: dbc:Literature .
The approach is applicable to other domains with small configuration
Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach."
In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
Cataloguing
“Challenging knowledge extraction to support the curation
of documentary evidence in the humanities. “
Enrico Daga and Enrico Motta
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). @K-CAP 2019
• Bet: metadata curation could be supported by Knowledge Extraction (KE)
• “Slot filling”
• Approaches in the literature vary in task / scope:
• (Named) Entity Recognition and Classification
• Entity Linking: encyclopedic (DBpedia, WikiData), domain specific (Gazetteers)
• Relation Extraction (e.g. listener of, in place)
• Event extraction (e.g. Performance)
• Semantic Role Labelling, Machine reading, …
• Assumption: the information is IN the text. Is that a valid assumption?
Paper: http://oro.open.ac.uk/67961/
Example #1
"I then went to Amsterdam to conduct Oedipus at the
Concertgebouw, which was celebrating its fortieth
anniversary by a series of sumptuous musical
productions. The fine Concertgebouw orchestra,
always at the same high level, the magnificent male
choruses from the Royal Apollo Society, soloists of
the first rank - among them Mme Hélène Sadoven as
Jocasta, Louis van Tulder as Oedipus, and Paul Huf,
an excellent reader - and the way in which my work
was received by the public, have left a particularly
precious memory that I recall with much enjoyment."
listener: Igor Strawinsky
time: in the beginning of 1928
place: Amsterdam
opera: Oedipus Rex
/by: Igor Strawinsky
performer: Concertgebouw orch.
environment: Public
Igor Stravinksy
An Autobiography (1936), p. 139.
https://led.kmi.open.ac.uk/entity/lexp/1435674909834
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
Example #2
"Music is certainly a pleasure that may be
reckoned intellectual, and we shall never again
have it in the perfection it is this year, because
Mr. Handel will not compose any more!
Oratorios begin next week, to my great joy, for
they are the highest entertainment to me."
listener: Mrs Delany
time: March, 1737
place: London
opera: Operas and Oratorios
/by: G. F. Handel
environment: Public
From: Mary Granville, and Augusta Hall (ed.),
Autobiography and Correspondence of Mary
Granville, Mrs Delany: with interesting
Reminiscences of King George the Third and Queen
Charlotte, volume 1 (London, 1861), p. 594.
https://led.kmi.open.ac.uk/entity/lexp/1444424772006
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
Experiments
• Focus on Entity Recognition: Listener & Place
• Scope: 7.3% of the LED with sources available (archive.org) and including
DBpedia entities as place or agent, 690 excerpts from 26 books.
1. Find the position of the evidence text back in the original source
2. Check where the DBpedia entity (listener or place) is mentioned
• Details of the experiments are in the paper
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
Analysis
• Q1 - in the excerpt? The place is mentioned in the excerpt in
25.9% cases. The listener only in 13.4%.
• Q2 - near the excerpt? Only 10% of the times the place mention
is less than 5 paragraphs from the excerpt. The agent, in 4% of
the cases.
• Q3 - in the source? 83.2% of the times the place is mentioned at
least once in the source. In 11.4% the place hasn’t been found.
• Q4 - in the meta? 64.8% of the listeners are also the authors of
the text - 5874 cases in LED.
Distance of entity (in n of paragraphs)
Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities.
In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
Polifonia | 2020
16
• Implicit information, based on inference
requiring expertise (e.g. Mr Handel is G.F
Handel, Oedipus is “Oedipus Rex”)
• The role of contextual knowledge is key to
• (1) identify the entities (e.g. metadata);
• (2) common sense reasoning (“the next
year”, "in the beginning of 1928")
• Entities can exist in distributed, heterogeneous
resources (encyclopaedic KBs, domain-specific
taxonomies, gazetteers, …)
• Machine reading generates an ontology
formalising the discourse in the text, reducing the
task to one of ontology alignment (not a
simplification!)
• AI / Knowledge Extraction research is often
focused on common sense & encyclopaedic
knowledge
• Documentary evidence is heavily domain-
specific
• Problem: humanities scholars coin novel
concepts, e.g. LED, READ-IT
• Sitting Experience in Portraiture History (OU
Arts History PhD)
• Polifonia / CHILD pilot: music of/for children
• Polifonia / MEETUPS pilot: encounters and
exchange of ideas
Lessons learnt
This research has partly received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 870811
The communication reflects only the author’s view and the Research Executive Agency is not responsible for any use that may be made of the information it contains
Thank you
Questions?
@enridaga | www.enridaga.net

More Related Content

Similar to Capturing the semantics of documentary evidence for humanities research

Experience planet earth
Experience planet earthExperience planet earth
Experience planet earth
Simon Schneider
 
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
Arts and Humanities Research Council (AHRC)
 
Trier
TrierTrier
Medieval Studies: Some Hopes and Fears for the Future
Medieval Studies: Some Hopes and Fears for the FutureMedieval Studies: Some Hopes and Fears for the Future
Medieval Studies: Some Hopes and Fears for the Future
Andrew Prescott
 
The Future of Medieval Studies: Hopes and Fears
The Future of Medieval Studies: Hopes and FearsThe Future of Medieval Studies: Hopes and Fears
The Future of Medieval Studies: Hopes and Fears
Andrew Prescott
 
Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...
Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...
Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...
Martin Kalfatovic
 
Europeana Aggregators' Fair day 1
Europeana Aggregators' Fair day 1Europeana Aggregators' Fair day 1
Europeana Aggregators' Fair day 1
Europeana
 
Gayle levy, 9.5.13
Gayle levy, 9.5.13Gayle levy, 9.5.13
Gayle levy, 9.5.13
sarl2007
 
HTAV Calligraphy presentation
HTAV Calligraphy presentationHTAV Calligraphy presentation
HTAV Calligraphy presentation
SLV Education
 
Cultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and WikisCultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and Wikis
Thomas Tunsch
 
Mediating Media Art. Digital Visual Archives as Mediation-Tools
Mediating Media Art. Digital Visual Archives as Mediation-ToolsMediating Media Art. Digital Visual Archives as Mediation-Tools
Mediating Media Art. Digital Visual Archives as Mediation-Tools
fwiencek
 
KuneraPeregrinations
KuneraPeregrinationsKuneraPeregrinations
KuneraPeregrinations
Hanneke van Asperen
 
Integrated History Unit: How can Friendships and Dance shape History?
Integrated History Unit: How can Friendships and Dance shape History?Integrated History Unit: How can Friendships and Dance shape History?
Integrated History Unit: How can Friendships and Dance shape History?
MahriAutumn
 
Rimini 16 5 2008
Rimini 16 5 2008Rimini 16 5 2008
Rimini 16 5 2008
Stuart Dunn
 
Celebrations of the International Mother Language Day 2024.
Celebrations of the International Mother Language Day 2024.Celebrations of the International Mother Language Day 2024.
Celebrations of the International Mother Language Day 2024.
Christina Parmionova
 
Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...
Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...
Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...
Encyclopaedia Iranica
 
Mobile Technology and the Museum
Mobile Technology and the MuseumMobile Technology and the Museum
Mobile Technology and the Museum
Dorota Kawęcka
 
Rescue Archival Documents
Rescue Archival DocumentsRescue Archival Documents
Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...
Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...
Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...
Alba Morales
 
SDoyle Presentation
SDoyle PresentationSDoyle Presentation
SDoyle Presentation
Siobhán Doyle
 

Similar to Capturing the semantics of documentary evidence for humanities research (20)

Experience planet earth
Experience planet earthExperience planet earth
Experience planet earth
 
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
HERA - Creativity and Craft Production in Middle and Late Bronze Age Europe (...
 
Trier
TrierTrier
Trier
 
Medieval Studies: Some Hopes and Fears for the Future
Medieval Studies: Some Hopes and Fears for the FutureMedieval Studies: Some Hopes and Fears for the Future
Medieval Studies: Some Hopes and Fears for the Future
 
The Future of Medieval Studies: Hopes and Fears
The Future of Medieval Studies: Hopes and FearsThe Future of Medieval Studies: Hopes and Fears
The Future of Medieval Studies: Hopes and Fears
 
Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...
Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...
Cultural Heritage and the Technology of Culture: Finding the Nature of Illumi...
 
Europeana Aggregators' Fair day 1
Europeana Aggregators' Fair day 1Europeana Aggregators' Fair day 1
Europeana Aggregators' Fair day 1
 
Gayle levy, 9.5.13
Gayle levy, 9.5.13Gayle levy, 9.5.13
Gayle levy, 9.5.13
 
HTAV Calligraphy presentation
HTAV Calligraphy presentationHTAV Calligraphy presentation
HTAV Calligraphy presentation
 
Cultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and WikisCultural heritage: Tradition, Museums and Wikis
Cultural heritage: Tradition, Museums and Wikis
 
Mediating Media Art. Digital Visual Archives as Mediation-Tools
Mediating Media Art. Digital Visual Archives as Mediation-ToolsMediating Media Art. Digital Visual Archives as Mediation-Tools
Mediating Media Art. Digital Visual Archives as Mediation-Tools
 
KuneraPeregrinations
KuneraPeregrinationsKuneraPeregrinations
KuneraPeregrinations
 
Integrated History Unit: How can Friendships and Dance shape History?
Integrated History Unit: How can Friendships and Dance shape History?Integrated History Unit: How can Friendships and Dance shape History?
Integrated History Unit: How can Friendships and Dance shape History?
 
Rimini 16 5 2008
Rimini 16 5 2008Rimini 16 5 2008
Rimini 16 5 2008
 
Celebrations of the International Mother Language Day 2024.
Celebrations of the International Mother Language Day 2024.Celebrations of the International Mother Language Day 2024.
Celebrations of the International Mother Language Day 2024.
 
Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...
Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...
Call for papers, project on the "Continuous Page: Scrolls and Scrolling from ...
 
Mobile Technology and the Museum
Mobile Technology and the MuseumMobile Technology and the Museum
Mobile Technology and the Museum
 
Rescue Archival Documents
Rescue Archival DocumentsRescue Archival Documents
Rescue Archival Documents
 
Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...
Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...
Musical Meetups Knowledge Graph (MMKG): a collection of evidence for historic...
 
SDoyle Presentation
SDoyle PresentationSDoyle Presentation
SDoyle Presentation
 

More from Enrico Daga

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
Enrico Daga
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Enrico Daga
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
Enrico Daga
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
Enrico Daga
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
Enrico Daga
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
Enrico Daga
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Enrico Daga
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
Enrico Daga
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
Enrico Daga
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
Enrico Daga
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
Enrico Daga
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
Enrico Daga
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
Enrico Daga
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
Enrico Daga
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
Enrico Daga
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
Enrico Daga
 

More from Enrico Daga (16)

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 

Recently uploaded

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 

Recently uploaded (20)

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 

Capturing the semantics of documentary evidence for humanities research

  • 1. Capturing the semantics of documentary evidence for humanities research DBpedia Day, NLP & DBpedia 09 / 09 / 2021, Semantics 2021 Amsterdam (& online) Enrico Daga The Open University @enridaga | www.enridaga.net
  • 2. Motivation The identification and cataloguing of documentary evidence from textual corpora is an important part of empirical research in the humanities (e.g. historiographic methodology). Semantic databases of documentary evidence: a recent trend • The Listening Experience Database Project (LED) (over 10.000 unique experiences) - https://led.kmi.open.ac.uk/ (2 UK AHRC 2012-2019) • READ-IT: Reading Europe Advanced Data Investigation Tool - https:// readit-project.eu/ (2018-2020) • Polifonia: Knowledge Graph of Musical Cultural Heritage, with pilots focusing on scholars in the musical heritage domain - http://polifonia- project.eu (2021-2023) Two problems: • Identification -> find evidence in texts • Cataloguing -> curate a database of evidence
  • 3. Identification The task of identifying pieces of evidence in books is a manual work, which may include relying on free text search tools (e.g. PDF viewers) Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is prone to errors, and (d) the methodology is (often) not documented
  • 4. "Capturing themed evidence, a hybrid approach." Enrico Daga and Enrico Motta In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. • Focus on Identification • We coin the expression themed evidence, to refer to (direct or indirect) traces of a fact or situation relevant to a theme of interest and study the problem of identifying them in texts. • The task of identifying themed evidence is at the intersection between topical text classification (finding texts relevant to a certain theme) and event retrieval (find events mentioned in texts). • Not all topical texts are themed evidence and the nature of the event itself is often assumed, implicit, and left to the reader Paper: http://oro.open.ac.uk/67961/
  • 5. Finding Listening Experiences (theme: music) • RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to the piano. • MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel, where one is always sure of edification from the sermon if not from the psalms. • MASONB-88, negative: Flags and pendants were suspended from the windows, [. . . ] the colors of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and floated triumphant amidst the standards of electorates, dukedoms, and kingdoms. Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
  • 6. Entity boost. To promote terms mapped to entities PoS Filter: demote terms other then verbs and nouns, to privilege factual statements Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. 1) Statistical Relatedness Analysis 2) Themed entity detection 3) Hybridisation
  • 7. RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano. http://dbpedia.org/resource/Anacreontic_Society http://dbpedia.org/resource/Orchestra http://dbpedia.org/resource/Trio_(music) http://dbpedia.org/resource/Così_fan_tutte http://dbpedia.org/resource/Piano Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. http://led.kmi.open.ac.uk/discovery/findler
  • 8. MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. http://dbpedia.org/resource/ Evening_Prayer_(Anglican) http://dbpedia.org/resource/Psalms Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. MASONB-88, negative: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms. http://dbpedia.org/resource/Music
  • 9. Evaluation The results are very good: 87% F-Measure & Accuracy Baseline methods: • Fo: Random Forest Classifier high precision, low recall, accuracy slightly above random (on training/test, it performed 80% accuracy:: robust GS!!!) • ST: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF Variants on our method: • Em: Statistical relatedness component only (Embeddings) • En: Themed entity detection component (Entity) slightly above random: gold standard is pessimistic / robust!!! • Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered) • Hy-F: No filter, only entity boost (Hybrid - Unfiltered) Without applying noise correction (POS filter), precision is generally lower; shows the impact of entity detection on recall • Hy: best of both worlds. Substantial agreement with annotators (Cohen’s K) Our method on an alternative case study: • Hy/R: Our Hybrid approach on the Reading Experience Database (to test portability). Core concept: book[n] and core entity: dbc:Literature . The approach is applicable to other domains with small configuration Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
  • 11. “Challenging knowledge extraction to support the curation of documentary evidence in the humanities. “ Enrico Daga and Enrico Motta In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). @K-CAP 2019 • Bet: metadata curation could be supported by Knowledge Extraction (KE) • “Slot filling” • Approaches in the literature vary in task / scope: • (Named) Entity Recognition and Classification • Entity Linking: encyclopedic (DBpedia, WikiData), domain specific (Gazetteers) • Relation Extraction (e.g. listener of, in place) • Event extraction (e.g. Performance) • Semantic Role Labelling, Machine reading, … • Assumption: the information is IN the text. Is that a valid assumption? Paper: http://oro.open.ac.uk/67961/
  • 12. Example #1 "I then went to Amsterdam to conduct Oedipus at the Concertgebouw, which was celebrating its fortieth anniversary by a series of sumptuous musical productions. The fine Concertgebouw orchestra, always at the same high level, the magnificent male choruses from the Royal Apollo Society, soloists of the first rank - among them Mme Hélène Sadoven as Jocasta, Louis van Tulder as Oedipus, and Paul Huf, an excellent reader - and the way in which my work was received by the public, have left a particularly precious memory that I recall with much enjoyment." listener: Igor Strawinsky time: in the beginning of 1928 place: Amsterdam opera: Oedipus Rex /by: Igor Strawinsky performer: Concertgebouw orch. environment: Public Igor Stravinksy An Autobiography (1936), p. 139. https://led.kmi.open.ac.uk/entity/lexp/1435674909834 Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  • 13. Example #2 "Music is certainly a pleasure that may be reckoned intellectual, and we shall never again have it in the perfection it is this year, because Mr. Handel will not compose any more! Oratorios begin next week, to my great joy, for they are the highest entertainment to me." listener: Mrs Delany time: March, 1737 place: London opera: Operas and Oratorios /by: G. F. Handel environment: Public From: Mary Granville, and Augusta Hall (ed.), Autobiography and Correspondence of Mary Granville, Mrs Delany: with interesting Reminiscences of King George the Third and Queen Charlotte, volume 1 (London, 1861), p. 594. https://led.kmi.open.ac.uk/entity/lexp/1444424772006 Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  • 14. Experiments • Focus on Entity Recognition: Listener & Place • Scope: 7.3% of the LED with sources available (archive.org) and including DBpedia entities as place or agent, 690 excerpts from 26 books. 1. Find the position of the evidence text back in the original source 2. Check where the DBpedia entity (listener or place) is mentioned • Details of the experiments are in the paper Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  • 15. Analysis • Q1 - in the excerpt? The place is mentioned in the excerpt in 25.9% cases. The listener only in 13.4%. • Q2 - near the excerpt? Only 10% of the times the place mention is less than 5 paragraphs from the excerpt. The agent, in 4% of the cases. • Q3 - in the source? 83.2% of the times the place is mentioned at least once in the source. In 11.4% the place hasn’t been found. • Q4 - in the meta? 64.8% of the listeners are also the authors of the text - 5874 cases in LED. Distance of entity (in n of paragraphs) Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  • 16. Polifonia | 2020 16 • Implicit information, based on inference requiring expertise (e.g. Mr Handel is G.F Handel, Oedipus is “Oedipus Rex”) • The role of contextual knowledge is key to • (1) identify the entities (e.g. metadata); • (2) common sense reasoning (“the next year”, "in the beginning of 1928") • Entities can exist in distributed, heterogeneous resources (encyclopaedic KBs, domain-specific taxonomies, gazetteers, …) • Machine reading generates an ontology formalising the discourse in the text, reducing the task to one of ontology alignment (not a simplification!) • AI / Knowledge Extraction research is often focused on common sense & encyclopaedic knowledge • Documentary evidence is heavily domain- specific • Problem: humanities scholars coin novel concepts, e.g. LED, READ-IT • Sitting Experience in Portraiture History (OU Arts History PhD) • Polifonia / CHILD pilot: music of/for children • Polifonia / MEETUPS pilot: encounters and exchange of ideas Lessons learnt This research has partly received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 870811 The communication reflects only the author’s view and the Research Executive Agency is not responsible for any use that may be made of the information it contains