Successfully reported this slideshow.
Your SlideShare is downloading. ×

Capturing the semantics of documentary evidence for humanities research

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
View, Diana Mustata RUG
View, Diana Mustata RUG
Loading in …3
×

Check these out next

1 of 17 Ad

Capturing the semantics of documentary evidence for humanities research

Download to read offline

Identifying and curating documentary evidence from textual corpora is an essential part of empirical research in the humanities.
Initially, we discuss "themed" evidence - traces of a fact or situation relevant to a theme of interest and focus on the problem of identifying them in texts. To that end, we combine statistical NLP, background knowledge, and Semantic Web technologies in a hybrid approach. We illustrate the method's effectiveness in a case study of a database of evidence of experiences of listening to music. We also evidence its generality by testing it on a different use case in the digital humanities.
Finally, we ponder the applicability of knowledge extraction techniques to automatically populate a database of documentary evidence and discuss the challenges from the point of view of scientific knowledge acquisition.

Identifying and curating documentary evidence from textual corpora is an essential part of empirical research in the humanities.
Initially, we discuss "themed" evidence - traces of a fact or situation relevant to a theme of interest and focus on the problem of identifying them in texts. To that end, we combine statistical NLP, background knowledge, and Semantic Web technologies in a hybrid approach. We illustrate the method's effectiveness in a case study of a database of evidence of experiences of listening to music. We also evidence its generality by testing it on a different use case in the digital humanities.
Finally, we ponder the applicability of knowledge extraction techniques to automatically populate a database of documentary evidence and discuss the challenges from the point of view of scientific knowledge acquisition.

Advertisement
Advertisement

More Related Content

Similar to Capturing the semantics of documentary evidence for humanities research (20)

More from Enrico Daga (14)

Advertisement

Recently uploaded (20)

Capturing the semantics of documentary evidence for humanities research

  1. 1. Capturing the semantics of documentary evidence for humanities research DBpedia Day, NLP & DBpedia 09 / 09 / 2021, Semantics 2021 Amsterdam (& online) Enrico Daga The Open University @enridaga | www.enridaga.net
  2. 2. Motivation The identification and cataloguing of documentary evidence from textual corpora is an important part of empirical research in the humanities (e.g. historiographic methodology). Semantic databases of documentary evidence: a recent trend • The Listening Experience Database Project (LED) (over 10.000 unique experiences) - https://led.kmi.open.ac.uk/ (2 UK AHRC 2012-2019) • READ-IT: Reading Europe Advanced Data Investigation Tool - https:// readit-project.eu/ (2018-2020) • Polifonia: Knowledge Graph of Musical Cultural Heritage, with pilots focusing on scholars in the musical heritage domain - http://polifonia- project.eu (2021-2023) Two problems: • Identification -> find evidence in texts • Cataloguing -> curate a database of evidence
  3. 3. Identification The task of identifying pieces of evidence in books is a manual work, which may include relying on free text search tools (e.g. PDF viewers) Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is prone to errors, and (d) the methodology is (often) not documented
  4. 4. "Capturing themed evidence, a hybrid approach." Enrico Daga and Enrico Motta In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. • Focus on Identification • We coin the expression themed evidence, to refer to (direct or indirect) traces of a fact or situation relevant to a theme of interest and study the problem of identifying them in texts. • The task of identifying themed evidence is at the intersection between topical text classification (finding texts relevant to a certain theme) and event retrieval (find events mentioned in texts). • Not all topical texts are themed evidence and the nature of the event itself is often assumed, implicit, and left to the reader Paper: http://oro.open.ac.uk/67961/
  5. 5. Finding Listening Experiences (theme: music) • RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to the piano. • MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel, where one is always sure of edification from the sermon if not from the psalms. • MASONB-88, negative: Flags and pendants were suspended from the windows, [. . . ] the colors of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and floated triumphant amidst the standards of electorates, dukedoms, and kingdoms. Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
  6. 6. Entity boost. To promote terms mapped to entities PoS Filter: demote terms other then verbs and nouns, to privilege factual statements Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. 1) Statistical Relatedness Analysis 2) Themed entity detection 3) Hybridisation
  7. 7. RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano. http://dbpedia.org/resource/Anacreontic_Society http://dbpedia.org/resource/Orchestra http://dbpedia.org/resource/Trio_(music) http://dbpedia.org/resource/Così_fan_tutte http://dbpedia.org/resource/Piano Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. http://led.kmi.open.ac.uk/discovery/findler
  8. 8. MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. http://dbpedia.org/resource/ Evening_Prayer_(Anglican) http://dbpedia.org/resource/Psalms Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019. MASONB-88, negative: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms. http://dbpedia.org/resource/Music
  9. 9. Evaluation The results are very good: 87% F-Measure & Accuracy Baseline methods: • Fo: Random Forest Classifier high precision, low recall, accuracy slightly above random (on training/test, it performed 80% accuracy:: robust GS!!!) • ST: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF Variants on our method: • Em: Statistical relatedness component only (Embeddings) • En: Themed entity detection component (Entity) slightly above random: gold standard is pessimistic / robust!!! • Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered) • Hy-F: No filter, only entity boost (Hybrid - Unfiltered) Without applying noise correction (POS filter), precision is generally lower; shows the impact of entity detection on recall • Hy: best of both worlds. Substantial agreement with annotators (Cohen’s K) Our method on an alternative case study: • Hy/R: Our Hybrid approach on the Reading Experience Database (to test portability). Core concept: book[n] and core entity: dbc:Literature . The approach is applicable to other domains with small configuration Daga, Enrico, and Enrico Motta. "Capturing themed evidence, a hybrid approach." In Proceedings of the 10th International Conference on Knowledge Capture, pp. 93-100. 2019.
  10. 10. Cataloguing
  11. 11. “Challenging knowledge extraction to support the curation of documentary evidence in the humanities. “ Enrico Daga and Enrico Motta In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). @K-CAP 2019 • Bet: metadata curation could be supported by Knowledge Extraction (KE) • “Slot filling” • Approaches in the literature vary in task / scope: • (Named) Entity Recognition and Classification • Entity Linking: encyclopedic (DBpedia, WikiData), domain specific (Gazetteers) • Relation Extraction (e.g. listener of, in place) • Event extraction (e.g. Performance) • Semantic Role Labelling, Machine reading, … • Assumption: the information is IN the text. Is that a valid assumption? Paper: http://oro.open.ac.uk/67961/
  12. 12. Example #1 "I then went to Amsterdam to conduct Oedipus at the Concertgebouw, which was celebrating its fortieth anniversary by a series of sumptuous musical productions. The fine Concertgebouw orchestra, always at the same high level, the magnificent male choruses from the Royal Apollo Society, soloists of the first rank - among them Mme Hélène Sadoven as Jocasta, Louis van Tulder as Oedipus, and Paul Huf, an excellent reader - and the way in which my work was received by the public, have left a particularly precious memory that I recall with much enjoyment." listener: Igor Strawinsky time: in the beginning of 1928 place: Amsterdam opera: Oedipus Rex /by: Igor Strawinsky performer: Concertgebouw orch. environment: Public Igor Stravinksy An Autobiography (1936), p. 139. https://led.kmi.open.ac.uk/entity/lexp/1435674909834 Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  13. 13. Example #2 "Music is certainly a pleasure that may be reckoned intellectual, and we shall never again have it in the perfection it is this year, because Mr. Handel will not compose any more! Oratorios begin next week, to my great joy, for they are the highest entertainment to me." listener: Mrs Delany time: March, 1737 place: London opera: Operas and Oratorios /by: G. F. Handel environment: Public From: Mary Granville, and Augusta Hall (ed.), Autobiography and Correspondence of Mary Granville, Mrs Delany: with interesting Reminiscences of King George the Third and Queen Charlotte, volume 1 (London, 1861), p. 594. https://led.kmi.open.ac.uk/entity/lexp/1444424772006 Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  14. 14. Experiments • Focus on Entity Recognition: Listener & Place • Scope: 7.3% of the LED with sources available (archive.org) and including DBpedia entities as place or agent, 690 excerpts from 26 books. 1. Find the position of the evidence text back in the original source 2. Check where the DBpedia entity (listener or place) is mentioned • Details of the experiments are in the paper Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  15. 15. Analysis • Q1 - in the excerpt? The place is mentioned in the excerpt in 25.9% cases. The listener only in 13.4%. • Q2 - near the excerpt? Only 10% of the times the place mention is less than 5 paragraphs from the excerpt. The agent, in 4% of the cases. • Q3 - in the source? 83.2% of the times the place is mentioned at least once in the source. In 11.4% the place hasn’t been found. • Q4 - in the meta? 64.8% of the listeners are also the authors of the text - 5874 cases in LED. Distance of entity (in n of paragraphs) Daga, Enrico and Motta, Enrico (2019). Challenging knowledge extraction to support the curation of documentary evidence in the humanities. In: Third International Workshop on Capturing Scientific Knowledge (Sciknow). Collocated with the K-CAP conference.
  16. 16. Polifonia | 2020 16 • Implicit information, based on inference requiring expertise (e.g. Mr Handel is G.F Handel, Oedipus is “Oedipus Rex”) • The role of contextual knowledge is key to • (1) identify the entities (e.g. metadata); • (2) common sense reasoning (“the next year”, "in the beginning of 1928") • Entities can exist in distributed, heterogeneous resources (encyclopaedic KBs, domain-specific taxonomies, gazetteers, …) • Machine reading generates an ontology formalising the discourse in the text, reducing the task to one of ontology alignment (not a simplification!) • AI / Knowledge Extraction research is often focused on common sense & encyclopaedic knowledge • Documentary evidence is heavily domain- specific • Problem: humanities scholars coin novel concepts, e.g. LED, READ-IT • Sitting Experience in Portraiture History (OU Arts History PhD) • Polifonia / CHILD pilot: music of/for children • Polifonia / MEETUPS pilot: encounters and exchange of ideas Lessons learnt This research has partly received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 870811 The communication reflects only the author’s view and the Research Executive Agency is not responsible for any use that may be made of the information it contains
  17. 17. Thank you Questions? @enridaga | www.enridaga.net

×