Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

13 views

Published on

Development and assessment of DADAlytics, a semantic tool created by the Semantic Lab at Pratt to help librarians, archivists and humanities scholars generate linked data from textual resources and descriptive records. The presentation includes the description of the overall design of the tool and outlines the methods adopted to make the tool intuitive and flexible. It also addresses ongoing assessment activities conducted through piloting and testing. The testbed is provided by the collection of personal diaries of Mary Berenson, part of the Bernard and Mary Berenson Papers (1880-2002) held at the Berenson Library at the Villa I Tatti in Florence, Italy.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

  1. 1. Documents as Data Harvesting Knowledge from Textual Resources with DADAlytics Mary Mann, Sarah Ann Adams, Rose Gold, Ilaria Della Monica, M. Cristina Pattuelli Qualitative and Quantitative Methods in Libraries Florence, May 28 - June 1, 2019 Semantic Lab at Pratt Institute @semlabteam bit.ly/QQMLSemLab
  2. 2. What is Linked Open Data ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- LOD: Recommended best practices for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web conceived by Tim Berners-Lee in 2006 Diagrams by Sarah Ann Adams bit.ly/QQMLSemLab
  3. 3. Availability of easy-to-use tools Technological understanding How DADAlytics Helps Intuitive data service Lowers barrier to LOD creation Linked Data Obstacles bit.ly/QQMLSemLab
  4. 4. What is DADAlytics Partners: Carnegie Hall, Tulane University, University of Minnesota, Harvard University, Villa I Tatti, Whitney Museum of American Art Named-Entity Recognition (NER) Module Sélavy Document Analysis Tool ------------- ------------- ------------- ------------- ------------- organization location date person misc -------------- -------------- -------------- -------------- ------ -------------- Title -------------- Subtitle -------------- -------------- ------ Body Diagrams by Sarah Ann Adams bit.ly/QQMLSemLab
  5. 5. bit.ly/QQMLSemLab
  6. 6. DBpedia Spotlight Stanford NLP NLTK SpaCy OpeNER Project TensorFlow Syntaxnet Tool Type NLP tool with NER component NLP tool with NER component NLP tool with NER component NLP tool with NER component NLP tool with NER component neural network part-of-speech tagger Trainin g Data DBpedia resources (Wikipedia- extracted structured content) mix of CoNLL, MUC- 6, MUC-7, and ACE named entity corpora using the english.muc. 7class.distsim.crf.se r.gz classifier Groningen Meaning Bank corpus OntoNotes and Common Cawl Apache OpeNLP models Parsey McParseface Further Info dbpedia- spotlight.org nlp.stanford.edu nltk.org spacy.io opener-project.eu research.googleblo g.com/2016/05/ann ouncing-syntaxnet- worlds-most.html The Six DADAlytics NER Tools bit.ly/QQMLSemLab
  7. 7. Mary Berenson [1885] Public Domain, held at National Portrait Gallery Mary Berenson and her Diaries Mary (Whitall) Berenson - art historian, art critic - wife of art historian Bernard Berenson - influenced Bernard’s work - Archive held at Villa I Tatti Mary and Bernard Berenson near Fernhurst, England, 1898, courtesy of the Villa I Tatti Berenson Library DADA•Berenson bit.ly/QQMLSemLab
  8. 8. DADA•Berenson NAMES? PLACES? ARTISTS? WORKS OF ART? Photograph courtesy of the Villa I Tatti Berenson Library Villa I Tatti Diary Project bit.ly/QQMLSemLab
  9. 9. Methodology 1] DIARY SELECTION bit.ly/QQMLSemLab
  10. 10. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT bit.ly/QQMLSemLab
  11. 11. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES bit.ly/QQMLSemLab
  12. 12. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES bit.ly/QQMLSemLab
  13. 13. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES bit.ly/QQMLSemLab
  14. 14. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES 5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE bit.ly/QQMLSemLab
  15. 15. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES 5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE 6] COMPARE MANUAL EXTRACTION TO DADALYTICS OUTPUT bit.ly/QQMLSemLab
  16. 16. Dadalytics NER Demo bit.ly/QQMLSemLab
  17. 17. Person Location Date Organization Event Miscellaneous DADAlytics Entity Categories Diary-Specific Entity Types Literature Music Poetry Theater Non-Fiction Visual Art Art Described by Era Art Described by Region Drawing Painting Photography Pottery Print Sculpture Stained Glass Textile Mural Art Collection Biographic Cultural Historic Article Journal Lecture Magazine Newspaper Thesis Entity Classification bit.ly/QQMLSemLab
  18. 18. Extraction Comparison Results semlab.io/DADAlytics-ner-evaluation/ bit.ly/QQMLSemLab
  19. 19. Extraction Comparison Results semlab.io/DADAlytics-ner-evaluation/ NLTK Example bit.ly/QQMLSemLab
  20. 20. 100.00% Analysis of Results 54.50% 11.17% 57.33% 38.00% 54.50% 29.33% 21.67% 20.50% 12.67% 11.50% 35.00% 24.17% 38.50% 39.67% 75.00% 23.83% 68.83% 73.00% 78.67% 67.83% 61.33% bit.ly/QQMLSemLab
  21. 21. 100.00% Analysis of Results 54.50% 11.17% 57.33% 38.00% 54.50% 29.33% 21.67% 20.50% 12.67% 11.50% 35.00% 24.17% 38.50% 39.67% 75.00% 23.83% 68.83% 73.00% 78.67% 67.83% 61.33% Batista, D. Named-Entity evaluation metrics based on entity-level. (2018 May 9). Retrieved from www.davidsbatista.net/blog/ 2018/05/09/Named_Entity_Evaluation/ 66.00% 50.00% [For partial matches] [For exact matches] bit.ly/QQMLSemLab
  22. 22. RroseSelavy(MarcelDuchamp),1920©ManRayTrust/ADAGP,ParisandDACS,London2015 SÉLAVY - DOCUMENT ANALYSIS TOOL Marcel Duchamp as Rrose Sélavy (pronounced “c’est la vie”) --------------- --------------- --------------- --------------- ------- ------------- Block 1 ------------- Block 2 --------------- --------------- ---- Block 3 Diagram by Sarah Ann Adams bit.ly/QQMLSemLab
  23. 23. Turning the document into blocks bit.ly/QQMLSemLab
  24. 24. Document clean up bit.ly/QQMLSemLab
  25. 25. Processing of document through Sélavy The text that was formatted in Selavy is now being pushed through the NER tool for entity recognition, and will then be pulled back into the Selavy tool for further transformation bit.ly/QQMLSemLab
  26. 26. Reviewing the entities bit.ly/QQMLSemLab
  27. 27. Next Steps - Complete the development of the Sélavy module - Test the Sélavy using Mary Berenson’s diary, and then on other types of documents (interviews, finding aids, etc.) - Evaluate the tool with the intended community of users - Review and refine the tool and workflow - Apply methodology to other Semantic Lab projects bit.ly/QQMLSemLab
  28. 28. Thank You Semantic Lab at Pratt Institute --- S E M L A B C O - D I R E C T O R S --- prof. m. cristina pattuelli prof. matt miller ------ S E M L A B T E A M ------ mary mann rose gold sarah adams taylor baker megan lyon ------ C O N T A C T ------ w :: semlab.io t :: @semlabteam e :: foaf.Person@semlab.io A special thank you to Ilaria della Monica, Archivist, Villa I Tatti tools by Nithinan Tatah from the Noun Project (slide 4) personal solution by ProSymbols from the Noun Project (slide 4) solution by Gregor Cresnar from the Noun Project (slide 4) PErson passive confused by Margaret Hagan from the Noun Project (slide 5) NOUN PROJECT IMAGE CREDITS Questions? @semlabteam bit.ly/QQMLSemLab

×