Successfully reported this slideshow.
Your SlideShare is downloading. ×

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 28 Ad

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

Download to read offline

Development and assessment of DADAlytics, a semantic tool created by the Semantic Lab at Pratt to help librarians, archivists and humanities scholars generate linked data from textual resources and descriptive records. The presentation includes the description of the overall design of the tool and outlines the methods adopted to make the tool intuitive and flexible. It also addresses ongoing assessment activities conducted through piloting and testing. The testbed is provided by the collection of personal diaries of Mary Berenson, part of the Bernard and Mary Berenson Papers (1880-2002) held at the Berenson Library at the Villa I Tatti in Florence, Italy.

Development and assessment of DADAlytics, a semantic tool created by the Semantic Lab at Pratt to help librarians, archivists and humanities scholars generate linked data from textual resources and descriptive records. The presentation includes the description of the overall design of the tool and outlines the methods adopted to make the tool intuitive and flexible. It also addresses ongoing assessment activities conducted through piloting and testing. The testbed is provided by the collection of personal diaries of Mary Berenson, part of the Bernard and Mary Berenson Papers (1880-2002) held at the Berenson Library at the Villa I Tatti in Florence, Italy.

Advertisement
Advertisement

More Related Content

Recently uploaded (20)

Advertisement

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

  1. 1. Documents as Data Harvesting Knowledge from Textual Resources with DADAlytics Mary Mann, Sarah Ann Adams, Rose Gold, Ilaria Della Monica, M. Cristina Pattuelli Qualitative and Quantitative Methods in Libraries Florence, May 28 - June 1, 2019 Semantic Lab at Pratt Institute @semlabteam bit.ly/QQMLSemLab
  2. 2. What is Linked Open Data ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- LOD: Recommended best practices for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web conceived by Tim Berners-Lee in 2006 Diagrams by Sarah Ann Adams bit.ly/QQMLSemLab
  3. 3. Availability of easy-to-use tools Technological understanding How DADAlytics Helps Intuitive data service Lowers barrier to LOD creation Linked Data Obstacles bit.ly/QQMLSemLab
  4. 4. What is DADAlytics Partners: Carnegie Hall, Tulane University, University of Minnesota, Harvard University, Villa I Tatti, Whitney Museum of American Art Named-Entity Recognition (NER) Module Sélavy Document Analysis Tool ------------- ------------- ------------- ------------- ------------- organization location date person misc -------------- -------------- -------------- -------------- ------ -------------- Title -------------- Subtitle -------------- -------------- ------ Body Diagrams by Sarah Ann Adams bit.ly/QQMLSemLab
  5. 5. bit.ly/QQMLSemLab
  6. 6. DBpedia Spotlight Stanford NLP NLTK SpaCy OpeNER Project TensorFlow Syntaxnet Tool Type NLP tool with NER component NLP tool with NER component NLP tool with NER component NLP tool with NER component NLP tool with NER component neural network part-of-speech tagger Trainin g Data DBpedia resources (Wikipedia- extracted structured content) mix of CoNLL, MUC- 6, MUC-7, and ACE named entity corpora using the english.muc. 7class.distsim.crf.se r.gz classifier Groningen Meaning Bank corpus OntoNotes and Common Cawl Apache OpeNLP models Parsey McParseface Further Info dbpedia- spotlight.org nlp.stanford.edu nltk.org spacy.io opener-project.eu research.googleblo g.com/2016/05/ann ouncing-syntaxnet- worlds-most.html The Six DADAlytics NER Tools bit.ly/QQMLSemLab
  7. 7. Mary Berenson [1885] Public Domain, held at National Portrait Gallery Mary Berenson and her Diaries Mary (Whitall) Berenson - art historian, art critic - wife of art historian Bernard Berenson - influenced Bernard’s work - Archive held at Villa I Tatti Mary and Bernard Berenson near Fernhurst, England, 1898, courtesy of the Villa I Tatti Berenson Library DADA•Berenson bit.ly/QQMLSemLab
  8. 8. DADA•Berenson NAMES? PLACES? ARTISTS? WORKS OF ART? Photograph courtesy of the Villa I Tatti Berenson Library Villa I Tatti Diary Project bit.ly/QQMLSemLab
  9. 9. Methodology 1] DIARY SELECTION bit.ly/QQMLSemLab
  10. 10. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT bit.ly/QQMLSemLab
  11. 11. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES bit.ly/QQMLSemLab
  12. 12. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES bit.ly/QQMLSemLab
  13. 13. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES bit.ly/QQMLSemLab
  14. 14. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES 5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE bit.ly/QQMLSemLab
  15. 15. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES 5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE 6] COMPARE MANUAL EXTRACTION TO DADALYTICS OUTPUT bit.ly/QQMLSemLab
  16. 16. Dadalytics NER Demo bit.ly/QQMLSemLab
  17. 17. Person Location Date Organization Event Miscellaneous DADAlytics Entity Categories Diary-Specific Entity Types Literature Music Poetry Theater Non-Fiction Visual Art Art Described by Era Art Described by Region Drawing Painting Photography Pottery Print Sculpture Stained Glass Textile Mural Art Collection Biographic Cultural Historic Article Journal Lecture Magazine Newspaper Thesis Entity Classification bit.ly/QQMLSemLab
  18. 18. Extraction Comparison Results semlab.io/DADAlytics-ner-evaluation/ bit.ly/QQMLSemLab
  19. 19. Extraction Comparison Results semlab.io/DADAlytics-ner-evaluation/ NLTK Example bit.ly/QQMLSemLab
  20. 20. 100.00% Analysis of Results 54.50% 11.17% 57.33% 38.00% 54.50% 29.33% 21.67% 20.50% 12.67% 11.50% 35.00% 24.17% 38.50% 39.67% 75.00% 23.83% 68.83% 73.00% 78.67% 67.83% 61.33% bit.ly/QQMLSemLab
  21. 21. 100.00% Analysis of Results 54.50% 11.17% 57.33% 38.00% 54.50% 29.33% 21.67% 20.50% 12.67% 11.50% 35.00% 24.17% 38.50% 39.67% 75.00% 23.83% 68.83% 73.00% 78.67% 67.83% 61.33% Batista, D. Named-Entity evaluation metrics based on entity-level. (2018 May 9). Retrieved from www.davidsbatista.net/blog/ 2018/05/09/Named_Entity_Evaluation/ 66.00% 50.00% [For partial matches] [For exact matches] bit.ly/QQMLSemLab
  22. 22. RroseSelavy(MarcelDuchamp),1920©ManRayTrust/ADAGP,ParisandDACS,London2015 SÉLAVY - DOCUMENT ANALYSIS TOOL Marcel Duchamp as Rrose Sélavy (pronounced “c’est la vie”) --------------- --------------- --------------- --------------- ------- ------------- Block 1 ------------- Block 2 --------------- --------------- ---- Block 3 Diagram by Sarah Ann Adams bit.ly/QQMLSemLab
  23. 23. Turning the document into blocks bit.ly/QQMLSemLab
  24. 24. Document clean up bit.ly/QQMLSemLab
  25. 25. Processing of document through Sélavy The text that was formatted in Selavy is now being pushed through the NER tool for entity recognition, and will then be pulled back into the Selavy tool for further transformation bit.ly/QQMLSemLab
  26. 26. Reviewing the entities bit.ly/QQMLSemLab
  27. 27. Next Steps - Complete the development of the Sélavy module - Test the Sélavy using Mary Berenson’s diary, and then on other types of documents (interviews, finding aids, etc.) - Evaluate the tool with the intended community of users - Review and refine the tool and workflow - Apply methodology to other Semantic Lab projects bit.ly/QQMLSemLab
  28. 28. Thank You Semantic Lab at Pratt Institute --- S E M L A B C O - D I R E C T O R S --- prof. m. cristina pattuelli prof. matt miller ------ S E M L A B T E A M ------ mary mann rose gold sarah adams taylor baker megan lyon ------ C O N T A C T ------ w :: semlab.io t :: @semlabteam e :: foaf.Person@semlab.io A special thank you to Ilaria della Monica, Archivist, Villa I Tatti tools by Nithinan Tatah from the Noun Project (slide 4) personal solution by ProSymbols from the Noun Project (slide 4) solution by Gregor Cresnar from the Noun Project (slide 4) PErson passive confused by Margaret Hagan from the Noun Project (slide 5) NOUN PROJECT IMAGE CREDITS Questions? @semlabteam bit.ly/QQMLSemLab

Editor's Notes

  • SLIDE 1 [MARY]
    Hello, my name is Mary, from the Semantic Lab at Pratt Institute. We are very happy to be here at QQML to present “Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics.” Before we start, I’d like to invite you all to take a minute to stand up and stretch - I know we’ve all been sitting for a long time already!
    On the screen here you can see a link to this presentation, if you’d like to follow along. And just to give a roadmap of what we’ll be sharing with you, I will start by speaking about Linked Open Data and introducing the Linked Data creation tool package we’ve coined DADAlytics; then my colleague Rose will speak about the process of using a diary written by Mary Berenson from Villa I Tatti to test the first component of DADAlytics; lastly our colleague Sarah will speak about the results of that testing and then describe the second component of the DADAlytics tool package, Sélavy.
    So let’s get started.
  • SLIDE 2 [MARY]
    DADAlytics is a tool package that helps institutions and researchers create linked open data. So before we get into DADAlytics itself, I’ll just give you a brief overview of what linked open data is and why you might want to create it.
    The internet as we currently know it is a series of documents linked by URLs. But the semantic web is a web of linked data, rather than just linked documents. In this context, “data” could mean anything from statistics to people to names of artworks. A semantic web with information stored as linked data allows for more granular searching of data, which has the added benefit of increasing discoverability of and access to the data and/or resources.

  • SLIDE 3 [MARY]
    Because the generation of linked open data is relatively new to the cultural heritage domain, there are still significant barriers to entry, particularly in terms of understanding the processes and technology involved, and the availability of intuitive linked data tools.
    We recognize that there’s a significant time cost to creating linked open data at this stage in its development, which can make the process daunting for already-busy library and museum professionals.
    So the Semantic Lab envisioned DADAlytics, with the goal of creating a lightweight tool package that could enable every librarian, archivist, museum professional and digital humanities scholar to contribute to the Semantic web by creating linked open data, advancing scholarship and creating new knowledge.
  • SLIDE 4 [MARY]
    The DADAlytics tool package is being developed by Matt Miller, one of the two co-directors of the Semantic Lab at Pratt, and was informed by needs of the stakeholders. Representatives from these institutions gave us feedback on what they might want to see in a package designed to help them create linked open data.
    DADAlytics currently consists of two modules: a named-entity-recognition toolchain and a document analysis tool called Sélavy.
    My colleague Sarah will speak about Sélavy at the end of the presentation, but right now I’ll focus on the named-entity-recognition toolchain, or NER toolchain for short.
  • SLIDE 5 [MARY]
    The NER toolchain is a combination of six existing open source tools that work together to recognize entities. In this context, an “entity ” can loosely be thought of as a proper noun. The main categories of entities picked up by NER tools are dates, locations, organizations, people, and “miscellaneous”.
    Here’s a visual of the NER toolchain process at work on an archival document. First the document has to be transcribed into machine-readable type. Then you can simply copy and paste the text into the NER toolchain, and it returns something that looks like this (gesture to screen) where all of the detected entities are highlighted with a color block indicating what type of entity the tools believe they are.
  • SLIDE 6 [MARY]
    The NER toolchain harnesses the strengths of six existing tools into one super-tool, which outputs stronger results combined than any one of these tools could do individually.
    That said, users have the option to select or deselect tools before processing a document through the NER toolchain.
    And now I’ll turn it over to my colleague Rose, who will talk about the testing of the NER toolchain...
  • SLIDE 7 [ROSE]
    We used 7 different different types of written documents to test the NER toolchain my colleague Mary just described
    Chapter from a fiction book
    Interview transcript
    Metadata descriptions
    Press release
    Artist cv/resume
    Portion of an EAD finding aid
    And a diary -- the diary of art historian and critic Mary Berenson whose papers are held, along with those of her husband Bernard Bernson, at Villa I Tatti, The Harvard University Center for Renaissance Studies
    While Mary worked in the shadow of her more renowned husband, she is credited with having had significant influence over his scholarly work as well as cultivating relationships with intellectuals, artists and collectors who surrounded the couple while they lived in Florence.
    Because of our partnership with i Tatti and their interest in knowing more details about Mary Berenson’s diaries , we decided to take a closer look at one of her diaries
  • SLIDE 8 [ROSE]
    Why Berenson’s diaries? As I Tatti archivist Ilaria della Monica puts it: “Mary recorded the travels she undertook with her second husband Bernard Berenson to visit museums, churches and private collections. She also took notes on books, music and the people they met.”
    Her diaries are thus rich in information, full of useful entities like the names of artwork and artists, places and people, books and theories
    These entities are helpful because they create the world and orbit that Berenson moved within.
    This information is highly valued by i Tatti researchers and staff
  • SLIDE 9 [ROSE]
    We began by choosing the 1903 diary of the Berenson’s trip to America, which i Tatti researchers were particularly interested in knowing more about

  • SLIDE 10 [ROSE]
    I Tatti provided us with a transcribed and OCR’ed PDF of the diary
  • SLIDE 11 [ROSE]
    And once we had the diary, we began building a sort of dictionary of terms that we could use to classify entities. We’ll show you some of these terms later.

  • SLIDE 12 [ROSE]
    We also began manually extracting entities from the document - and by entities I mean names of people, pieces of art, places they visited, and so on.
  • SLIDE 13 [ROSE]
    This manual extraction informed the classification dictionary and vice versa, so the development of both happened concurrently
  • SLIDE 14 [ROSE]
    Once this manual extraction was complete, we ran the diary through the NER toolchain...
  • SLIDE 15 [ROSE]
    ...and compared the toolchain output to the results of manual extraction
  • SLIDE 16 [ROSE]
    You’ve seen this image before. This is what the first page of Mary Berenson’s diary looked like after being processed by the NER toolchain.
  • SLIDE 17 [ROSE]
    And here you can see the classification of entities
    On the left are the entities that DADAlytics looks for: date, location, organization, person and miscellaneous
    miscellaneous can be domain specific depending on what you choose to be relevant to your research needs
    On the right side of this slide, you can see the nuances that became available when when the miscellaneous entities were further categorized - for example, this differentiation between drawings, paintings, and murals
  • SLIDE 18 [SARAH]
    Thank you Rose, for describing the process for preparing the Mary Berenson document. As previously mentioned by Mary, I will talk about the results of testing the DADAlytics NER tool with the Mary Berenson diary.
    This screenshot here shows the overall results of which tools picked up which percentages of exact matches, partial matches, and no matches, out of the 3273 entities that were extracted manually from the diary.
    All of this information - including the extraction results from the 6 other typologies of documents - is available on our website (semlab.io/DADAlytics-ner-evaluation) as well as our github repository (https://github.com/SemanticLab/DADAlytics-ner-evaluation)
  • SLIDE 19 [SARAH]
    Clicking on “more” shows each entity that was manually marked up and whether it were matched exactly, partially, or not at all by a given tool.
    This slide shows examples of the results of the NLTK tool on the Mary Berenson diary.
    Manually extracted entities were compared to the entities detected by the DADAlytics NER tool through a series of python scripts written by Semantic Lab co-director Matt Miller. These python scripts can also be found in our github repository
  • SLIDE 20 [SARAH]
    So how did the DADAlytics NER tool do? This chart shows the exact match average of all the six tools combined for each document (the dark purple) as well as the additional partial match average of all the six tools combined for each document (pink).
    As you can see, the DADAlytics NER tool picked up the least amount of exact and partial matches for Mary Berenson’s diary
    We believe the diary had low numbers because we conducted a more nuanced manual extraction of entities it, which greatly expanded potential matches within the miscellaneous category
    Although this significantly differentiates the results of the diary from the results of the other documents, this granular encoding process was still useful because it can reveal the specific strengths or gaps of each tool. A more in depth study of the precision, recall, and F1 measurements for the first 100 entities of the diary has been conducted and will be made available on our website.
  • SLIDE 21 [SARAH]
    Moving back to looking at the results as a whole, what is a typical Named-Entity evaluation metric to which we can compare our results?
    In a 2018 article David S. Batista cites a metric of 50% precision for exact matches and 66% precision for partial matches as NER performance averages.
    Even though the results of DADAlytics NER tool don’t meet those evaluation metric thresholds with all types of documents, the NER tool is useful in doing the heavy lifting of recognizing named entities, especially in large amounts of texts, which lessens the researcher’s manual workload.
    This task becomes more powerful in conjunction with the second component of the DADAlytics tool package, Sélavy
  • SLIDE 22 [SARAH]
    Sélavy is a document analysis tool called that will support the generation of linked data from text. While the NER tool is powerful in identifying entities, the strength of the Sélavytool is in its ability to relate the entities to one another.
    This module is still in development by Matt Miller, but I’ll show a few screenshots in these last slides to give an idea of how this tool might be used.
  • SLIDE 23 [SARAH]
    The first step in using Sélavyis to determine the text blocks that make up a document.
    For example, the text block of Mary Berenson’s diary is a day.
    The Sélavytool can automatically detect some document structure, and the user can also refine the structure using regular expressions.
    An example of how this can be useful is that, in the case of the diary, an entity can now be related to a specific day rather than to the whole diary.
  • SLIDE 24 [SARAH]
    A traditional “find and replace” can also be used to clean up a document. For example, the administrative text that was on each page of the original pdf was removed so that it would not be run through Sélavy .
  • SLIDE 25 [SARAH]
    Once a document has been sufficiently transformed, it pushed through the DADAlytics NER tool for entity recognition, and then pushed back into the Sélavy module.
  • SLIDE 26 [SARAH]
    This slide is an example of how Sélavy picked up the entity Bryn Mawr (a college in Pennsylvania) 18 times during the processing of the text
    This is where the knowledge of a domain expert comes into play. A user can review the detected entities, decide whether or not to include them in a linked data set.
    A file of curated entities that can be downloaded in RDF, which is a linked data file format.
  • SLIDE 27 [SARAH]
    Complete the development of the Sélavy module
    Test the Sélavy using Mary Berenson’s diary, and then on other types of documents (interviews, finding aids, etc.)
    Evaluate the tool with the intended community of users
    Review and refine the tool and workflow
    Apply methodology to other Semantic Lab projects
  • SLIDE 28 [SARAH]
    Thank you so much for your time and attention. It has been a pleasure for us to be here. We’re happy to take any questions.

×