Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Documents as Data
Harvesting Knowledge from
Textual Resources with DADAlytics
Mary Mann, Sarah Ann Adams, Rose Gold,
Ilari...
What is Linked Open Data
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-----------
-...
Availability of
easy-to-use tools
Technological
understanding
How DADAlytics
Helps
Intuitive
data service
Lowers barrier t...
What is DADAlytics
Partners:
Carnegie Hall, Tulane
University, University of
Minnesota, Harvard
University, Villa I Tatti,...
bit.ly/QQMLSemLab
DBpedia
Spotlight
Stanford NLP NLTK SpaCy
OpeNER
Project
TensorFlow
Syntaxnet
Tool
Type
NLP tool with
NER component
NLP to...
Mary Berenson [1885] Public Domain, held at National Portrait Gallery
Mary Berenson and her Diaries
Mary (Whitall) Berenso...
DADA•Berenson
NAMES?
PLACES?
ARTISTS?
WORKS OF ART?
Photograph courtesy of the Villa I Tatti Berenson Library
Villa I Tatt...
Methodology
1] DIARY SELECTION
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
bit.ly/QQMLSemLab
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS E...
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS E...
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS E...
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS E...
Methodology
1] DIARY SELECTION
2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL
DOCUMENT
3] CLASSIFICATION OF MISCELLANEOUS E...
Dadalytics NER Demo
bit.ly/QQMLSemLab
Person
Location
Date
Organization
Event
Miscellaneous
DADAlytics Entity
Categories
Diary-Specific
Entity Types
Literature
...
Extraction Comparison Results
semlab.io/DADAlytics-ner-evaluation/
bit.ly/QQMLSemLab
Extraction Comparison Results
semlab.io/DADAlytics-ner-evaluation/
NLTK Example
bit.ly/QQMLSemLab
100.00%
Analysis of
Results
54.50%
11.17%
57.33%
38.00%
54.50%
29.33%
21.67%
20.50%
12.67%
11.50%
35.00%
24.17%
38.50%
39....
100.00%
Analysis of
Results
54.50%
11.17%
57.33%
38.00%
54.50%
29.33%
21.67%
20.50%
12.67%
11.50%
35.00%
24.17%
38.50%
39....
RroseSelavy(MarcelDuchamp),1920©ManRayTrust/ADAGP,ParisandDACS,London2015
SÉLAVY - DOCUMENT ANALYSIS TOOL
Marcel Duchamp a...
Turning the document into blocks
bit.ly/QQMLSemLab
Document clean up
bit.ly/QQMLSemLab
Processing of document through Sélavy
The text that was formatted in Selavy
is now being pushed through the NER
tool for e...
Reviewing the entities
bit.ly/QQMLSemLab
Next Steps
- Complete the development of the Sélavy module
- Test the Sélavy using Mary Berenson’s diary, and then on
othe...
Thank You
Semantic Lab
at Pratt Institute
--- S E M L A B C O - D I R E C T O R S ---
prof. m. cristina pattuelli
prof. ma...
Upcoming SlideShare
Loading in …5
×

of

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 1 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 2 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 3 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 4 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 5 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 6 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 7 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 8 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 9 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 10 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 11 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 12 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 13 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 14 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 15 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 16 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 17 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 18 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 19 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 20 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 21 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 22 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 23 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 24 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 25 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 26 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 27 Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics Slide 28
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

Download to read offline

Development and assessment of DADAlytics, a semantic tool created by the Semantic Lab at Pratt to help librarians, archivists and humanities scholars generate linked data from textual resources and descriptive records. The presentation includes the description of the overall design of the tool and outlines the methods adopted to make the tool intuitive and flexible. It also addresses ongoing assessment activities conducted through piloting and testing. The testbed is provided by the collection of personal diaries of Mary Berenson, part of the Bernard and Mary Berenson Papers (1880-2002) held at the Berenson Library at the Villa I Tatti in Florence, Italy.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Documents as Data: Harvesting Knowledge from Textual Resources with DADAlytics

  1. 1. Documents as Data Harvesting Knowledge from Textual Resources with DADAlytics Mary Mann, Sarah Ann Adams, Rose Gold, Ilaria Della Monica, M. Cristina Pattuelli Qualitative and Quantitative Methods in Libraries Florence, May 28 - June 1, 2019 Semantic Lab at Pratt Institute @semlabteam bit.ly/QQMLSemLab
  2. 2. What is Linked Open Data ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- LOD: Recommended best practices for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web conceived by Tim Berners-Lee in 2006 Diagrams by Sarah Ann Adams bit.ly/QQMLSemLab
  3. 3. Availability of easy-to-use tools Technological understanding How DADAlytics Helps Intuitive data service Lowers barrier to LOD creation Linked Data Obstacles bit.ly/QQMLSemLab
  4. 4. What is DADAlytics Partners: Carnegie Hall, Tulane University, University of Minnesota, Harvard University, Villa I Tatti, Whitney Museum of American Art Named-Entity Recognition (NER) Module Sélavy Document Analysis Tool ------------- ------------- ------------- ------------- ------------- organization location date person misc -------------- -------------- -------------- -------------- ------ -------------- Title -------------- Subtitle -------------- -------------- ------ Body Diagrams by Sarah Ann Adams bit.ly/QQMLSemLab
  5. 5. bit.ly/QQMLSemLab
  6. 6. DBpedia Spotlight Stanford NLP NLTK SpaCy OpeNER Project TensorFlow Syntaxnet Tool Type NLP tool with NER component NLP tool with NER component NLP tool with NER component NLP tool with NER component NLP tool with NER component neural network part-of-speech tagger Trainin g Data DBpedia resources (Wikipedia- extracted structured content) mix of CoNLL, MUC- 6, MUC-7, and ACE named entity corpora using the english.muc. 7class.distsim.crf.se r.gz classifier Groningen Meaning Bank corpus OntoNotes and Common Cawl Apache OpeNLP models Parsey McParseface Further Info dbpedia- spotlight.org nlp.stanford.edu nltk.org spacy.io opener-project.eu research.googleblo g.com/2016/05/ann ouncing-syntaxnet- worlds-most.html The Six DADAlytics NER Tools bit.ly/QQMLSemLab
  7. 7. Mary Berenson [1885] Public Domain, held at National Portrait Gallery Mary Berenson and her Diaries Mary (Whitall) Berenson - art historian, art critic - wife of art historian Bernard Berenson - influenced Bernard’s work - Archive held at Villa I Tatti Mary and Bernard Berenson near Fernhurst, England, 1898, courtesy of the Villa I Tatti Berenson Library DADA•Berenson bit.ly/QQMLSemLab
  8. 8. DADA•Berenson NAMES? PLACES? ARTISTS? WORKS OF ART? Photograph courtesy of the Villa I Tatti Berenson Library Villa I Tatti Diary Project bit.ly/QQMLSemLab
  9. 9. Methodology 1] DIARY SELECTION bit.ly/QQMLSemLab
  10. 10. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT bit.ly/QQMLSemLab
  11. 11. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES bit.ly/QQMLSemLab
  12. 12. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES bit.ly/QQMLSemLab
  13. 13. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES bit.ly/QQMLSemLab
  14. 14. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES 5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE bit.ly/QQMLSemLab
  15. 15. Methodology 1] DIARY SELECTION 2] HANDWRITTEN DIARY TRANSCRIBED TO A DIGITAL DOCUMENT 3] CLASSIFICATION OF MISCELLANEOUS ENTITY TYPES 4] MANUALLY EXTRACT ENTITIES 5] RUN DIARY TEXT THROUGH DADALYTICS NER MODULE 6] COMPARE MANUAL EXTRACTION TO DADALYTICS OUTPUT bit.ly/QQMLSemLab
  16. 16. Dadalytics NER Demo bit.ly/QQMLSemLab
  17. 17. Person Location Date Organization Event Miscellaneous DADAlytics Entity Categories Diary-Specific Entity Types Literature Music Poetry Theater Non-Fiction Visual Art Art Described by Era Art Described by Region Drawing Painting Photography Pottery Print Sculpture Stained Glass Textile Mural Art Collection Biographic Cultural Historic Article Journal Lecture Magazine Newspaper Thesis Entity Classification bit.ly/QQMLSemLab
  18. 18. Extraction Comparison Results semlab.io/DADAlytics-ner-evaluation/ bit.ly/QQMLSemLab
  19. 19. Extraction Comparison Results semlab.io/DADAlytics-ner-evaluation/ NLTK Example bit.ly/QQMLSemLab
  20. 20. 100.00% Analysis of Results 54.50% 11.17% 57.33% 38.00% 54.50% 29.33% 21.67% 20.50% 12.67% 11.50% 35.00% 24.17% 38.50% 39.67% 75.00% 23.83% 68.83% 73.00% 78.67% 67.83% 61.33% bit.ly/QQMLSemLab
  21. 21. 100.00% Analysis of Results 54.50% 11.17% 57.33% 38.00% 54.50% 29.33% 21.67% 20.50% 12.67% 11.50% 35.00% 24.17% 38.50% 39.67% 75.00% 23.83% 68.83% 73.00% 78.67% 67.83% 61.33% Batista, D. Named-Entity evaluation metrics based on entity-level. (2018 May 9). Retrieved from www.davidsbatista.net/blog/ 2018/05/09/Named_Entity_Evaluation/ 66.00% 50.00% [For partial matches] [For exact matches] bit.ly/QQMLSemLab
  22. 22. RroseSelavy(MarcelDuchamp),1920©ManRayTrust/ADAGP,ParisandDACS,London2015 SÉLAVY - DOCUMENT ANALYSIS TOOL Marcel Duchamp as Rrose Sélavy (pronounced “c’est la vie”) --------------- --------------- --------------- --------------- ------- ------------- Block 1 ------------- Block 2 --------------- --------------- ---- Block 3 Diagram by Sarah Ann Adams bit.ly/QQMLSemLab
  23. 23. Turning the document into blocks bit.ly/QQMLSemLab
  24. 24. Document clean up bit.ly/QQMLSemLab
  25. 25. Processing of document through Sélavy The text that was formatted in Selavy is now being pushed through the NER tool for entity recognition, and will then be pulled back into the Selavy tool for further transformation bit.ly/QQMLSemLab
  26. 26. Reviewing the entities bit.ly/QQMLSemLab
  27. 27. Next Steps - Complete the development of the Sélavy module - Test the Sélavy using Mary Berenson’s diary, and then on other types of documents (interviews, finding aids, etc.) - Evaluate the tool with the intended community of users - Review and refine the tool and workflow - Apply methodology to other Semantic Lab projects bit.ly/QQMLSemLab
  28. 28. Thank You Semantic Lab at Pratt Institute --- S E M L A B C O - D I R E C T O R S --- prof. m. cristina pattuelli prof. matt miller ------ S E M L A B T E A M ------ mary mann rose gold sarah adams taylor baker megan lyon ------ C O N T A C T ------ w :: semlab.io t :: @semlabteam e :: foaf.Person@semlab.io A special thank you to Ilaria della Monica, Archivist, Villa I Tatti tools by Nithinan Tatah from the Noun Project (slide 4) personal solution by ProSymbols from the Noun Project (slide 4) solution by Gregor Cresnar from the Noun Project (slide 4) PErson passive confused by Margaret Hagan from the Noun Project (slide 5) NOUN PROJECT IMAGE CREDITS Questions? @semlabteam bit.ly/QQMLSemLab
  • NikkiKatastrofa

    Dec. 4, 2021

Development and assessment of DADAlytics, a semantic tool created by the Semantic Lab at Pratt to help librarians, archivists and humanities scholars generate linked data from textual resources and descriptive records. The presentation includes the description of the overall design of the tool and outlines the methods adopted to make the tool intuitive and flexible. It also addresses ongoing assessment activities conducted through piloting and testing. The testbed is provided by the collection of personal diaries of Mary Berenson, part of the Bernard and Mary Berenson Papers (1880-2002) held at the Berenson Library at the Villa I Tatti in Florence, Italy.

Views

Total views

209

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

2

Shares

0

Comments

0

Likes

1

×