SlideShare a Scribd company logo
“Visualizing Textual Data”
Drayton C. Benner
Founder/President, Miklal Software Solutions
PhD Candidate, Northwest Semitic Philology
University of Chicago
DraytonBenner@MiklalSoftware.com
Analyzing a word in a corpus
• Digital: search results
• Print: concordances
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Searching digital Bibles
Search results from Olive Tree Bible Software
(OliveTree.com) on a Samsung Galaxy S 3
smartphone. Disclaimer: I wrote the search
engine for Olive Tree Bible Software, but I did
not write the display of the search results.
Search results from Logos Bible Software (logos.com) on a PC.
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Searching digital texts: KWIC display
From Perseus under Philologic at philologic.uchicago.edu
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Print concordances
 From Strong 1890
 Context chosen by hand
to maximize
understanding of the key
word’s context given the
space limitations
 Incredibly labor-intensive
(tens or hundreds of
person-years)
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Can we unite the benefits of digital and print?
• Advantage of print
• Context chosen carefully for maximum understanding of the key word’s
context given the space constraints
• Advantages of digital
• Ability to present many texts
• Ability to present search results for any key word almost instantaneously
• Ability to allow for multiple fonts and font sizes
• Uniting the advantages of print and digital: algorithmically select the
best context for maximum understanding of the key word’s context
given the space constraints
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Texts for experiment
• Bibles
• KJV (1769 edition)
• ESV (2011 edition) Old Testament/Hebrew Bible
• Novel
• Henry James, What Maisie Knew
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Training data
• 500 key words chosen at random from ESV and presented to an
annotator
• A width is chosen at random, ranging from approximately what would
fit on a smartphone to a width three times as long
• All possible contexts for the key word are presented to the annotator
• The line is filled with as many words of context as will fit
• Punctuation is handled reasonably. E.g., the context cannot begin with a
period or end with an open quotation mark.
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Training data
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Algorithm: overview
• Score each nearby word according to its relevance to understanding
the key word
• Factors determining the score
• Is the nearby word a function word or a content word?
• How much punctuation separates the key word and the nearby word?
• Contiguous punctuation counts as one
• Syntax-based measures
• How far apart are the two key word and the nearby word in a phrase structure tree (=
constituency-based parse tree)?
• How far apart are the key word and the nearby word in a dependency tree?
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Algorithm: phrase structure and dependency
trees
From http://en.wikipedia.org/wiki/Phrase_structure_rules
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Algorithm: phrase structure and dependency
trees (cont.)
• Generated using the Stanford Parser
• Some text pre-processing, especially to replace the major
archaisms of the KJV
• Some post-processing
• Fix repetitive parsing errors algorithmically
• Restore the major archaisms of the KJV
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Algorithm: scoring nearby words
The weight w for a nearby word n of a key word k is calculated as
follows:
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Algorithm: scoring nearby words (cont.)
Key word kNearby word n
dpkdpn
ddkddn
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Algorithm: picking the best context
• Each possible context for a keyword k is evaluated as the
sum of w(k,n) for each nearby word n in the possible context.
The context with the highest sum is chosen.
• The various constants were optimized using a Monte Carlo
particle filter on the training data.
• 𝑐𝑓 = 1.5; 𝑐 𝑝 = −3.37; 𝑐 𝑝𝑘 = 0.175; 𝑐 𝑝𝑛 = 0.2; 𝑐 𝑑𝑘 = 1; 𝑐 𝑑𝑛 =
1.4.
• The dependency tree was more important than the phrase
structure tree.
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Algorithm: displaying the chosen context
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Demo
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Results
ESV training set ESV test set Maisie test set
Algorithm matches user selection (A0) 67.8% 62.5% 47.8%
Expected algorithm matches if selections were
random from a uniform distribution (Ae)
27.4% 25.5% 21.9%
𝑆 =
𝐴0 − 𝐴 𝑒
1 − 𝐴 𝑒
0.556 0.497 0.332
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Results
ESV training set ESV test set Maisie test set
Algorithm matches user selection (A0) 67.8% 62.5% 47.8%
Expected algorithm matches if selections were
random from a uniform distribution (Ae)
27.4% 25.5% 21.9%
𝑆 =
𝐴0 − 𝐴 𝑒
1 − 𝐴 𝑒
0.556 0.497 0.332
Inter-annotator agreement (A0) N/A 65.8% 53.0%
Expected inter-annotator agreement if selections
were random from a uniform distribution (Ae)
N/A 27.0% 23.5%
𝑆 =
𝐴0 − 𝐴 𝑒
1 − 𝐴 𝑒
0.532 0.386
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Conclusion
• Using these techniques, we can marry the benefits of print and
digital!
• Context chosen for maximum understanding of the key word’s context given
the space constraints
• Ability to present search results for any key word almost instantaneously
• Ability to allow for multiple fonts and font sizes
• Ability to present well-chosen context for many texts
• As statistical parsers improve and are extended to more languages, this
technique will improve and be able to be used for more broadly
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
Possible future improvements
• Get more training data from more annotators
• Allow best context not to use all of the available
space?
• Allow for ellipses?
• Handle multiple key words?
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
• Reference
• Drayton C. Benner. “Marrying the Benefits of Print and Digital.” Proceedings of Digital
Humanities 2014. http://dharchive.org/paper/DH2014/Paper-845.xml
Acknowledgements
• James Covington
• Annotator for the training set and both test sets
• Rodelle Williams and D. Chris Benner
• Annotators for both test sets
• Humphey H. Hardy
• Annotator for the ESV test set
• Samuel L. Boyd
• Annotator for the Maisie test set
Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com

More Related Content

Similar to Visualizing Textual Data

INF 103(ASH) Possible Is Everything/newtonhelp.com
INF 103(ASH) Possible Is Everything/newtonhelp.comINF 103(ASH) Possible Is Everything/newtonhelp.com
INF 103(ASH) Possible Is Everything/newtonhelp.com
lechenau71
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
MLconf
 
INF 103 Effective Communication/tutorialrank.com
 INF 103 Effective Communication/tutorialrank.com INF 103 Effective Communication/tutorialrank.com
INF 103 Effective Communication/tutorialrank.com
jonhson291
 
La2015kbdex
La2015kbdexLa2015kbdex
La2015kbdex
Yoshiaki Matsuzawa
 
An Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source ProjectsAn Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source Projects
Pavneet Singh Kochhar
 
Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness
Generating Audio-Visual Slideshows from Text Articles Using Word ConcretenessGenerating Audio-Visual Slideshows from Text Articles Using Word Concreteness
Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness
ivaderivader
 
INF 103 Education Specialist / snaptutorial.com
INF 103 Education Specialist / snaptutorial.comINF 103 Education Specialist / snaptutorial.com
INF 103 Education Specialist / snaptutorial.com
McdonaldRyan94
 
Introduction+to+software+design
Introduction+to+software+designIntroduction+to+software+design
Introduction+to+software+design
Munazza-Mah-Jabeen
 
A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...
Scott Farley
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
DataWorks Summit
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
Yiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
Aman Grover
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
Si Krishan
 
Shubhangi nov20
Shubhangi nov20Shubhangi nov20
Shubhangi nov20
Shubhangi Tandon
 
A functional software measurement approach bridging the gap between problem a...
A functional software measurement approach bridging the gap between problem a...A functional software measurement approach bridging the gap between problem a...
A functional software measurement approach bridging the gap between problem a...
IWSM Mensura
 
How to make Effective Question paper With Question paper Generator
How to make Effective Question paper With Question paper GeneratorHow to make Effective Question paper With Question paper Generator
How to make Effective Question paper With Question paper Generator
Yoctel Solution
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
Alon Bochman, CFA
 
CIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.comCIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.com
bellflower87
 
Prototype System for Recommending Academic Subjects for Students' Self Design...
Prototype System for Recommending Academic Subjects for Students' Self Design...Prototype System for Recommending Academic Subjects for Students' Self Design...
Prototype System for Recommending Academic Subjects for Students' Self Design...
siramatu-lab
 
The Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxThe Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- Redux
Pierre Schaus
 

Similar to Visualizing Textual Data (20)

INF 103(ASH) Possible Is Everything/newtonhelp.com
INF 103(ASH) Possible Is Everything/newtonhelp.comINF 103(ASH) Possible Is Everything/newtonhelp.com
INF 103(ASH) Possible Is Everything/newtonhelp.com
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
INF 103 Effective Communication/tutorialrank.com
 INF 103 Effective Communication/tutorialrank.com INF 103 Effective Communication/tutorialrank.com
INF 103 Effective Communication/tutorialrank.com
 
La2015kbdex
La2015kbdexLa2015kbdex
La2015kbdex
 
An Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source ProjectsAn Empirical Study on the Adequacy of Testing in Open Source Projects
An Empirical Study on the Adequacy of Testing in Open Source Projects
 
Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness
Generating Audio-Visual Slideshows from Text Articles Using Word ConcretenessGenerating Audio-Visual Slideshows from Text Articles Using Word Concreteness
Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness
 
INF 103 Education Specialist / snaptutorial.com
INF 103 Education Specialist / snaptutorial.comINF 103 Education Specialist / snaptutorial.com
INF 103 Education Specialist / snaptutorial.com
 
Introduction+to+software+design
Introduction+to+software+designIntroduction+to+software+design
Introduction+to+software+design
 
A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
Shubhangi nov20
Shubhangi nov20Shubhangi nov20
Shubhangi nov20
 
A functional software measurement approach bridging the gap between problem a...
A functional software measurement approach bridging the gap between problem a...A functional software measurement approach bridging the gap between problem a...
A functional software measurement approach bridging the gap between problem a...
 
How to make Effective Question paper With Question paper Generator
How to make Effective Question paper With Question paper GeneratorHow to make Effective Question paper With Question paper Generator
How to make Effective Question paper With Question paper Generator
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
CIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.comCIS 375 Focus Dreams/newtonhelp.com
CIS 375 Focus Dreams/newtonhelp.com
 
Prototype System for Recommending Academic Subjects for Students' Self Design...
Prototype System for Recommending Academic Subjects for Students' Self Design...Prototype System for Recommending Academic Subjects for Students' Self Design...
Prototype System for Recommending Academic Subjects for Students' Self Design...
 
The Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxThe Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- Redux
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 

Visualizing Textual Data

  • 1. “Visualizing Textual Data” Drayton C. Benner Founder/President, Miklal Software Solutions PhD Candidate, Northwest Semitic Philology University of Chicago DraytonBenner@MiklalSoftware.com
  • 2. Analyzing a word in a corpus • Digital: search results • Print: concordances Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 3. Searching digital Bibles Search results from Olive Tree Bible Software (OliveTree.com) on a Samsung Galaxy S 3 smartphone. Disclaimer: I wrote the search engine for Olive Tree Bible Software, but I did not write the display of the search results. Search results from Logos Bible Software (logos.com) on a PC. Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 4. Searching digital texts: KWIC display From Perseus under Philologic at philologic.uchicago.edu Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 5. Print concordances  From Strong 1890  Context chosen by hand to maximize understanding of the key word’s context given the space limitations  Incredibly labor-intensive (tens or hundreds of person-years) Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 6. Can we unite the benefits of digital and print? • Advantage of print • Context chosen carefully for maximum understanding of the key word’s context given the space constraints • Advantages of digital • Ability to present many texts • Ability to present search results for any key word almost instantaneously • Ability to allow for multiple fonts and font sizes • Uniting the advantages of print and digital: algorithmically select the best context for maximum understanding of the key word’s context given the space constraints Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 7. Texts for experiment • Bibles • KJV (1769 edition) • ESV (2011 edition) Old Testament/Hebrew Bible • Novel • Henry James, What Maisie Knew Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 8. Training data • 500 key words chosen at random from ESV and presented to an annotator • A width is chosen at random, ranging from approximately what would fit on a smartphone to a width three times as long • All possible contexts for the key word are presented to the annotator • The line is filled with as many words of context as will fit • Punctuation is handled reasonably. E.g., the context cannot begin with a period or end with an open quotation mark. Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 9. Training data Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 10. Algorithm: overview • Score each nearby word according to its relevance to understanding the key word • Factors determining the score • Is the nearby word a function word or a content word? • How much punctuation separates the key word and the nearby word? • Contiguous punctuation counts as one • Syntax-based measures • How far apart are the two key word and the nearby word in a phrase structure tree (= constituency-based parse tree)? • How far apart are the key word and the nearby word in a dependency tree? Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 11. Algorithm: phrase structure and dependency trees From http://en.wikipedia.org/wiki/Phrase_structure_rules Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 12. Algorithm: phrase structure and dependency trees (cont.) • Generated using the Stanford Parser • Some text pre-processing, especially to replace the major archaisms of the KJV • Some post-processing • Fix repetitive parsing errors algorithmically • Restore the major archaisms of the KJV Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 13. Algorithm: scoring nearby words The weight w for a nearby word n of a key word k is calculated as follows: Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 14. Algorithm: scoring nearby words (cont.) Key word kNearby word n dpkdpn ddkddn Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 15. Algorithm: picking the best context • Each possible context for a keyword k is evaluated as the sum of w(k,n) for each nearby word n in the possible context. The context with the highest sum is chosen. • The various constants were optimized using a Monte Carlo particle filter on the training data. • 𝑐𝑓 = 1.5; 𝑐 𝑝 = −3.37; 𝑐 𝑝𝑘 = 0.175; 𝑐 𝑝𝑛 = 0.2; 𝑐 𝑑𝑘 = 1; 𝑐 𝑑𝑛 = 1.4. • The dependency tree was more important than the phrase structure tree. Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 16. Algorithm: displaying the chosen context Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 17. Demo Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 18. Results ESV training set ESV test set Maisie test set Algorithm matches user selection (A0) 67.8% 62.5% 47.8% Expected algorithm matches if selections were random from a uniform distribution (Ae) 27.4% 25.5% 21.9% 𝑆 = 𝐴0 − 𝐴 𝑒 1 − 𝐴 𝑒 0.556 0.497 0.332 Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 19. Results ESV training set ESV test set Maisie test set Algorithm matches user selection (A0) 67.8% 62.5% 47.8% Expected algorithm matches if selections were random from a uniform distribution (Ae) 27.4% 25.5% 21.9% 𝑆 = 𝐴0 − 𝐴 𝑒 1 − 𝐴 𝑒 0.556 0.497 0.332 Inter-annotator agreement (A0) N/A 65.8% 53.0% Expected inter-annotator agreement if selections were random from a uniform distribution (Ae) N/A 27.0% 23.5% 𝑆 = 𝐴0 − 𝐴 𝑒 1 − 𝐴 𝑒 0.532 0.386 Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 20. Conclusion • Using these techniques, we can marry the benefits of print and digital! • Context chosen for maximum understanding of the key word’s context given the space constraints • Ability to present search results for any key word almost instantaneously • Ability to allow for multiple fonts and font sizes • Ability to present well-chosen context for many texts • As statistical parsers improve and are extended to more languages, this technique will improve and be able to be used for more broadly Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 21. Possible future improvements • Get more training data from more annotators • Allow best context not to use all of the available space? • Allow for ellipses? • Handle multiple key words? Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com
  • 22. • Reference • Drayton C. Benner. “Marrying the Benefits of Print and Digital.” Proceedings of Digital Humanities 2014. http://dharchive.org/paper/DH2014/Paper-845.xml Acknowledgements • James Covington • Annotator for the training set and both test sets • Rodelle Williams and D. Chris Benner • Annotators for both test sets • Humphey H. Hardy • Annotator for the ESV test set • Samuel L. Boyd • Annotator for the Maisie test set Drayton Benner | Miklal Software Solutions | University of Chicago | DraytonBenner@MiklalSoftware.com