Duke University Libraries, Digital Scholarship
Text > Data, October 25




HIGH-LEVEL TEXT ANALYSIS
AND TECHNIQUES
Angela Zoss
Data Visualization Coordinator
226 Perkins Library
angela.zoss@duke.edu
DOCUMENTS AS CONTEXT
But first,

ANGELA AS CONTEXT
How I learned to love the
document.
B.A. courses:         Linguistics, Communication

M.S. courses:         Communication, Human-Computer
Interaction

Employment:           arXiv.org Administrator
              • Bibliometrics/Scientometrics
Ph.D.         •
        courses:Computer Mediated Discourse Analysis
              • Latent Structure Analysis
              • Natural Language Processing
Now,

DOCUMENTS AS CONTEXT
Text analysis from…
• documents down to words (“low-level”)
• words up to documents (“high-level”)
Using documents to learn about
language
(or other social phenomena)
Analyzing documents as records/proxies of
language, social structures, events, etc.

Linguistic studies:
morphology, word counts, syntax, etc. …
      over time (e.g., Google ngram viewer)
language across corpora (e.g., political
speeches)

Underwood, T. (2012). Where to start with text mining.
Using documents to learn about
language
  Historical culturomics of pronoun frequencies
Using documents to learn about
language
 Universal properties of mythological networks
Using language to learn about
documents
Analyzing documents as artifacts themselves, with
their own properties and dynamics

Literary, documentary studies:
Structural/rhetorical/stylistic analysis
Document categorization, classification
Detecting clusters of document features (topic
modeling)


Underwood, T. (2012). Where to start with text mining.
Using language to learn about
documents
   Literary Empires, Mapping Temporal and
         Spatial Settings in Swinburne
Using language to learn about
documents
 Using Word Clouds for Topic Modeling Results
What are documents?
For this discussion,
     digital versions of works of
     spoken or written language
Examples:
     books, articles, transcripts, emails, twe
ets…
Documents as context
Documents have:
• form(at)
• style
• provenance
• entities
• intentions
STUDIES OF DOCUMENTS
Why study documents?
• Describe a corpus
• Compare/organize documents
• Locate relevant information/filter out
  irrelevant information
Describing a corpus
• Finding regularities/differences across
  groups of documents
• Developing theories of structure, style, etc.
  that can then be tested or applied
• May be manual (content analysis) or
  computer-assisted (statistical)
Example: Storylines




            http://xkcd.com/657/
Differences of
format, genre, participants…
• Articles may have sections, but these will
  vary by discipline and type of article
• Books may be fiction or non-fiction (or
  both)
• Transcripts may refer to multiple speakers,
  non-text content
• …ad infinitum
Example: Literature
Fingerprinting




 Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE
 Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi:
 10.1109/VAST.2007.4389004
Organizing documents
Detect similarity between documents and a
known category (or simply among
themselves)

Supports browsing, sentiment
analysis, authorship detection
Example: Bohemian Bookshelf




Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book
Discoveries through
Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, to appear.
Similarity based on…
• common document attributes
    authorship, genre
• common language patterns
    topics, phrases
• common entity references
    characters, citations
Example: Quantitative
Formalism




Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An
experiment. Pamphlets of the Stanford Literary Lab (vol. 1).
Example: Clinton’s DNC Speech




                http://b.globe.com/TogUqq
Example: View DHQ




      http://digitalliterature.net/viewDHQ/vis3.html
Classification
• assigning an object to a single class
• often supervised, using an existing
  classification scheme and a tagged corpus
Example: Relative signatures




Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level
of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012
(pp. 103-112).
Categorization
• assigning documents to one or more
  categories
• suggestive of unsupervised clustering
  techniques
• design choices made to fit particular tasks
  or goals
Example: UCSD Map of
Science




Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., &
Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS
ONE, 7(7), e39464.
Example: NIH Map Viewer




        https://app.nihmaps.org/nih/browser/
Reference
systems, infrastructure
What do we gain by adding structure?

What do we lose?
SUMMARIZING DOCUMENTS
Text is only one component of a document.

Research questions often push us to be
creative with how we operationalize
constructs.

The richness of language and documents is
best preserved by using
multiple, complementary approaches.
QUESTIONS?
angela.zoss@duke.edu

Zoss High-Level Text Analysis and Techniques

  • 1.
    Duke University Libraries,Digital Scholarship Text > Data, October 25 HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library angela.zoss@duke.edu
  • 2.
  • 3.
  • 4.
    How I learnedto love the document. B.A. courses: Linguistics, Communication M.S. courses: Communication, Human-Computer Interaction Employment: arXiv.org Administrator • Bibliometrics/Scientometrics Ph.D. • courses:Computer Mediated Discourse Analysis • Latent Structure Analysis • Natural Language Processing
  • 5.
  • 6.
    Text analysis from… •documents down to words (“low-level”) • words up to documents (“high-level”)
  • 7.
    Using documents tolearn about language (or other social phenomena) Analyzing documents as records/proxies of language, social structures, events, etc. Linguistic studies: morphology, word counts, syntax, etc. … over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches) Underwood, T. (2012). Where to start with text mining.
  • 8.
    Using documents tolearn about language Historical culturomics of pronoun frequencies
  • 9.
    Using documents tolearn about language Universal properties of mythological networks
  • 10.
    Using language tolearn about documents Analyzing documents as artifacts themselves, with their own properties and dynamics Literary, documentary studies: Structural/rhetorical/stylistic analysis Document categorization, classification Detecting clusters of document features (topic modeling) Underwood, T. (2012). Where to start with text mining.
  • 11.
    Using language tolearn about documents Literary Empires, Mapping Temporal and Spatial Settings in Swinburne
  • 12.
    Using language tolearn about documents Using Word Clouds for Topic Modeling Results
  • 13.
    What are documents? Forthis discussion, digital versions of works of spoken or written language Examples: books, articles, transcripts, emails, twe ets…
  • 14.
    Documents as context Documentshave: • form(at) • style • provenance • entities • intentions
  • 15.
  • 16.
    Why study documents? •Describe a corpus • Compare/organize documents • Locate relevant information/filter out irrelevant information
  • 17.
    Describing a corpus •Finding regularities/differences across groups of documents • Developing theories of structure, style, etc. that can then be tested or applied • May be manual (content analysis) or computer-assisted (statistical)
  • 18.
    Example: Storylines http://xkcd.com/657/
  • 19.
    Differences of format, genre,participants… • Articles may have sections, but these will vary by discipline and type of article • Books may be fiction or non-fiction (or both) • Transcripts may refer to multiple speakers, non-text content • …ad infinitum
  • 20.
    Example: Literature Fingerprinting Keim,D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.4389004
  • 21.
    Organizing documents Detect similaritybetween documents and a known category (or simply among themselves) Supports browsing, sentiment analysis, authorship detection
  • 22.
    Example: Bohemian Bookshelf Thudt,A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.
  • 23.
    Similarity based on… •common document attributes authorship, genre • common language patterns topics, phrases • common entity references characters, citations
  • 24.
    Example: Quantitative Formalism Allison, S.,Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).
  • 25.
    Example: Clinton’s DNCSpeech http://b.globe.com/TogUqq
  • 26.
    Example: View DHQ http://digitalliterature.net/viewDHQ/vis3.html
  • 27.
    Classification • assigning anobject to a single class • often supervised, using an existing classification scheme and a tagged corpus
  • 28.
    Example: Relative signatures Jankowska,M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp. 103-112).
  • 29.
    Categorization • assigning documentsto one or more categories • suggestive of unsupervised clustering techniques • design choices made to fit particular tasks or goals
  • 30.
    Example: UCSD Mapof Science Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS ONE, 7(7), e39464.
  • 31.
    Example: NIH MapViewer https://app.nihmaps.org/nih/browser/
  • 32.
    Reference systems, infrastructure What dowe gain by adding structure? What do we lose?
  • 33.
  • 34.
    Text is onlyone component of a document. Research questions often push us to be creative with how we operationalize constructs. The richness of language and documents is best preserved by using multiple, complementary approaches.
  • 35.

Editor's Notes

  • #22 why categorize/organize?