DH Tools Workshop #1: Text Analysis


Published on

A text extraction workshop delivered by Cameron Buckner on Friday, October 18th, 2012 as part of the University of Houston Digital Humanities Initiative.

Published in: Education
  • Be the first to comment

  • Be the first to like this

DH Tools Workshop #1: Text Analysis

  1. 1. DH TOOLSIntroduction to Text Analysis Cameron Buckner Visiting Assistant Professor Department of Philosophy cjbuckner@uh.edu
  2. 2. Our Initiative• Promote, facilitate, interact • Reading group • Tools workshops • Speaker series • Grantwriting support • Infrastructure advocacy http://www.uh.edu/class/digitalhumanities/
  3. 3. RoadmapGoal today: Analyze texts using cutting-edge analysesfrom computational psycholinguistics with an off-the-shelftool, word2word1. What can you do with text analysis?2. A little bit of theory: Semantic spaces3. BEAGLE: The holographic lexicon4. MDS: Visualizing multidimensional networks5. Examples6. Hands-on play
  4. 4. What is DH?• Computation and interpretation • The use of computational tools for the production, exploration, analysis, and dissemination of humanistic knowledge • Thread common between new and old: pattern recognition• Includes • Digitization and archiving, markup • Analysis & visualization • Search & dissemination • Pedagogy
  5. 5. Methods of Text Analysis I• Statistical analysis, information extraction, machine learning • Syntactic: word frequencies (Google n-grams), vocabulary usage, stylometry (authorship and genre), Pagerank http://www.nytimes.com/interactive/2012/09/06/us/politics/conventio n-word-counts.html
  6. 6. Methods of Text Analysis II• Semantic: tf-idf, latent semantic analysis, latent dirichlet allocation, entropy-based measures, ontologies • Aim to model relevance, semantic similarity, taxonomic relationships, object properties and relations
  7. 7. Reminders• Be creative and have fun, but if you want to publish…• Be principled: • Junk in, junk out • Always know assumptions required by a method • Analyses should hold up under trivial transformations of data representation • Be prepared for pragmatic design decisions • Go in with hypotheses and structured questions • Confirm with careful humanistic interpretation
  8. 8. The Mental Lexicon• A “mental dictionary” • Contains information about: • Word meaning, grammatical roles, taxonomic relations, typical properties • Behavioral indicators: recognition speed, synonymy and relevance judgments, priming, frequency effects, categorization
  9. 9. BEAGLE• A model that learns (unsupervised) a holographic mental lexicon automatically from text• History: Two approaches to semantic analysis • Co-occurrence based measures (“bag of words”, LSA, tf- idf) • Good at determining relevance, bad at determining roles and relations • Order-based measures (n-gram models, generative grammars, hidden Markov models) • Good at identifying grammatical and structural relations, bad at identifying relevance and meaning• Challenge: Can the two be combined?
  10. 10. Context + Role• Assumption: People acquire an idiosyncratic mental lexicon from patterns of co-occurrence and syntactic relationships they encounter in natural language. • “You shall know a word by both the company it keeps and how it keeps it.”• Goal: If we could build a representation of a text’s context/role distributions, we could predict the structure of a mental lexicon that produced a corpus and/or that would be produced by it • Texts as “mental fingerprints”
  11. 11. HowHolograms Work
  12. 12. Basic Vector Approach1. Start with a multi-dimensional vector space2. Each term meaning is initially represented by a random, constant environment vector and an empty memory vector3. Associations between terms can be represented by adding or averaging their environment vectors into their memory vectors4. Each time terms co-occur, their memory vectors become closer in multi-dimensional similarity space
  13. 13. Representing Order Info• Convolution: compressing outer-product matrix of two term vectors so that the product contains recoverable information about both• Example: z = x * y • Association vector z contains information about both x and y • Can (approximately) reconstruct source vector y by probing z (deconvolution) with x (and vice versa)• Combined BEAGLE memory vector: Context memory comes from vector addition, and order information comes from n-gram binding using convolution
  14. 14. Combined Memory Vector• m = memory vector• e = initial random environment vector• p = position in sentence• lambda = constant chunking factor (size of n-gram window)• bind i,j = a non-commutative convolution of constant order vector with other environment vectors in n-gram
  15. 15. Resonance retrieval…
  16. 16. So, BEAGLE method1. Choose number of dimensions for vector space, size of n-gram window for order info2. Clean up source documents using standard NLP (stop words, stemmers, etc.)3. Learn context and order vectors from corpus, combine4. Select words of interest5. Visualize multi-dimensional space using favorite method (e.g. MDS)
  17. 17. Limitations of BEAGLE• Only considers 1-sentence windows• Lexical ambiguity• Valence (e.g. synonyms, antonyms)
  18. 18. MDS• A way to view a multi-dimensional similarity space• Collapses multi-dimensional space in way that tries to mutually preserve distances between vectors • Collapsing dimensions often reveals most significant [higher-order] dimensions
  19. 19. Uses• How do two academic reference works compare in their coverage of a discipline? • Biases? Overlap? InPhO- Semantics Credit: Robert Rose
  20. 20. Black = SEP, Red = IEP Credit: Jun Otsuka
  21. 21. Political rhetoric• What can we learn from the “semantic space” derived from a party or candidate’s rhetoric? • Central issues? • Key comparisons? • Ideological focus/big tent? • Location on ideological spectrum?• Example: compare speeches from Republican and Democratic political conventions
  22. 22. Heat Map: Terms most diagnostic of a speech’s being delivered by a Democrat“Hotter” indicates more diagnostic in comparison. Hottest terms =aarp, experience, affordable, abuelo, billionaires, afghanistan, beijing, biofuels, aliens
  23. 23. Character Analysis• Moretti: “protagonist is the character that minimized the sum of the distances to all other vertices” • (But Moretti did it by hand!)
  24. 24. Character similarity analysis from A Dance with Dragons
  25. 25. Acknowledgements InPhO Team Brent Kievet-Kylar word2word Mike Jones BEAGLE