Your SlideShare is downloading. ×
0
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1:  Text Analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

DH Tools Workshop #1: Text Analysis

134

Published on

A text extraction workshop delivered by Cameron Buckner on Friday, October 18th, 2012 as part of the University of Houston Digital Humanities Initiative.

A text extraction workshop delivered by Cameron Buckner on Friday, October 18th, 2012 as part of the University of Houston Digital Humanities Initiative.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
134
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • You shall know a word by the company it keepsYou shall know a word by the company it keeps and how it keeps it
  • If you have a photographic medium that can record and reproduce not only the amount of light that strikes it but also its direction, then you can represent multiple dimensions of an object simultaneously and recall the desired dimensions by shining reconstruction beam at different angles.
  • Snow/slow
  • SEP = Stanford Encyclopedia of Philosophy, IEP = Internet Encyclopedia of Philosophy
  • Analysis method = LSA
  • Analysis computed on composite transcripts from 2012 Democratic and Republican national conventions.
  • Transcript

    • 1. DH TOOLSIntroduction to Text Analysis Cameron Buckner Visiting Assistant Professor Department of Philosophy cjbuckner@uh.edu
    • 2. Our Initiative• Promote, facilitate, interact • Reading group • Tools workshops • Speaker series • Grantwriting support • Infrastructure advocacy http://www.uh.edu/class/digitalhumanities/
    • 3. RoadmapGoal today: Analyze texts using cutting-edge analysesfrom computational psycholinguistics with an off-the-shelftool, word2word1. What can you do with text analysis?2. A little bit of theory: Semantic spaces3. BEAGLE: The holographic lexicon4. MDS: Visualizing multidimensional networks5. Examples6. Hands-on play
    • 4. What is DH?• Computation and interpretation • The use of computational tools for the production, exploration, analysis, and dissemination of humanistic knowledge • Thread common between new and old: pattern recognition• Includes • Digitization and archiving, markup • Analysis & visualization • Search & dissemination • Pedagogy
    • 5. Methods of Text Analysis I• Statistical analysis, information extraction, machine learning • Syntactic: word frequencies (Google n-grams), vocabulary usage, stylometry (authorship and genre), Pagerank http://www.nytimes.com/interactive/2012/09/06/us/politics/conventio n-word-counts.html
    • 6. Methods of Text Analysis II• Semantic: tf-idf, latent semantic analysis, latent dirichlet allocation, entropy-based measures, ontologies • Aim to model relevance, semantic similarity, taxonomic relationships, object properties and relations
    • 7. Reminders• Be creative and have fun, but if you want to publish…• Be principled: • Junk in, junk out • Always know assumptions required by a method • Analyses should hold up under trivial transformations of data representation • Be prepared for pragmatic design decisions • Go in with hypotheses and structured questions • Confirm with careful humanistic interpretation
    • 8. The Mental Lexicon• A “mental dictionary” • Contains information about: • Word meaning, grammatical roles, taxonomic relations, typical properties • Behavioral indicators: recognition speed, synonymy and relevance judgments, priming, frequency effects, categorization
    • 9. BEAGLE• A model that learns (unsupervised) a holographic mental lexicon automatically from text• History: Two approaches to semantic analysis • Co-occurrence based measures (“bag of words”, LSA, tf- idf) • Good at determining relevance, bad at determining roles and relations • Order-based measures (n-gram models, generative grammars, hidden Markov models) • Good at identifying grammatical and structural relations, bad at identifying relevance and meaning• Challenge: Can the two be combined?
    • 10. Context + Role• Assumption: People acquire an idiosyncratic mental lexicon from patterns of co-occurrence and syntactic relationships they encounter in natural language. • “You shall know a word by both the company it keeps and how it keeps it.”• Goal: If we could build a representation of a text’s context/role distributions, we could predict the structure of a mental lexicon that produced a corpus and/or that would be produced by it • Texts as “mental fingerprints”
    • 11. HowHolograms Work
    • 12. Basic Vector Approach1. Start with a multi-dimensional vector space2. Each term meaning is initially represented by a random, constant environment vector and an empty memory vector3. Associations between terms can be represented by adding or averaging their environment vectors into their memory vectors4. Each time terms co-occur, their memory vectors become closer in multi-dimensional similarity space
    • 13. Representing Order Info• Convolution: compressing outer-product matrix of two term vectors so that the product contains recoverable information about both• Example: z = x * y • Association vector z contains information about both x and y • Can (approximately) reconstruct source vector y by probing z (deconvolution) with x (and vice versa)• Combined BEAGLE memory vector: Context memory comes from vector addition, and order information comes from n-gram binding using convolution
    • 14. Combined Memory Vector• m = memory vector• e = initial random environment vector• p = position in sentence• lambda = constant chunking factor (size of n-gram window)• bind i,j = a non-commutative convolution of constant order vector with other environment vectors in n-gram
    • 15. Resonance retrieval…
    • 16. So, BEAGLE method1. Choose number of dimensions for vector space, size of n-gram window for order info2. Clean up source documents using standard NLP (stop words, stemmers, etc.)3. Learn context and order vectors from corpus, combine4. Select words of interest5. Visualize multi-dimensional space using favorite method (e.g. MDS)
    • 17. Limitations of BEAGLE• Only considers 1-sentence windows• Lexical ambiguity• Valence (e.g. synonyms, antonyms)
    • 18. MDS• A way to view a multi-dimensional similarity space• Collapses multi-dimensional space in way that tries to mutually preserve distances between vectors • Collapsing dimensions often reveals most significant [higher-order] dimensions
    • 19. Uses• How do two academic reference works compare in their coverage of a discipline? • Biases? Overlap? InPhO- Semantics Credit: Robert Rose
    • 20. Black = SEP, Red = IEP Credit: Jun Otsuka
    • 21. Political rhetoric• What can we learn from the “semantic space” derived from a party or candidate’s rhetoric? • Central issues? • Key comparisons? • Ideological focus/big tent? • Location on ideological spectrum?• Example: compare speeches from Republican and Democratic political conventions
    • 22. Heat Map: Terms most diagnostic of a speech’s being delivered by a Democrat“Hotter” indicates more diagnostic in comparison. Hottest terms =aarp, experience, affordable, abuelo, billionaires, afghanistan, beijing, biofuels, aliens
    • 23. Character Analysis• Moretti: “protagonist is the character that minimized the sum of the distances to all other vertices” • (But Moretti did it by hand!)
    • 24. Character similarity analysis from A Dance with Dragons
    • 25. Acknowledgements InPhO Team Brent Kievet-Kylar word2word Mike Jones BEAGLE

    ×