Zoss High-Level Text Analysis and Techniques

375
-1

Published on

2012 Oct 25 presentation by Angela Zoss (Duke University) for Duke University Libraries' Text > Data digital scholarship series

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
375
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • why categorize/organize?
  • Zoss High-Level Text Analysis and Techniques

    1. 1. Duke University Libraries, Digital ScholarshipText > Data, October 25HIGH-LEVEL TEXT ANALYSISAND TECHNIQUESAngela ZossData Visualization Coordinator226 Perkins Libraryangela.zoss@duke.edu
    2. 2. DOCUMENTS AS CONTEXT
    3. 3. But first,ANGELA AS CONTEXT
    4. 4. How I learned to love thedocument.B.A. courses: Linguistics, CommunicationM.S. courses: Communication, Human-ComputerInteractionEmployment: arXiv.org Administrator • Bibliometrics/ScientometricsPh.D. • courses:Computer Mediated Discourse Analysis • Latent Structure Analysis • Natural Language Processing
    5. 5. Now,DOCUMENTS AS CONTEXT
    6. 6. Text analysis from…• documents down to words (“low-level”)• words up to documents (“high-level”)
    7. 7. Using documents to learn aboutlanguage(or other social phenomena)Analyzing documents as records/proxies oflanguage, social structures, events, etc.Linguistic studies:morphology, word counts, syntax, etc. … over time (e.g., Google ngram viewer)language across corpora (e.g., politicalspeeches)Underwood, T. (2012). Where to start with text mining.
    8. 8. Using documents to learn aboutlanguage Historical culturomics of pronoun frequencies
    9. 9. Using documents to learn aboutlanguage Universal properties of mythological networks
    10. 10. Using language to learn aboutdocumentsAnalyzing documents as artifacts themselves, withtheir own properties and dynamicsLiterary, documentary studies:Structural/rhetorical/stylistic analysisDocument categorization, classificationDetecting clusters of document features (topicmodeling)Underwood, T. (2012). Where to start with text mining.
    11. 11. Using language to learn aboutdocuments Literary Empires, Mapping Temporal and Spatial Settings in Swinburne
    12. 12. Using language to learn aboutdocuments Using Word Clouds for Topic Modeling Results
    13. 13. What are documents?For this discussion, digital versions of works of spoken or written languageExamples: books, articles, transcripts, emails, tweets…
    14. 14. Documents as contextDocuments have:• form(at)• style• provenance• entities• intentions
    15. 15. STUDIES OF DOCUMENTS
    16. 16. Why study documents?• Describe a corpus• Compare/organize documents• Locate relevant information/filter out irrelevant information
    17. 17. Describing a corpus• Finding regularities/differences across groups of documents• Developing theories of structure, style, etc. that can then be tested or applied• May be manual (content analysis) or computer-assisted (statistical)
    18. 18. Example: Storylines http://xkcd.com/657/
    19. 19. Differences offormat, genre, participants…• Articles may have sections, but these will vary by discipline and type of article• Books may be fiction or non-fiction (or both)• Transcripts may refer to multiple speakers, non-text content• …ad infinitum
    20. 20. Example: LiteratureFingerprinting Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.4389004
    21. 21. Organizing documentsDetect similarity between documents and aknown category (or simply amongthemselves)Supports browsing, sentimentanalysis, authorship detection
    22. 22. Example: Bohemian BookshelfThudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous BookDiscoveries throughInformation Visualization. In CHI 12: Proceedings of the SIGCHI Conference on Human Factors in ComputingSystems, to appear.
    23. 23. Similarity based on…• common document attributes authorship, genre• common language patterns topics, phrases• common entity references characters, citations
    24. 24. Example: QuantitativeFormalismAllison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: Anexperiment. Pamphlets of the Stanford Literary Lab (vol. 1).
    25. 25. Example: Clinton’s DNC Speech http://b.globe.com/TogUqq
    26. 26. Example: View DHQ http://digitalliterature.net/viewDHQ/vis3.html
    27. 27. Classification• assigning an object to a single class• often supervised, using an existing classification scheme and a tagged corpus
    28. 28. Example: Relative signaturesJankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the levelof character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012(pp. 103-112).
    29. 29. Categorization• assigning documents to one or more categories• suggestive of unsupervised clustering techniques• design choices made to fit particular tasks or goals
    30. 30. Example: UCSD Map ofScienceBörner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., &Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoSONE, 7(7), e39464.
    31. 31. Example: NIH Map Viewer https://app.nihmaps.org/nih/browser/
    32. 32. Referencesystems, infrastructureWhat do we gain by adding structure?What do we lose?
    33. 33. SUMMARIZING DOCUMENTS
    34. 34. Text is only one component of a document.Research questions often push us to becreative with how we operationalizeconstructs.The richness of language and documents isbest preserved by usingmultiple, complementary approaches.
    35. 35. QUESTIONS?angela.zoss@duke.edu
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×