Published on

text analysis tools

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. New Tools in Digital Humanities UDHIG June 13 2006 Zoe Borovsky
  2. 2. New tools <ul><li>Text: </li></ul><ul><ul><li>Juxta </li></ul></ul><ul><ul><li>TAPoR, HyperPo </li></ul></ul><ul><ul><li>WordHoard </li></ul></ul><ul><ul><li>Images: </li></ul></ul><ul><ul><ul><li>Image Markup Tool </li></ul></ul></ul>
  3. 3. Why digitize text? Text analysis: discovering new knowledge by linking information together in interesting ways, not just showing overall trends. “ I think discovering new knowledge vs. showing trends is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft.” (Marti Hearst, 2003)
  4. 4. The verb “look” occurs more often near words & names of giantesses than giants. Three volumes of sagas: Hundreds of giants and giantesses
  5. 5. Types of tools <ul><li>Concordance, comparison, corpus, critical editions (Juxta) </li></ul><ul><li>Search (TAPoR, HyperPo, WordHoard) </li></ul><ul><ul><li>Key words in context (KWIC) </li></ul></ul><ul><ul><li>Collocates (associations) </li></ul></ul><ul><ul><li>Markup: Lemma, Parts of speech, Speaker </li></ul></ul>
  6. 6. Juxta <ul><li>Produces critical editions, comparing and collating multiple witnesses of a single work </li></ul>http://www.patacriticism.org/juxta/
  7. 7. Juxta <ul><li>Desktop Application: Mac, Windows and Unix/Linux (open source) </li></ul><ul><li>Input: plain text (UTF-8), or XML </li></ul><ul><li>Output: HTML critical apparatus </li></ul>
  8. 8. The darker color, the more variants that differ
  9. 9. Toggle between texts
  10. 10. Generate HTML
  11. 12. TAPoR <ul><ul><li>Web-based text analysis portal </li></ul></ul><ul><ul><li>Search and display using online tools </li></ul></ul>http://test-tapor.mcmaster.ca/portal/portal Input: XML, HTML, TEI, plain text
  12. 13. TAPoR <ul><li>Mostly English, some western European languages </li></ul><ul><li>Word Lists </li></ul><ul><li>KWIC (key word in context) </li></ul><ul><li>Collocates/co occurrences - words that occur in the proximity </li></ul>
  13. 14. Word List HyperPo
  14. 15. Key word in context, HyperPo
  15. 16. <ul><li>co occurrences </li></ul><ul><li>“ white” </li></ul><ul><li>add secondary corpus </li></ul>
  16. 17. WordHoard <ul><li>Desktop application/server version </li></ul><ul><li>texts are annotated or tagged by morphological, lexical, semantic, prosodic, and narratological criteria. </li></ul>http://wordhoard.northwestern.edu/userman/index.html
  17. 18. The downloadable version comes with texts Open source version can be installed on your own server with your texts
  18. 19. Sample WordHoard query <ul><li>Shakespeare’s use of the word “love” over time </li></ul>
  19. 20. Results….
  20. 21. Image Markup Tool http://www.tapor.uvic.ca/~mholmes/image_markup/ Windows only
  21. 22. Image Markup tool <ul><li>Input: an image that you want to make available on a web page with annotations directly on the image </li></ul>Ex, Robert Watson’s Back to Nature
  22. 24. Image Markup Tool <ul><li>Output: sample </li></ul><ul><li>A copy of your XML data file with an added XSL stylesheet declaration </li></ul><ul><li>A copy of the image file you're marking up (usually reduced to a size suitable for a Web page -- you can control this size in the Options / Web view preferences window). </li></ul><ul><li>An XSLT file (copied from the web_view folder in the program folder, with some variables modified to suit your data). </li></ul><ul><li>A JavaScript file (copied from the web_view folder in the program folder). </li></ul><ul><li>A CSS stylesheet file (copied from the web_view folder in the program folder). </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.