Successfully reported this slideshow.
Your SlideShare is downloading. ×

Text-mining as a Research Tool in the Humanities and Social Sciences

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 245 Ad

Text-mining as a Research Tool in the Humanities and Social Sciences

Download to read offline

Ryan Shaw (School of Information & Library Science, UNC Chapel Hill) provides an overview and a critique of text-mining projects, and discusses project design, methodology, scope, integrity of data and analysis as well as preservation. This presentation will help scholars understand the research potential of text mining, and offer a summary of issues and concerns about technology and methods.

See also:

http://aesh.in/RC
http://sfy.co/e8ys

Ryan Shaw (School of Information & Library Science, UNC Chapel Hill) provides an overview and a critique of text-mining projects, and discusses project design, methodology, scope, integrity of data and analysis as well as preservation. This presentation will help scholars understand the research potential of text mining, and offer a summary of issues and concerns about technology and methods.

See also:

http://aesh.in/RC
http://sfy.co/e8ys

Advertisement
Advertisement

More Related Content

Recently uploaded (20)

Advertisement

Text-mining as a Research Tool in the Humanities and Social Sciences

  1. 1. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC @rybesh #duketext 1
  2. 2. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC @rybesh #duketext 1
  3. 3. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC @rybesh #duketext 1
  4. 4. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  5. 5. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  6. 6. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  7. 7. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  8. 8. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  9. 9. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  10. 10. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 2
  11. 11. Duke Libraries / Text > Data September 20, 2012 Roberto Busa @rybesh #duketext 3
  12. 12. Duke Libraries / Text > Data September 20, 2012 Automated text analysis @rybesh #duketext 4
  13. 13. Duke Libraries / Text > Data September 20, 2012 Automated text analysis Automated text analysis is a tool for discovery and measurement in textual data of prevalent attitudes, concepts, or events. O'Connor, Bamman & Smith 2011 "Computational Text Analysis for Social Science" http://goo.gl/PxruI @rybesh #duketext 4
  14. 14. Duke Libraries / Text > Data September 20, 2012 Automated text analysis Automated text analysis is a tool for discovery and measurement in textual data of patterns of language use interpretable as prevalent attitudes, concepts, or events. O'Connor, Bamman & Smith 2011 "Computational Text Analysis for Social Science" http://goo.gl/PxruI @rybesh #duketext 5
  15. 15. Duke Libraries / Text > Data September 20, 2012 Language modeling Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  16. 16. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  17. 17. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language • Mathematical models distinguish elements and make explicit the relations among them Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  18. 18. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language • Mathematical models distinguish elements and make explicit the relations among them • They do not explain, but they can be interpreted Black 1962, "Models and Archetypes" http://goo.gl/zKtrx @rybesh #duketext 6
  19. 19. Duke Libraries / Text > Data September 20, 2012 Language modeling Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  20. 20. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  21. 21. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong • Nevertheless they may be useful Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  22. 22. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong • Nevertheless they may be useful • They must be evaluated on their ability to help scholars make inferences, achieve insights, and generate new interpretations Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 7
  23. 23. Duke Libraries / Text > Data September 20, 2012 Plan of attack @rybesh #duketext 8
  24. 24. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text @rybesh #duketext 8
  25. 25. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text @rybesh #duketext 8
  26. 26. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text @rybesh #duketext 8
  27. 27. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text • Validating results @rybesh #duketext 8
  28. 28. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text • Validating results • Managing data @rybesh #duketext 8
  29. 29. Duke Libraries / Text > Data September 20, 2012 Acquiring text Collecting your data @rybesh #duketext 9
  30. 30. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 10
  31. 31. Duke Libraries / Text > Data September 20, 2012 Sources @rybesh #duketext 11
  32. 32. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora @rybesh #duketext 11
  33. 33. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora • Other digital sources (e.g. Web, twitter) @rybesh #duketext 11
  34. 34. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora • Other digital sources (e.g. Web, twitter) • Undigitized text @rybesh #duketext 11
  35. 35. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora @rybesh #duketext 12
  36. 36. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML @rybesh #duketext 12
  37. 37. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high @rybesh #duketext 12
  38. 38. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high • But collections tend to be small @rybesh #duketext 12
  39. 39. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high • But collections tend to be small • Licensing agreements may prohibit text analysis @rybesh #duketext 12
  40. 40. Duke Libraries / Text > Data September 20, 2012 • 10.5 million total volumes • 5.5 million book titles • 270,000 serial titles • 3.2 million public domain http://www.hathitrust.org/htrc @rybesh #duketext 13
  41. 41. Duke Libraries / Text > Data September 20, 2012 Other digital sources @rybesh #duketext 14
  42. 42. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API @rybesh #duketext 14
  43. 43. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" @rybesh #duketext 14
  44. 44. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming @rybesh #duketext 14
  45. 45. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming • Website restrictions may limit how much or how quickly texts can be collected @rybesh #duketext 14
  46. 46. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming • Website restrictions may limit how much or how quickly texts can be collected • Metadata will be limited or absent @rybesh #duketext 14
  47. 47. Duke Libraries / Text > Data September 20, 2012 Undigitized text @rybesh #duketext 15
  48. 48. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition @rybesh #duketext 15
  49. 49. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive @rybesh #duketext 15
  50. 50. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive • OCR will introduce errors in your texts @rybesh #duketext 15
  51. 51. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive • OCR will introduce errors in your texts • You need to produce your own metadata @rybesh #duketext 15
  52. 52. Duke Libraries / Text > Data September 20, 2012 Preparing texts @rybesh #duketext 16
  53. 53. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors @rybesh #duketext 16
  54. 54. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines @rybesh #duketext 16
  55. 55. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines • Running headers and footers @rybesh #duketext 16
  56. 56. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines • Running headers and footers • Breaking into paragraphs, sentences, etc. @rybesh #duketext 16
  57. 57. Duke Libraries / Text > Data September 20, 2012 Preparing texts @rybesh #duketext 17
  58. 58. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts @rybesh #duketext 17
  59. 59. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts • Worth your time to learn a scripting language (such as Python) @rybesh #duketext 17
  60. 60. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts • Worth your time to learn a scripting language (such as Python) • Command-line text-processing tools on Mac OS and Unix also very useful @rybesh #duketext 17
  61. 61. Duke Libraries / Text > Data September 20, 2012 Representing text Turning words into numbers @rybesh #duketext 18
  62. 62. Duke Libraries / Text > Data September 20, 2012 Slowly welling from the point of her gold nib, pale blue ink dissolved the full stop; for there her pen stuck; her eyes fixed, and tears slowly filled them. The entire bay quivered; the lighthouse wobbled; and she had the illusion that the mast of Mr. Connor's little yacht was bending like a wax candle in the sun. She winked quickly. Accidents were awful things. She winked again. The mast was straight; the waves were regular; the lighthouse was upright; but the blot had spread. @rybesh #duketext 19
  63. 63. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quivered 3 was 1 waves 1 quickly 3 she 1 upright 1 point 3 her 1 things 1 pen 2 winked 1 there 1 pale 2 were 1 them 1 nib 2 slowly 1 that 1 mr 2 of 1 tears 1 little 2 mast 1 sun 1 like 2 lighthouse 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illusion 1 yacht 1 spread 1 gold 1 wobbled 1 s 1 full 1 welling 1 regular 1 from @rybesh #duketext 20
  64. 64. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quiver 3 wa 1 wave 1 quickli 3 she 1 upright 1 point 3 her 1 thing 1 pen 2 wink 1 there 1 pale 2 were 1 them 1 nib 2 slowli 1 that 1 mr 2 of 1 tear 1 littl 2 mast 1 sun 1 like 2 lighthous 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illus 1 yacht 1 spread 1 gold 1 wobbl 1 s 1 full 1 well 1 regular 1 from @rybesh #duketext 21
  65. 65. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quiver 3 wa 1 wave 1 quickli 3 she 1 upright 1 point 3 her 1 thing 1 pen 2 wink 1 there 1 pale 2 were 1 them 1 nib 2 slowli 1 that 1 mr 2 of 1 tear 1 littl 2 mast 1 sun 1 like 2 lighthous 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illus 1 yacht 1 spread 1 gold 1 wobbl 1 s 1 full 1 well 1 regular 1 from @rybesh #duketext 22
  66. 66. Duke Libraries / Text > Data September 20, 2012 doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 accid 1 actual 1 again 1 1 alreadi 1 antenna 1 archer 1 avoid 2 1 awai 1 aw 1 bag 1 bandanna 1 barfoot 2 @rybesh #duketext 23
  67. 67. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again 1 1 2 @rybesh #duketext avoid 24
  68. 68. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 1 1 1 2 @rybesh #duketext avoid 24
  69. 69. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 6 doc 1 1 1 2 @rybesh #duketext avoid 24
  70. 70. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 6 doc 1 1 ar ity m il si 1 2 @rybesh #duketext avoid 24
  71. 71. Duke Libraries / Text > Data September 20, 2012 Analyzing text Counting, comparing, categorizing and pattern-finding @rybesh #duketext 25
  72. 72. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  73. 73. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  74. 74. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  75. 75. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  76. 76. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  77. 77. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  78. 78. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning • Unsupervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 26
  79. 79. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning • Unsupervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x @rybesh #duketext 27
  80. 80. Duke Libraries / Text > Data September 20, 2012 Counting words http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html @rybesh #duketext 28
  81. 81. Duke Libraries / Text > Data September 20, 2012 Counting words @rybesh #duketext 29
  82. 82. Duke Libraries / Text > Data September 20, 2012 Michel et al. 2010 @rybesh #duketext http://dx.doi.org/10.1126/science.1199644 30
  83. 83. Duke Libraries / Text > Data September 20, 2012 Counting words @rybesh #duketext 31
  84. 84. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed @rybesh #duketext 31
  85. 85. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable @rybesh #duketext 31
  86. 86. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location @rybesh #duketext 31
  87. 87. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location • Word use is ambiguous @rybesh #duketext 31
  88. 88. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location • Word use is ambiguous • Spelling may vary @rybesh #duketext 31
  89. 89. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 32
  90. 90. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 33
  91. 91. Duke Libraries / Text > Data September 20, 2012 Concordance tools @rybesh #duketext 34
  92. 92. Duke Libraries / Text > Data September 20, 2012 Dictionary methods @rybesh #duketext 35
  93. 93. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words @rybesh #duketext 35
  94. 94. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words • Lists are compiled for specific categories of interest: negative words, law-related words, names of places, names of chemicals, etc. @rybesh #duketext 35
  95. 95. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words • Lists are compiled for specific categories of interest: negative words, law-related words, names of places, names of chemicals, etc. • May be custom-built or reused @rybesh #duketext 35
  96. 96. Duke Libraries / Text > Data September 20, 2012 Lexicoder Sentiment Dictionary A LIE 0 WOUNDED 0 ABILITY* 1 WOOS 1 ABANDON* 0 WOUNDS 0 ABOUND* 1 WORKABLE* 1 ABAS* 0 WRATH* 0 ABSOLV* 1 WORKMANSHIP* 1 ABATTOIR* 0 WRECK* 0 ABSORBENT* 1 WORSHIP* 1 ABDICAT* 0 WRESTL* 0 ABSORPTION* 1 WORTH 1 ABERRA* 0 WRETCH* 0 ABUNDANC* 1 WORTH WHILE* 1 ABHOR* 0 WRITHE* 0 ABUNDANT* 1 WORTHI* 1 ABJECT* 0 WRONG* 0 ACCED* 1 WORTHWHILE* 1 ABNORMAL* 0 XENOPHOB* 0 ACCENTUAT* 1 WORTHY* 1 ABOLISH* 0 YAWN* 0 ACCEPT* 1 YOUNG AT HEART 1 ABOMINAB* 0 YEARN* 0 ACCESSIB* 1 ZEAL 1 ABOMINAT* 0 YUCK* 0 ACCLAIM* 1 ZEALOUS* 1 ABRASIV* 0 ZEALOT* 0 ACCLAMATION* 1 ZEST* 1 @rybesh #duketext 36
  97. 97. Duke Libraries / Text > Data September 20, 2012 ACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIM ADJOURNING ESCHEATED LEGALLY REBUTS APPELLANTS EXCEEDENCES LITIGATORS REQUESTER APPOINTOR EXCULPATED MISTRIALS RESCINDS ARBITRATE FOREBEAR NOTARIZE STATUTE ASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHS CHATTEL INDEMNITY OBLIGOR SUBPOENAS CODIFICATIONS INJUNCTION PERSONAM SUBTRUSTS CONVICTED INTERLOCUTORY PLEADS TENANTABILITY COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY DEFEASANCE INTERROGATE PRETRIAL UNENCUMBERED DELEGATEE IRREVOCABLY PRIMA UNREMEDIATED DEPOSED LEGALIZATION PROSECUTIONS WHEREOF @rybesh #duketext 37
  98. 98. Duke Libraries / Text > Data September 20, 2012 Litigious Words ACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIM ADJOURNING ESCHEATED LEGALLY REBUTS APPELLANTS EXCEEDENCES LITIGATORS REQUESTER APPOINTOR EXCULPATED MISTRIALS RESCINDS ARBITRATE FOREBEAR NOTARIZE STATUTE ASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHS CHATTEL INDEMNITY OBLIGOR SUBPOENAS CODIFICATIONS INJUNCTION PERSONAM SUBTRUSTS CONVICTED INTERLOCUTORY PLEADS TENANTABILITY COUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARY DEFEASANCE INTERROGATE PRETRIAL UNENCUMBERED DELEGATEE IRREVOCABLY PRIMA UNREMEDIATED DEPOSED LEGALIZATION PROSECUTIONS WHEREOF @rybesh #duketext 37
  99. 99. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm @rybesh #duketext 38
  100. 100. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: @rybesh #duketext 38
  101. 101. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list @rybesh #duketext 38
  102. 102. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list • –1 if the word is in the negative list @rybesh #duketext 38
  103. 103. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list • –1 if the word is in the negative list • Divide the total by the number of words @rybesh #duketext 38
  104. 104. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 39
  105. 105. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words @rybesh #duketext 40
  106. 106. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words @rybesh #duketext 40
  107. 107. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words = –25 @rybesh #duketext 40
  108. 108. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words –25 / 779 total words @rybesh #duketext 40
  109. 109. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words –25 / 779 total words = –0.032 @rybesh #duketext 40
  110. 110. Duke Libraries / Text > Data September 20, 2012 AGAINST LIMITED AGGRESSIVENESS LIMITING ATTACK NEGATE ATTACKING OFFENSE CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT CONTRAST OFFENSIVELY ADVANTAGE KEEPING DEFENSIVE OPPOSING ASSISTS LIKE DEFICIENCIES PLAGUED EFFICIENT PATRIOT DEVIL POOR EFFICIENTLY PERFECT DEVILS PROBLEM EFFORT RESPONSIBLE DISMAL SHORTCOMINGS FREE SIGNIFICANT EXPLOIT SLUGGISH FRESHMAN STRONGER FAILED THORNTON GOOD SUCCESS FOUL THREATS GREAT WELL FOULING TOO FOULS TROUBLE FUTILITY TROUBLES INABILITY UNABLE @rybesh #duketext 41
  111. 111. Duke Libraries / Text > Data September 20, 2012 AGAINST LIMITED AGGRESSIVENESS LIMITING ATTACK NEGATE ATTACKING OFFENSE CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT CONTRAST OFFENSIVELY ADVANTAGE KEEPING DEFENSIVE OPPOSING ASSISTS LIKE DEFICIENCIES PLAGUED EFFICIENT PATRIOT DEVIL POOR EFFICIENTLY PERFECT DEVILS PROBLEM EFFORT RESPONSIBLE DISMAL SHORTCOMINGS FREE SIGNIFICANT EXPLOIT SLUGGISH FRESHMAN STRONGER FAILED THORNTON GOOD SUCCESS FOUL THREATS GREAT WELL FOULING TOO FOULS TROUBLE FUTILITY TROUBLES INABILITY UNABLE @rybesh #duketext 42
  112. 112. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 43
  113. 113. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 43
  114. 114. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 43
  115. 115. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning @rybesh #duketext 44
  116. 116. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest @rybesh #duketext 44
  117. 117. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest • The problem: human coding of documents doesn't scale @rybesh #duketext 44
  118. 118. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest • The problem: human coding of documents doesn't scale • The solution: teach a robot to do it @rybesh #duketext 44
  119. 119. Duke Libraries / Text > Data September 20, 2012 Welcome your robot overlords @rybesh #duketext 45
  120. 120. Duke Libraries / Text > Data September 20, 2012 Welcome your robot overlords @rybesh #duketext 45
  121. 121. Duke Libraries / Text > Data September 20, 2012 Augmenting human capacity @rybesh #duketext 46
  122. 122. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  123. 123. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  124. 124. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  125. 125. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  126. 126. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  127. 127. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 47
  128. 128. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning @rybesh #duketext 48
  129. 129. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. @rybesh #duketext 48
  130. 130. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. @rybesh #duketext 48
  131. 131. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. 3. Test your classifying machine to see if it learned correctly. @rybesh #duketext 48
  132. 132. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. 3. Test your classifying machine to see if it learned correctly. 4. Use it to classify the rest of your documents. @rybesh #duketext 48
  133. 133. Duke Libraries / Text > Data September 20, 2012 Creating a training set @rybesh #duketext 49
  134. 134. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. @rybesh #duketext 49
  135. 135. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. • Select (ideally randomly) a subset of your documents, and code them by hand. @rybesh #duketext 49
  136. 136. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. • Select (ideally randomly) a subset of your documents, and code them by hand. • You need "enough" documents: more categories, more documents. @rybesh #duketext 49
  137. 137. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms @rybesh #duketext 50
  138. 138. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. @rybesh #duketext 50
  139. 139. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. • No "best" one: performance is domain- and dataset-specific @rybesh #duketext 50
  140. 140. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. • No "best" one: performance is domain- and dataset-specific • "Ensembles" of different algorithms can often outperform single algorithms @rybesh #duketext 50
  141. 141. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning @rybesh #duketext 51
  142. 142. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning @rybesh #duketext 52
  143. 143. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you don't know the categories of interest, or want to discover new ones @rybesh #duketext 52
  144. 144. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you don't know the categories of interest, or want to discover new ones • The solution: have a robot explore and find possible categorizations for you, and use them to categorize documents @rybesh #duketext 52
  145. 145. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you don't know the categories of interest, or want to discover new ones • The solution: have a robot explore and find possible categorizations for you, and use them to categorize documents • Also known as "clustering" @rybesh #duketext 52
  146. 146. Duke Libraries / Text > Data September 20, 2012 No free lunch Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  147. 147. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  148. 148. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand • But as much or more manual labor is needed to evaluate suggested categorizations afterwards Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  149. 149. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand • But as much or more manual labor is needed to evaluate suggested categorizations afterwards • The value is a novel categorization, not time or labor saved Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs @rybesh #duketext 53
  150. 150. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning @rybesh #duketext 54
  151. 151. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning • Single membership clustering: each document is assigned to one category @rybesh #duketext 54
  152. 152. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning • Single membership clustering: each document is assigned to one category • Mixed membership clustering: a document may be assigned to multiple categories, each with a different proportion @rybesh #duketext 54
  153. 153. Duke Libraries / Text > Data September 20, 2012 Single membership clustering @rybesh #duketext 55
  154. 154. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. @rybesh #duketext 55
  155. 155. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. 2. Define a quantitative measure of how "good" a cluster is. @rybesh #duketext 55
  156. 156. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. 2. Define a quantitative measure of how "good" a cluster is. 3. Define a process for optimizing the overall goodness of the clusters. @rybesh #duketext 55
  157. 157. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 56
  158. 158. Duke Libraries / Text > Data September 20, 2012 @rybesh #duketext 56
  159. 159. Duke Libraries / Text > Data September 20, 2012 http://shabal.in/visuals.html @rybesh #duketext 57
  160. 160. Duke Libraries / Text > Data September 20, 2012 http://shabal.in/visuals.html @rybesh #duketext 57
  161. 161. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering @rybesh #duketext 58
  162. 162. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example @rybesh #duketext 58
  163. 163. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics @rybesh #duketext 58
  164. 164. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics • A document is a probability distribution over topics @rybesh #duketext 58
  165. 165. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics • A document is a probability distribution over topics • A topic is a probability distribution over words @rybesh #duketext 58
  166. 166. Duke Libraries / Text > Data September 20, 2012 Probability distribution @rybesh #duketext 59
  167. 167. Duke Libraries / Text > Data September 20, 2012 "Generating" text @rybesh #duketext 60
  168. 168. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. @rybesh #duketext 60
  169. 169. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. @rybesh #duketext 60
  170. 170. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. 3. Roll the "word dice" to choose a word. @rybesh #duketext 60
  171. 171. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. 3. Roll the "word dice" to choose a word. 4. Repeat until we've chosen all the words for our text. @rybesh #duketext 60
  172. 172. Duke Libraries / Text > Data September 20, 2012 Topic modeling demo @rybesh #duketext 61
  173. 173. Duke Libraries / Text > Data September 20, 2012 http://dsl.richmond.edu/dispatch/ @rybesh #duketext 62
  174. 174. Duke Libraries / Text > Data September 20, 2012 Complex statistics / computation Topic models Weaker Stronger domain Supervised methods domain assumptions assumptions Word counting Dictionary methods Simple statistics / computation @rybesh #duketext O'Connor, Bamman & Smith 2011 http://goo.gl/PxruI 63
  175. 175. Duke Libraries / Text > Data September 20, 2012 Validating results Keeping the machines from leading you astray @rybesh #duketext 64
  176. 176. Duke Libraries / Text > Data September 20, 2012 Validating word counts @rybesh #duketext 65
  177. 177. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) @rybesh #duketext 65
  178. 178. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors @rybesh #duketext 65
  179. 179. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors • Texts may appear multiple times @rybesh #duketext 65
  180. 180. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors • Texts may appear multiple times • Collections are biased samples @rybesh #duketext 65
  181. 181. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  182. 182. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  183. 183. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  184. 184. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701 @rybesh #duketext 66
  185. 185. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods @rybesh #duketext 67
  186. 186. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments @rybesh #duketext 67
  187. 187. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments • But humans can't reliably "score" documents on "positivity" or "litigiousness" @rybesh #duketext 67
  188. 188. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments • But humans can't reliably "score" documents on "positivity" or "litigiousness" • Better to convert scores to simple binaries @rybesh #duketext 67
  189. 189. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods @rybesh #duketext 68
  190. 190. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. @rybesh #duketext 68
  191. 191. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. • Use the first sample to train your supervised learning algorithm. @rybesh #duketext 68
  192. 192. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. • Use the first sample to train your supervised learning algorithm. • Use the second sample to evaluate its performance. @rybesh #duketext 68
  193. 193. Duke Libraries / Text > Data September 20, 2012 figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 69
  194. 194. Duke Libraries / Text > Data September 20, 2012 figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 69
  195. 195. Duke Libraries / Text > Data September 20, 2012 Accuracy: 197 / 262 = 75% figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 69
  196. 196. Duke Libraries / Text > Data September 20, 2012 Precision: 57 / 78 = 73% figurative category figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 70
  197. 197. Duke Libraries / Text > Data September 20, 2012 Recall: 57 / 91 = 63% figurative category figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110 @rybesh #duketext 262 documents 71
  198. 198. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods @rybesh #duketext 72
  199. 199. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data @rybesh #duketext 72
  200. 200. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data • These are not appropriate for evaluating unsupervised clustering of texts @rybesh #duketext 72
  201. 201. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data • These are not appropriate for evaluating unsupervised clustering of texts • The "data" is butchered text, we don't want to fit it well @rybesh #duketext 72
  202. 202. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods @rybesh #duketext 73
  203. 203. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? @rybesh #duketext 73
  204. 204. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? @rybesh #duketext 73
  205. 205. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? • Are they internally consistent? @rybesh #duketext 73
  206. 206. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? • Are they internally consistent? • Do they provide insight? @rybesh #duketext 73
  207. 207. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  208. 208. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  209. 209. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } { car, teacher, platypus, agile, blue, Zaire } Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  210. 210. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } { car, teacher, platypus, agile, blue, Zaire } ? Chang et al. 2009 http://goo.gl/FCizP @rybesh #duketext 74
  211. 211. Duke Libraries / Text > Data September 20, 2012 Validating topic assignment @rybesh #duketext 75
  212. 212. Duke Libraries / Text > Data September 20, 2012 Validating topic assignment @rybesh #duketext 75
  213. 213. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods @rybesh #duketext 76
  214. 214. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? @rybesh #duketext 76
  215. 215. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? • Do the categories correlate with external facts? @rybesh #duketext 76
  216. 216. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? • Do the categories correlate with external facts? • Turn the categories into a coding scheme and apply supervised methods @rybesh #duketext 76
  217. 217. Duke Libraries / Text > Data September 20, 2012 Managing data Helping others stand on your shoulders @rybesh #duketext 77
  218. 218. Duke Libraries / Text > Data September 20, 2012 Three kinds of data @rybesh #duketext 78
  219. 219. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts you're analyzing and derivations thereof @rybesh #duketext 78
  220. 220. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts you're analyzing and derivations thereof 2. The software code you're using to process and analyze your texts @rybesh #duketext 78
  221. 221. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts you're analyzing and derivations thereof 2. The software code you're using to process and analyze your texts 3. Documentation of your process @rybesh #duketext 78
  222. 222. Duke Libraries / Text > Data September 20, 2012 Textual data @rybesh #duketext 79
  223. 223. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts you're processing @rybesh #duketext 79
  224. 224. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts you're processing • A version control system is ideal for this @rybesh #duketext 79
  225. 225. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts you're processing • A version control system is ideal for this • Version control hosting platforms such as GitHub are ideal for sharing your data too @rybesh #duketext 79
  226. 226. Duke Libraries / Text > Data September 20, 2012 Software data @rybesh #duketext 80
  227. 227. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software @rybesh #duketext 80
  228. 228. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software • Keep past versions of whatever software you use @rybesh #duketext 80
  229. 229. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software • Keep past versions of whatever software you use • Use version control for your own scripts and software @rybesh #duketext 80
  230. 230. Duke Libraries / Text > Data September 20, 2012 Documentary data @rybesh #duketext 81
  231. 231. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage @rybesh #duketext 81
  232. 232. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage • Consider keeping a (public or private) "lab notebook" blog @rybesh #duketext 81
  233. 233. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage • Consider keeping a (public or private) "lab notebook" blog • Anything else you write related to the project, formal or informal @rybesh #duketext 81
  234. 234. Duke Libraries / Text > Data September 20, 2012 Long-term preservation @rybesh #duketext 82
  235. 235. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions @rybesh #duketext 82
  236. 236. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions • Create static snapshots of websites, blogs, etc. @rybesh #duketext 82
  237. 237. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions • Create static snapshots of websites, blogs, etc. • Place everything in a long-term digital repository such as DukeSpace @rybesh #duketext 82
  238. 238. Duke Libraries / Text > Data September 20, 2012 Take-aways @rybesh #duketext 83
  239. 239. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. @rybesh #duketext 83
  240. 240. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • It's a systematic method of transforming texts to produce new texts for interpretation. @rybesh #duketext 83
  241. 241. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • It's a systematic method of transforming texts to produce new texts for interpretation. • It only augments human judgment and interpretation; it can't replace them. @rybesh #duketext 83
  242. 242. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • It's a systematic method of transforming texts to produce new texts for interpretation. • It only augments human judgment and interpretation; it can't replace them. • Be excited by the possibilities but skeptical of the hype. @rybesh #duketext 83
  243. 243. Duke Libraries / Text > Data September 20, 2012 Thanks! @rybesh #duketext 84
  244. 244. Duke Libraries / Text > Data September 20, 2012 Thanks! http://aesh.in/RC @rybesh #duketext 84
  245. 245. Duke Libraries / Text > Data September 20, 2012 Thanks! http://aesh.in/RC ryanshaw@unc.edu @rybesh #duketext 84

Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • 1949 - persuaded IBM to sponsor his project to produce a complete concordance of the works of St. Thomas Aquinas\n30 years\nNot new -- what's new is that it has become affordable, in both money and time\n
  • title of this workshop mentions "text mining", i prefer\n
  • \n
  • through a process of abstraction...\n
  • through a process of abstraction...\n
  • through a process of abstraction...\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • We computed a “suppression index” for each person by dividing their frequency from 1933 – 1945 by the mean frequency in 1925-1933 and in 1955-1965.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • designed to capture the sentiment of political texts\n
  • designed to capture the sentiment of political texts\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • accuracy: proportion correctly classified\n
  • accuracy: proportion correctly classified\n
  • accuracy: proportion correctly classified\n
  • accuracy: proportion correctly classified\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

×