Duke Libraries / Text > Data                                           September 20, 2012                                 ...
Duke Libraries / Text > Data                                           September 20, 2012                                 ...
Duke Libraries / Text > Data                                           September 20, 2012                                 ...
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                              2
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                              2
Duke Libraries / Text > Data                  September 20, 2012                               Roberto Busa@rybesh #dukete...
Duke Libraries / Text > Data                       September 20, 2012                         Automated text analysis@rybe...
Duke Libraries / Text > Data                                                   September 20, 2012                         ...
Duke Libraries / Text > Data                                                  September 20, 2012                         A...
Duke Libraries / Text > Data                                          September 20, 2012                               Lan...
Duke Libraries / Text > Data                                           September 20, 2012                               La...
Duke Libraries / Text > Data                                           September 20, 2012                               La...
Duke Libraries / Text > Data                                           September 20, 2012                               La...
Duke Libraries / Text > Data                                             September 20, 2012                               ...
Duke Libraries / Text > Data                                               September 20, 2012                             ...
Duke Libraries / Text > Data                                               September 20, 2012                             ...
Duke Libraries / Text > Data                                               September 20, 2012                             ...
Duke Libraries / Text > Data                    September 20, 2012                               Plan of attack@rybesh #du...
Duke Libraries / Text > Data                    September 20, 2012                               Plan of attack           ...
Duke Libraries / Text > Data                    September 20, 2012                               Plan of attack           ...
Duke Libraries / Text > Data                    September 20, 2012                               Plan of attack           ...
Duke Libraries / Text > Data                    September 20, 2012                               Plan of attack           ...
Duke Libraries / Text > Data                    September 20, 2012                               Plan of attack           ...
Duke Libraries / Text > Data                            September 20, 2012                               Acquiring text   ...
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             10
Duke Libraries / Text > Data             September 20, 2012                               Sources@rybesh #duketext        ...
Duke Libraries / Text > Data                September 20, 2012                               Sources               • Exist...
Duke Libraries / Text > Data                                 September 20, 2012                               Sources     ...
Duke Libraries / Text > Data                                 September 20, 2012                               Sources     ...
Duke Libraries / Text > Data                          September 20, 2012                           Existing digital corpor...
Duke Libraries / Text > Data                               September 20, 2012                           Existing digital c...
Duke Libraries / Text > Data                               September 20, 2012                           Existing digital c...
Duke Libraries / Text > Data                               September 20, 2012                           Existing digital c...
Duke Libraries / Text > Data                               September 20, 2012                           Existing digital c...
Duke Libraries / Text > Data                                                September 20, 2012                            ...
Duke Libraries / Text > Data                           September 20, 2012                               Other digital sour...
Duke Libraries / Text > Data                                September 20, 2012                               Other digital...
Duke Libraries / Text > Data                                September 20, 2012                               Other digital...
Duke Libraries / Text > Data                                September 20, 2012                               Other digital...
Duke Libraries / Text > Data                                 September 20, 2012                               Other digita...
Duke Libraries / Text > Data                                 September 20, 2012                               Other digita...
Duke Libraries / Text > Data                      September 20, 2012                               Undigitized text@rybesh...
Duke Libraries / Text > Data                               September 20, 2012                               Undigitized te...
Duke Libraries / Text > Data                               September 20, 2012                               Undigitized te...
Duke Libraries / Text > Data                               September 20, 2012                               Undigitized te...
Duke Libraries / Text > Data                               September 20, 2012                               Undigitized te...
Duke Libraries / Text > Data                     September 20, 2012                               Preparing texts@rybesh #...
Duke Libraries / Text > Data                     September 20, 2012                               Preparing texts         ...
Duke Libraries / Text > Data                     September 20, 2012                               Preparing texts         ...
Duke Libraries / Text > Data                     September 20, 2012                               Preparing texts         ...
Duke Libraries / Text > Data                                 September 20, 2012                               Preparing te...
Duke Libraries / Text > Data                     September 20, 2012                               Preparing texts@rybesh #...
Duke Libraries / Text > Data                           September 20, 2012                               Preparing texts   ...
Duke Libraries / Text > Data                           September 20, 2012                               Preparing texts   ...
Duke Libraries / Text > Data                            September 20, 2012                               Preparing texts  ...
Duke Libraries / Text > Data                                September 20, 2012                          Representing text ...
Duke Libraries / Text > Data                                September 20, 2012               Slowly welling from the point...
Duke Libraries / Text > Data                              September 20, 2012            11       the          1   wax     ...
Duke Libraries / Text > Data                             September 20, 2012            11       the         1   wax       ...
Duke Libraries / Text > Data                             September 20, 2012            11       the         1   wax       ...
Duke Libraries / Text > Data                                   September 20, 2012                               doc 1 doc ...
Duke Libraries / Text > Data                         September 20, 2012                               Document similarity ...
Duke Libraries / Text > Data                          September 20, 2012                               Document similarity...
Duke Libraries / Text > Data                          September 20, 2012                               Document similarity...
Duke Libraries / Text > Data                                     September 20, 2012                               Document...
Duke Libraries / Text > Data                                    September 20, 2012                               Analyzing...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                               September 20, 2012                  Six methods...
Duke Libraries / Text > Data                                                          September 20, 2012                  ...
Duke Libraries / Text > Data                    September 20, 2012                               Counting words@rybesh #du...
Duke Libraries / Text > Data                                       September 20, 2012                                     ...
Duke Libraries / Text > Data                    September 20, 2012                               Counting words@rybesh #du...
Duke Libraries / Text > Data                        September 20, 2012                                   Counting words   ...
Duke Libraries / Text > Data                              September 20, 2012                                   Counting wo...
Duke Libraries / Text > Data                                        September 20, 2012                                   C...
Duke Libraries / Text > Data                                        September 20, 2012                                   C...
Duke Libraries / Text > Data                                        September 20, 2012                                   C...
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             32
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             33
Duke Libraries / Text > Data                       September 20, 2012                               Concordance tools@rybe...
Duke Libraries / Text > Data                        September 20, 2012                               Dictionary methods@ry...
Duke Libraries / Text > Data                              September 20, 2012                               Dictionary meth...
Duke Libraries / Text > Data                                  September 20, 2012                               Dictionary ...
Duke Libraries / Text > Data                                  September 20, 2012                               Dictionary ...
Duke Libraries / Text > Data                                       September 20, 2012            Lexicoder Sentiment Dicti...
Duke Libraries / Text > Data                  September 20, 2012ACQUITTANCE DOCKET          LEGALIZATIONS QUITCLAIMADJOURN...
Duke Libraries / Text > Data                     September 20, 2012                               Litigious WordsACQUITTAN...
Duke Libraries / Text > Data                 September 20, 2012                   Simple dictionary algorithm@rybesh #duke...
Duke Libraries / Text > Data                  September 20, 2012                   Simple dictionary algorithm            ...
Duke Libraries / Text > Data                               September 20, 2012                   Simple dictionary algorith...
Duke Libraries / Text > Data                               September 20, 2012                   Simple dictionary algorith...
Duke Libraries / Text > Data                           September 20, 2012                   Simple dictionary algorithm   ...
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             39
Duke Libraries / Text > Data                               September 20, 2012                               26 uses of pos...
Duke Libraries / Text > Data                               September 20, 2012                               26 uses of pos...
Duke Libraries / Text > Data                               September 20, 2012                               26 uses of pos...
Duke Libraries / Text > Data                               September 20, 2012                               26 uses of pos...
Duke Libraries / Text > Data                               September 20, 2012                               26 uses of pos...
Duke Libraries / Text > Data                                      September 20, 2012        AGAINST                LIMITED...
Duke Libraries / Text > Data                                      September 20, 2012        AGAINST                LIMITED...
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             43
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             43
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             43
Duke Libraries / Text > Data                September 20, 2012                  Supervised machine learning@rybesh #dukete...
Duke Libraries / Text > Data                           September 20, 2012                  Supervised machine learning    ...
Duke Libraries / Text > Data                           September 20, 2012                  Supervised machine learning    ...
Duke Libraries / Text > Data                           September 20, 2012                  Supervised machine learning    ...
Duke Libraries / Text > Data   September 20, 2012          Welcome your         robot overlords@rybesh #duketext          ...
Duke Libraries / Text > Data   September 20, 2012          Welcome your         robot overlords@rybesh #duketext          ...
Duke Libraries / Text > Data              September 20, 2012                  Augmenting human capacity@rybesh #duketext  ...
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             47
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             47
Duke Libraries / Text > Data                September 20, 2012                  Supervised machine learning@rybesh #dukete...
Duke Libraries / Text > Data                September 20, 2012                  Supervised machine learning               ...
Duke Libraries / Text > Data                              September 20, 2012                  Supervised machine learning ...
Duke Libraries / Text > Data                                   September 20, 2012                  Supervised machine lear...
Duke Libraries / Text > Data                                   September 20, 2012                  Supervised machine lear...
Duke Libraries / Text > Data                             September 20, 2012                               Creating a train...
Duke Libraries / Text > Data                               September 20, 2012                               Creating a tra...
Duke Libraries / Text > Data                               September 20, 2012                               Creating a tra...
Duke Libraries / Text > Data                               September 20, 2012                               Creating a tra...
Duke Libraries / Text > Data              September 20, 2012              Supervised learning algorithms@rybesh #duketext ...
Duke Libraries / Text > Data                                    September 20, 2012              Supervised learning algori...
Duke Libraries / Text > Data                                    September 20, 2012              Supervised learning algori...
Duke Libraries / Text > Data                                    September 20, 2012              Supervised learning algori...
Duke Libraries / Text > Data           September 20, 2012             Unsupervised machine learning@rybesh #duketext      ...
Duke Libraries / Text > Data           September 20, 2012             Unsupervised machine learning@rybesh #duketext      ...
Duke Libraries / Text > Data                            September 20, 2012             Unsupervised machine learning      ...
Duke Libraries / Text > Data                             September 20, 2012             Unsupervised machine learning     ...
Duke Libraries / Text > Data                             September 20, 2012             Unsupervised machine learning     ...
Duke Libraries / Text > Data                                           September 20, 2012                               No...
Duke Libraries / Text > Data                                           September 20, 2012                               No...
Duke Libraries / Text > Data                                              September 20, 2012                              ...
Duke Libraries / Text > Data                                              September 20, 2012                              ...
Duke Libraries / Text > Data                           September 20, 2012                                   Two kinds of  ...
Duke Libraries / Text > Data                               September 20, 2012                                   Two kinds ...
Duke Libraries / Text > Data                                 September 20, 2012                                   Two kind...
Duke Libraries / Text > Data              September 20, 2012                Single membership clustering@rybesh #duketext ...
Duke Libraries / Text > Data                              September 20, 2012                Single membership clustering  ...
Duke Libraries / Text > Data                              September 20, 2012                Single membership clustering  ...
Duke Libraries / Text > Data                              September 20, 2012                Single membership clustering  ...
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             56
Duke Libraries / Text > Data   September 20, 2012@rybesh #duketext                             56
Duke Libraries / Text > Data                 September 20, 2012                               http://shabal.in/visuals.htm...
Duke Libraries / Text > Data                 September 20, 2012                               http://shabal.in/visuals.htm...
Duke Libraries / Text > Data             September 20, 2012                Mixed membership clustering@rybesh #duketext   ...
Duke Libraries / Text > Data                           September 20, 2012                Mixed membership clustering      ...
Duke Libraries / Text > Data                           September 20, 2012                Mixed membership clustering      ...
Duke Libraries / Text > Data                                September 20, 2012                Mixed membership clustering ...
Duke Libraries / Text > Data                                September 20, 2012                Mixed membership clustering ...
Duke Libraries / Text > Data                          September 20, 2012                           Probability distributio...
Duke Libraries / Text > Data                       September 20, 2012                               "Generating" text@rybe...
Duke Libraries / Text > Data                                 September 20, 2012                               "Generating"...
Duke Libraries / Text > Data                                 September 20, 2012                               "Generating"...
Duke Libraries / Text > Data                                 September 20, 2012                               "Generating"...
Duke Libraries / Text > Data                                 September 20, 2012                               "Generating"...
Duke Libraries / Text > Data                         September 20, 2012                               Topic modeling demo@...
Duke Libraries / Text > Data                           September 20, 2012                               http://dsl.richmon...
Duke Libraries / Text > Data                                                      September 20, 2012                      ...
Duke Libraries / Text > Data                                         September 20, 2012                               Vali...
Duke Libraries / Text > Data                        September 20, 2012                           Validating word counts@ry...
Duke Libraries / Text > Data                           September 20, 2012                           Validating word counts...
Duke Libraries / Text > Data                           September 20, 2012                           Validating word counts...
Duke Libraries / Text > Data                           September 20, 2012                           Validating word counts...
Duke Libraries / Text > Data                           September 20, 2012                           Validating word counts...
Duke Libraries / Text > Data                                         September 20, 2012                               http...
Duke Libraries / Text > Data                                         September 20, 2012                               http...
Duke Libraries / Text > Data                                         September 20, 2012                               http...
Duke Libraries / Text > Data                                         September 20, 2012                               http...
Duke Libraries / Text > Data              September 20, 2012               Validating dictionary methods@rybesh #duketext ...
Duke Libraries / Text > Data                              September 20, 2012               Validating dictionary methods  ...
Duke Libraries / Text > Data                                  September 20, 2012               Validating dictionary metho...
Duke Libraries / Text > Data                                  September 20, 2012               Validating dictionary metho...
Duke Libraries / Text > Data             September 20, 2012              Validating supervised methods@rybesh #duketext   ...
Duke Libraries / Text > Data                            September 20, 2012              Validating supervised methods     ...
Duke Libraries / Text > Data                            September 20, 2012              Validating supervised methods     ...
Duke Libraries / Text > Data                             September 20, 2012              Validating supervised methods    ...
Duke Libraries / Text > Data                                             September 20, 2012                               ...
Duke Libraries / Text > Data                                             September 20, 2012                               ...
Duke Libraries / Text > Data                                             September 20, 2012    Accuracy: 197 / 262 = 75%  ...
Duke Libraries / Text > Data                                             September 20, 2012       Precision: 57 / 78 = 73%...
Duke Libraries / Text > Data                                             September 20, 2012       Recall: 57 / 91 = 63%   ...
Duke Libraries / Text > Data          September 20, 2012          Validating unsupervised methods@rybesh #duketext        ...
Duke Libraries / Text > Data                                  September 20, 2012          Validating unsupervised methods ...
Duke Libraries / Text > Data                                  September 20, 2012          Validating unsupervised methods ...
Duke Libraries / Text > Data                                  September 20, 2012          Validating unsupervised methods ...
Duke Libraries / Text > Data          September 20, 2012          Validating unsupervised methods@rybesh #duketext        ...
Duke Libraries / Text > Data                           September 20, 2012          Validating unsupervised methods        ...
Duke Libraries / Text > Data                           September 20, 2012          Validating unsupervised methods        ...
Duke Libraries / Text > Data                           September 20, 2012          Validating unsupervised methods        ...
Duke Libraries / Text > Data                           September 20, 2012          Validating unsupervised methods        ...
Duke Libraries / Text > Data                               September 20, 2012                     Validating topic coheren...
Duke Libraries / Text > Data                               September 20, 2012                     Validating topic coheren...
Duke Libraries / Text > Data                               September 20, 2012                     Validating topic coheren...
Duke Libraries / Text > Data                               September 20, 2012                     Validating topic coheren...
Duke Libraries / Text > Data                  September 20, 2012                    Validating topic assignment@rybesh #du...
Duke Libraries / Text > Data                  September 20, 2012                    Validating topic assignment@rybesh #du...
Duke Libraries / Text > Data          September 20, 2012          Validating unsupervised methods@rybesh #duketext        ...
Duke Libraries / Text > Data                            September 20, 2012          Validating unsupervised methods       ...
Duke Libraries / Text > Data                            September 20, 2012          Validating unsupervised methods       ...
Duke Libraries / Text > Data                            September 20, 2012          Validating unsupervised methods       ...
Duke Libraries / Text > Data                                            September 20, 2012                                ...
Duke Libraries / Text > Data                         September 20, 2012                               Three kinds of data@...
Duke Libraries / Text > Data                              September 20, 2012                               Three kinds of ...
Duke Libraries / Text > Data                              September 20, 2012                               Three kinds of ...
Duke Libraries / Text > Data                              September 20, 2012                               Three kinds of ...
Duke Libraries / Text > Data                  September 20, 2012                               Textual data@rybesh #dukete...
Duke Libraries / Text > Data                            September 20, 2012                               Textual data     ...
Duke Libraries / Text > Data                                  September 20, 2012                               Textual dat...
Duke Libraries / Text > Data                                September 20, 2012                               Textual data ...
Duke Libraries / Text > Data                   September 20, 2012                               Software data@rybesh #duke...
Duke Libraries / Text > Data                         September 20, 2012                               Software data       ...
Duke Libraries / Text > Data                           September 20, 2012                               Software data     ...
Duke Libraries / Text > Data                            September 20, 2012                               Software data    ...
Duke Libraries / Text > Data                      September 20, 2012                               Documentary data@rybesh...
Duke Libraries / Text > Data                          September 20, 2012                               Documentary data   ...
Duke Libraries / Text > Data                              September 20, 2012                               Documentary dat...
Duke Libraries / Text > Data                              September 20, 2012                               Documentary dat...
Duke Libraries / Text > Data                      September 20, 2012                         Long-term preservation@rybesh...
Duke Libraries / Text > Data                         September 20, 2012                         Long-term preservation    ...
Duke Libraries / Text > Data                             September 20, 2012                         Long-term preservation...
Duke Libraries / Text > Data                             September 20, 2012                         Long-term preservation...
Duke Libraries / Text > Data                September 20, 2012                               Take-aways@rybesh #duketext  ...
Duke Libraries / Text > Data                           September 20, 2012                               Take-aways        ...
Duke Libraries / Text > Data                                September 20, 2012                               Take-aways   ...
Duke Libraries / Text > Data                                September 20, 2012                                Take-aways  ...
Text-mining as a Research Tool in the Humanities and Social Sciences
Text-mining as a Research Tool in the Humanities and Social Sciences
Text-mining as a Research Tool in the Humanities and Social Sciences
Text-mining as a Research Tool in the Humanities and Social Sciences
Upcoming SlideShare
Loading in...5
×

Text-mining as a Research Tool in the Humanities and Social Sciences

5,043

Published on

Ryan Shaw (School of Information & Library Science, UNC Chapel Hill) provides an overview and a critique of text-mining projects, and discusses project design, methodology, scope, integrity of data and analysis as well as preservation. This presentation will help scholars understand the research potential of text mining, and offer a summary of issues and concerns about technology and methods.

See also:

http://aesh.in/RC
http://sfy.co/e8ys

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,043
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
61
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • 1949 - persuaded IBM to sponsor his project to produce a complete concordance of the works of St. Thomas Aquinas\n30 years\nNot new -- what's new is that it has become affordable, in both money and time\n
  • title of this workshop mentions "text mining", i prefer\n
  • \n
  • through a process of abstraction...\n
  • through a process of abstraction...\n
  • through a process of abstraction...\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • We computed a “suppression index” for each person by dividing their frequency from 1933 – 1945 by the mean frequency in 1925-1933 and in 1955-1965.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • designed to capture the sentiment of political texts\n
  • designed to capture the sentiment of political texts\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • accuracy: proportion correctly classified\n
  • accuracy: proportion correctly classified\n
  • accuracy: proportion correctly classified\n
  • accuracy: proportion correctly classified\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Text-mining as a Research Tool in the Humanities and Social Sciences

    1. 1. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC@rybesh #duketext 1
    2. 2. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC@rybesh #duketext 1
    3. 3. Duke Libraries / Text > Data September 20, 2012 Text-mining as a Research Tool in the Humanities and Social Sciences Ryan Shaw ryanshaw@unc.edu http://aesh.in/RC@rybesh #duketext 1
    4. 4. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 2
    5. 5. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 2
    6. 6. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 2
    7. 7. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 2
    8. 8. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 2
    9. 9. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 2
    10. 10. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 2
    11. 11. Duke Libraries / Text > Data September 20, 2012 Roberto Busa@rybesh #duketext 3
    12. 12. Duke Libraries / Text > Data September 20, 2012 Automated text analysis@rybesh #duketext 4
    13. 13. Duke Libraries / Text > Data September 20, 2012 Automated text analysis Automated text analysis is a tool for discovery and measurement in textual data of prevalent attitudes, concepts, or events. OConnor, Bamman & Smith 2011 "Computational Text Analysis for Social Science" http://goo.gl/PxruI@rybesh #duketext 4
    14. 14. Duke Libraries / Text > Data September 20, 2012 Automated text analysis Automated text analysis is a tool for discovery and measurement in textual data of patterns of language use interpretable as prevalent attitudes, concepts, or events. OConnor, Bamman & Smith 2011 "Computational Text Analysis for Social Science" http://goo.gl/PxruI@rybesh #duketext 5
    15. 15. Duke Libraries / Text > Data September 20, 2012 Language modeling Black 1962, "Models and Archetypes" http://goo.gl/zKtrx@rybesh #duketext 6
    16. 16. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language Black 1962, "Models and Archetypes" http://goo.gl/zKtrx@rybesh #duketext 6
    17. 17. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language • Mathematical models distinguish elements and make explicit the relations among them Black 1962, "Models and Archetypes" http://goo.gl/zKtrx@rybesh #duketext 6
    18. 18. Duke Libraries / Text > Data September 20, 2012 Language modeling • Methods for automated text analysis are based on mathematical models of language • Mathematical models distinguish elements and make explicit the relations among them • They do not explain, but they can be interpreted Black 1962, "Models and Archetypes" http://goo.gl/zKtrx@rybesh #duketext 6
    19. 19. Duke Libraries / Text > Data September 20, 2012 Language modeling Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 7
    20. 20. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 7
    21. 21. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong • Nevertheless they may be useful Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 7
    22. 22. Duke Libraries / Text > Data September 20, 2012 Language modeling • All mathematical models of language are necessarily wrong • Nevertheless they may be useful • They must be evaluated on their ability to help scholars make inferences, achieve insights, and generate new interpretations Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 7
    23. 23. Duke Libraries / Text > Data September 20, 2012 Plan of attack@rybesh #duketext 8
    24. 24. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text@rybesh #duketext 8
    25. 25. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text@rybesh #duketext 8
    26. 26. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text@rybesh #duketext 8
    27. 27. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text • Validating results@rybesh #duketext 8
    28. 28. Duke Libraries / Text > Data September 20, 2012 Plan of attack • Acquiring text • Representing text • Analyzing text • Validating results • Managing data@rybesh #duketext 8
    29. 29. Duke Libraries / Text > Data September 20, 2012 Acquiring text Collecting your data@rybesh #duketext 9
    30. 30. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 10
    31. 31. Duke Libraries / Text > Data September 20, 2012 Sources@rybesh #duketext 11
    32. 32. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora@rybesh #duketext 11
    33. 33. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora • Other digital sources (e.g. Web, twitter)@rybesh #duketext 11
    34. 34. Duke Libraries / Text > Data September 20, 2012 Sources • Existing digital corpora • Other digital sources (e.g. Web, twitter) • Undigitized text@rybesh #duketext 11
    35. 35. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora@rybesh #duketext 12
    36. 36. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML@rybesh #duketext 12
    37. 37. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high@rybesh #duketext 12
    38. 38. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high • But collections tend to be small@rybesh #duketext 12
    39. 39. Duke Libraries / Text > Data September 20, 2012 Existing digital corpora • Ideally, texts will be available as XML • Quality of text and metadata is high • But collections tend to be small • Licensing agreements may prohibit text analysis@rybesh #duketext 12
    40. 40. Duke Libraries / Text > Data September 20, 2012 • 10.5 million total volumes • 5.5 million book titles • 270,000 serial titles • 3.2 million public domain http://www.hathitrust.org/htrc@rybesh #duketext 13
    41. 41. Duke Libraries / Text > Data September 20, 2012 Other digital sources@rybesh #duketext 14
    42. 42. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API@rybesh #duketext 14
    43. 43. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped"@rybesh #duketext 14
    44. 44. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming@rybesh #duketext 14
    45. 45. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming • Website restrictions may limit how much or how quickly texts can be collected@rybesh #duketext 14
    46. 46. Duke Libraries / Text > Data September 20, 2012 Other digital sources • Some kinds of texts (e.g. tweets) can be obtained through an API • Websites without APIs can be "scraped" • Generally requires custom programming • Website restrictions may limit how much or how quickly texts can be collected • Metadata will be limited or absent@rybesh #duketext 14
    47. 47. Duke Libraries / Text > Data September 20, 2012 Undigitized text@rybesh #duketext 15
    48. 48. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition@rybesh #duketext 15
    49. 49. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive@rybesh #duketext 15
    50. 50. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive • OCR will introduce errors in your texts@rybesh #duketext 15
    51. 51. Duke Libraries / Text > Data September 20, 2012 Undigitized text • Undigitized text must be scanned and subjected to Optical Character Recognition • Time and labor intensive • OCR will introduce errors in your texts • You need to produce your own metadata@rybesh #duketext 15
    52. 52. Duke Libraries / Text > Data September 20, 2012 Preparing texts@rybesh #duketext 16
    53. 53. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors@rybesh #duketext 16
    54. 54. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines@rybesh #duketext 16
    55. 55. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines • Running headers and footers@rybesh #duketext 16
    56. 56. Duke Libraries / Text > Data September 20, 2012 Preparing texts • OCR errors • Words broken across lines • Running headers and footers • Breaking into paragraphs, sentences, etc.@rybesh #duketext 16
    57. 57. Duke Libraries / Text > Data September 20, 2012 Preparing texts@rybesh #duketext 17
    58. 58. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts@rybesh #duketext 17
    59. 59. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts • Worth your time to learn a scripting language (such as Python)@rybesh #duketext 17
    60. 60. Duke Libraries / Text > Data September 20, 2012 Preparing texts • The bulk of your time will be spent acquiring and preparing your texts • Worth your time to learn a scripting language (such as Python) • Command-line text-processing tools on Mac OS and Unix also very useful@rybesh #duketext 17
    61. 61. Duke Libraries / Text > Data September 20, 2012 Representing text Turning words into numbers@rybesh #duketext 18
    62. 62. Duke Libraries / Text > Data September 20, 2012 Slowly welling from the point of her gold nib, pale blue ink dissolved the full stop; for there her pen stuck; her eyes fixed, and tears slowly filled them. The entire bay quivered; the lighthouse wobbled; and she had the illusion that the mast of Mr. Connors little yacht was bending like a wax candle in the sun. She winked quickly. Accidents were awful things. She winked again. The mast was straight; the waves were regular; the lighthouse was upright; but the blot had spread.@rybesh #duketext 19
    63. 63. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quivered 3 was 1 waves 1 quickly 3 she 1 upright 1 point 3 her 1 things 1 pen 2 winked 1 there 1 pale 2 were 1 them 1 nib 2 slowly 1 that 1 mr 2 of 1 tears 1 little 2 mast 1 sun 1 like 2 lighthouse 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illusion 1 yacht 1 spread 1 gold 1 wobbled 1 s 1 full 1 welling 1 regular 1 from@rybesh #duketext 20
    64. 64. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quiver 3 wa 1 wave 1 quickli 3 she 1 upright 1 point 3 her 1 thing 1 pen 2 wink 1 there 1 pale 2 were 1 them 1 nib 2 slowli 1 that 1 mr 2 of 1 tear 1 littl 2 mast 1 sun 1 like 2 lighthous 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illus 1 yacht 1 spread 1 gold 1 wobbl 1 s 1 full 1 well 1 regular 1 from@rybesh #duketext 21
    65. 65. Duke Libraries / Text > Data September 20, 2012 11 the 1 wax 1 quiver 3 wa 1 wave 1 quickli 3 she 1 upright 1 point 3 her 1 thing 1 pen 2 wink 1 there 1 pale 2 were 1 them 1 nib 2 slowli 1 that 1 mr 2 of 1 tear 1 littl 2 mast 1 sun 1 like 2 lighthous 1 stuck 1 ink 2 had 1 straight 1 in 2 and 1 stop 1 illus 1 yacht 1 spread 1 gold 1 wobbl 1 s 1 full 1 well 1 regular 1 from@rybesh #duketext 22
    66. 66. Duke Libraries / Text > Data September 20, 2012 doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 accid 1 actual 1 again 1 1 alreadi 1 antenna 1 archer 1 avoid 2 1 awai 1 aw 1 bag 1 bandanna 1 barfoot 2@rybesh #duketext 23
    67. 67. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again 1 1 2@rybesh #duketext avoid 24
    68. 68. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 1 1 1 2@rybesh #duketext avoid 24
    69. 69. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 6 doc 1 1 1 2@rybesh #duketext avoid 24
    70. 70. Duke Libraries / Text > Data September 20, 2012 Document similarity 2 again doc 6 doc 1 1 ar ity m il si 1 2@rybesh #duketext avoid 24
    71. 71. Duke Libraries / Text > Data September 20, 2012 Analyzing text Counting, comparing, categorizing and pattern-finding@rybesh #duketext 25
    72. 72. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 26
    73. 73. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 26
    74. 74. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 26
    75. 75. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 26
    76. 76. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 26
    77. 77. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 26
    78. 78. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning • Unsupervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 26
    79. 79. Duke Libraries / Text > Data September 20, 2012 Six methods of text analysis • Reading • Counting words • Human coding (manual content analysis) • Dictionary methods • Supervised machine learning • Unsupervised machine learning Quinn et al. 2010 http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x@rybesh #duketext 27
    80. 80. Duke Libraries / Text > Data September 20, 2012 Counting words http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html@rybesh #duketext 28
    81. 81. Duke Libraries / Text > Data September 20, 2012 Counting words@rybesh #duketext 29
    82. 82. Duke Libraries / Text > Data September 20, 2012 Michel et al. 2010@rybesh #duketext http://dx.doi.org/10.1126/science.1199644 30
    83. 83. Duke Libraries / Text > Data September 20, 2012 Counting words@rybesh #duketext 31
    84. 84. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed@rybesh #duketext 31
    85. 85. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable@rybesh #duketext 31
    86. 86. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location@rybesh #duketext 31
    87. 87. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location • Word use is ambiguous@rybesh #duketext 31
    88. 88. Duke Libraries / Text > Data September 20, 2012 Counting words • Easily computed • Results are replicable • Comparisons require metadata e.g. publication year, language, subject category, location • Word use is ambiguous • Spelling may vary@rybesh #duketext 31
    89. 89. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 32
    90. 90. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 33
    91. 91. Duke Libraries / Text > Data September 20, 2012 Concordance tools@rybesh #duketext 34
    92. 92. Duke Libraries / Text > Data September 20, 2012 Dictionary methods@rybesh #duketext 35
    93. 93. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words@rybesh #duketext 35
    94. 94. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words • Lists are compiled for specific categories of interest: negative words, law-related words, names of places, names of chemicals, etc.@rybesh #duketext 35
    95. 95. Duke Libraries / Text > Data September 20, 2012 Dictionary methods • A dictionary is simply a list of words • Lists are compiled for specific categories of interest: negative words, law-related words, names of places, names of chemicals, etc. • May be custom-built or reused@rybesh #duketext 35
    96. 96. Duke Libraries / Text > Data September 20, 2012 Lexicoder Sentiment DictionaryA LIE 0 WOUNDED 0 ABILITY* 1 WOOS 1ABANDON* 0 WOUNDS 0 ABOUND* 1 WORKABLE* 1ABAS* 0 WRATH* 0 ABSOLV* 1 WORKMANSHIP* 1ABATTOIR* 0 WRECK* 0 ABSORBENT* 1 WORSHIP* 1ABDICAT* 0 WRESTL* 0 ABSORPTION* 1 WORTH 1ABERRA* 0 WRETCH* 0 ABUNDANC* 1 WORTH WHILE* 1ABHOR* 0 WRITHE* 0 ABUNDANT* 1 WORTHI* 1ABJECT* 0 WRONG* 0 ACCED* 1 WORTHWHILE* 1ABNORMAL* 0 XENOPHOB* 0 ACCENTUAT* 1 WORTHY* 1ABOLISH* 0 YAWN* 0 ACCEPT* 1 YOUNG AT HEART 1ABOMINAB* 0 YEARN* 0 ACCESSIB* 1 ZEAL 1ABOMINAT* 0 YUCK* 0 ACCLAIM* 1 ZEALOUS* 1ABRASIV* 0 ZEALOT* 0 ACCLAMATION* 1 ZEST* 1@rybesh #duketext 36
    97. 97. Duke Libraries / Text > Data September 20, 2012ACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIMADJOURNING ESCHEATED LEGALLY REBUTSAPPELLANTS EXCEEDENCES LITIGATORS REQUESTERAPPOINTOR EXCULPATED MISTRIALS RESCINDSARBITRATE FOREBEAR NOTARIZE STATUTEASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHSCHATTEL INDEMNITY OBLIGOR SUBPOENASCODIFICATIONS INJUNCTION PERSONAM SUBTRUSTSCONVICTED INTERLOCUTORY PLEADS TENANTABILITYCOUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARYDEFEASANCE INTERROGATE PRETRIAL UNENCUMBEREDDELEGATEE IRREVOCABLY PRIMA UNREMEDIATEDDEPOSED LEGALIZATION PROSECUTIONS WHEREOF@rybesh #duketext 37
    98. 98. Duke Libraries / Text > Data September 20, 2012 Litigious WordsACQUITTANCE DOCKET LEGALIZATIONS QUITCLAIMADJOURNING ESCHEATED LEGALLY REBUTSAPPELLANTS EXCEEDENCES LITIGATORS REQUESTERAPPOINTOR EXCULPATED MISTRIALS RESCINDSARBITRATE FOREBEAR NOTARIZE STATUTEASSERTABLE INASMUCH NOTARIZED SUBPARAGRAPHSCHATTEL INDEMNITY OBLIGOR SUBPOENASCODIFICATIONS INJUNCTION PERSONAM SUBTRUSTSCONVICTED INTERLOCUTORY PLEADS TENANTABILITYCOUNTERSUIT INTERPLEADER POSTJUDGMENT TESTAMENTARYDEFEASANCE INTERROGATE PRETRIAL UNENCUMBEREDDELEGATEE IRREVOCABLY PRIMA UNREMEDIATEDDEPOSED LEGALIZATION PROSECUTIONS WHEREOF@rybesh #duketext 37
    99. 99. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm@rybesh #duketext 38
    100. 100. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document:@rybesh #duketext 38
    101. 101. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list@rybesh #duketext 38
    102. 102. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list • –1 if the word is in the negative list@rybesh #duketext 38
    103. 103. Duke Libraries / Text > Data September 20, 2012 Simple dictionary algorithm • For each word in document: • +1 if the word is in the positive list • –1 if the word is in the negative list • Divide the total by the number of words@rybesh #duketext 38
    104. 104. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 39
    105. 105. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words@rybesh #duketext 40
    106. 106. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words@rybesh #duketext 40
    107. 107. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words = –25@rybesh #duketext 40
    108. 108. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words –25 / 779 total words@rybesh #duketext 40
    109. 109. Duke Libraries / Text > Data September 20, 2012 26 uses of positive words – 51 uses of negative words –25 / 779 total words = –0.032@rybesh #duketext 40
    110. 110. Duke Libraries / Text > Data September 20, 2012 AGAINST LIMITED AGGRESSIVENESS LIMITING ATTACK NEGATE ATTACKING OFFENSE CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT CONTRAST OFFENSIVELY ADVANTAGE KEEPING DEFENSIVE OPPOSING ASSISTS LIKE DEFICIENCIES PLAGUED EFFICIENT PATRIOT DEVIL POOR EFFICIENTLY PERFECT DEVILS PROBLEM EFFORT RESPONSIBLE DISMAL SHORTCOMINGS FREE SIGNIFICANT EXPLOIT SLUGGISH FRESHMAN STRONGER FAILED THORNTON GOOD SUCCESS FOUL THREATS GREAT WELL FOULING TOO FOULS TROUBLE FUTILITY TROUBLES INABILITY UNABLE@rybesh #duketext 41
    111. 111. Duke Libraries / Text > Data September 20, 2012 AGAINST LIMITED AGGRESSIVENESS LIMITING ATTACK NEGATE ATTACKING OFFENSE CHALLENGE OFFENSIVE ADEQUATELY IMPROVEMENT CONTRAST OFFENSIVELY ADVANTAGE KEEPING DEFENSIVE OPPOSING ASSISTS LIKE DEFICIENCIES PLAGUED EFFICIENT PATRIOT DEVIL POOR EFFICIENTLY PERFECT DEVILS PROBLEM EFFORT RESPONSIBLE DISMAL SHORTCOMINGS FREE SIGNIFICANT EXPLOIT SLUGGISH FRESHMAN STRONGER FAILED THORNTON GOOD SUCCESS FOUL THREATS GREAT WELL FOULING TOO FOULS TROUBLE FUTILITY TROUBLES INABILITY UNABLE@rybesh #duketext 42
    112. 112. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 43
    113. 113. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 43
    114. 114. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 43
    115. 115. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning@rybesh #duketext 44
    116. 116. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest@rybesh #duketext 44
    117. 117. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest • The problem: human coding of documents doesnt scale@rybesh #duketext 44
    118. 118. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning • The situation: you know the categories of interest • The problem: human coding of documents doesnt scale • The solution: teach a robot to do it@rybesh #duketext 44
    119. 119. Duke Libraries / Text > Data September 20, 2012 Welcome your robot overlords@rybesh #duketext 45
    120. 120. Duke Libraries / Text > Data September 20, 2012 Welcome your robot overlords@rybesh #duketext 45
    121. 121. Duke Libraries / Text > Data September 20, 2012 Augmenting human capacity@rybesh #duketext 46
    122. 122. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 47
    123. 123. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 47
    124. 124. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 47
    125. 125. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 47
    126. 126. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 47
    127. 127. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 47
    128. 128. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning@rybesh #duketext 48
    129. 129. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set.@rybesh #duketext 48
    130. 130. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories.@rybesh #duketext 48
    131. 131. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. 3. Test your classifying machine to see if it learned correctly.@rybesh #duketext 48
    132. 132. Duke Libraries / Text > Data September 20, 2012 Supervised machine learning 1. Create a training set. 2. Use the training set to "teach" a supervised learning algorithm how to map document features (e.g. words) to categories. 3. Test your classifying machine to see if it learned correctly. 4. Use it to classify the rest of your documents.@rybesh #duketext 48
    133. 133. Duke Libraries / Text > Data September 20, 2012 Creating a training set@rybesh #duketext 49
    134. 134. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity.@rybesh #duketext 49
    135. 135. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. • Select (ideally randomly) a subset of your documents, and code them by hand.@rybesh #duketext 49
    136. 136. Duke Libraries / Text > Data September 20, 2012 Creating a training set • Create a coding scheme that humans can use reliably and without ambiguity. • Select (ideally randomly) a subset of your documents, and code them by hand. • You need "enough" documents: more categories, more documents.@rybesh #duketext 49
    137. 137. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms@rybesh #duketext 50
    138. 138. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc.@rybesh #duketext 50
    139. 139. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. • No "best" one: performance is domain- and dataset-specific@rybesh #duketext 50
    140. 140. Duke Libraries / Text > Data September 20, 2012 Supervised learning algorithms • Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. • No "best" one: performance is domain- and dataset-specific • "Ensembles" of different algorithms can often outperform single algorithms@rybesh #duketext 50
    141. 141. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning@rybesh #duketext 51
    142. 142. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning@rybesh #duketext 52
    143. 143. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you dont know the categories of interest, or want to discover new ones@rybesh #duketext 52
    144. 144. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you dont know the categories of interest, or want to discover new ones • The solution: have a robot explore and find possible categorizations for you, and use them to categorize documents@rybesh #duketext 52
    145. 145. Duke Libraries / Text > Data September 20, 2012 Unsupervised machine learning • The situation: you dont know the categories of interest, or want to discover new ones • The solution: have a robot explore and find possible categorizations for you, and use them to categorize documents • Also known as "clustering"@rybesh #duketext 52
    146. 146. Duke Libraries / Text > Data September 20, 2012 No free lunch Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 53
    147. 147. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 53
    148. 148. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand • But as much or more manual labor is needed to evaluate suggested categorizations afterwards Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 53
    149. 149. Duke Libraries / Text > Data September 20, 2012 No free lunch • No need for manual coding beforehand • But as much or more manual labor is needed to evaluate suggested categorizations afterwards • The value is a novel categorization, not time or labor saved Grimmer & Stewart 2012, "Text as Data" http://goo.gl/tFPFs@rybesh #duketext 53
    150. 150. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning@rybesh #duketext 54
    151. 151. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning • Single membership clustering: each document is assigned to one category@rybesh #duketext 54
    152. 152. Duke Libraries / Text > Data September 20, 2012 Two kinds of unsupervised learning • Single membership clustering: each document is assigned to one category • Mixed membership clustering: a document may be assigned to multiple categories, each with a different proportion@rybesh #duketext 54
    153. 153. Duke Libraries / Text > Data September 20, 2012 Single membership clustering@rybesh #duketext 55
    154. 154. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents.@rybesh #duketext 55
    155. 155. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. 2. Define a quantitative measure of how "good" a cluster is.@rybesh #duketext 55
    156. 156. Duke Libraries / Text > Data September 20, 2012 Single membership clustering 1. Define a quantitative measure of similarity between documents. 2. Define a quantitative measure of how "good" a cluster is. 3. Define a process for optimizing the overall goodness of the clusters.@rybesh #duketext 55
    157. 157. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 56
    158. 158. Duke Libraries / Text > Data September 20, 2012@rybesh #duketext 56
    159. 159. Duke Libraries / Text > Data September 20, 2012 http://shabal.in/visuals.html@rybesh #duketext 57
    160. 160. Duke Libraries / Text > Data September 20, 2012 http://shabal.in/visuals.html@rybesh #duketext 57
    161. 161. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering@rybesh #duketext 58
    162. 162. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example@rybesh #duketext 58
    163. 163. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics@rybesh #duketext 58
    164. 164. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics • A document is a probability distribution over topics@rybesh #duketext 58
    165. 165. Duke Libraries / Text > Data September 20, 2012 Mixed membership clustering • Topic modeling is a popular example • Each document is modeled as a mixture of categories or topics • A document is a probability distribution over topics • A topic is a probability distribution over words@rybesh #duketext 58
    166. 166. Duke Libraries / Text > Data September 20, 2012 Probability distribution@rybesh #duketext 59
    167. 167. Duke Libraries / Text > Data September 20, 2012 "Generating" text@rybesh #duketext 60
    168. 168. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic.@rybesh #duketext 60
    169. 169. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic.@rybesh #duketext 60
    170. 170. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. 3. Roll the "word dice" to choose a word.@rybesh #duketext 60
    171. 171. Duke Libraries / Text > Data September 20, 2012 "Generating" text 1. Roll our "topic dice" to choose a topic. 2. Get the "word dice" corresponding to the the chosen topic. 3. Roll the "word dice" to choose a word. 4. Repeat until weve chosen all the words for our text.@rybesh #duketext 60
    172. 172. Duke Libraries / Text > Data September 20, 2012 Topic modeling demo@rybesh #duketext 61
    173. 173. Duke Libraries / Text > Data September 20, 2012 http://dsl.richmond.edu/dispatch/@rybesh #duketext 62
    174. 174. Duke Libraries / Text > Data September 20, 2012 Complex statistics / computation Topic models Weaker Stronger domain Supervised methods domain assumptions assumptions Word counting Dictionary methods Simple statistics / computation@rybesh #duketext OConnor, Bamman & Smith 2011 http://goo.gl/PxruI 63
    175. 175. Duke Libraries / Text > Data September 20, 2012 Validating results Keeping the machines from leading you astray@rybesh #duketext 64
    176. 176. Duke Libraries / Text > Data September 20, 2012 Validating word counts@rybesh #duketext 65
    177. 177. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR)@rybesh #duketext 65
    178. 178. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors@rybesh #duketext 65
    179. 179. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors • Texts may appear multiple times@rybesh #duketext 65
    180. 180. Duke Libraries / Text > Data September 20, 2012 Validating word counts • Text data may have errors (e.g. from OCR) • Metadata may have errors • Texts may appear multiple times • Collections are biased samples@rybesh #duketext 65
    181. 181. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701@rybesh #duketext 66
    182. 182. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701@rybesh #duketext 66
    183. 183. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701@rybesh #duketext 66
    184. 184. Duke Libraries / Text > Data September 20, 2012 http://languagelog.ldc.upenn.edu/nll/?p=1701@rybesh #duketext 66
    185. 185. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods@rybesh #duketext 67
    186. 186. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments@rybesh #duketext 67
    187. 187. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments • But humans cant reliably "score" documents on "positivity" or "litigiousness"@rybesh #duketext 67
    188. 188. Duke Libraries / Text > Data September 20, 2012 Validating dictionary methods • Must verify that dictionary categorizations match human judgments • But humans cant reliably "score" documents on "positivity" or "litigiousness" • Better to convert scores to simple binaries@rybesh #duketext 67
    189. 189. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods@rybesh #duketext 68
    190. 190. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them.@rybesh #duketext 68
    191. 191. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. • Use the first sample to train your supervised learning algorithm.@rybesh #duketext 68
    192. 192. Duke Libraries / Text > Data September 20, 2012 Validating supervised methods • Ideally: take two random non-overlapping samples and manually code them. • Use the first sample to train your supervised learning algorithm. • Use the second sample to evaluate its performance.@rybesh #duketext 68
    193. 193. Duke Libraries / Text > Data September 20, 2012 figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110@rybesh #duketext 262 documents 69
    194. 194. Duke Libraries / Text > Data September 20, 2012 figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110@rybesh #duketext 262 documents 69
    195. 195. Duke Libraries / Text > Data September 20, 2012 Accuracy: 197 / 262 = 75% figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110@rybesh #duketext 262 documents 69
    196. 196. Duke Libraries / Text > Data September 20, 2012 Precision: 57 / 78 = 73% figurative category figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110@rybesh #duketext 262 documents 70
    197. 197. Duke Libraries / Text > Data September 20, 2012 Recall: 57 / 91 = 63% figurative category figurative mixed literal figurative 57 32 2 mixed 21 30 6 literal 0 4 110@rybesh #duketext 262 documents 71
    198. 198. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods@rybesh #duketext 72
    199. 199. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data@rybesh #duketext 72
    200. 200. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data • These are not appropriate for evaluating unsupervised clustering of texts@rybesh #duketext 72
    201. 201. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • There are statistical measures of how well a particular clustering "fits" the data • These are not appropriate for evaluating unsupervised clustering of texts • The "data" is butchered text, we dont want to fit it well@rybesh #duketext 72
    202. 202. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods@rybesh #duketext 73
    203. 203. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense?@rybesh #duketext 73
    204. 204. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct?@rybesh #duketext 73
    205. 205. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? • Are they internally consistent?@rybesh #duketext 73
    206. 206. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Does the categorization make sense? • Are the categories distinct? • Are they internally consistent? • Do they provide insight?@rybesh #duketext 73
    207. 207. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } Chang et al. 2009 http://goo.gl/FCizP@rybesh #duketext 74
    208. 208. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } Chang et al. 2009 http://goo.gl/FCizP@rybesh #duketext 74
    209. 209. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } { car, teacher, platypus, agile, blue, Zaire } Chang et al. 2009 http://goo.gl/FCizP@rybesh #duketext 74
    210. 210. Duke Libraries / Text > Data September 20, 2012 Validating topic coherence { dog, cat, horse, apple, pig, cow } { car, teacher, platypus, agile, blue, Zaire } ? Chang et al. 2009 http://goo.gl/FCizP@rybesh #duketext 74
    211. 211. Duke Libraries / Text > Data September 20, 2012 Validating topic assignment@rybesh #duketext 75
    212. 212. Duke Libraries / Text > Data September 20, 2012 Validating topic assignment@rybesh #duketext 75
    213. 213. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods@rybesh #duketext 76
    214. 214. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness?@rybesh #duketext 76
    215. 215. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? • Do the categories correlate with external facts?@rybesh #duketext 76
    216. 216. Duke Libraries / Text > Data September 20, 2012 Validating unsupervised methods • Compared to other (manual) categorizations, how well does this one approximate judgments of document relatedness? • Do the categories correlate with external facts? • Turn the categories into a coding scheme and apply supervised methods@rybesh #duketext 76
    217. 217. Duke Libraries / Text > Data September 20, 2012 Managing data Helping others stand on your shoulders@rybesh #duketext 77
    218. 218. Duke Libraries / Text > Data September 20, 2012 Three kinds of data@rybesh #duketext 78
    219. 219. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts youre analyzing and derivations thereof@rybesh #duketext 78
    220. 220. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts youre analyzing and derivations thereof 2. The software code youre using to process and analyze your texts@rybesh #duketext 78
    221. 221. Duke Libraries / Text > Data September 20, 2012 Three kinds of data 1. The texts youre analyzing and derivations thereof 2. The software code youre using to process and analyze your texts 3. Documentation of your process@rybesh #duketext 78
    222. 222. Duke Libraries / Text > Data September 20, 2012 Textual data@rybesh #duketext 79
    223. 223. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts youre processing@rybesh #duketext 79
    224. 224. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts youre processing • A version control system is ideal for this@rybesh #duketext 79
    225. 225. Duke Libraries / Text > Data September 20, 2012 Textual data • You want to keep all intermediate versions of the texts youre processing • A version control system is ideal for this • Version control hosting platforms such as GitHub are ideal for sharing your data too@rybesh #duketext 79
    226. 226. Duke Libraries / Text > Data September 20, 2012 Software data@rybesh #duketext 80
    227. 227. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software@rybesh #duketext 80
    228. 228. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software • Keep past versions of whatever software you use@rybesh #duketext 80
    229. 229. Duke Libraries / Text > Data September 20, 2012 Software data • Ideally, use open-source software • Keep past versions of whatever software you use • Use version control for your own scripts and software@rybesh #duketext 80
    230. 230. Duke Libraries / Text > Data September 20, 2012 Documentary data@rybesh #duketext 81
    231. 231. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage@rybesh #duketext 81
    232. 232. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage • Consider keeping a (public or private) "lab notebook" blog@rybesh #duketext 81
    233. 233. Duke Libraries / Text > Data September 20, 2012 Documentary data • This is the hardest data to manage • Consider keeping a (public or private) "lab notebook" blog • Anything else you write related to the project, formal or informal@rybesh #duketext 81
    234. 234. Duke Libraries / Text > Data September 20, 2012 Long-term preservation@rybesh #duketext 82
    235. 235. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions@rybesh #duketext 82
    236. 236. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions • Create static snapshots of websites, blogs, etc.@rybesh #duketext 82
    237. 237. Duke Libraries / Text > Data September 20, 2012 Long-term preservation • Data under version control can be exported, including all versions • Create static snapshots of websites, blogs, etc. • Place everything in a long-term digital repository such as DukeSpace@rybesh #duketext 82
    238. 238. Duke Libraries / Text > Data September 20, 2012 Take-aways@rybesh #duketext 83
    239. 239. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool.@rybesh #duketext 83
    240. 240. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • Its a systematic method of transforming texts to produce new texts for interpretation.@rybesh #duketext 83
    241. 241. Duke Libraries / Text > Data September 20, 2012 Take-aways • Text analysis can be a powerful tool. • Its a systematic method of transforming texts to produce new texts for interpretation. • It only augments human judgment and interpretation; it cant replace them.@rybesh #duketext 83
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×