Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DiscoverText: Tools for Text

1,452 views

Published on

A talk prepared for a presentation at the Digital Methods Initiative 2014 Winter School held at the University of Amsterdam.

Published in: Technology, Education
  • Be the first to comment

DiscoverText: Tools for Text

  1. 1. Tools for Text Dr. Stuart Shulman 
 @stuartwshulman stu@texifter.com 
 Prepared for the Digital Methods Initiative Winter School 2014
 University of Amsterdam !1
  2. 2. Acknowledgements Richard Rogers
 The National Science Foundation
 Mark J. Hoy !2
  3. 3. Plan of Attack A few high level thoughts Five pillars of text analytics Getting started on DiscoverText A small collaborative project The twittersifter.com beta release !3
  4. 4. “A funny thing happened…” A brief history of DiscoverText ! !4
  5. 5. A Master Metaphor: Sifter !5
  6. 6. An Open Source Kernel !6
  7. 7. Three Primary Tasks in CAT !7
  8. 8. Classification of Text A 2500 year-old problem Plato argued it would be frustrating It still is… !8
  9. 9. Grimmer & Stewart “Text as Data”
 Political Analysis (2013) Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is “essential” Some models are easier to validate than others All models are wrong Automated models enhance/amplify, but don’t replace humans There is no one right way to do this “Validate, validate, validate” “What should be avoided then, is the blind use 
 of any method without a validation step.” !9
  10. 10. (Patent Pending) !10
  11. 11. Three Important Books !11
  12. 12. One Particularly Important Idea !12
  13. 13. Five Pillars of Text Analytics Search
 Filter
 Code
 Cluster
 Classify You can execute all five using DT !13
  14. 14. Pillar #1: Search !14
  15. 15. Search for Negative Cases !15
  16. 16. Defined Search (Multi-term) !16
  17. 17. Pillar #2: Filters Remember this filter !17
  18. 18. Another Common Filter !18
  19. 19. !19
  20. 20. Pillar#3: Human Coding !20
  21. 21. Keystroke Coding is Fast !21
  22. 22. Coding Off a List is Faster !22
  23. 23. Data Cleaning is Fundamental !23
  24. 24. Pillar #4: Clustering !24
  25. 25. !25
  26. 26. Latent Dirichlet Allocation 
 (LDA) Topic Models !26
  27. 27. LDA on the Christie Data Data is still processing… !27
  28. 28. Pillar#5: Machine-Learning !28
  29. 29. Getting Started on DiscoverText !29
  30. 30. Use the Key in Your Email !30
  31. 31. Note the Peer Visibility Setting !31
  32. 32. Peers Make Collaboration Possible !32
  33. 33. !33
  34. 34. !34
  35. 35. !35
  36. 36. Perhaps a Trending Topic !36
  37. 37. !37
  38. 38. The Basics Raw Data Subsets of Data Data Humans or Machines Classify !38
  39. 39. !39
  40. 40. Grab Some Twitter Data !40
  41. 41. Create an Empty Archive !41
  42. 42. Login to a Twitter Account !42
  43. 43. Enable via OAuth !43
  44. 44. Ready to Query Twitter !44
  45. 45. Use Operators to Refine Queries !45
  46. 46. Set the Frequency of Fetches !46
  47. 47. Data Will Start Flowing !47
  48. 48. Data List View !48
  49. 49. Best List Settings for Twitter Data !49
  50. 50. Use Buckets to Refine Lists Search results go into buckets “Defined search” is a multi-term filter Meta data filters also useful for buckets Buckets focus the text analytic process !50
  51. 51. !51
  52. 52. Create a Dataset to Code Any archive or bucket Use the random sampling tool Standard: All coders get all items Triage: Coders get next uncoded item !52
  53. 53. !53
  54. 54. Select from Three Coding Styles Default: Mutually Exclusive Codes Option 1: Non-Mutually Exclusive Codes Option 2: User-Defined Codes (Grounded Theory) !54
  55. 55. !55
  56. 56. Assign Peers to Code a Dataset How many coders? How many items need to be coded? How many test or training sets? There are no cookbook answers !56
  57. 57. Look at Inter-Rater Reliability Highly reliable coding (easy tasks) Unreliable coding (interesting tasks) If humans can’t, neither can machines Some tasks better suited for machines !57
  58. 58. Adjudication: The Secret Sauce Expert review or consensus process Invalidate false positives Identify strong and weak coders Exclude false positives from training sets !58
  59. 59. !59
  60. 60. !60
  61. 61. Use Classification Scores as Filters Iteration plays a critical role Train, classify, filter Repeat until the model is trusted Each round weeds out false positives !61
  62. 62. Classifier Histograms: More Filtering !62
  63. 63. Track Your Progress !63
  64. 64. !64
  65. 65. !66
  66. 66. Running the Classifier !67
  67. 67. !68
  68. 68. Filter by Classification !69
  69. 69. Filtered List >95% Not Chris Christie !70
  70. 70. http://beta.twittersifter.com
  71. 71. Thanks for Having Me! Dr. Stuart Shulman
 @stuartwshulman
 stu@texifter.com
 discovertext.com
 twittersifter.com !74

×