Your SlideShare is downloading. ×
0
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
DiscoverText: Tools for Text
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

DiscoverText: Tools for Text

923

Published on

A talk prepared for a presentation at the Digital Methods Initiative 2014 Winter School held at the University of Amsterdam.

A talk prepared for a presentation at the Digital Methods Initiative 2014 Winter School held at the University of Amsterdam.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
923
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Tools for Text Dr. Stuart Shulman 
 @stuartwshulman stu@texifter.com 
 Prepared for the Digital Methods Initiative Winter School 2014
 University of Amsterdam !1
  • 2. Acknowledgements Richard Rogers
 The National Science Foundation
 Mark J. Hoy !2
  • 3. Plan of Attack A few high level thoughts Five pillars of text analytics Getting started on DiscoverText A small collaborative project The twittersifter.com beta release !3
  • 4. “A funny thing happened…” A brief history of DiscoverText ! !4
  • 5. A Master Metaphor: Sifter !5
  • 6. An Open Source Kernel !6
  • 7. Three Primary Tasks in CAT !7
  • 8. Classification of Text A 2500 year-old problem Plato argued it would be frustrating It still is… !8
  • 9. Grimmer & Stewart “Text as Data”
 Political Analysis (2013) Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is “essential” Some models are easier to validate than others All models are wrong Automated models enhance/amplify, but don’t replace humans There is no one right way to do this “Validate, validate, validate” “What should be avoided then, is the blind use 
 of any method without a validation step.” !9
  • 10. (Patent Pending) !10
  • 11. Three Important Books !11
  • 12. One Particularly Important Idea !12
  • 13. Five Pillars of Text Analytics Search
 Filter
 Code
 Cluster
 Classify You can execute all five using DT !13
  • 14. Pillar #1: Search !14
  • 15. Search for Negative Cases !15
  • 16. Defined Search (Multi-term) !16
  • 17. Pillar #2: Filters Remember this filter !17
  • 18. Another Common Filter !18
  • 19. !19
  • 20. Pillar#3: Human Coding !20
  • 21. Keystroke Coding is Fast !21
  • 22. Coding Off a List is Faster !22
  • 23. Data Cleaning is Fundamental !23
  • 24. Pillar #4: Clustering !24
  • 25. !25
  • 26. Latent Dirichlet Allocation 
 (LDA) Topic Models !26
  • 27. LDA on the Christie Data Data is still processing… !27
  • 28. Pillar#5: Machine-Learning !28
  • 29. Getting Started on DiscoverText !29
  • 30. Use the Key in Your Email !30
  • 31. Note the Peer Visibility Setting !31
  • 32. Peers Make Collaboration Possible !32
  • 33. !33
  • 34. !34
  • 35. !35
  • 36. Perhaps a Trending Topic !36
  • 37. !37
  • 38. The Basics Raw Data Subsets of Data Data Humans or Machines Classify !38
  • 39. !39
  • 40. Grab Some Twitter Data !40
  • 41. Create an Empty Archive !41
  • 42. Login to a Twitter Account !42
  • 43. Enable via OAuth !43
  • 44. Ready to Query Twitter !44
  • 45. Use Operators to Refine Queries !45
  • 46. Set the Frequency of Fetches !46
  • 47. Data Will Start Flowing !47
  • 48. Data List View !48
  • 49. Best List Settings for Twitter Data !49
  • 50. Use Buckets to Refine Lists Search results go into buckets “Defined search” is a multi-term filter Meta data filters also useful for buckets Buckets focus the text analytic process !50
  • 51. !51
  • 52. Create a Dataset to Code Any archive or bucket Use the random sampling tool Standard: All coders get all items Triage: Coders get next uncoded item !52
  • 53. !53
  • 54. Select from Three Coding Styles Default: Mutually Exclusive Codes Option 1: Non-Mutually Exclusive Codes Option 2: User-Defined Codes (Grounded Theory) !54
  • 55. !55
  • 56. Assign Peers to Code a Dataset How many coders? How many items need to be coded? How many test or training sets? There are no cookbook answers !56
  • 57. Look at Inter-Rater Reliability Highly reliable coding (easy tasks) Unreliable coding (interesting tasks) If humans can’t, neither can machines Some tasks better suited for machines !57
  • 58. Adjudication: The Secret Sauce Expert review or consensus process Invalidate false positives Identify strong and weak coders Exclude false positives from training sets !58
  • 59. !59
  • 60. !60
  • 61. Use Classification Scores as Filters Iteration plays a critical role Train, classify, filter Repeat until the model is trusted Each round weeds out false positives !61
  • 62. Classifier Histograms: More Filtering !62
  • 63. Track Your Progress !63
  • 64. !64
  • 65. !66
  • 66. Running the Classifier !67
  • 67. !68
  • 68. Filter by Classification !69
  • 69. Filtered List >95% Not Chris Christie !70
  • 70. http://beta.twittersifter.com
  • 71. Thanks for Having Me! Dr. Stuart Shulman
 @stuartwshulman
 stu@texifter.com
 discovertext.com
 twittersifter.com !74

×