Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining Text, Survey, Twitter & RSS Data Using DiscoverText


Published on

Participate in this workshop to learn how to use DiscoverText build custom machine classifiers for sifting free text, emails, survey responses, Twitter data, RSS feeds, and more. Each participant will receive a gratis Special Enterprise Account good for a group of 10 users for 90 days. Please email to request a license key in advance of the workshop.

The topics covered include how to: fetch fresh sample Twitter Search API datasets, construct precise or broad social data queries, join Twitter data research teams working on #metoo, #balancetonporc, #cuéntalo, respect the right to be forgotten,
use Boolean search on raw data archives, filter on metadata or other project attributes, tabulate, explore, and set aside duplicates, cluster near-duplicates, crowd source human coding (annotation),
measure and think about inter-rater reliability, adjudicate coder disagreements, and quickly build reusable word sense, language, and topic disambiguation machine classifiers.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Mining Text, Survey, Twitter & RSS Data Using DiscoverText

  1. 1. Mining Text, Survey, Twitter & RSS Data  Using DiscoverText Dr. Stuart W. Shulman Founder & CEO, Texifter
  2. 2. Emergent properties found in very well‐read texts,  such as the character type “extremist agent of the law” My Roots as a Coder
  3. 3. Dissertation Data (1997‐1999)
  4. 4. Relations between Classes Rates and Terms for Credit Farm Profitability Cost of Living Soil Fertility Education Exploration Speculation Coding Validation
  5. 5. Circa 1999
  6. 6. May 2001 Council for Excellence in Government June 2002 National Defense University
  7. 7. Purist A Spectrum of Methods Approaches deep immersion closenessto data antipathy to numbers credible interpretation in‐depthanalysis contextual subjective experimental  mixed method adaptivehybrid flexibleapproach interdisciplinary open minded quantitative focus on error measurementcritical validityand reliability replication & objectivity generalization hypotheses PositivistPluralist
  8. 8. An Important Book in the Journey
  9. 9. Other Very Important Books
  10. 10. Text Classification A problem for 2.5 millennia Plato argued it is frustrating Software cannot remove the problem It can expose problems humans must fix
  11. 11. Grimmer & Stewart  “Text as Data” Political Analysis (2013) Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is “essential” Some models are easier to validate than others All models are wrong Automated models enhance/amplify, but don’t replace humans There is no one right way to do this “Validate, validate, validate” “What should be avoided then, is the blind use of  any method without a validation step.” 
  12. 12. Computer Science & NSF Influence: Measure Everything! How fast? How reliable? How accurate? Valid?
  13. 13. Labeling, Tagging, or Annotation Improves Machine Learning Over Time
  14. 14. A Labeling Interface Built for Speed Redacted Redacted
  15. 15. Samples for Human Coding  & Machine Classification 
  16. 16. Inter‐Rater Reliability is Only One Factor Understanding the landscape of human interpretationbetter  prepares us to face the challenge of machine classification Fleiss’ Kappa: The Level of  Agreement Beyond Chance
  17. 17. Adjudicate Coder Disagreement
  18. 18. Enhanced Machine Learning “CoderRank is to text analytics what PageRank was to  search. Just as Google said not all web pages are  created equal, Texifter argues that not all humans are  created equal. When training machines, it is best to  rely most on the humans most likely to create a valid  observation. We proposedaunique way to rank  humans on trust and knowledge vectors.”
  19. 19. CoderRank 2 CoderRank 3 CoderRank 4 CoderRank 5 CoderRank 6 CoderRank 7 CoderRank 1 CoderRank 8 CoderRank 9
  20. 20. Iterate Human & Machine Learning
  21. 21. Word Sense Disambiguation (Relevance)
  22. 22. “Patriots” Football Versus Politics
  23. 23. New Work on Fake News Detection
  24. 24. •A free and open source software option (CAT) •Web‐basedcrowd sourcecollaborative tools •Measurement innovations  •Free real time Twitter data collection •Random samplingand keystroke coding •Advanced search and filtering •Deduplicationand clusteringalgorithms  •Custom machine‐learning classifiers •Word sense disambiguation •CoderRankfor enhanced machine learning What Can CAT & DiscoverText Contribute?
  25. 25. Dr. Stuart W. Shulman Founder & CEO, Texifter, LLC Editor Emeritus, Journal of Information Technology & Politics Contact Information Email: Twitter:@stuartwshulman  Thanks for Listening!