Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Mining with R for Social Science Research


Published on

Text Mining workshop using R for Project Mosaic, UNC Charlotte's Social Science Research Initiative. The workshop analyzes the Federalist Papers and the Federal Reserve Beige Book with applications in text classification, topic modeling and sentiment analysis.

Published in: Education
  • Be the first to comment

Text Mining with R for Social Science Research

  1. 1.
  2. 2. Coreference resolution Question answering (QA) Part-of-speech (POS) tagging Word sense disambiguation (WSD) Paraphrase Named entity recognition (NER) Parsing Summarization Information extraction (IE) Machine translation (MT) Dialog Sentiment analysis mostly solved making good progress still really hard Spam detection (Classification) Let’s go to Agra! Buy V1AGRA … ✓ ✗ Colorless green ideas sleep furiously. ADJ ADJ NOUN VERB ADV Einstein met with UN officials in Princeton PERSON ORG LOC You’re invited to our dinner party, Friday May 27 at 8:30 Party May 27 add Best roast chicken in San Francisco! The waiter ignored us for 20 minutes. Carter told Mubarak he shouldn’t run again. I need new batteries for my mouse. The 13th Shanghai International Film Festival… 第13届上海国际电影节开幕… The Dow Jones is up Housing prices rose Economy is good Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness? I can see Alcatraz from the window! XYZ acquired ABC yesterday ABC has been taken over by XYZ Where is Citizen Kane playing in SF? Castro Theatre at 7:30. Do you want a ticket? The S&P500 jumped Source: Dan Jurafsky
  3. 3. non-standard English Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥ segmentation issues idioms dark horse get cold feet lose face throw in the towel neologisms unfriend Retweet bromance tricky entity names Where is A Bug’s Life playing … Let It Be was recorded … … a mutation on the for gene … the New York-New Haven Railroad the New York-New Haven Railroad Source: Dan Jurafsky (modified) sarcasm A: I love Justin Bieber. Do you like him to? B:Yeah. Sure. I absolutely love him.
  4. 4. apers-10
  5. 5. 1:10pm
  6. 6. Non-Stop
  7. 7. Adair Moesteller &Wallace Fung Collins et al
  8. 8. Corpus Document Term
  9. 9. Source: Chris Manning
  10. 10. Tokenize Clean Stem Filter Then a hurricane came, and devastation reigned then a hurricane came and devastation reigned then a hurricane came and devastation reigned then a hurricane came and devastation reigned
  11. 11. GitHub site 1:20pm Code Lines: 1 - 49
  12. 12. Code Lines: 50-79
  13. 13. Federalist Paper 1: Before Federalist Paper 1: After Code Lines: 71-88
  14. 14. Federalist Paper 1: After Code Lines: 89-104
  15. 15. Code Lines: 142-149
  16. 16. Code Lines: 151-1651:30pm
  17. 17. Code Lines: 167-171
  18. 18. Code Lines: 173-188
  19. 19. Code Lines: 189-201
  20. 20. Code Lines: 202-207
  21. 21. Uncomment (CTRL + SHIFT +C) and run lines 107-139 Code Lines: 107-139 then rerun lines 141-206
  22. 22. 1:50pm - 2pm
  23. 23. BayesTheorem these slides
  24. 24. Code Lines: 208-219
  25. 25. Update Code Lines: 231-241
  26. 26. Code Lines: 242-248
  27. 27. Code Lines: 250-273
  28. 28. Code Lines: 275-290
  29. 29. This will take about 4 mins, depending on the computer you run it on Code Lines: 295-308
  30. 30. Source: David Blei (link to article)
  31. 31. Code Lines: 295-308
  32. 32. Index.html file in the “Federalist” folder in your working directory. Open with FireFox; it is not supported by Chrome or IE.
  33. 33. Code Lines: 321-349
  34. 34. Code Lines: 350-370
  35. 35. • Naïve Bayes predicts 9 of the 12 papers as written by Madison. • K-NN predicts only 4 of the 12 papers as written by Madison • Why? How stable are these results?? Code Lines: 371-373
  36. 36. 2:30pm
  37. 37. Source: Richard Heimann
  38. 38. Source: Richard Heimann
  39. 39. Source: Richard Heimann
  40. 40. The Beige Book GitHub Source: Richard Heimann
  41. 41.
  42. 42. First six records of BB.sentiment
  43. 43. First six records of BB.sentiment (updated)
  44. 44. Raw Scored Sentiment Scaled Scored Sentiment
  45. 45. Stanford Deep Learning NLP class materials
  46. 46. GNIP access http://www.r-
  47. 47. AlchemyAPI Taste Analytics Signals SAS Enterprise Miner SAS Sentiment Analysis Hamilton Soundtrack Amazon Reviews
  48. 48. R tm package Python nltk package Python gensim package Mallet
  49. 49. IntroductoryText MiningClass Coursera Natural Language ProcessingClass CourseraText Mining & Analytics Course Deep Learning for Natural Language Processing
  50. 50. 1-for-beginners-bag-of-words guide/twitter-sentiment-analysis introduction-to-topic-modeling-using-r/ trump-using-r-and-tableau/ Follow this link for all R “text” blogs on Rbloggers website