Working with text data

1,817 views

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,817
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
62
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Working with text data

  1. 1. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Working with linguistic data Ekaterina Vylomova April 14, 2014 Ekaterina Vylomova Working with linguistic data
  2. 2. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Possible data sources Dictionaries and corpora Linked Data Social media Ekaterina Vylomova Working with linguistic data
  3. 3. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Thesauri & Corpora WordNets Roget's Thesaurus Moby Project Ekaterina Vylomova Working with linguistic data
  4. 4. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Moby Project Moby Hyphenator - 185,000 entries fully hyphenated Moby Language - Word lists in ve of the world's great languages Moby Part-of-Speech - 230,000 entries fully described by part(s) of speech Moby Pronunciator - 175,000 entries fully International Phonetic Alphabet coded Moby Thesaurus - 30,000 root words, 2.5 million synonyms and related words Moby Words - 610,000+ words and phrases Ekaterina Vylomova Working with linguistic data
  5. 5. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Linked Structured Data Using RDF format. DBPedia is a project aiming to extract structured content from the information created as part of Wikipedia project FreeBase is a large collaborative knowledge base consisting of metadata composed mainly by its community members BabelNet is a multilingual lexicalized semantic network and ontology. Automatically created using Wikipedia. YAGO is a knowledge base developed at the Max Planck Institute. Also automatically built. Ekaterina Vylomova Working with linguistic data
  6. 6. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Spoken corpus TalkBank(multilingual): rst language acquisition, second language acquisition, conversation analysis, classroom discourse, and aphasic language. CHILDES(part of TalkBank): Child Language Data Exchange System Ekaterina Vylomova Working with linguistic data
  7. 7. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Sentiment data SentiWordNet Dictionary by Warriner et al. Dictionary by Hu and Liu Ekaterina Vylomova Working with linguistic data
  8. 8. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Social media Rating systems: IMDB, Amazon, TripAdvisor, OpenTable Sentiment: ExperienceProject, FMyLife, MyLifeIsAverage Facebook (OpenGraph) Twitter Blogs (LiveJournal, Blogger, etc.) Ekaterina Vylomova Working with linguistic data
  9. 9. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible ways to get the data Corpora: just download it! Facebook, Twitter and other social media: use API Blogs, Internet data: parse HTML or XML (download webpage using wget/curl) Linked data: parse RDF Ekaterina Vylomova Working with linguistic data
  10. 10. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Don'n forget this step! Tokenization Remove punctuation, may be number and stop words, lower-case everything Lemmatization or stemming(Porter, Snowball) In case of bag-of-words you may create term x document or term x term matrix(using TF, TFIDF, RIDF for normalization) Ekaterina Vylomova Working with linguistic data
  11. 11. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Few key words from data mining Compute set similarity: Jaccard, Dice, F-scores Transform words to vectors: LSA, MDS Get topics of documents: LDA For classication you may use: SVM, neural networks, discriminant analysis, bayesian networks, decision trees, random forest,adaboost For clustering you may use: k-means, knn, SOM, SVM For regression you may use: SVM, neural networks, GLM, NLS Ekaterina Vylomova Working with linguistic data
  12. 12. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Connect to Facebook OpenGraph Get access token Go to https: //developers.facebook.com/tools/access_token/ Check it works: https://developers.facebook.com/tools/explorer? method=GETpath=me%3Ffields%3Did%2Cnameme?fields= id,name,gender Use tutorial: https://developers.facebook.com/docs/graph-api/ common-scenarios/ Ekaterina Vylomova Working with linguistic data
  13. 13. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Facebook Python Download the package: https://github.com/pythonforfacebook/facebook-sdk Install it : python setup.py install Ekaterina Vylomova Working with linguistic data
  14. 14. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Facebook Python Get names and gender of your friends. Possible project: prediction of gender according to the names import facebook token='your_token ' graph = facebook.GraphAPI(token) profile = graph.get_object(me) friends = graph.get_connections(me, friends) friend_list = [friend['id'] for friend in friends['data']] for friend_id in friend_list: data=graph.get_object(friend_id) if 'gender ' in data.keys(): print data['name'], data['gender '] Ekaterina Vylomova Working with linguistic data
  15. 15. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Using R Packages you may need tm - text mining + tm.plugin.webmining for webcorpora, html parsers, plain text extraction topicmodels - topicality wordcloud - create a cloud of words qdap - sentiment analysis RCurl - curl (get the contents of a webpage) twitteR - to use data from twitter Wordnet - wordnet usage (dictionary needed) e1071 - machine learning(clustering, SVM, naive Bayes, LSA) Ekaterina Vylomova Working with linguistic data
  16. 16. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Packages usage Installation: install.packages(name) Usage: library(name) Ekaterina Vylomova Working with linguistic data
  17. 17. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Load packages: library(twitteR) library(tm) library(RCurl) library(qdap) library(wordcloud) Ekaterina Vylomova Working with linguistic data
  18. 18. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Get Token: reqURL - https://api.twitter.com/oauth/request_token accessURL - https://api.twitter.com/oauth/access_token authURL - https://api.twitter.com/oauth/authorize consumerKey - key consumerSecret - secret twitCred - OAuthFactory$new(consumerKey=consumerKey , consumerSecret=consumerSecret , requestURL=reqURL , accessURL=accessURL , authURL=authURL) # The method will return a link to get a PIN code , you should enter the code twitCred$handshake(cainfo = system.file(CurlSSL, cacert. pem, package = RCurl)) registerTwitterOAuth(twitCred) Ekaterina Vylomova Working with linguistic data
  19. 19. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Get the data and convert to corpus: # search by hashtag , you may also search by plain words. Get n=1000 entries gglTweets - searchTwitter('#sochi2014 ', n=1000) n - length(gglTweets) # show first 3 entries gglTweets [1:3] # put it in a data frame df - do.call(rbind, lapply(gglTweets , as.data.frame)) # get dimenstionality dim(df) # create a corpus myCorpus - Corpus(VectorSource(df$text)) Ekaterina Vylomova Working with linguistic data
  20. 20. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Do normalization: myCorpus - tm_map(myCorpus , tolower) # remove punctuation myCorpus - tm_map(myCorpus , removePunctuation) # remove numbers myCorpus - tm_map(myCorpus , removeNumbers) # remove stopwords (very frequent words , e.g. articles , prepositions) myStopwords - c(stopwords('english ')), sochi,amp, get ) myCorpus - tm_map(myCorpus , removeWords , myStopwords) Ekaterina Vylomova Working with linguistic data
  21. 21. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Stem the documents: dictCorpus - myCorpus # apply stemming for normalization , you may use lemmatization instead myCorpus - tm_map(myCorpus , stemDocument) inspect(myCorpus [1:3]) myCorpus - tm_map(myCorpus , stemCompletion , dictionary=dictCorpus) inspect(myCorpus [1:3]) Ekaterina Vylomova Working with linguistic data
  22. 22. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Create TDM: # create term -document matrix , you may use TF or TFIDF metric myDtm - TermDocumentMatrix(myCorpus , control = list(minWordLength = 1, weighting = weightTfIdf)) inspect(myDtm [66:70 ,11:20]) # frequent terms and associations findFreqTerms(myDtm , lowfreq =10) Ekaterina Vylomova Working with linguistic data
  23. 23. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Create a wordcloud: # convert TDM to plain matrix m-as.matrix(myDtm) # sort by decreasing frequencies v-sort(rowSums(m),decreasing=TRUE) # show first 14 entries head(v,14) # get colnames words -names(v) # create dataframe for words with frequencies dat -data.frame(word=words ,freq=v) # create wordcloud from words which appeared at least 5 times wordcloud(dat$word ,dat$freq , min.freq =5) Ekaterina Vylomova Working with linguistic data
  24. 24. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Experience Project is a free social networking website consisting of various online communities. Users/members submit experiences personal stories, confessions, blogs, groups, photos, and videos. The users assign categories to the stories. Example: I really hate being shy ... I just want to be able to talk to someone about anything and everything and be myself ... That's all I've ever wanted. Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0; Author age: 21 Author gender:female Text group: friends Ekaterina Vylomova Working with linguistic data
  25. 25. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Data Let's load the data: # read .cvs file with data ep = read.csv('ep3 -context.csv') Here: Count is the number of Category reactions received by confessions containing Word in Group with an author of Gender and Age. Total is the number of Category reactions used by confessions containing any Word in Group with an author of Gender and Age. Ekaterina Vylomova Working with linguistic data
  26. 26. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Data Look at dierent parameters: # show examples of words levels(ep$Word) Ekaterina Vylomova Working with linguistic data
  27. 27. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation Check if there is any correlation between words and categories # include source file source('ep.R') # create a subset for word funny funny = epCollapsedFrame(ep, 'funny') # plot the frequencies of the word for each category plot(funny$Category , funny$Count , xlab='Category ', ylab=' Count', main='funny') Ekaterina Vylomova Working with linguistic data
  28. 28. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation Funnycorresponds to understandcategory. This doesn't look realistically..Ekaterina Vylomova Working with linguistic data
  29. 29. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation We need normalization! # apply normalization (divide by the total number of words) funny$Count / funny$Total # get a subset for funny, take frequencies into account funny = epCollapsedFrame(ep, 'funny', freqs=TRUE) # create a plot plot(funny$Category , funny$Freq , xlab='Category ', ylab=' Count/Total', main='funny') Ekaterina Vylomova Working with linguistic data
  30. 30. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation Much better!Ekaterina Vylomova Working with linguistic data
  31. 31. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Probability theory Get category from word Freq corresponds to the conditional probability P(word|category), i.e. the probability to that a speaker used 'word' in a given 'category'. Let's apply Bayesian rule and compute P(category|word), i.e. the probability of category given that a speaker used 'word'. funny$Freq / sum(funny$Freq) funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE ) plot(funny$Category , funny$Pr , xlab='Category ', ylab='(Count /Total)/sum(Count/Total)', main='funny') Question: any other words specic for a category? Ekaterina Vylomova Working with linguistic data
  32. 32. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Compare with estimated value Estimate expected value funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE , oe=TRUE) Estimated value: Exp = N i=1 xip(xi), p(xi) is a probability of xi. category.probs = (funny$Total/sum(funny$Total)) funny.count = sum(funny$Count) funny.expected = funny.count * category.probs funny.expected Compare estimated and observed values: (funny$observed / funny.expected) - 1 Value less than 0 means that a word is underrepresented in a category. Ekaterina Vylomova Working with linguistic data
  33. 33. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by gender Usage of 'awesome' by male/female/unknown eptok = read.csv('ep3 -context -tokencounts.csv') par(mfrow=c(1,3)) epPlot(ep , eptok , 'awesome ', genders='male', probs=T) epPlot(ep , eptok , 'awesome ', genders='female ', probs=T) epPlot(ep , eptok , 'awesome ', genders='unknown ', probs=T) Ekaterina Vylomova Working with linguistic data
  34. 34. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by gender Usage of 'awesome' by male/female/unknown Ekaterina Vylomova Working with linguistic data
  35. 35. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by age Usage of 'awesome' by people of dierent ages par(mfrow=c(2,3)) for (i in 1:5) { epPlot(ep, eptok , 'awesome ', ages=i, probs= T) } Ekaterina Vylomova Working with linguistic data
  36. 36. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by age Usage of 'awesome' by people of dierent ages Ekaterina Vylomova Working with linguistic data
  37. 37. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' comparing gender with the category 'Awesome': gender+category Changing the parameter for each category separately: epCategoryByFactorPlot(ep, eptok , 'awesome ', 'Gender ', probs =T, type='b') Ekaterina Vylomova Working with linguistic data
  38. 38. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' comparing gender with the category 'Awesome': gender+category Ekaterina Vylomova Working with linguistic data
  39. 39. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'drunk' comparing gender with the category 'Drunk': gender+category Stories with drunkdepend on the age: epCategoryByFactorPlot(ep, eptok , 'drunk', 'Age', probs=T, type='b') Ekaterina Vylomova Working with linguistic data
  40. 40. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'drunk' comparing gender with the category 'Drunk': gender+category Ekaterina Vylomova Working with linguistic data
  41. 41. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Creating a logistic regression model Regression modelling Let's create a regression model: predict the frequency of 'drunk' using age and category drunk = epFullFrame(ep, 'drunk', age=c(1,2,3,4,5)) drunk$Age = as.numeric(drunk$Age) fit.glm = glm(cbind(Count ,Total -Count) ~ Category - 1 + Age , data=drunk , family=binomial) summary(fit.glm) Ekaterina Vylomova Working with linguistic data
  42. 42. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Creating a logistic regression model Regression modelling Find a function that predicts a word according to the category and age of person FittedGlmFunc = function(fit , category , age) { coefs = fit$coef cat.coef = coefs[[ paste('Category ',category , sep='')]] prediction = plogis(cat.coef + coefs [['Age']]*age) return(prediction) } Calling the function: FittedGlmFunc(fit.glm , 'wow', 1) Ekaterina Vylomova Working with linguistic data
  43. 43. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Creating a logistic regression model Regression modelling Visualize the data and compare empirical(black) values with tted(red) data. par(mfrow=c(2,3)) cats = levels(ep$Category) for(i in 1:5) { epPlot(ep , eptok , 'drunk', age=i) for (j in 1:5) { val = FittedGlmFunc(fit.glm , cats[j], i) points(j, val , col='red', pch =19) } } Ekaterina Vylomova Working with linguistic data
  44. 44. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Calculating expected value Regression modelling Visualize the data and compare empirical(black) values with tted(red) data. Ekaterina Vylomova Working with linguistic data
  45. 45. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models IMDB data Analysis of ADV-ADJcollocations Ekaterina Vylomova Working with linguistic data
  46. 46. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Data from rating systems Data We will use the data from rating systems(Amazon.com, OpenTable.com, Goodreads.com, IMDB.com). Load them: d = read.csv('ratings -advadj.csv') head(d) Ekaterina Vylomova Working with linguistic data
  47. 47. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Extract subsets 'Horrid' categories horrid = ratingFullFrame(d, 'horrid ', types=NULL , modifiers= NULL , modifier.types=NULL , ratingmax =0) nrow(horrid) head(horrid) Ekaterina Vylomova Working with linguistic data
  48. 48. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Extract subsets 'Absolutely'+'Horrid' With modier: horrid = ratingFullFrame(d, 'horrid ', modifiers='absolutely ' ) nrow(horrid) head(horrid) Ekaterina Vylomova Working with linguistic data
  49. 49. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Tonality evaluation for adjectives Probabilities of categories for 'horrid' horrid = ratingCollapsedFrame(d, 'horrid ', freqs=TRUE , probs =TRUE) horrid Ekaterina Vylomova Working with linguistic data
  50. 50. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Tonality Probabilities vs frequencies par(mfrow=c(1,2)) ratingPlot(d, 'horrid ', probs=FALSE) ratingPlot(d, 'horrid ', probs=TRUE) Ekaterina Vylomova Working with linguistic data
  51. 51. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Evaluating expectation Predict category using adjective Predict a category based on adjective. Expectation: sum(horrid$Category * horrid$Pr) The same does ExpectedCategory function: ExpectedCategory(horrid) Adding value to the plot: ratingPlot(d, 'horrid ', probs=TRUE , ec=TRUE) Ekaterina Vylomova Working with linguistic data
  52. 52. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Evaluating expectation Predict category using adjective Ekaterina Vylomova Working with linguistic data
  53. 53. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Let's create a model to predict probability that a word will be in particular category fit.horrid = glm(cbind(horrid$Count , horrid$Total -horrid$ Count) ~ Category , family=quasibinomial , data=horrid) fit.horrid Ekaterina Vylomova Working with linguistic data
  54. 54. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Ekaterina Vylomova Working with linguistic data
  55. 55. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Improve the model by adding quadratic function GlmWordQuadratic -function(pf) { pf$Category2 = pf$Category ^2 fit = glm(cbind(Count ,Total -Count) ~ Category + Category2 , family=quasibinomial , data=pf) return(fit) } par(mfrow=c(2,2)) ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic) , ratingmax=5, ylim=c(0, 0.5)) ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic) , ratingmax =10, ylim=c(0, 0.3)) ratingPlot(d, 'disappointing ', probs=TRUE , models=c( GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5)) ratingPlot(d, 'disappointing ', probs=TRUE , models=c( GlmWordQuadratic), ratingmax =10, ylim=c(0, 0.3)) Ekaterina Vylomova Working with linguistic data
  56. 56. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Ekaterina Vylomova Working with linguistic data
  57. 57. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Vector space models Vector space models How to transform words to vectors: LSA (latent semantic analysis) MDS (multidimensional scaling) Ekaterina Vylomova Working with linguistic data
  58. 58. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Basics about vectors Euclidean distance: EuclideanDist(x, y) = n i=1 (xi − yi)2 Vector length: VectorLength(x) = n i=1 (xi)2 Vector normalization - component divided by its length. Cosine between vectors: CosineDist(x, y) = 1 − n i=1 (xi) ∗ n i=1 (yi) VectorLength(x) ∗ VectorLength(y) Ekaterina Vylomova Working with linguistic data
  59. 59. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Vector space models Data from IMDB Initail data: term x term matrix, xij element of matrix is a frequency of cooccurrence of termi and termj in context(document, sentences, etc.) source('vsm.R') # co-occurrence matrix(words appearing in the same context( phrase , sentence , paragraph)) imdb = Csv2Matrix('imdb -wordword.csv') imdb [100:110 , 100:110] Ekaterina Vylomova Working with linguistic data
  60. 60. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantically related words Extract semantically related words df = Neighbors(imdb , 'happy') head(df) Ekaterina Vylomova Working with linguistic data
  61. 61. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantically related words Problem a = c(1000 , 2000, 3000) b = c(1, 2, 3) a/sum(a) # 0.1666667 0.3333333 0.5000000 b/sum(b) # 0.1666667 0.3333333 0.5000000 LengthNorm(a) # 0.2672612 0.5345225 0.8017837 LengthNorm(b) [1] 0.2672612 0.5345225 0.801783 Ekaterina Vylomova Working with linguistic data
  62. 62. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models PMI - Pointwise mutual information How to deal with it? - PMI! PMI(x, y) = log p(x, y) p(x) ∗ p(y) PMI normalization: NPMI(i, j) = pmi(i, j)∗ p(i, j) p(i, j) + 1 ∗ min ( m k=1 p(k, j), n k=1 p(k, j)) min ( m k=1 p(k, j), n k=1 p(k, j)) + 1 Where p(i,j)=M/sum(M), M - term x term matrix Ekaterina Vylomova Working with linguistic data
  63. 63. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models PMI - Pointwise mutual information PMI imdb.ppcd = PMI(imdb , positive=TRUE , discounting=TRUE) df = Neighbors(imdb.ppcd , 'happy', byrow=TRUE , distfunc= CosineDistance) head(df) Ekaterina Vylomova Working with linguistic data
  64. 64. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantic orientation method Semantic orientation Describe 2 sets of words S1 è S2 (vector representations) Choose the distance measure For a word w: calculate the sum of distances to vectors of S1 and S2 The tonality is a dierence between two sums Ekaterina Vylomova Working with linguistic data
  65. 65. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantic orientation method Example of semantic orientation method neg = c('bad', 'nasty ', 'poor', 'negative ', 'unfortunate ', ' wrong', 'inferior ') pos = c('good', 'nice', 'excellent ', 'positive ', 'fortunate ' , 'correct ', 'superior ') SemanticOrientation(imdb.ppcd , word='great ', seeds1=neg , seeds2=pos , distfunc=CosineDistance) # 0.8923544 SemanticOrientation(imdb.ppci , word='horrid ', seeds1=neg , seeds2=pos , distfunc=CosineDistance) # -0.04741898 Ekaterina Vylomova Working with linguistic data
  66. 66. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models More information Data examples For more detailed examples and tutorials about sentiment analysis go to Chris Potts tutorials. http://nasslli2012.christopherpotts.net http://sentiment.christopherpotts.net Email me if you need any help! Ekaterina Vylomova Working with linguistic data

×