Your SlideShare is downloading. ×
Quantifying Text Sentiment in R
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Quantifying Text Sentiment in R

2,153
views

Published on

Published in: Technology, Business

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,153
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
70
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Happy,  Sad,  Indifferent  …  Quan3fying  Text  Sen3ment  in  R   Rajarshi  Guha     CT  R  Users  Group   May  2012  
  • 2. Preamble  •  hHps://github.com/rajarshi/ctrug-­‐tweet  •  Focus  is  on  using  R  to  perform  this  task  •  Won’t  comment  on  validity,  rigor,  u3lity,  …  of   sen3ment  analysis  methods  •  Some  of  the  example  data  is  available  freely,   other  parts  available  on  request    
  • 3. GeUng  TwiHer  Data  •  Based  on  a  collabora3on  with  Prof.  Debs  Ghosh   (Uconn),  studying  obesity  &  social  media  •  Accessing  TwiHer  is  easy  using  many  languages   –  We  obtained  tweets  via  a  PHP  client  running  over  an   extended  period  of  3me   –  Ended  up  with  108,164  tweets  •  Won’t  focus  on  accessing  TwiHer  data  from  R   –  Very  straighaorward  with  twitteR  
  • 4. Cleaning  Text  •  Load  in  tweet  data,  get  rid  of  urls,  HTML   escape  codes,  punctua3on  etc  d  <-­‐  read.csv(pizza-­‐unique.csv,  colClass=character,                                comment=,  header=TRUE)  d$geox  <-­‐  as.numeric(d$geox)  d$geoy  <-­‐  as.numeric(d$geoy)    remove.urls  <-­‐  function(x)  gsub("http.*$",  "",  gsub(http.*s,    ,  x))  remove.html  <-­‐  function(x)  gsub(&quot;,  ,  x)    d$text  <-­‐  remove.urls(d$text)  d$text  <-­‐  remove.html(d$text)  d$text  <-­‐  gsub("@",  "FOOBAZ",  d$text)  d$text  <-­‐  gsub("[[:punct:]]+",  "  ",  d$text)  d$text  <-­‐  gsub("FOOBAZ",  "@",  d$text)  d$text  <-­‐  gsub("[[:space:]]+",    ,  d$text)  d$text  <-­‐  tolower(d$text)  
  • 5. Quan3fying  Sen3ment  •  Based  on  iden3fying  words  with  posi3ve  or   nega3ve  connota3ons  •  Fundamentally  based  on  looking  up  words   from  a  dic3onary  •  If  a  tweet  has  more  posi3ve  words  than   nega3ve  words,  the  tweet  is  posi3ve  •  More  sophis3cated  scoring  schemes  are   possible  
  • 6. BeHer  Dic3onaries?  •  Sen3WordNet   –  Derived  from  WordNet,  each  term  is  assigned  a   posi3vity  and  nega3vity  score   –  206K  terms   –  Converted  to  simple     1.0 CSV  for  easy  import     0.8 into  R   0.6 Sentiment Proportion•  Ideally,  should     negative neutral positive 0.4 perform  POS  tagging   0.2 0.0 adjective adverb noun verb
  • 7. Scoring  Tweets  •  Given  a  scoring  func3on,  we  can  process  the   tweets   swn  <-­‐  read.csv(sentinet_r.csv,  header=TRUE,     –  Perfect  use                                  as.is=TRUE)     case  for     swn.match  <-­‐  function(w)  {   parallel        tmp  <-­‐  subset(swn,  Term  ==  w)      if  (nrow(tmp)  >=  1)  return(tmp[1,c(3,4)])   processing      else  return(c(0,0))   }   –  Easily  switch     out  the     score.swn  <-­‐  function(tweet)  {      words  <-­‐  strsplit(tweet,  "s+")[[1]]   scoring        cs  <-­‐  colSums(do.call(rbind,   func3on                                                  lapply(words,  function(z)                                                                  swn.match(z))))      return(cs[1]-­‐cs[2])   }     scores  <-­‐  mclapply(d$text,  score.swn)  
  • 8. Profiling  Makes  Me  Happy   swn.match  <-­‐  function(w)  {   •  6052  sec  with      tmp  <-­‐  subset(swn,  Term  ==  w)      if  (nrow(tmp)  >=  1)  return(tmp[1,c(3,4)])   24  cores      else  return(c(0,0))   }   •  Rprof()  is  a     score.swn  <-­‐  function(tweet)  {      words  <-­‐  strsplit(tweet,  "s+")[[1]]   good  way  to      cs  <-­‐  colSums(do.call(rbind,                                                  lapply(words,  function(z)                                                                  swn.match(z))))   iden3fy        return(cs[1]-­‐cs[2])   boHlenecks*   }     score.swn.2  <-­‐  function(tweet)  {   •  461  sec  with      words  <-­‐  strsplit(tweet,  "s+")[[1]]      rows  <-­‐  match(words,  swn$Term)   24  cores      rows  <-­‐  rows[!is.na(rows)]      cs  <-­‐  colSums(swn[rows,c(3,4)])      return(cs[1]-­‐cs[2])       }    *  overkill  for  this  example  
  • 9. Looking  at  the  Scores   •  Bulk  of  the  tweets   2.5 are  neutral   2.0 •  Similar  behavior   Method density 1.5 SWN from  either     Breen 1.0 scoring  func3on   0.5 0.0 -6 -4 -2 0 2 4 6 Sentiment Scoresd$swn  <-­‐  unlist(scores.swn)  d$breen  <-­‐  unlist(scores.breen)    tmp  <-­‐  rbind(data.frame(Method=SWN,  Scores=d$swn),                            data.frame(Method=Breen,  Scores=d$breen))  ggplot(tmp,  aes(x=Scores,  fill=Method))  +      geom_density(alpha=0.25)  +      xlab("Sentiment  Scores")  
  • 10. Sen3ment  &  Time  of  Day   •  Group  tweets  by  hour  and  evaluate  how   propor3ons  of  posi3ve,  nega3ve,  etc  vary  .  tmp  <-­‐  d  tmp$hour  <-­‐  strptime(d$time,  format=%a,  %d  %b  %Y  %H:%M)$hour    tmp  <-­‐  subset(tmp,  !is.na(swn))  tmp$status  <-­‐  sapply(tmp$swn,  function(x)  {      if  (x  >  0)  return("Positive")      else  if  (x  <  0)  return("Negative")      else  return("Neutral")  })    tmp  <-­‐  data.frame(do.call(rbind,                              by(tmp,  tmp$hour,  function(x)  table(x$status))))  tmp$Hour  <-­‐  factor(rownames(tmp),  levels=0:23)  tmp  <-­‐  melt(tmp,  id=Hour,  variable_name=Sentiment)  ggplot(tmp,  aes(x=Hour,y=value,fill=Sentiment))+geom_bar(position=fill)+      xlab("")+ylab("Proportion")    
  • 11. Sen3ment  &  Time  of  Day   1.0 0.8 0.6 SentimentProportion Negative Neutral 0.4 Positive 0.2 0.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
  • 12. Contradic3ons?   •  Tweets  that  are  nega3ve  according  to  one   score  but  posi3ve  according  to  another   subset(d,  swn  <  -­‐2  &  breen  >  1)  "i  m  trying  to  get  some  legit  food  right  now  like  pizza  or  chicken  not  this  shi7y  ass  school  lunch”  "24  i  like  reading  25  i  hate  hopsin  26  i  love  chips  salsa  27  i  love  chevys  28  i      was  a  thug  in  middle  school  29  i  love  pizza”    "@naturesempwm  had  a  raw  pizza  4  lunch  today  but  i  was  not  impressed  with  the  dried  out      not  fresh  vegetable  spring  roll  i  bought  threw  out  "  
  • 13. Sen3ment  and  Geography  •  What’s  the  spa3al  distribu3on  of  tweet   sen3ment?  •  Extract  tweets  located  in  the  CONUS  (~  500)  •  Visualize  the  direc3on  and  strength  of   sen3ments   swn•  Correlate  with   -1 0 1 other  socio-­‐   2 abs(swn) economic  factors?   0.0 0.5 1.0 1.5 2.0
  • 14. Other  Considera3ons  •  Should  take  into  account  nega3on     –  Scan  for  nega3on  terms  and  adjust  score   appropriately  •  Oblivious  to  sarcasm  •  Sen3ment  scores  should  probably  be  modified   by  context  •  Lots  of  M/L  opportuni3es   –  Spa3al  analysis   –  Topic  modeling  /  clustering   –  Predic3ve  models