Comparing social tags to microblogs


   Victoria Lai, Christopher Rajashekar, William Rand
              Modeling Social Media 2011
                    October 9, 2011
Social Tags and Social Media
     Brand manager – what are people saying about a product
      online?
     Goal: See if tags about an album
      reflect Twitter conversations
     Amazon tags
       Where purchases take place
       Easier to collect than tweets




2
Similarity framework S(fa(ta),fw(tw)) > θ
                                                          ta
                               album tweets               all tags
album tags (ta)                                           top ten tags
                               keywords (tw)
                                                          fa
            importance                     importance     tag weights
            measure (fa)                   measure (fw)   fw
                                                          frequency
                                                          tf-idf
 phrase 1   #                   phrase 1   #
 phrase 2   #                   phrase 2   #              S
                                                          Spearman
 phrase 3   #         S > θ?    phrase 3   #
                                                          Kendall tau
      …




                                     …
                                                          Precision
                                                          Recall
Baselines (θ)
 General control
   I, the, and, a, of
   Used in tf-idf
 Music control
   music
   Used as threshold
Relevant Work
 Heymann, Ramage, and Garcia-Molina (2008)
  IR measures
 Eck, Lamere, Bertin-Mahieux, and Green (2007)
  correlation measures
 Wagner and Strohmaier (2010)
  tweet stream properties
 Inouye and Kalita (2011)
  automatic tweet summarization
 Wu, Zhang, and Ostendorf (2010)
  tf-idf on user tweets
Correlations
        Threshold (music control)         Base case                   Best case
         C1: ta = all tags, fw =    C2: ta = all tags, fw = C3: ta = top tags, fw =
Album      freq, tw = music                  freq                   tf-idf
         Spearman       Kendall     Spearman      Kendall Spearman          Kendall
 D1         0.44          0.38         0.29           0.25     0.69           0.43
 D2         0.29          0.24         0.38           0.37     0.78           0.70
 D3         0.24          0.20         0.38           0.33     0.33           0.31
 D4         0.30          0.26         0.40           0.35     0.60           0.51
 J1         0.64          0.55         0.31           0.28     0.31           0.28
 J5         0.20          0.18         0.23           0.18     0.63           0.44
 J6         0.47          0.37         0.28           0.19     0.63           0.45
 F2         0.24         0.20         0.43          0.36       0.30           0.28
                       Shaded – strongest correlation listed
                        C3 Bolded – better than base case
Information Retrieval
            Album    Precision     Precision      Recall
                       (P1)      threshold (P2)
       D1           0.48       0.43             0.002
       D2           0.24       0.62             0.008
       D3           0.29      0.36              0.001
       D4           0.36      0.36              0.0004
       J1           0.20      0.50              0.0003
       J3           0.00      0.75              0.00
       J5           0.57      0.40              0.0002
       J6           0.75      0.38              0.0004
       F1           0.00      0.50              0.00
       F2           0.67      0.59              0.00009
       Average      0.35      0.49              0.001
       HV         0.51        0.45              0.0003
       average
       LV average 0.20        0.53              0.002
Conclusions
 Good proxy for top content when sufficient Twitter activity
 More relevant tags are higher in tweet keyword rankings
 TF-IDF is effective


Next Steps
 Larger dataset
 Analysis over time
 Other sources like LastFM
 Linguistic analysis (clustering, stemming)
 Other user-generated data (e.g. user reviews)
Questions?

Comparing social tags to microblogs

  • 1.
    Comparing social tagsto microblogs Victoria Lai, Christopher Rajashekar, William Rand Modeling Social Media 2011 October 9, 2011
  • 2.
    Social Tags andSocial Media  Brand manager – what are people saying about a product online?  Goal: See if tags about an album reflect Twitter conversations  Amazon tags  Where purchases take place  Easier to collect than tweets 2
  • 3.
    Similarity framework S(fa(ta),fw(tw))> θ ta album tweets all tags album tags (ta) top ten tags keywords (tw) fa importance importance tag weights measure (fa) measure (fw) fw frequency tf-idf phrase 1 # phrase 1 # phrase 2 # phrase 2 # S Spearman phrase 3 # S > θ? phrase 3 # Kendall tau … … Precision Recall
  • 4.
    Baselines (θ)  Generalcontrol  I, the, and, a, of  Used in tf-idf  Music control  music  Used as threshold
  • 5.
    Relevant Work  Heymann,Ramage, and Garcia-Molina (2008) IR measures  Eck, Lamere, Bertin-Mahieux, and Green (2007) correlation measures  Wagner and Strohmaier (2010) tweet stream properties  Inouye and Kalita (2011) automatic tweet summarization  Wu, Zhang, and Ostendorf (2010) tf-idf on user tweets
  • 6.
    Correlations Threshold (music control) Base case Best case C1: ta = all tags, fw = C2: ta = all tags, fw = C3: ta = top tags, fw = Album freq, tw = music freq tf-idf Spearman Kendall Spearman Kendall Spearman Kendall D1 0.44 0.38 0.29 0.25 0.69 0.43 D2 0.29 0.24 0.38 0.37 0.78 0.70 D3 0.24 0.20 0.38 0.33 0.33 0.31 D4 0.30 0.26 0.40 0.35 0.60 0.51 J1 0.64 0.55 0.31 0.28 0.31 0.28 J5 0.20 0.18 0.23 0.18 0.63 0.44 J6 0.47 0.37 0.28 0.19 0.63 0.45 F2 0.24 0.20 0.43 0.36 0.30 0.28 Shaded – strongest correlation listed C3 Bolded – better than base case
  • 7.
    Information Retrieval Album Precision Precision Recall (P1) threshold (P2) D1 0.48 0.43 0.002 D2 0.24 0.62 0.008 D3 0.29 0.36 0.001 D4 0.36 0.36 0.0004 J1 0.20 0.50 0.0003 J3 0.00 0.75 0.00 J5 0.57 0.40 0.0002 J6 0.75 0.38 0.0004 F1 0.00 0.50 0.00 F2 0.67 0.59 0.00009 Average 0.35 0.49 0.001 HV 0.51 0.45 0.0003 average LV average 0.20 0.53 0.002
  • 8.
    Conclusions  Good proxyfor top content when sufficient Twitter activity  More relevant tags are higher in tweet keyword rankings  TF-IDF is effective Next Steps  Larger dataset  Analysis over time  Other sources like LastFM  Linguistic analysis (clustering, stemming)  Other user-generated data (e.g. user reviews)
  • 9.