Uprising microblogs: A Bayesian network retrieval model for tweet search

3,392 views

Published on

Lamjed Ben Jabeur, Lynda Tamine, Mohand Boughanem. Uprising microblogs: A Bayesian network retrieval model for tweet search (regular paper). Dans : ACM Symposium on Applied Computing (SAC 2012), Riva del Garda (Trento), Italy, 26/03/12-30/03/12, mars 2012 http://dx.doi.org/10.1145/2245276.2245459

We investigate in this paper the problem of accessing to real-time information and we propose a Bayesian network
retrieval model for tweet search. The proposed model interprets tweet relevance as a conditional probability and
estimates it using different sources of evidence. In particular, we introduce a social search model that considers, in
addition to text similarity measures, the microblogger’s influence, the time magnitude and the presence of hashtags.
To evaluate our model, we conducted a series of experiments on the TREC Tweets2011 corpus. Experiments with “Arab
Spring” topic set show that both of social and temporal features improve tweet search for different types of queries.
Final results show also that our model outperforms other traditional information retrieval baselines.

Published in: Education, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,392
On SlideShare
0
From Embeds
0
Number of Embeds
131
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Uprising microblogs: A Bayesian network retrieval model for tweet search

  1. Uprising microblogs: A Bayesian network retrieval model for tweet search Lamjed Ben Jabeur, Lynda Tamine and Mohand Boughanem IRIT, Université Paul Sabatier
  2. A Bayesian network retrieval model for tweet search Outline1. Microblogging service2. Tweet search3. Bayesian network topology4. Computing conditional probabilities5. Experimental evaluation6. Conclusion and future work 2
  3. Microblogging service Microblog?“ Microblogging is a new form of communication [….] that enables users to broadcast and share information about their activities, opinions and status. [Java et al.2007]. ”• Microblog post – Short (140 characters) 1 billions Publications /week – Real-time 50 millions Publications /day – Social motivation 177 million Publications in mars 2011 – Mobile device +106 millions User accounts 3
  4. Microblogging service Tweet, retweet et hashtag ?“ Jack Dorsey 21 Mars 06  1ier Tweetinviting coworkers #oilspill“ Stephen Colbert 21 Juin 2010  Golden Tweet Award 2010In honor of oil-soaked birds, tweets are now gurgles. http://bit.ly/cIhZNf“ Wendys 8 Juin 2011  Golden Tweet Award 2011RT for a good cause. Each Retweet sends 50¢ to help kids in foster care. #TreatItFwd “ CORIA11 16 mars 2010 CORIA 2011 : Université dAvignon #CORIA11 http://yfrog.com/h3y ““ MohBoughanem 17 Mars 2010 @coria2011 well visualized, quickly found MohBoughanem CORIA11 17 Mars 2010 4 @coria2011 well visualized, quickly found
  5. Microblogging serviceSocial information network 5
  6. Tweet search Microblog IR• Users overwhelmed by the huge quantity of tweets – Important publication rate – Diverse sources of information Difficulty to accessing to interesting posts• Microblog IR tasks – Person search and follower suggestion – Trend extraction – Opinion search – Tweet search 6
  7. Tweet search Tweet search task“ real-time search task, where the user wishes to see the most recent but relevant information to the query. (Ounis et al., 2011). ”“ adhoc search on Twitter, where a user’s information need is ” represented by a query at a specific time. (Ounis et al., 2011).• Search motivations – access to concise and credible information – access to fresh and real-time news – follow an event – collect opinions and public sentiments 7
  8. Tweet search Related work1. Spatio-temporel context TwitterStand (Sankaranarayanan J. et al, 2009) TweetSieve (Grinev M et al, 2009)2. Microblog features – followership, tweets, retweets, reply, hashtags, URLs – Linear combination (Nagmoti et al., 2010) – Learn to Rank (Duan Y et al., 2010) 8
  9. Tweet search Related work3. Social network structure – Indegree, Retweet et Mention influence (Cha et al., 2010).,TweetRank, FollowerRank (Nagmoti et al., 2010). – Authority (Kwak et al., 2010) – Influence (Kwak et al., 2010), TwitterRank (Weng et al., 2010), Popularity (Duan et al.,2010) 9
  10. Tweet search Contributions topical• Relevance features: – Term occurrence – social influence – time magnitude• Bayesian network model temporal social 10
  11. Bayesian network topology Definitions and notations• Query: q  0,1 q, q• Term: ki  0,1 k , ki i• Term configuration: k example : k1 , k 2  k   k1 , k2 ), (k1 , k2 ), (k1 , k2 ), (k1 , k2 ) (• Tweet: t j  0,1 ti , ti• Microblogger: uk  0,1 uk , uk 11
  12. Bayesian network topologyNetwork nodes and edges Query q Terms k1 k2 k3 Tweets t1 t2 t3 Microbloggers u1 u2 12
  13. Computing conditional probabilities Query evaluationQuery q   P(q  t i )   P(q | k )P(k | t i ) P( t i | u k ) P(u k )  kTerms k1 k2 k3  P(q  t j )   P(q | k )P( t j | u k ) P(u k )  kTweets t1 t2 t3     P(k i | t j )   P(k i | t j )   k |on(i,k ) 1    i k i |on(i,k )  0 Microbloggers u1 u2 13
  14. Computing conditional probabilities Query   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0   P(q | k )   on(i, k ) i , ki q 14
  15. Computing conditional probabilities Tweet   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0  P(k i | t j )  (1   ) F (ki , t j ) H (ki , t j )  T (ki , t j ) L(t j ) Term occurrence Tweet properties P( k i | t j )  1  P( k i | t j ) 15
  16. Computing conditional probabilities Term frequency   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0  P(k i | t j )  (1   ) F (ki , t j ) H (ki , t j )  T (ki , t j ) L(t j )  a if k i  t j F ( ki , t j ) 1  1F (ki , t j )   tf ki ,t j 0,8  0 a=0,1  otherwise 0,6 a=0,25 0,4 a=0,5 0,2 a=0,75 0 a=1 0 5 tf ki ,t j10 16
  17. Computing conditional probabilities Hashtag   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0  P(k i | t j )  (1   ) F (ki , t j ) H (ki , t j )  T (ki , t j ) L(t j )  b if # k i  t j 1 H (ki , t j )   tf #ki ,t j  b otherwise  17
  18. Computing conditional probabilities Time magnitude   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0  P(k i | t j )  (1   ) F (ki , t j ) H (ki , t j )  T (ki , t j ) L(t j ) tweets df k i, j T ( ki , t j )  30 j 20 t1 10 t2 0 1 2 tems 3 4 5   j  t k ,  t j   t k  t  time 18
  19. Computing conditional probabilities Tweet length   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0  P(k i | t j )  (1   ) F (ki , t j ) H (ki , t j )  T (ki , t j ) L(t j ) 1L(t j )  1  avgtl  tltj 19
  20. Computing conditional probabilities Microblogger   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0  1 P( t j | u k )  u k 20
  21. Computing conditional probabilities Social influence   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0  P(uk )  Inf (uk )PageRank on Retweet Social Network 1 Inf G 1 (ui ) kInf Gk (ui )  d  (1  d )  w j ,i U u j ,e ( u j ,ui )E O(u j )  (u j )   (u j )w j ,i   (u j ) 21
  22. Computing conditional probabilities Social influence   P(q  t j )   P(q | k )P( t j | u k ) P(u k )  P(k i | t j )   P( k i | t j )    k |on(i,k )1   k  i k i |on(i,k )  0   (u j )   (ui ) wi , j   (ui ) 22
  23. Experimental evaluationTREC 2011 Microblog NESTOR Microblog Search EngineTweets 16 141 812 Microbloggers 5 356 432Retweets 1 128 179 Retweet relationships 1 060 551Tweet 1 860 112 Social network of retweets: nodes 5 495 081Terms 7 781 775 Social network of retweets: edges 1 024 914Hashtags 455 179 Giant component 11.12% Term frequency Hashtags Tweet length 1.5E8 1.5E 7 1.5E 60 5 10 0 5 10 0 20 Term frequency, hashtags and length distributions 23
  24. Experimental evaluation Queries and ground truth• “Arab Spring” query dataset (25 queries) – Topical “Number of protesters in Tahrir”, “Tunisian revolution” – Temporal “ElBaradei arrvies in Egypt”, “Clashes in Tahrir”, “SMS Down Egypt” – Social “Wael Ghonim”, “Mubarak dissolves government”• User rating (relevant, not relevent)• Tweets ranked by Score; p@10; p@20 24
  25. Experimental evaluation Configurations and baselinesBNTS Bayesian network model for tweet search*BNTS-L BNTS, Tweet length feature disabledBNTS-T BNTS, Time magnitude feature disabledBNTS-H BNTS, Hashtag feature disabledBNTS-S BNTS, Social influence feature disabledBM25 Okapi BM25VSM Vector Space ModelBM Boolean Model*   0.25, a  0.25, b  0.4, t  1h, d  0.15 25
  26. Experimental evaluation Features impact BNTS BNTS-L BNTS-T BNTS-H BNTS-S 0,584 0,580,552 0,532 0,548 0,542 0,528 0,502 0,294 0,256 p@10 p@20 26
  27. Experimental evaluationFeatures impact Topical BNTS BNTS-L BNTS-T BNTS-H BNTS-S 0,7533 0,7333 0,72330,66 0,6867 0,6867 0,6833 0,6833 0,3767 0,2867 p@10 p@20 27
  28. Experimental evaluation Features impact Temporal BNTS BNTS-L BNTS-T BNTS-H BNTS-S0,4333 0,4 0,3333 0,35 0,3 0,3167 0,2333 0,2 0,1 0,0667 p@10 p@20 28
  29. Experimental evaluation Features impact Social BNTS BNTS-L BNTS-T BNTS-H BNTS-L0,3714 0,3286 0,3286 0,3286 0,3357 0,2714 0,2857 0,2429 0,2571 0,2 p@10 p@20 29
  30. Experimental evaluation Retrieval effectiveness p@10 p@20BNTS 0,552 0,548BM25 0,576 -4% 0,494 11%BM 0,416 ** 33% 0,382 ** 34%VSM 0,376 ** 47% 0,36 ** 52% 30
  31. A Bayesian network retrieval model for tweet search Conclusion and future work• Tweet search model – Normalized Term frequency – Time magnitude – Social influence• Integrating relevance factors within a Bayesian network• Query profile impact features performances.• Our model outperforms traditional IR baselines.• Future work – Automatically detect optimal time window – Select appropriate feature depending on the query profile 31
  32. Thank you for your attention! Follow me on Twitter! http://twitter.com/amjedbj
  33. Computing conditional probabilities Query evaluation    q P(t j | q)   P(q | k ) P(t j | k )P(k )  k     k1 k2 k3 P(t j | q)   P(q | k ) P(tkj | k )P(toj | k ) P(t sj | k ) P(k )  k o1 o2 u1 u1tk1 tk2 tk3 to3 to2 to3 ts1 ts2 ts3 t1 t2 t3 33
  34. Experimental evaluation Term frequency normalization• BNTS.K p @ 30  1 tf ki ,t j    0,35 P(t kj | k )    0,3 k ki k t j tf ki ,t j 0,25 0,2 0,15 0,1 0,05 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1  34
  35. Experimental evaluation Time window • BNTS.KOp @ 30 0,32  t t  oe :  oe  , oe   0,315  2 2 0,31 0,305 0,3 0,295 jours 0,29 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 35 t
  36. Experimental evaluation Retrieval effectiveness isiFDL DFReeKLIM30 BNTS Médiane Nestor BM25 Disjunctive 0,50,45 0,40,35 0,30,25 0,20,15 0,10,05 0 p@30 MAP 36
  37. Experimental evaluation TREC Microblogs 2011 Ranked by time Ranked by score All rel High rel All rel p@30 MAP p@30 MAP p@30 MAPNestor* 0.2027 0.1305 0.0838 0.1287 0.2218 0.1384Nestor-S* 0.2027 0.1305 0.0838 0.1286 0.2184 0.1360Nestor-T 0.2082 0.1343 0.0585 0.0912 0.1912 0.1196Nestor-L 0.2048 0.1306 0.0565 0.0867 0.2293 0.1426Median 0.2592 0.1433 0.2646 0.1381 37
  38. Experimental evaluation TREC Microblogs 2011 Système Seuil p@10 p@20 p@30 Map1 Somme IDF des termes présents 30 0,3633 0,3316 0,3333 0,17592 BM25 30 0,3571 0,3245 0,2973 0,15463 Proportion des termes présents 30 0,2653 0,2561 0,2782 0,144 Somme des fréquences booléennes 30 0,2571 0,2663 0,2755 0,13875 EBM (AND) 30 0,3041 0,2918 0,2714 0,12826 Réseau d’inférence Bayésien 30 0,302 0,2888 0,2687 0,12747 Somme TF*IDF 30 0,302 0,2888 0,2687 0,12748 VSM 30 0,302 0,2888 0,2687 0,12749 Somme TF 30 0,2327 0,2276 0,2238 0,106610 Nestor 0,2857 0,2347 0,2027 0,130511 EBM (OR) 30 0,1837 0,1786 0,166 0,054112 Sommes des fréquences des Hashtags 30 0,1612 0,1541 0,1469 0,051213 Lucene-Baseline 1000 0,1612 0,1143 0,0986 0,141114 Somme TF (normalise par longueur) 30 0,0816 0,0673 0,0612 0,022315 Ordre chronologique inverse 30 0,0184 0,0255 0,0218 0,0082 38

×