CERI2010
I Congreso Español
de Recuperación de Información   Daniel Gayo Avello @pfcdgayo, David J. Brenes @brenes
40% of Twitter conversation is
           pointless babble
           Pear Analytics (2009)
40% of Twitter conversation is
           pointless babble
           Pear Analytics (2009)


“Who would have thought that...
40% of Twitter conversation is
            pointless babble
            Pear Analytics (2009)


“Who would have thought th...
40% of Twitter conversation is
                                  pointless babble
                                  Pear A...
40% of Twitter conversation is
           pointless babble
           Pear Analytics (2009)
40% of Twitter conversation is
           pointless babble
           Pear Analytics (2009)
40% of Twitter conversation is
           pointless babble
           Pear Analytics (2009)




ok, it may be true,
      ...
60% not
     40% of Twitter conversation is
                pointless babble
                Pear Analytics (2009)




   ...
“Micro-blogging services can
develop into valuable sources of
up-to-date information provided
the spam problem
is overcome...
“Micro-blogging services can
develop into valuable sources of
up-to-date information provided
the spam problem
is overcome...
“Micro-blogging services can
develop into valuable sources of
up-to-date information provided
the spam problem
is overcome...
“Micro-blogging services can
develop into valuable sources of
up-to-date information provided
the spam problem
is overcome...
“Micro-blogging services can
develop into valuable sources of
up-to-date information provided
the spam problem
is overcome...
finding “authoritative”
     sources/users
black magic, secret sauce approach
finding “authoritative”
     sources/users
black magic, secret sauce approach
finding “authoritative”
     sources/users
                            algorithmic meth...
black magic, secret sauce approach
finding “authoritative”
     sources/users
                            algorithmic meth...
Warning! Slight detour…
El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
http://listocomics.com/394-
piramide-del-glamour-twittero/
The Brads – Twitter Outage
http://bradcolbow.com/archive/view/the_brads_twitter_outage/
The Brads – Twitter Outage
http://bradcolbow.com/archive/view/the_brads_twitter_outage/
End of detour.
     What’s the moral?
People want lots of followers (?!)
The follower/followee ratio “matters”
more than r...
Research questions?
 Vulnerability of rank prestige algorithms
 to link spamming in social graphs.
 Feasibility of “desens...
5 Rank prestige algorithms
   PageRank
   HITS
   NodeRanking
   TunkRank
   TwitterRank
TunkRank
  Originally proposed by Daniel Tunkelang
  Rather similar to PageRank
Desensitizing against link spamming in social graphs

    follower/followee ratio can be
    interpreted as the user’s val...
Desensitizing against link spamming in social graphs

    follower/followee ratio can be
    interpreted as the user’s val...
Desensitizing against link spamming in social graphs
Desensitizing against link spamming in social graphs




          HotSEOGuru
          12,000      11,800
          follo...
Desensitizing against link spamming in social graphs




           HotSEOGuru
           12,000      11,800
           fo...
Desensitizing against link spamming in social graphs




          stevebaker
          225         2,227
          follow...
Desensitizing against link spamming in social graphs




          stevebaker
          225         2,227
          follow...
Desensitizing against link spamming in social graphs




         HotSEOGuru                        stevebaker
         12...
Desensitizing against link spamming in social graphs




  for this study it has been applied
   as an extra weight to Pag...
How can we measure
 performance in this
          scenario?
How can we measure
 performance in this
          scenario?
How can we measure
 performance in this
          scenario?
How can we measure
 performance in this
          scenario?
How can we measure
 performance in this
          scenario?




         …
How can we measure
 performance in this
          scenario?

 The lower the ranking
      spammers reach,
the better a met...
A Twitter dataset is needed!
A Twitter dataset is needed!
A Twitter dataset is needed!




               January to August 2009
    Tweets: 27.9M English entries by 4.98M users
  ...
What about the spammers?
What about the spammers?
What about the spammers?


simple method based on URL
presence and keyword matching

            using this method 9,369 u...
Results
Results




     HITS and TwitterRank underperform PageRank :(
Results




       HITS and TwitterRank underperform PageRank :(
 NodeRanking, “pruned” PageRank and PageRank very similar...
Results




          HITS and TwitterRank underperform PageRank :(
  NodeRanking, “pruned” PageRank and PageRank very sim...
Results




          HITS and TwitterRank underperform PageRank :(
  NodeRanking, “pruned” PageRank and PageRank very sim...
Conclusions
  Rank prestige can be “gamed” in social networks.
  Ranking in itself shouldn’t be the point.
  TunkRank bett...
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Overcoming Spammers in Twitter – A Tale of Five Algorithms
Upcoming SlideShare
Loading in...5
×

Overcoming Spammers in Twitter – A Tale of Five Algorithms

4,129

Published on

Micro-blogging services such as Twitter can develop into valuable sources of up-to-date information provided the spam problem is overcome. Thus, separating the most relevant users from the spammers is a highly pertinent question for which graph centrality methods can provide an answer. In this paper we examine the vulnerability of five different algorithms to linking malpractice in Twitter and propose a first step towards "desensitizing" them against such abusive behavior.

Published in: Technology, News & Politics

Transcript of "Overcoming Spammers in Twitter – A Tale of Five Algorithms"

  1. 1. CERI2010 I Congreso Español de Recuperación de Información Daniel Gayo Avello @pfcdgayo, David J. Brenes @brenes
  2. 2. 40% of Twitter conversation is pointless babble Pear Analytics (2009)
  3. 3. 40% of Twitter conversation is pointless babble Pear Analytics (2009) “Who would have thought that the status message would be one of the hottest features on the Web? Jansen, Chowdury & Cook (2010)
  4. 4. 40% of Twitter conversation is pointless babble Pear Analytics (2009) “Who would have thought that the status message would be one of the hottest features on the Web? Jansen, Chowdury & Cook (2010) “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now)
  5. 5. 40% of Twitter conversation is pointless babble Pear Analytics (2009) “Who would have thought that the status message would be how can all of this one of the hottest features on the Web? be reconcilable? Jansen, Chowdury & Cook (2010) “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now)
  6. 6. 40% of Twitter conversation is pointless babble Pear Analytics (2009)
  7. 7. 40% of Twitter conversation is pointless babble Pear Analytics (2009)
  8. 8. 40% of Twitter conversation is pointless babble Pear Analytics (2009) ok, it may be true, but. .
  9. 9. 60% not 40% of Twitter conversation is pointless babble Pear Analytics (2009) troomereforis hope h stil huuuuuuuuge number of users info on current events also valuable contents …
  10. 10. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) huuuuuuuuge number of users info on current events also valuable contents …
  11. 11. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now)
  12. 12. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) Aggregated analysis topic detection and tracking opinion mining …
  13. 13. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) Aggregated analysis topic detection and tracking opinion mining … finding “authoritative” sources/users
  14. 14. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) Aggregated analysis topic detection and tracking opinion mining … finding “authoritative” sources/users
  15. 15. finding “authoritative” sources/users
  16. 16. black magic, secret sauce approach finding “authoritative” sources/users
  17. 17. black magic, secret sauce approach finding “authoritative” sources/users algorithmic methods based on the users’ graph
  18. 18. black magic, secret sauce approach finding “authoritative” sources/users algorithmic methods based on the users’ graph
  19. 19. Warning! Slight detour…
  20. 20. El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
  21. 21. El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
  22. 22. El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
  23. 23. http://listocomics.com/394- piramide-del-glamour-twittero/
  24. 24. The Brads – Twitter Outage http://bradcolbow.com/archive/view/the_brads_twitter_outage/
  25. 25. The Brads – Twitter Outage http://bradcolbow.com/archive/view/the_brads_twitter_outage/
  26. 26. End of detour. What’s the moral? People want lots of followers (?!) The follower/followee ratio “matters” more than raw number of followers. Following people is a simple way to get followers. In other words, users are “abusing” social networks to grab “prestige”.
  27. 27. Research questions? Vulnerability of rank prestige algorithms to link spamming in social graphs. Feasibility of “desensitization”.
  28. 28. 5 Rank prestige algorithms PageRank HITS NodeRanking TunkRank TwitterRank
  29. 29. TunkRank Originally proposed by Daniel Tunkelang Rather similar to PageRank
  30. 30. Desensitizing against link spamming in social graphs follower/followee ratio can be interpreted as the user’s value regarding the introduction of new information from the outside world into the Twitter global ecosystem. Reciprocal links are “counterfit currency” to increase followers count.
  31. 31. Desensitizing against link spamming in social graphs follower/followee ratio can be interpreted as the user’s value regarding the introduction of new information from the outside world into the Twitter global ecosystem. Reciprocal links are “counterfit currency” to increase followers count.
  32. 32. Desensitizing against link spamming in social graphs
  33. 33. Desensitizing against link spamming in social graphs HotSEOGuru 12,000 11,800 following followers ratio=11,800/12,000=0.983
  34. 34. Desensitizing against link spamming in social graphs HotSEOGuru 12,000 11,800 following followers ratio=11,800/12,000=0.983 if 11,650 were reciprocal links ratio_discounted=150/350=0.429
  35. 35. Desensitizing against link spamming in social graphs stevebaker 225 2,227 following followers ratio=2,227/225=9.898
  36. 36. Desensitizing against link spamming in social graphs stevebaker 225 2,227 following followers ratio=2,227/225=9.898 140 are reciprocal links ratio_discounted=2,087/85=24.553
  37. 37. Desensitizing against link spamming in social graphs HotSEOGuru stevebaker 12,000 11,800 225 2,227 following followers following followers she would prefer 0.983 but he would prefer 24.553 but she obtains 0.429 he obtains 9.898
  38. 38. Desensitizing against link spamming in social graphs for this study it has been applied as an extra weight to PageRank and also to prune the graph before applying PageRank
  39. 39. How can we measure performance in this scenario?
  40. 40. How can we measure performance in this scenario?
  41. 41. How can we measure performance in this scenario?
  42. 42. How can we measure performance in this scenario?
  43. 43. How can we measure performance in this scenario? …
  44. 44. How can we measure performance in this scenario? The lower the ranking spammers reach, the better a method is.
  45. 45. A Twitter dataset is needed!
  46. 46. A Twitter dataset is needed!
  47. 47. A Twitter dataset is needed! January to August 2009 Tweets: 27.9M English entries by 4.98M users Graph: 1.8M users, 134M links
  48. 48. What about the spammers?
  49. 49. What about the spammers?
  50. 50. What about the spammers? simple method based on URL presence and keyword matching using this method 9,369 users marked as spammers another 22,290 users marked as aggressive marketers (similar bios)
  51. 51. Results
  52. 52. Results HITS and TwitterRank underperform PageRank :(
  53. 53. Results HITS and TwitterRank underperform PageRank :( NodeRanking, “pruned” PageRank and PageRank very similar :/
  54. 54. Results HITS and TwitterRank underperform PageRank :( NodeRanking, “pruned” PageRank and PageRank very similar :/ “discounted” PageRank unconclusive (elitism & “giant shoulders” effect) :S
  55. 55. Results HITS and TwitterRank underperform PageRank :( NodeRanking, “pruned” PageRank and PageRank very similar :/ “discounted” PageRank unconclusive (elitism & “giant shoulders” effect) :S TunkRank better choice :)
  56. 56. Conclusions Rank prestige can be “gamed” in social networks. Ranking in itself shouldn’t be the point. TunkRank better choice to rank social graphs. Reciprocal links can help to find valuable users. An extended report is available at http://arxiv.org/abs/1004.0816

×