Overcoming Spammers in Twitter – A Tale of Five Algorithms

4,354
-1

Published on

Micro-blogging services such as Twitter can develop into valuable sources of up-to-date information provided the spam problem is overcome. Thus, separating the most relevant users from the spammers is a highly pertinent question for which graph centrality methods can provide an answer. In this paper we examine the vulnerability of five different algorithms to linking malpractice in Twitter and propose a first step towards "desensitizing" them against such abusive behavior.

Published in: Technology, News & Politics

Overcoming Spammers in Twitter – A Tale of Five Algorithms

  1. 1. CERI2010 I Congreso Español de Recuperación de Información Daniel Gayo Avello @pfcdgayo, David J. Brenes @brenes
  2. 2. 40% of Twitter conversation is pointless babble Pear Analytics (2009)
  3. 3. 40% of Twitter conversation is pointless babble Pear Analytics (2009) “Who would have thought that the status message would be one of the hottest features on the Web? Jansen, Chowdury & Cook (2010)
  4. 4. 40% of Twitter conversation is pointless babble Pear Analytics (2009) “Who would have thought that the status message would be one of the hottest features on the Web? Jansen, Chowdury & Cook (2010) “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now)
  5. 5. 40% of Twitter conversation is pointless babble Pear Analytics (2009) “Who would have thought that the status message would be how can all of this one of the hottest features on the Web? be reconcilable? Jansen, Chowdury & Cook (2010) “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now)
  6. 6. 40% of Twitter conversation is pointless babble Pear Analytics (2009)
  7. 7. 40% of Twitter conversation is pointless babble Pear Analytics (2009)
  8. 8. 40% of Twitter conversation is pointless babble Pear Analytics (2009) ok, it may be true, but. .
  9. 9. 60% not 40% of Twitter conversation is pointless babble Pear Analytics (2009) troomereforis hope h stil huuuuuuuuge number of users info on current events also valuable contents …
  10. 10. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) huuuuuuuuge number of users info on current events also valuable contents …
  11. 11. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now)
  12. 12. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) Aggregated analysis topic detection and tracking opinion mining …
  13. 13. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) Aggregated analysis topic detection and tracking opinion mining … finding “authoritative” sources/users
  14. 14. “Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome. Us (Now) Aggregated analysis topic detection and tracking opinion mining … finding “authoritative” sources/users
  15. 15. finding “authoritative” sources/users
  16. 16. black magic, secret sauce approach finding “authoritative” sources/users
  17. 17. black magic, secret sauce approach finding “authoritative” sources/users algorithmic methods based on the users’ graph
  18. 18. black magic, secret sauce approach finding “authoritative” sources/users algorithmic methods based on the users’ graph
  19. 19. Warning! Slight detour…
  20. 20. El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
  21. 21. El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
  22. 22. El Listo – Chulerías 2.0 http://listocomics.com/414-chulerias-2-0/
  23. 23. http://listocomics.com/394- piramide-del-glamour-twittero/
  24. 24. The Brads – Twitter Outage http://bradcolbow.com/archive/view/the_brads_twitter_outage/
  25. 25. The Brads – Twitter Outage http://bradcolbow.com/archive/view/the_brads_twitter_outage/
  26. 26. End of detour. What’s the moral? People want lots of followers (?!) The follower/followee ratio “matters” more than raw number of followers. Following people is a simple way to get followers. In other words, users are “abusing” social networks to grab “prestige”.
  27. 27. Research questions? Vulnerability of rank prestige algorithms to link spamming in social graphs. Feasibility of “desensitization”.
  28. 28. 5 Rank prestige algorithms PageRank HITS NodeRanking TunkRank TwitterRank
  29. 29. TunkRank Originally proposed by Daniel Tunkelang Rather similar to PageRank
  30. 30. Desensitizing against link spamming in social graphs follower/followee ratio can be interpreted as the user’s value regarding the introduction of new information from the outside world into the Twitter global ecosystem. Reciprocal links are “counterfit currency” to increase followers count.
  31. 31. Desensitizing against link spamming in social graphs follower/followee ratio can be interpreted as the user’s value regarding the introduction of new information from the outside world into the Twitter global ecosystem. Reciprocal links are “counterfit currency” to increase followers count.
  32. 32. Desensitizing against link spamming in social graphs
  33. 33. Desensitizing against link spamming in social graphs HotSEOGuru 12,000 11,800 following followers ratio=11,800/12,000=0.983
  34. 34. Desensitizing against link spamming in social graphs HotSEOGuru 12,000 11,800 following followers ratio=11,800/12,000=0.983 if 11,650 were reciprocal links ratio_discounted=150/350=0.429
  35. 35. Desensitizing against link spamming in social graphs stevebaker 225 2,227 following followers ratio=2,227/225=9.898
  36. 36. Desensitizing against link spamming in social graphs stevebaker 225 2,227 following followers ratio=2,227/225=9.898 140 are reciprocal links ratio_discounted=2,087/85=24.553
  37. 37. Desensitizing against link spamming in social graphs HotSEOGuru stevebaker 12,000 11,800 225 2,227 following followers following followers she would prefer 0.983 but he would prefer 24.553 but she obtains 0.429 he obtains 9.898
  38. 38. Desensitizing against link spamming in social graphs for this study it has been applied as an extra weight to PageRank and also to prune the graph before applying PageRank
  39. 39. How can we measure performance in this scenario?
  40. 40. How can we measure performance in this scenario?
  41. 41. How can we measure performance in this scenario?
  42. 42. How can we measure performance in this scenario?
  43. 43. How can we measure performance in this scenario? …
  44. 44. How can we measure performance in this scenario? The lower the ranking spammers reach, the better a method is.
  45. 45. A Twitter dataset is needed!
  46. 46. A Twitter dataset is needed!
  47. 47. A Twitter dataset is needed! January to August 2009 Tweets: 27.9M English entries by 4.98M users Graph: 1.8M users, 134M links
  48. 48. What about the spammers?
  49. 49. What about the spammers?
  50. 50. What about the spammers? simple method based on URL presence and keyword matching using this method 9,369 users marked as spammers another 22,290 users marked as aggressive marketers (similar bios)
  51. 51. Results
  52. 52. Results HITS and TwitterRank underperform PageRank :(
  53. 53. Results HITS and TwitterRank underperform PageRank :( NodeRanking, “pruned” PageRank and PageRank very similar :/
  54. 54. Results HITS and TwitterRank underperform PageRank :( NodeRanking, “pruned” PageRank and PageRank very similar :/ “discounted” PageRank unconclusive (elitism & “giant shoulders” effect) :S
  55. 55. Results HITS and TwitterRank underperform PageRank :( NodeRanking, “pruned” PageRank and PageRank very similar :/ “discounted” PageRank unconclusive (elitism & “giant shoulders” effect) :S TunkRank better choice :)
  56. 56. Conclusions Rank prestige can be “gamed” in social networks. Ranking in itself shouldn’t be the point. TunkRank better choice to rank social graphs. Reciprocal links can help to find valuable users. An extended report is available at http://arxiv.org/abs/1004.0816

×