Characterizing the Splogosphere Tim Finin http://ebiquity.umbc.edu/paper/html/id/299/ Pranam Kolari, Akshay Java and Tim F...
Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.co...
The Blogosphere <ul><li>57% of online US teens generate content, 40% read blogs, 20% have them! (Pew Nov. 2005) </li></ul>...
Blogosphere/Splogosphere
Spam in the Blogosphere <ul><li>Types: comment spam, ping spam, spam blogs </li></ul><ul><li>Akismet: “87% of all comments...
Motivation: host ads
Motivation: index affiliates, promote pageRank
Spings from weblogs.com
Where do Splogs come from? “ Honestly, Do you think  people who make $10k/month from adsense make blogs manually? Come on,...
 
Our splog bait was picked up  and used by dozens of sploggers
 
Our feed is RSSjacked by at least one splogger
Why are splogs a problem? <ul><li>Splogs undermine ranking algorithms </li></ul><ul><li>Splogs water down search results <...
Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.co...
Splog Detection <ul><li>SVM based probabilistic splog detection (Kolari et al., 2006) </li></ul><ul><li>Hand verified trai...
This Work By characterizing the splogosphere, we aim to achieve the following: (i) Get a handle on the seriousness of the ...
Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.co...
BlogPulse Dataset <ul><li>21 days of July 2005 </li></ul><ul><li>1.3 million blogs </li></ul><ul><li>Eliminated Live-Journ...
Splogs vs. Blogs – Word Count blogs splogs blogs and splogs
Splogs vs. Blogs – In-degree http://www.engadget.com 1942 http://www.huffingtonpost.com/theblog  905 http://www.crooksandl...
Splogs vs. Blogs – Out-degree http://www.xanga.com/home.aspx?user=hit_me_layoutz  273 http://www.xanga.com/home.aspx?user=...
Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.co...
Weblogs.com Dataset <ul><li>20 Nov 2005 – 11 Dec 2005 </li></ul><ul><li>16 million update pings </li></ul><ul><li>Pings su...
Ping times – Italian Blogs
Sping vs. Ping times
Spings vs. Pings: frequency blogs vs. their ping frequency follows a power law, but splogs vs. spings does not
All Pings – 16 Million <ul><li>Close to 40% spings </li></ul><ul><li>Among English blogs </li></ul><ul><ul><li>75% pings a...
Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.co...
Implications (1)  <ul><li>BlogPulse dataset </li></ul><ul><ul><li>Local word models most effective for fast splog detectio...
Implications (2) – Filter Design Heuristics Spam Blog Filter Language Identifiers Spam Blog Detectors Blog  Identifier 1 2...
Conclusions <ul><li>Blog spam is a serious problem </li></ul><ul><ul><li>Classic arms race, e.g., increased plagiarism, fe...
<ul><ul><li>http://ebiquity.umbc.edu/ </li></ul></ul>Annotated in OWL For more  information
Questions?
Blogs – A Specialized Domain Update Pings Update Pings Ping Stream 1 2 Update Stream Fetch Content 3 4 1 2 3 4 ( )
Upcoming SlideShare
Loading in …5
×

Characterizing the Splogosphere

543 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
543
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Characterizing the Splogosphere

  1. 1. Characterizing the Splogosphere Tim Finin http://ebiquity.umbc.edu/paper/html/id/299/ Pranam Kolari, Akshay Java and Tim Finin University of Maryland, Baltimore County 3 rd Annual Workshop on the Weblogging Ecosytem: Aggregation, Analysis and Dynamics 22 May 2006
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.com Dataset </li></ul><ul><li>Implications </li></ul>
  3. 3. The Blogosphere <ul><li>57% of online US teens generate content, 40% read blogs, 20% have them! (Pew Nov. 2005) </li></ul><ul><li>53% of companies are blogging (Guideware Oct. 2005) </li></ul><ul><li>MySpace accounts for 1/3 of all web clicks (Hendler, 2006) ?! </li></ul><ul><li>But … the Blogosphere is awash in spam </li></ul>Source: Wikipedia
  4. 4. Blogosphere/Splogosphere
  5. 5. Spam in the Blogosphere <ul><li>Types: comment spam, ping spam, spam blogs </li></ul><ul><li>Akismet: “87% of all comments are spam” </li></ul><ul><li>75% of update pings are spam (ebiquity 2005) </li></ul><ul><li>20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) </li></ul><ul><li>“ Spam blogs , sometimes referred to by the neologism splogs , are weblog sites which the author uses only for promoting affiliated websites” </li></ul><ul><li>“ Spings, or ping spam, are pings that are sent from spam blogs” </li></ul>1 Wikipedia
  6. 6. Motivation: host ads
  7. 7. Motivation: index affiliates, promote pageRank
  8. 8. Spings from weblogs.com
  9. 9. Where do Splogs come from? “ Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “ Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “ Holy Grail Of Advertising... “ “ Easily Dominate Any Market, Any Search Engine, Any Keyword.” $ 197
  10. 11. Our splog bait was picked up and used by dozens of sploggers
  11. 13. Our feed is RSSjacked by at least one splogger
  12. 14. Why are splogs a problem? <ul><li>Splogs undermine ranking algorithms </li></ul><ul><li>Splogs water down search results </li></ul><ul><li>Splogs threaten the Web advertising model </li></ul><ul><li>Splogs indulge in “plagiarism” </li></ul><ul><li>Splogs skew results of market research tools </li></ul><ul><li>Splogs stress the Blogosphere infrastructure of ping servers, blog search engines, etc. </li></ul>
  13. 15. Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.com Dataset </li></ul><ul><li>Implications </li></ul>
  14. 16. Splog Detection <ul><li>SVM based probabilistic splog detection (Kolari et al., 2006) </li></ul><ul><li>Hand verified training set of blogs and splogs </li></ul><ul><li>Precision/Recall of 87% </li></ul><ul><li>Bag-of-words based feature using text on blog home-page, O(x) </li></ul><ul><li>Some additional local features </li></ul>we what was my org flickr paper 600 open words weblog motion me thank go january trackback archives now political find info news your 27 another website best articles on perfect products uncategorized 280 hot resources inc 60 three copyright P( x is a splog | O(x) ) P( x is a blog | O(x) ) top features blogs splogs
  15. 17. This Work By characterizing the splogosphere, we aim to achieve the following: (i) Get a handle on the seriousness of the problem, (ii) Develop new techniques for splog detection, and (iii) Recommend placement of splog filters on the blogging infrastructure. Characterization is based on comparing the nature of authentic blogs against splogs to identify discriminating features
  16. 18. Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.com Dataset </li></ul><ul><li>Implications </li></ul>
  17. 19. BlogPulse Dataset <ul><li>21 days of July 2005 </li></ul><ul><li>1.3 million blogs </li></ul><ul><li>Eliminated Live-Journal </li></ul><ul><li>Re-fetched blog-homepages, many spam blogs were non- existent since spam blogs are short lived </li></ul><ul><li>Arrived at 500K samples </li></ul><ul><li>Set probability thresholds to 0.2 (authentic blog) and 0.8 (splog) </li></ul><ul><li>Identified 27K splogs </li></ul><ul><li>Sampled for 27K authentic blogs </li></ul>
  18. 20. Splogs vs. Blogs – Word Count blogs splogs blogs and splogs
  19. 21. Splogs vs. Blogs – In-degree http://www.engadget.com 1942 http://www.huffingtonpost.com/theblog 905 http://www.crooksandliars.com 637 http://blogs.guardian.co.uk/news 616 http://www.littlegreenfootballs.com/weblog 611 http://spaces.msn.com/members/pony-girl 505 http://spaces.msn.com/members/black-puss 505 http://spaces.msn.com/members/amputee-women 505 http://spaces.msn.com/members/free-stories 505 http://spaces.msn.com/members/first-time-girl 505 Top 5 Top 5
  20. 22. Splogs vs. Blogs – Out-degree http://www.xanga.com/home.aspx?user=hit_me_layoutz 273 http://www.xanga.com/home.aspx?user=i_jock_layouts 271 http://www.xanga.com/home.aspx?user=slp_layouts_slp 198 http://spaces.msn.com/members/cyrustse1986 193 http://www.xanga.com/home.aspx?user=layouts_n_codes2005 180 http://worldseriesofpokerchipscardguard.blogspot.com 898 http://rule-wsop.blogspot.com 898 http://worldseries-ofpoler.blogspot.com 898 http://qsopcom-1.blogspot.com 898 http://weopcom.blogspot.com 898 Top 5 Top 5
  21. 23. Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.com Dataset </li></ul><ul><li>Implications </li></ul>
  22. 24. Weblogs.com Dataset <ul><li>20 Nov 2005 – 11 Dec 2005 </li></ul><ul><li>16 million update pings </li></ul><ul><li>Pings subdivided by language: da, de, en, es, fi, fr, it, nl, pt, sv </li></ul><ul><li>Heuristics to identify Japanese, Chinese, Korean </li></ul><ul><li>Set threshold of 0.5 to separate out authentic blogs from splogs. </li></ul>1 Thanks to James Mayfield, JHU APL
  23. 25. Ping times – Italian Blogs
  24. 26. Sping vs. Ping times
  25. 27. Spings vs. Pings: frequency blogs vs. their ping frequency follows a power law, but splogs vs. spings does not
  26. 28. All Pings – 16 Million <ul><li>Close to 40% spings </li></ul><ul><li>Among English blogs </li></ul><ul><ul><li>75% pings are spings </li></ul></ul><ul><ul><li>Authentic blogs are 13% of all pings </li></ul></ul><ul><li>Including Info domain </li></ul><ul><ul><li>50% of all pings are spings </li></ul></ul>1191 http://www.countrymusicdigest.com 1207 http://www.tipstohealth.com/blog 1211 http://www.microdermabrasion-secrets.com 1215 http://www.criss-angel.biz 1375 http://www.myaquariumiplace.com 1452 http://www.freecancerfacts.com/wp 1491 http://www.wiccapaganblog.com count url
  27. 29. Outline <ul><li>Introduction </li></ul><ul><li>Motivation </li></ul><ul><li>BlogPulse Dataset </li></ul><ul><li>Weblogs.com Dataset </li></ul><ul><li>Implications </li></ul>
  28. 30. Implications (1) <ul><li>BlogPulse dataset </li></ul><ul><ul><li>Local word models most effective for fast splog detection </li></ul></ul><ul><ul><li>If splogs escape filters, in-link and out-link distribution point to link-based classification </li></ul></ul><ul><li>Weblogs.com dataset </li></ul><ul><ul><li>Ping frequency can be useful </li></ul></ul><ul><ul><li>Splogs probably not a big problem in most European languages. Yet. </li></ul></ul><ul><li>The nature of the domain, points to spam filters employing a multi-step, and adaptive approach, which we are currently pursuing </li></ul>
  29. 31. Implications (2) – Filter Design Heuristics Spam Blog Filter Language Identifiers Spam Blog Detectors Blog Identifier 1 2 3 4 Authentic Blogs Spam Blogs IP Blacklists Supporting Info (OPTIONAL)
  30. 32. Conclusions <ul><li>Blog spam is a serious problem </li></ul><ul><ul><li>Classic arms race, e.g., increased plagiarism, feedjacking </li></ul></ul><ul><li>Blog spam identification requires different tactics than used for email and Web spam </li></ul><ul><ul><li>Local features effective, but not sufficient </li></ul></ul><ul><ul><li>Lots of relational features (e.g., links, ads, IP addresses, tight but disconnected communities) but dynamism reduces effectiveness of analysis </li></ul></ul><ul><li>Getting good training sets expensive, especially in a multilingual environment. </li></ul><ul><ul><li>Minute or more a judgment </li></ul></ul><ul><li>Good opportunities for infrastructure insertion, e.g., sping free ping servers </li></ul>
  31. 33. <ul><ul><li>http://ebiquity.umbc.edu/ </li></ul></ul>Annotated in OWL For more information
  32. 34. Questions?
  33. 35. Blogs – A Specialized Domain Update Pings Update Pings Ping Stream 1 2 Update Stream Fetch Content 3 4 1 2 3 4 ( )

×