Machine learning and link spam - how hard can it be?
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Machine learning and link spam - how hard can it be?

on

  • 4,511 views

My presentation from LinkLove 2013. All about link spam, machine learning and my fumbling around with both.

My presentation from LinkLove 2013. All about link spam, machine learning and my fumbling around with both.

Statistics

Views

Total Views
4,511
Views on SlideShare
3,493
Embed Views
1,018

Actions

Likes
18
Downloads
30
Comments
0

21 Embeds 1,018

http://www.portent.com 794
http://feeds.feedburner.com 91
http://www.stateofsearch.com 32
https://twitter.com 16
http://www.newsblur.com 15
http://confluence 13
http://new.portent.com 12
http://confluence.hc.lan 12
http://localhost.wordpress 9
http://newsblur.com 5
http://www.scoop.it 5
http://staging.portent.com 3
http://dev.newsblur.com 2
http://affaholic.com 2
http://feeds2.feedburner.com 1
http://excession.org.uk 1
http://tweets.distilled.net 1
http://www.barriemoran.com 1
http://beforeitsnews.com 1
http://socialmediasuperstars.collected.info 1
http://inbound.org 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Machine learning and link spam - how hard can it be? Presentation Transcript

  • 1. Enterprise LinkSpam Analysis Ian Lurie @portentint ian@portent.com
  • 2. MACHINE LEARNINGHow hard can it be? Ian Lurie Portent, Inc @portentint
  • 3. http://portent.co/machine-spam yes, this is right, no ‘m’
  • 4. NERDOSITYTIME
  • 5. 1: There’s gotta be a better way2: How hard can it be?3: Lessons learned
  • 6. ONE: THESE ARE GOOD LINKS!!!!
  • 7. try to relax...
  • 8. Some of these are real
  • 9. BUILD A REALLY BIG SPREADSHEET.
  • 10. GET ALL LINKS FROMGOOGLE WEBMASTER TOOLSOPENSITE EXPLORERMAJESTIC SEO
  • 11. URLTITLEANCHOR TEXTMAJESTIC TRUSTFLOWMOZ DA AND PA
  • 12. EVALUATE URLSFREELINKSHERE.COMARTICLEFUNHOSTING.COMGETYERGREATARTICLESHERE.COMNEWJERSEYPRESSRELEASES.COM NO
  • 13. EVALUATE TITLES BAD GRAMMAR JUST STUPID MAKES YOU ITCH NO
  • 14. CONTACTDISAVOW
  • 15. So cute. soharmless.
  • 16. AFILTRATIONPROBLEM
  • 17. 500 LINKS: THE LOSERS ARE OBVIOUS
  • 18. 500,000 LINKS: YOU GET A MIGRAINE
  • 19. A KNOWLEDGEPROBLEM
  • 20. Anchor  Text  Distribu1on  Col  Von  Miseryburgen   Alp  d  Huez   Sufferlandria   Grunter  von  Agony   Mithril  Shirt   The  Devils  Den   John  Snow   Broad  Street  Pump   GalacDca   Bagshot  Row   Li<le  Round  Top   Pegasus   The  Lonely  Mountain   0   10000   20000   30000   40000   50000   60000   70000   80000   90000   100000  
  • 21. BUT THESE AREGOOD LINKS!!!!
  • 22. WE CAN’T READ GOOGLE’S MIND
  • 23. MACHINE LEARNING
  • 24. TWO: HOW HARD CAN IT BE?
  • 25. MACHINE LEARNING IN 60 SECONDS
  • 26. START WITH A QUESTION
  • 27. START WITH A IS THIS PAGE QUESTION SPAM?
  • 28. THE ANSWER IS ACLASSIFICATION
  • 29. + TRAINING SET ALGORITHMCLASSIFICATION
  • 30. + TRAINING SET ALGORITHMCLASSIFICATION
  • 31. + TRAINING SET ALGORITHM PREDICTION
  • 32. + TRAINING SET TEXT-BASED? ALGORITHM SUPERVISED? UNSUPERVISED?CLASSIFICATION
  • 33. + TRAINING SET ALGORITHM CORRECT?CLASSIFICATION OR MORONIC?
  • 34. CLASSIFICATION = QUESTION
  • 35. CLASSIFICATION= SPAM? TRUE OR FALSE
  • 36. TRAINING SET 1 JUST WORDS
  • 37. A home cooking blog featuring healthy low-glycemic recipes withstep-by-step photos, as well as cooking tips, vegetable gardening,and products Kalyn loves. Mouse Trap (originally titled MouseTrap Game) is a board game first published by Ideal in 1963 fortwo or more players. Over the course of the game, players at first.A delicious and refreshing cherry pie and a story about makingfriends of enemies :). Brady Bunch Punch drink recipe made withAmaretto,Cranberry juice,Orange juice,Triple Sec,Vodka,. How tomake a Brady Bunch Punch with all the instructions and. A blogabout a foreigners life in Japan, on a mountainside above LakeBiwa. SEE the worlds greatest collection of tattoo designs!Sample FREE Downloads! Cutting Edge Art by Famous TattooArtists! YOUR TATTOO DESIGN IS HERE!.
  • 38. + TRAINING SET BAYESIAN FAIL
  • 39. +
  • 40. TRAINING SET 2WORDS INTO NUMBERS
  • 41. who da nerd?!!! WHO DA NERD?!!!!!+ TRAINING SET LOGISTIC REGRESSION WIN
  • 42. logistic regression
  • 43. python nltkscikit-learn mongodb
  • 44. Flesch-Kincaid (FK) FK grade level FK reading ease word count sentence count syllable count
  • 45. links/word MajesticSEO Page TrustFlowDomain TrustFlow Unique c-blocks
  • 46. THE TRAINING SET python nltk scikit-learn mongodb
  • 47. seogadget.co.uk
  • 48. Is seogadget.co.uk spam? true = 1.93% false = 98.07%
  • 49. THREE:LESSONS LEARNED
  • 50. ABOUT GOOGLE
  • 51. IT’S ABOUT LINKS. NOT PAGES.HREF=“HTTP://GETYERLINKSHEREFREE.COM
  • 52. TRUST FLOW 71 DA 97 PA 47
  • 53. LESSON 1: THERE IS NO SPAM
  • 54. HOW LIKELY IS IT THAT THIS LINK,FROM THIS PAGE, IN THE CONTEXTOF ALL OTHER LINKS TO THIS SITE,MIGHT SEEM SPAM-LIKE?
  • 55. USEFUL INTERFLORA CNN.COMMANIPULATIVE TRUSTWORTHY GODX.NET DAILY SQUEE NOPE
  • 56. LINKS FROM EDU SITES 45
  • 57. LINKS FROM EDU SITES 45
  • 58. 2: DECLINING SPAM TOLERANCE
  • 59. Percent  spam  links  90%  80%  70%  60%  50%  40%  30%  20%  10%   0%  
  • 60. THIS WASSPAM IN APRIL
  • 61. THIS MAY BE SPAM NOWhttp://www.cs.hiram.edu/~oliphantlt/cpsc171/links.html
  • 62. GOOGLE’S GETTINGGRUMPIER
  • 63. CLEAN UP. NOW.
  • 64. ABOUT MACHINE LEARNING
  • 65. NEED A BIGGER TRAINING SETTOO BIG.
  • 66. USE WORDS, TOO
  • 67. SPECIALIZE BY VERTICAL
  • 68. ian wtf!!!!
  • 69. MACHINE LEARNING IS GROWING
  • 70. MACHINE LEARNINGIS BECOMING EASIER
  • 71. GOOGLE PREDICTION API BIGML EXCEL DATASCOPE ….?
  • 72. How hard can it be? EASY TO UNDERSTAND
  • 73. How hard can it be? SO UNDERSTAND IT
  • 74. http://portent.co/machine-spam yes, this is right, no ‘m’
  • 75. THE END. Ian Lurie Portent, Inc @portentint