Machine learning and link spam - how hard can it be?

6,861 views
6,251 views

Published on

My presentation from LinkLove 2013. All about link spam, machine learning and my fumbling around with both.

Published in: Business

Machine learning and link spam - how hard can it be?

  1. Enterprise LinkSpam Analysis Ian Lurie @portentint ian@portent.com
  2. MACHINE LEARNINGHow hard can it be? Ian Lurie Portent, Inc @portentint
  3. http://portent.co/machine-spam yes, this is right, no ‘m’
  4. NERDOSITYTIME
  5. 1: There’s gotta be a better way2: How hard can it be?3: Lessons learned
  6. ONE: THESE ARE GOOD LINKS!!!!
  7. try to relax...
  8. Some of these are real
  9. BUILD A REALLY BIG SPREADSHEET.
  10. GET ALL LINKS FROMGOOGLE WEBMASTER TOOLSOPENSITE EXPLORERMAJESTIC SEO
  11. URLTITLEANCHOR TEXTMAJESTIC TRUSTFLOWMOZ DA AND PA
  12. EVALUATE URLSFREELINKSHERE.COMARTICLEFUNHOSTING.COMGETYERGREATARTICLESHERE.COMNEWJERSEYPRESSRELEASES.COM NO
  13. EVALUATE TITLES BAD GRAMMAR JUST STUPID MAKES YOU ITCH NO
  14. CONTACTDISAVOW
  15. So cute. soharmless.
  16. AFILTRATIONPROBLEM
  17. 500 LINKS: THE LOSERS ARE OBVIOUS
  18. 500,000 LINKS: YOU GET A MIGRAINE
  19. A KNOWLEDGEPROBLEM
  20. Anchor  Text  Distribu1on  Col  Von  Miseryburgen   Alp  d  Huez   Sufferlandria   Grunter  von  Agony   Mithril  Shirt   The  Devils  Den   John  Snow   Broad  Street  Pump   GalacDca   Bagshot  Row   Li<le  Round  Top   Pegasus   The  Lonely  Mountain   0   10000   20000   30000   40000   50000   60000   70000   80000   90000   100000  
  21. BUT THESE AREGOOD LINKS!!!!
  22. WE CAN’T READ GOOGLE’S MIND
  23. MACHINE LEARNING
  24. TWO: HOW HARD CAN IT BE?
  25. MACHINE LEARNING IN 60 SECONDS
  26. START WITH A QUESTION
  27. START WITH A IS THIS PAGE QUESTION SPAM?
  28. THE ANSWER IS ACLASSIFICATION
  29. + TRAINING SET ALGORITHMCLASSIFICATION
  30. + TRAINING SET ALGORITHMCLASSIFICATION
  31. + TRAINING SET ALGORITHM PREDICTION
  32. + TRAINING SET TEXT-BASED? ALGORITHM SUPERVISED? UNSUPERVISED?CLASSIFICATION
  33. + TRAINING SET ALGORITHM CORRECT?CLASSIFICATION OR MORONIC?
  34. CLASSIFICATION = QUESTION
  35. CLASSIFICATION= SPAM? TRUE OR FALSE
  36. TRAINING SET 1 JUST WORDS
  37. A home cooking blog featuring healthy low-glycemic recipes withstep-by-step photos, as well as cooking tips, vegetable gardening,and products Kalyn loves. Mouse Trap (originally titled MouseTrap Game) is a board game first published by Ideal in 1963 fortwo or more players. Over the course of the game, players at first.A delicious and refreshing cherry pie and a story about makingfriends of enemies :). Brady Bunch Punch drink recipe made withAmaretto,Cranberry juice,Orange juice,Triple Sec,Vodka,. How tomake a Brady Bunch Punch with all the instructions and. A blogabout a foreigners life in Japan, on a mountainside above LakeBiwa. SEE the worlds greatest collection of tattoo designs!Sample FREE Downloads! Cutting Edge Art by Famous TattooArtists! YOUR TATTOO DESIGN IS HERE!.
  38. + TRAINING SET BAYESIAN FAIL
  39. +
  40. TRAINING SET 2WORDS INTO NUMBERS
  41. who da nerd?!!! WHO DA NERD?!!!!!+ TRAINING SET LOGISTIC REGRESSION WIN
  42. logistic regression
  43. python nltkscikit-learn mongodb
  44. Flesch-Kincaid (FK) FK grade level FK reading ease word count sentence count syllable count
  45. links/word MajesticSEO Page TrustFlowDomain TrustFlow Unique c-blocks
  46. THE TRAINING SET python nltk scikit-learn mongodb
  47. seogadget.co.uk
  48. Is seogadget.co.uk spam? true = 1.93% false = 98.07%
  49. THREE:LESSONS LEARNED
  50. ABOUT GOOGLE
  51. IT’S ABOUT LINKS. NOT PAGES.HREF=“HTTP://GETYERLINKSHEREFREE.COM
  52. TRUST FLOW 71 DA 97 PA 47
  53. LESSON 1: THERE IS NO SPAM
  54. HOW LIKELY IS IT THAT THIS LINK,FROM THIS PAGE, IN THE CONTEXTOF ALL OTHER LINKS TO THIS SITE,MIGHT SEEM SPAM-LIKE?
  55. USEFUL INTERFLORA CNN.COMMANIPULATIVE TRUSTWORTHY GODX.NET DAILY SQUEE NOPE
  56. LINKS FROM EDU SITES 45
  57. LINKS FROM EDU SITES 45
  58. 2: DECLINING SPAM TOLERANCE
  59. Percent  spam  links  90%  80%  70%  60%  50%  40%  30%  20%  10%   0%  
  60. THIS WASSPAM IN APRIL
  61. THIS MAY BE SPAM NOWhttp://www.cs.hiram.edu/~oliphantlt/cpsc171/links.html
  62. GOOGLE’S GETTINGGRUMPIER
  63. CLEAN UP. NOW.
  64. ABOUT MACHINE LEARNING
  65. NEED A BIGGER TRAINING SETTOO BIG.
  66. USE WORDS, TOO
  67. SPECIALIZE BY VERTICAL
  68. ian wtf!!!!
  69. MACHINE LEARNING IS GROWING
  70. MACHINE LEARNINGIS BECOMING EASIER
  71. GOOGLE PREDICTION API BIGML EXCEL DATASCOPE ….?
  72. How hard can it be? EASY TO UNDERSTAND
  73. How hard can it be? SO UNDERSTAND IT
  74. http://portent.co/machine-spam yes, this is right, no ‘m’
  75. THE END. Ian Lurie Portent, Inc @portentint

×