Your SlideShare is downloading. ×
0
Enterprise LinkSpam Analysis      Ian Lurie      @portentint      ian@portent.com
MACHINE LEARNINGHow hard can it be?   Ian Lurie   Portent, Inc   @portentint
http://portent.co/machine-spam      yes, this is right, no ‘m’
NERDOSITYTIME
1: There’s gotta be a better way2: How hard can it be?3: Lessons learned
ONE: THESE ARE GOOD LINKS!!!!
try to relax...
Some of these are real
BUILD A   REALLY BIG SPREADSHEET.
GET ALL LINKS FROMGOOGLE WEBMASTER TOOLSOPENSITE EXPLORERMAJESTIC SEO
URLTITLEANCHOR TEXTMAJESTIC TRUSTFLOWMOZ DA AND PA
EVALUATE URLSFREELINKSHERE.COMARTICLEFUNHOSTING.COMGETYERGREATARTICLESHERE.COMNEWJERSEYPRESSRELEASES.COM           NO
EVALUATE TITLES  BAD GRAMMAR  JUST STUPID  MAKES YOU ITCH      NO
CONTACTDISAVOW
So cute. soharmless.
AFILTRATIONPROBLEM
500 LINKS: THE LOSERS ARE OBVIOUS
500,000 LINKS: YOU GET A MIGRAINE
A   KNOWLEDGEPROBLEM
Anchor	  Text	  Distribu1on	  Col	  Von	  Miseryburgen	                 Alp	  d	  Huez	              Sufferlandria	     Gru...
BUT THESE AREGOOD LINKS!!!!
WE CAN’T READ GOOGLE’S MIND
MACHINE LEARNING
TWO: HOW HARD CAN IT BE?
MACHINE LEARNING   IN 60 SECONDS
START WITH A   QUESTION
START WITH A IS THIS PAGE   QUESTION SPAM?
THE ANSWER IS ACLASSIFICATION
+ TRAINING SET    ALGORITHMCLASSIFICATION
+ TRAINING SET    ALGORITHMCLASSIFICATION
+   TRAINING SET      ALGORITHM     PREDICTION
+ TRAINING SET   TEXT-BASED?    ALGORITHM    SUPERVISED?                 UNSUPERVISED?CLASSIFICATION
+ TRAINING SET    ALGORITHM    CORRECT?CLASSIFICATION   OR MORONIC?
CLASSIFICATION    = QUESTION
CLASSIFICATION= SPAM? TRUE OR FALSE
TRAINING SET 1 JUST WORDS
A home cooking blog featuring healthy low-glycemic recipes withstep-by-step photos, as well as cooking tips, vegetable gar...
+   TRAINING SET       BAYESIAN            FAIL
+
TRAINING SET 2WORDS INTO NUMBERS
who da          nerd?!!! WHO DA             NERD?!!!!!+           TRAINING SET    LOGISTIC REGRESSION                    WIN
logistic regression
python         nltkscikit-learn  mongodb
Flesch-Kincaid (FK)      FK grade level    FK reading ease         word count     sentence count      syllable count
links/word    MajesticSEO  Page TrustFlowDomain TrustFlow Unique c-blocks
THE TRAINING SET             python                 nltk        scikit-learn          mongodb
seogadget.co.uk
Is seogadget.co.uk spam?       true = 1.93%      false = 98.07%
THREE:LESSONS LEARNED
ABOUT GOOGLE
IT’S ABOUT LINKS. NOT PAGES.HREF=“HTTP://GETYERLINKSHEREFREE.COM
TRUST FLOW 71        DA 97        PA 47
LESSON 1: THERE IS NO SPAM
HOW LIKELY IS IT THAT THIS LINK,FROM THIS PAGE, IN THE CONTEXTOF ALL OTHER LINKS TO THIS SITE,MIGHT SEEM SPAM-LIKE?
USEFUL  INTERFLORA            CNN.COMMANIPULATIVE              TRUSTWORTHY   GODX.NET           DAILY SQUEE               ...
LINKS FROM EDU SITES 45
LINKS FROM EDU SITES 45
2: DECLINING SPAM TOLERANCE
Percent	  spam	  links	  90%	  80%	  70%	  60%	  50%	  40%	  30%	  20%	  10%	   0%	  
THIS WASSPAM IN APRIL
THIS MAY BE   SPAM NOWhttp://www.cs.hiram.edu/~oliphantlt/cpsc171/links.html
GOOGLE’S GETTINGGRUMPIER
CLEAN UP.    NOW.
ABOUT MACHINE LEARNING
NEED A BIGGER TRAINING SETTOO BIG.
USE WORDS, TOO
SPECIALIZE BY VERTICAL
ian wtf!!!!
MACHINE LEARNING      IS GROWING
MACHINE LEARNINGIS BECOMING EASIER
GOOGLE PREDICTION API               BIGML    EXCEL DATASCOPE                  ….?
How hard can it be?       EASY TO UNDERSTAND
How hard can it be?          SO UNDERSTAND IT
http://portent.co/machine-spam      yes, this is right, no ‘m’
THE END.  Ian Lurie  Portent, Inc  @portentint
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Machine learning and link spam - how hard can it be?
Upcoming SlideShare
Loading in...5
×

Machine learning and link spam - how hard can it be?

4,770

Published on

My presentation from LinkLove 2013. All about link spam, machine learning and my fumbling around with both.

Published in: Business

Transcript of "Machine learning and link spam - how hard can it be?"

  1. 1. Enterprise LinkSpam Analysis Ian Lurie @portentint ian@portent.com
  2. 2. MACHINE LEARNINGHow hard can it be? Ian Lurie Portent, Inc @portentint
  3. 3. http://portent.co/machine-spam yes, this is right, no ‘m’
  4. 4. NERDOSITYTIME
  5. 5. 1: There’s gotta be a better way2: How hard can it be?3: Lessons learned
  6. 6. ONE: THESE ARE GOOD LINKS!!!!
  7. 7. try to relax...
  8. 8. Some of these are real
  9. 9. BUILD A REALLY BIG SPREADSHEET.
  10. 10. GET ALL LINKS FROMGOOGLE WEBMASTER TOOLSOPENSITE EXPLORERMAJESTIC SEO
  11. 11. URLTITLEANCHOR TEXTMAJESTIC TRUSTFLOWMOZ DA AND PA
  12. 12. EVALUATE URLSFREELINKSHERE.COMARTICLEFUNHOSTING.COMGETYERGREATARTICLESHERE.COMNEWJERSEYPRESSRELEASES.COM NO
  13. 13. EVALUATE TITLES BAD GRAMMAR JUST STUPID MAKES YOU ITCH NO
  14. 14. CONTACTDISAVOW
  15. 15. So cute. soharmless.
  16. 16. AFILTRATIONPROBLEM
  17. 17. 500 LINKS: THE LOSERS ARE OBVIOUS
  18. 18. 500,000 LINKS: YOU GET A MIGRAINE
  19. 19. A KNOWLEDGEPROBLEM
  20. 20. Anchor  Text  Distribu1on  Col  Von  Miseryburgen   Alp  d  Huez   Sufferlandria   Grunter  von  Agony   Mithril  Shirt   The  Devils  Den   John  Snow   Broad  Street  Pump   GalacDca   Bagshot  Row   Li<le  Round  Top   Pegasus   The  Lonely  Mountain   0   10000   20000   30000   40000   50000   60000   70000   80000   90000   100000  
  21. 21. BUT THESE AREGOOD LINKS!!!!
  22. 22. WE CAN’T READ GOOGLE’S MIND
  23. 23. MACHINE LEARNING
  24. 24. TWO: HOW HARD CAN IT BE?
  25. 25. MACHINE LEARNING IN 60 SECONDS
  26. 26. START WITH A QUESTION
  27. 27. START WITH A IS THIS PAGE QUESTION SPAM?
  28. 28. THE ANSWER IS ACLASSIFICATION
  29. 29. + TRAINING SET ALGORITHMCLASSIFICATION
  30. 30. + TRAINING SET ALGORITHMCLASSIFICATION
  31. 31. + TRAINING SET ALGORITHM PREDICTION
  32. 32. + TRAINING SET TEXT-BASED? ALGORITHM SUPERVISED? UNSUPERVISED?CLASSIFICATION
  33. 33. + TRAINING SET ALGORITHM CORRECT?CLASSIFICATION OR MORONIC?
  34. 34. CLASSIFICATION = QUESTION
  35. 35. CLASSIFICATION= SPAM? TRUE OR FALSE
  36. 36. TRAINING SET 1 JUST WORDS
  37. 37. A home cooking blog featuring healthy low-glycemic recipes withstep-by-step photos, as well as cooking tips, vegetable gardening,and products Kalyn loves. Mouse Trap (originally titled MouseTrap Game) is a board game first published by Ideal in 1963 fortwo or more players. Over the course of the game, players at first.A delicious and refreshing cherry pie and a story about makingfriends of enemies :). Brady Bunch Punch drink recipe made withAmaretto,Cranberry juice,Orange juice,Triple Sec,Vodka,. How tomake a Brady Bunch Punch with all the instructions and. A blogabout a foreigners life in Japan, on a mountainside above LakeBiwa. SEE the worlds greatest collection of tattoo designs!Sample FREE Downloads! Cutting Edge Art by Famous TattooArtists! YOUR TATTOO DESIGN IS HERE!.
  38. 38. + TRAINING SET BAYESIAN FAIL
  39. 39. +
  40. 40. TRAINING SET 2WORDS INTO NUMBERS
  41. 41. who da nerd?!!! WHO DA NERD?!!!!!+ TRAINING SET LOGISTIC REGRESSION WIN
  42. 42. logistic regression
  43. 43. python nltkscikit-learn mongodb
  44. 44. Flesch-Kincaid (FK) FK grade level FK reading ease word count sentence count syllable count
  45. 45. links/word MajesticSEO Page TrustFlowDomain TrustFlow Unique c-blocks
  46. 46. THE TRAINING SET python nltk scikit-learn mongodb
  47. 47. seogadget.co.uk
  48. 48. Is seogadget.co.uk spam? true = 1.93% false = 98.07%
  49. 49. THREE:LESSONS LEARNED
  50. 50. ABOUT GOOGLE
  51. 51. IT’S ABOUT LINKS. NOT PAGES.HREF=“HTTP://GETYERLINKSHEREFREE.COM
  52. 52. TRUST FLOW 71 DA 97 PA 47
  53. 53. LESSON 1: THERE IS NO SPAM
  54. 54. HOW LIKELY IS IT THAT THIS LINK,FROM THIS PAGE, IN THE CONTEXTOF ALL OTHER LINKS TO THIS SITE,MIGHT SEEM SPAM-LIKE?
  55. 55. USEFUL INTERFLORA CNN.COMMANIPULATIVE TRUSTWORTHY GODX.NET DAILY SQUEE NOPE
  56. 56. LINKS FROM EDU SITES 45
  57. 57. LINKS FROM EDU SITES 45
  58. 58. 2: DECLINING SPAM TOLERANCE
  59. 59. Percent  spam  links  90%  80%  70%  60%  50%  40%  30%  20%  10%   0%  
  60. 60. THIS WASSPAM IN APRIL
  61. 61. THIS MAY BE SPAM NOWhttp://www.cs.hiram.edu/~oliphantlt/cpsc171/links.html
  62. 62. GOOGLE’S GETTINGGRUMPIER
  63. 63. CLEAN UP. NOW.
  64. 64. ABOUT MACHINE LEARNING
  65. 65. NEED A BIGGER TRAINING SETTOO BIG.
  66. 66. USE WORDS, TOO
  67. 67. SPECIALIZE BY VERTICAL
  68. 68. ian wtf!!!!
  69. 69. MACHINE LEARNING IS GROWING
  70. 70. MACHINE LEARNINGIS BECOMING EASIER
  71. 71. GOOGLE PREDICTION API BIGML EXCEL DATASCOPE ….?
  72. 72. How hard can it be? EASY TO UNDERSTAND
  73. 73. How hard can it be? SO UNDERSTAND IT
  74. 74. http://portent.co/machine-spam yes, this is right, no ‘m’
  75. 75. THE END. Ian Lurie Portent, Inc @portentint
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×