Modern On Page Factors1SMX AdvancedMatthew Peters, PhDmatt@moz.com @mattthemathman
2“philadelphiaphillies”
3“philadelphiaphillies”
4“Relevance” vs “Ranking”Conceptually “relevance” determination and “ranking” can be thought of a twodifferent steps (even...
5“Relevance” vs “Ranking”Conceptually “relevance” determination and “ranking” can be thought of a twodifferent steps (even...
6“Relevance” vs “Ranking”Conceptually “relevance” determination and “ranking” can be thought of a twodifferent steps (even...
7Is this page relevant to “philadelphia phillies”?
8Is this page relevant to “philadelphia phillies”?query-body similarity: 0.74
9Is this page relevant to “philadelphia phillies”?query-body similarity: 0.74query-title similarity: 0.8query-H1 similarit...
10Measuring query-document similarityGoal: given query + document string, compute “similarity”
11Measuring query-document similaritySee “Introduction to Information Retrieval” by Manning et al:http://nlp.stanford.edu/...
12Measuring query-document similarity“philadelphia phillies”In this context “document” can also refer to title tag, meta d...
13Measuring query-document similarity“philadelphia phillies”Query Modeltokenizationnormalization (stemming)query expansion...
14Measuring query-document similarity“philadelphia phillies”Query Modeltokenizationnormalization (stemming)query expansion...
15Measuring query-document similarity“philadelphia phillies”Query Modeltokenizationnormalization (stemming)query expansion...
16Query representationLanguage identificationWord segmentation(Japanese, Chinese)Tokenization + normalization{reviews, rev...
17Query representationLanguage identificationWord segmentation(Japanese, Chinese)Tokenization + normalization{reviews, rev...
18Query representationLanguage identificationWord segmentation(Japanese, Chinese)Tokenization + normalization{reviews, rev...
Document representationTF-IDF
Document representationTF-IDF Language ModelP(optimization | search, engine)>>P(walking | search, engine)
Document representationProbability Ranking PrincipleP(R = 1 | d, q) or P(R = 0 |d, q)TF-IDF Language ModelP(optimization |...
Which method performs best?What are the characteristics of sites that rank highly?14,000+ keywordsTop 50 results600,000 UR...
Which method performs best?We tried a few different types of smoothing for the language model,Dirichlet worked best (Zhai ...
Impact of stemmingPorter stemmer provided a slight increase in correlations
These correlations are still relatively low compared to other factors
50 results450randompagesmovie reviews
50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevant
50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevantURL ID PA In SERP?86 92 1355 90 0… ...
50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevantURL ID PA In SERP?86 92 1355 90 0… ...
50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevantURL ID PA In SERP?86 92 1355 90 0… ...
TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.
TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.Action ite...
TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.Action ite...
TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.Action ite...
Thanks for watching!Matthew Petersmatt@moz.com @mattthemathman35
Upcoming SlideShare
Loading in …5
×

Peters matthew periodictableseo

22,586 views

Published on

Published in: Technology, Design

Peters matthew periodictableseo

  1. 1. Modern On Page Factors1SMX AdvancedMatthew Peters, PhDmatt@moz.com @mattthemathman
  2. 2. 2“philadelphiaphillies”
  3. 3. 3“philadelphiaphillies”
  4. 4. 4“Relevance” vs “Ranking”Conceptually “relevance” determination and “ranking” can be thought of a twodifferent steps (even if they are implemented as one in a search engine)
  5. 5. 5“Relevance” vs “Ranking”Conceptually “relevance” determination and “ranking” can be thought of a twodifferent steps (even if they are implemented as one in a search engine)Relevance
  6. 6. 6“Relevance” vs “Ranking”Conceptually “relevance” determination and “ranking” can be thought of a twodifferent steps (even if they are implemented as one in a search engine)RelevanceRanking12
  7. 7. 7Is this page relevant to “philadelphia phillies”?
  8. 8. 8Is this page relevant to “philadelphia phillies”?query-body similarity: 0.74
  9. 9. 9Is this page relevant to “philadelphia phillies”?query-body similarity: 0.74query-title similarity: 0.8query-H1 similarity: 1.0etc …
  10. 10. 10Measuring query-document similarityGoal: given query + document string, compute “similarity”
  11. 11. 11Measuring query-document similaritySee “Introduction to Information Retrieval” by Manning et al:http://nlp.stanford.edu/IR-book/> 700papersGoal: given query + document string, compute “similarity”
  12. 12. 12Measuring query-document similarity“philadelphia phillies”In this context “document” can also refer to title tag, meta description, H1, etc.0.74
  13. 13. 13Measuring query-document similarity“philadelphia phillies”Query Modeltokenizationnormalization (stemming)query expansionintentIn this context “document” can also refer to title tag, meta description, H1, etc.0.74
  14. 14. 14Measuring query-document similarity“philadelphia phillies”Query Modeltokenizationnormalization (stemming)query expansionintentDocument Modeltokenizationnormalization (stemming)vector space representationlanguage modelIn this context “document” can also refer to title tag, meta description, H1, etc.0.74
  15. 15. 15Measuring query-document similarity“philadelphia phillies”Query Modeltokenizationnormalization (stemming)query expansionintentDocument Modeltokenizationnormalization (stemming)vector space representationlanguage modelIn this context “document” can also refer to title tag, meta description, H1, etc.Scoring function0.74
  16. 16. 16Query representationLanguage identificationWord segmentation(Japanese, Chinese)Tokenization + normalization{reviews, reviewer, reviewing} -> reviewSpelling correction
  17. 17. 17Query representationLanguage identificationWord segmentation(Japanese, Chinese)Tokenization + normalization{reviews, reviewer, reviewing} -> reviewQuery expansionUser intent (transactional,navigational, informational)LocalClassification(images, video, news)Spelling correction
  18. 18. 18Query representationLanguage identificationWord segmentation(Japanese, Chinese)Tokenization + normalization{reviews, reviewer, reviewing} -> reviewQuery expansionUser intent(transactional, navigational, informational)LocalClassification(images, video, news)Topic Model (LDA)Entity extractionSpelling correction
  19. 19. Document representationTF-IDF
  20. 20. Document representationTF-IDF Language ModelP(optimization | search, engine)>>P(walking | search, engine)
  21. 21. Document representationProbability Ranking PrincipleP(R = 1 | d, q) or P(R = 0 |d, q)TF-IDF Language ModelP(optimization | search, engine)>>P(walking | search, engine)
  22. 22. Which method performs best?What are the characteristics of sites that rank highly?14,000+ keywordsTop 50 results600,000 URLsGoogle-US, no personalizationMarch 2013Mean Spearman CorrelationRemember: “correlation is not causation”
  23. 23. Which method performs best?We tried a few different types of smoothing for the language model,Dirichlet worked best (Zhai and Lafferty SIGIR 2001)
  24. 24. Impact of stemmingPorter stemmer provided a slight increase in correlations
  25. 25. These correlations are still relatively low compared to other factors
  26. 26. 50 results450randompagesmovie reviews
  27. 27. 50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevant
  28. 28. 50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevantURL ID PA In SERP?86 92 1355 90 0… … …27 18 0URL ID LanguageModelIn SERP?213 0.97 1156 0.95 1… … …355 0.06 0
  29. 29. 50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevantURL ID PA In SERP?86 92 1355 90 0… … …27 18 0URL ID LanguageModelIn SERP?213 0.97 1156 0.95 1… … …355 0.06 0P@50 is the “Precision of the top 50 results”. It is the percentage of top 50results by PA/Language Model that are actually in the SERP.Top 50ranked
  30. 30. 50 results450randompagesmovie reviews For eachquery:500 pages10% relevant90% irrelevantURL ID PA In SERP?86 92 1355 90 0… … …27 18 0URL ID LanguageModelIn SERP?213 0.97 1156 0.95 1… … …355 0.06 0P@50 is the “Precision of the top 50 results”. It is the percentage of top 50results by PA/Language Model that are actually in the SERP.Top 50ranked
  31. 31. TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.
  32. 32. TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.Action item: With sophisticated query and document models, noneed to optimize separately for similar words, e.g. “moviereviews” vs “movie review”.
  33. 33. TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.Action item: With sophisticated query and document models, noneed to optimize separately for similar words, e.g. “moviereviews” vs “movie review”.Action item: Each page is relevant to many different keywords,so optimize each page for a broad set of related keywords,instead of a single keyword.
  34. 34. TakeawaysImplication: Query-document similarity is based on decades ofresearch. It’s immune to algorithm change.Action item: With sophisticated query and document models, noneed to optimize separately for similar words, e.g. “moviereviews” vs “movie review”.Action item: Each page is relevant to many different keywords,so optimize each page for a broad set of related keywords,instead of a single keyword.Use case: Content creation. What keywords will this new blogpost target? Is it relevant to a set of queries?
  35. 35. Thanks for watching!Matthew Petersmatt@moz.com @mattthemathman35

×