Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TechSEO Boost 2017: Fun with Machine Learning: How Machine Learning is Shaping Google and Technical SEO

35,696 views

Published on

Join JR as he shares a few projects that he had been involved with in the SEO space as well as recent technologies of particular interest to SEOs. He’ll also cover recent research and share thoughts on how Google is using machine learning. SEO is a data rich realm and machine learning thrives on data. Some of the limitations involve: access to data, cost of processing, and learning curve for platforms (and platform specialization). From query semantics, anomaly detection, ontologies, to having machines write descriptions for your images, you’ll learn what is available and how to get started.

Published in: Marketing

TechSEO Boost 2017: Fun with Machine Learning: How Machine Learning is Shaping Google and Technical SEO

  1. 1. JR Oakes | @jroakes #TechSEOBoost Fun with Machines. How Machine Learning is Shaping Google and Technical SEO
  2. 2. JR Oakes | @jroakes #TechSEOBoost About Me • Studied Industrial Design at NCSU • Worked as an architectural glass artist for 10 years. • Was Lead Developer and then Director of Strategy for medium- sized agency with 100+ clients worldwide. • Work as Director, Technical SEO for Adapt.
  3. 3. JR Oakes | @jroakes #TechSEOBoost I have a problem with tf-idf
  4. 4. JR Oakes | @jroakes #TechSEOBoost About TF-IDF TF-IDF is very hand-wavy and sounds very fancy, but is not the magic elixir to DOMINATING ON GOOGLE.
  5. 5. JR Oakes | @jroakes #TechSEOBoost About TF-IDF It is actually not even the best IR algorithm. BM25 takes into account document length in addition to other factors in various iterations.
  6. 6. JR Oakes | @jroakes #TechSEOBoost About TF-IDF https://wikimedia-research.github.io/Discovery-Search-Test-BM25/ https://wikimedia-research.github.io/Discovery-Search-Test-InterleavedLTR/
  7. 7. JR Oakes | @jroakes #TechSEOBoost About TF-IDF The Search Platform Team has been working on improving search on Wikimedia projects with machine learning. Machine learned-ranking (MLR) enables us to rank relevance of pages using a model trained on implicit and explicit judgements. In the first test of the learning-to-rank (LTR) project, we evaluated the performance of a click-based model on users searching English Wikipedia. We found that users were slightly more likely to engage with MLR- provided results than with BM25 results (assessed via the clickthrough rate and a preference statistic). We also found that users with machine learning- ranked results were statistically significantly more likely to click on the first search result first than users with BM25-ranked results, which indicates that we are onto something. The next step for us is to evaluate the model’s performance on Wikipedia in other languages.
  8. 8. JR Oakes | @jroakes #TechSEOBoost About TF-IDF
  9. 9. JR Oakes | @jroakes #TechSEOBoost About TF-IDF Wikimedia Research released their first model on Github last month. MjoLniR – our Python and Spark-based library for handling the backend data processing for Machine Learned Ranking at Wikimedia. https://github.com/wikimedia/search-MjoLniR/tree/master/mjolnir
  10. 10. JR Oakes | @jroakes #TechSEOBoost About TF-IDF We are WAY beyond TF-IDF. TF-IDF seems to work because it causes you to look for related phrases, but it is not a very good relevance metric. It is a keyword frequency metric.
  11. 11. JR Oakes | @jroakes #TechSEOBoost How is Google Using Machine Learning?
  12. 12. JR Oakes | @jroakes #TechSEOBoost Was Larry Kim right?
  13. 13. JR Oakes | @jroakes #TechSEOBoost CTR As A Ranking Factor
  14. 14. JR Oakes | @jroakes #TechSEOBoost CTR As A Ranking Factor
  15. 15. JR Oakes | @jroakes #TechSEOBoost CTR As A Ranking Factor Potentially: • Clicks - For our click model we use a generalization of the PositionBased Model (PBM) [9], at the core of which lies an examination hypothesis, stating that in order to be clicked a document has to be examined and attractive: • Attention – What if users get the information that they need directly from the SERP (Answer boxes), without a click, how do we know they were satisfied? • Satisfaction – “While looking at the reasons specified by the raters we found out that 42% of the raters who said that they would click through on a SERP, indicated that their goal was “to confirm information already present in the summary” So additional clicks don’t necessarily mean a poor initial result.
  16. 16. JR Oakes | @jroakes #TechSEOBoost CTR As A Ranking Factor
  17. 17. JR Oakes | @jroakes #TechSEOBoost CTR As A Ranking Factor Machine Learning in its simplest form takes: 1. Input features 2. An algorithm that processes the features (most often) in a linear, non- linear, or tree-based way to make a prediction. 3. And an evaluation metric that compares the prediction to your “ground truth” data. It is technically possible that CTR and / or Quality Rater data provides the ground truth.
  18. 18. JR Oakes | @jroakes #TechSEOBoost CTR As A Ranking Factor The problem is: We don’t have the ground truth, we don’t know the features, and we sure as hell have no idea what is in here:
  19. 19. JR Oakes | @jroakes #TechSEOBoost CTR As A Ranking Factor We know that it probably depends on: • Click-through-rate • Context models • Ground-truth quality (Quality Rater’s Guidelines) • And other standard factors.
  20. 20. JR Oakes | @jroakes #TechSEOBoost Storytelling
  21. 21. JR Oakes | @jroakes #TechSEOBoost Storytelling
  22. 22. JR Oakes | @jroakes #TechSEOBoost Storytelling Using Generative Adversarial Networks to train machines how to see the storylines in news events. https://www.ijcai.org/proceedings/2017/0554.pdf
  23. 23. JR Oakes | @jroakes #TechSEOBoost LSTMs
  24. 24. JR Oakes | @jroakes #TechSEOBoost LSTMs We would also guess that LSTMs (with attention) play some role in Rankbrain based on its state-of-the-art ability to pick up referential information in texts well beyond traditional BOW models. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  25. 25. JR Oakes | @jroakes #TechSEOBoost What should we focus on?
  26. 26. JR Oakes | @jroakes #TechSEOBoost Query Disambiguation
  27. 27. JR Oakes | @jroakes #TechSEOBoost Query Disambiguation Very little information in the query and a lot of information in the possible results.
  28. 28. JR Oakes | @jroakes #TechSEOBoost Query Disambiguation Google tries to give us a nudge.
  29. 29. JR Oakes | @jroakes #TechSEOBoost Query Disambiguation What a strong hint to consider when thinking about what needs to be included on a page discussing: Lipton Tea Also a very strong hint at potential navigation.
  30. 30. JR Oakes | @jroakes #TechSEOBoost Query Disambiguation AT&T does an amazing job at this.
  31. 31. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance
  32. 32. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance Bill Slawski (as always) is spot on.
  33. 33. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance Going back to the patent from Google in 2014 (Integrated external related phrase information into a phrase- based indexing information retrieval system), we see that there is an marked gain in the significance of phrases in a page based on additional semantically related qualifying phrases.
  34. 34. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance There are many ways to handle this on a page level.
  35. 35. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance But, this really starts much sooner by trying to discover content / intent categories that your site is relevant for to even start the process of building out relevant content categories for your visitors. https://anaconda.org/jroakes/cluster- share/notebook
  36. 36. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance The prior notebook ingests your keywords, models them to vector space, and then runs k-means to group the keywords into relevance clusters.
  37. 37. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance
  38. 38. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance Note this goes well beyond term- frequency.
  39. 39. JR Oakes | @jroakes #TechSEOBoost Semantic Relevance Using skip-gram models impart probability of cooccurrence across large corpuses which is much closer to what Google does than simple tf- idf.
  40. 40. JR Oakes | @jroakes #TechSEOBoost We should also care about click satisfaction.
  41. 41. JR Oakes | @jroakes #TechSEOBoost Click Satisfaction
  42. 42. JR Oakes | @jroakes #TechSEOBoost Click Satisfaction Working hard to ensure that your pages get the clicks. H/T to @fighto for the excellent article here: https://searchengineland.com/alert-abnormal-organic-ctr-detected-automatic- detection-poorly-performing-meta-data-280290 https://anaconda.org/jroakes/ctr_anamolies_share/notebook
  43. 43. JR Oakes | @jroakes #TechSEOBoost We should also care about content deduplication.
  44. 44. JR Oakes | @jroakes #TechSEOBoost Content Deduplication https://anaconda.org/jroakes/duplicate_detection _with_shingling_share/notebook
  45. 45. JR Oakes | @jroakes #TechSEOBoost Wrapping Up It is very difficult to gain intuition into how Google works based on solely external data. The reality is that context, machine learning, and click data allows for the building of models that humans cannot understand easily. We wanted to move the conversation away from simplistic keyword mechanisms and towards an understanding that there semantics and context are much more valuable to ranking.
  46. 46. JR Oakes | @jroakes #TechSEOBoost

×