Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLP for SEO

275 views

Published on

Paul Shapiro's slides from TechSEO Boost 2019

Published in: Marketing
  • Be the first to comment

NLP for SEO

  1. 1. Paul Shapiro | @fighto | #TechSEOBoost #TechSEOBoost | @CatalystSEM THANK YOU TO THIS YEAR’S SPONSORS NLP for SEO Paul Shapiro, Catalyst
  2. 2. Paul Shapiro | @fighto | #TechSEOBoost Paul Shapiro, Catalyst Breaking Down NLP for SEO
  3. 3. Paul Shapiro | @fighto | #TechSEOBoost Paul Shapiro Senior Partner, Head of SEO @ Catalyst, a GroupM Agency
  4. 4. Paul Shapiro | @fighto | #TechSEOBoost Assumptions & Prerequisites • Familiarity with Python • Familiarity with common data science libraries such as pandas and NumPy • Familiarity with Jupyter Notebooks (optional) • But no prior knowledge of NLP
  5. 5. Paul Shapiro | @fighto | #TechSEOBoost Libraries Used in Examples
  6. 6. Paul Shapiro | @fighto | #TechSEOBoost KNIME as an Alternative https://www.knime.com
  7. 7. Paul Shapiro | @fighto | #TechSEOBoost What is Natural Language Processing (NLP)?
  8. 8. Paul Shapiro | @fighto | #TechSEOBoost What is NLP? “NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.” https://blog.algorithmia.com/introduction-natural-language-processing-nlp/
  9. 9. Paul Shapiro | @fighto | #TechSEOBoost NLP Old New Linguistical Heuristics Statistics Machine Learning
  10. 10. Paul Shapiro | @fighto | #TechSEOBoost Input: Parse Semi/Unstructured Text Data https://github.com/niderhoff/nlp-datasets
  11. 11. Paul Shapiro | @fighto | #TechSEOBoost Example Data Sources • (Digital) Books • CSVs, Excel, JSON, XML, etc. • Word Docs/PDFs • Web Pages (most relevant to SEO)
  12. 12. Paul Shapiro | @fighto | #TechSEOBoost
  13. 13. Paul Shapiro | @fighto | #TechSEOBoost Text Pre-Processing Tokenization • Text must be broken into units aka tokens • (Usually individual words)
  14. 14. Paul Shapiro | @fighto | #TechSEOBoost Text Pre-Processing We need to parse, clean, and prepare text data for both analysis and conversion into a machine interpretable formats.
  15. 15. Paul Shapiro | @fighto | #TechSEOBoost Tokenize Words
  16. 16. Paul Shapiro | @fighto | #TechSEOBoost Text Pre-Processing Noise and Junk Removal/Cleanup • Punctuation and Special Characters • Stop Words • Common Abbreviations • Common Character Cases • Etc.
  17. 17. Paul Shapiro | @fighto | #TechSEOBoost Lowercase + Remove Punctuation
  18. 18. Paul Shapiro | @fighto | #TechSEOBoost Tokenize & Remove Stop Words
  19. 19. Paul Shapiro | @fighto | #TechSEOBoost Expand Common Abbreviations
  20. 20. Paul Shapiro | @fighto | #TechSEOBoost Text Pre-Processing Normalization and Standardization • Stemming • Lemmatization
  21. 21. Paul Shapiro | @fighto | #TechSEOBoost Why Normalization, Text Analytics Ex • Speeds up machine learning analysis • Disambiguation Say there are 500 jokes in our corpus that mention “Donald Trump” • 25 of those jokes include the word “economy, 15 include the word “economic” and 10 mention “world economies”. • All of these jokes have to do with both “economics” and “Donald Trump” but would turn up as 3 distinct co-occurences.
  22. 22. Paul Shapiro | @fighto | #TechSEOBoost Why Stemming and Pitfalls • More basic method of reducing different forms of the same word to a common base • Stemming chops off the end of the word to accomplish this • Faster method • Results in terms that are not real words:
  23. 23. Paul Shapiro | @fighto | #TechSEOBoost Stemming
  24. 24. Paul Shapiro | @fighto | #TechSEOBoost Why Lemmatization and Pitfalls • More sophisticated method of reducing different forms of the same word to a common base • Lemmatizations leverages vocabulary and grammar to infer the root of a word • Requires Parts of Speech tagging • Slower but more accurate method
  25. 25. Paul Shapiro | @fighto | #TechSEOBoost Lemmatization
  26. 26. Paul Shapiro | @fighto | #TechSEOBoost Information Extraction & Grouping Getting more context • N-Grams • Parts of Speech Tagging • Chunking/Chinking • Named Entity Recognition • Word Embeddings
  27. 27. Paul Shapiro | @fighto | #TechSEOBoost N-Grams
  28. 28. Paul Shapiro | @fighto | #TechSEOBoost Parts of Speech Tagging
  29. 29. Paul Shapiro | @fighto | #TechSEOBoost
  30. 30. Paul Shapiro | @fighto | #TechSEOBoost Named Entity Recognition
  31. 31. Paul Shapiro | @fighto | #TechSEOBoost Word Embeddings: word2vec, GloVe
  32. 32. Paul Shapiro | @fighto | #TechSEOBoost Word Embeddings: word2vec, GloVe
  33. 33. Paul Shapiro | @fighto | #TechSEOBoost Statistical Feature Creation • Leverage personal heuristics to create customized numeric representations that you think could be used by a machine learning model to make predictions
  34. 34. Paul Shapiro | @fighto | #TechSEOBoost Example: Joke Lines & Length
  35. 35. Paul Shapiro | @fighto | #TechSEOBoost Example: Boolean Profanity
  36. 36. Paul Shapiro | @fighto | #TechSEOBoost Example: Number of Profane Words
  37. 37. Paul Shapiro | @fighto | #TechSEOBoost Feature Normalization Box-Cox Power Transformations • “A Box Cox transformation is a way to transform non- normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.” https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/
  38. 38. Paul Shapiro | @fighto | #TechSEOBoost Box-Cox Power Transformation
  39. 39. Paul Shapiro | @fighto | #TechSEOBoost Check Distribution with Histogram
  40. 40. Paul Shapiro | @fighto | #TechSEOBoost Check Distribution with Histogram
  41. 41. Paul Shapiro | @fighto | #TechSEOBoost Box-Cox Transformation & Apply
  42. 42. Paul Shapiro | @fighto | #TechSEOBoost Vectorization • Count Vectorizer • N-Gram Vectorizer • TF-IDF Vectorizer
  43. 43. Paul Shapiro | @fighto | #TechSEOBoost Count Vectorizer – Cleaning Function
  44. 44. Paul Shapiro | @fighto | #TechSEOBoost Count Vectorizer
  45. 45. Paul Shapiro | @fighto | #TechSEOBoost N-Gram Vectorizer
  46. 46. Paul Shapiro | @fighto | #TechSEOBoost Let’s Talk About TF-IDF for a Moment • Count Vectorizer looked at how many times a term or n-gram appeared in a joke and represents as positive integer • TF-IDF would create a score that considers how many time a term appears in a joke as well as how many times it appears in the entire corpus of jokes. • Rarer words are deemed to more important because they can be used distinguish one joke from another. • Higher TF-IDF value = more uncommon • Lower TF-IDF value = less common
  47. 47. Paul Shapiro | @fighto | #TechSEOBoost TF-IDF Vectorizer
  48. 48. Paul Shapiro | @fighto | #TechSEOBoost Decision Trees Will [Sports Team] win? Players statistics are favorable? Is the team their playing historically better? Yes No? Yes No
  49. 49. Paul Shapiro | @fighto | #TechSEOBoost Random Forest Will [Sports Team] win? Players statistics are favorable? Is the team their playing historically better? Yes No? Yes No Will [Sports Team] win? Players statistics are favorable? Is the team their playing historically better? Yes No? Yes No
  50. 50. Paul Shapiro | @fighto | #TechSEOBoost Basic Machine Learning
  51. 51. Paul Shapiro | @fighto | #TechSEOBoost Basic Machine Learning
  52. 52. Paul Shapiro | @fighto | #TechSEOBoost Basic Machine Learning
  53. 53. Paul Shapiro | @fighto | #TechSEOBoost Having Done This Better • Reduce overfitting • Standardize features (mixing sparse and non-sparse data) • Word embeddings for more context • More sophisticated models
  54. 54. Paul Shapiro | @fighto | #TechSEOBoost More Applications for SEO • Creating performant content (joke example extrapolated) • Predicting natural link earning potential • Natural language generation, writing bits of content • Semantic content optimization • Site architecture design and taxonomy • User flow creation • Keyword research • Etc.
  55. 55. Paul Shapiro | @fighto | #TechSEOBoost How to Learn More, Resources • https://web.stanford.edu/~jurafsky/slp3/ • https://www.kaggle.com/learn/overview • https://towardsdatascience.com • https://github.com/keon/awesome-nlp
  56. 56. Paul Shapiro | @fighto | #TechSEOBoost LET’S REDEFINE TECHNICAL SEO
  57. 57. Paul Shapiro | @fighto | #TechSEOBoost Thank You – Paul Shapiro, Senior Partner, Head of SEO, Catalyst Paul.Shapiro@groupm.com
  58. 58. Paul Shapiro | @fighto | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/

×