Paul Shapiro | @fighto | #TechSEOBoost
#TechSEOBoost | @CatalystSEM
THANK YOU TO THIS YEAR’S SPONSORS
NLP for SEO
Paul Shapiro, Catalyst
Paul Shapiro | @fighto | #TechSEOBoost
Paul Shapiro, Catalyst
Breaking Down NLP
for SEO
Paul Shapiro | @fighto | #TechSEOBoost
Paul Shapiro
Senior Partner, Head of SEO
@ Catalyst, a GroupM Agency
Paul Shapiro | @fighto | #TechSEOBoost
Assumptions & Prerequisites
• Familiarity with Python
• Familiarity with common data science libraries such as pandas and NumPy
• Familiarity with Jupyter Notebooks (optional)
• But no prior knowledge of NLP
Paul Shapiro | @fighto | #TechSEOBoost
Libraries Used in Examples
Paul Shapiro | @fighto | #TechSEOBoost
KNIME as an Alternative
https://www.knime.com
Paul Shapiro | @fighto | #TechSEOBoost
What is Natural
Language Processing
(NLP)?
Paul Shapiro | @fighto | #TechSEOBoost
What is NLP?
“NLP is a way for computers to analyze, understand, and derive
meaning from human language in a smart and useful way. By utilizing
NLP, developers can organize and structure knowledge to perform
tasks such as automatic summarization, translation, named entity
recognition, relationship extraction, sentiment analysis, speech
recognition, and topic segmentation.”
https://blog.algorithmia.com/introduction-natural-language-processing-nlp/
Paul Shapiro | @fighto | #TechSEOBoost
NLP
Old New
Linguistical Heuristics
Statistics
Machine Learning
Paul Shapiro | @fighto | #TechSEOBoost
Input: Parse Semi/Unstructured Text Data
https://github.com/niderhoff/nlp-datasets
Paul Shapiro | @fighto | #TechSEOBoost
Example Data Sources
• (Digital) Books
• CSVs, Excel, JSON, XML, etc.
• Word Docs/PDFs
• Web Pages (most relevant to SEO)
Paul Shapiro | @fighto | #TechSEOBoost
Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
Tokenization
• Text must be broken into units aka tokens
• (Usually individual words)
Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
We need to parse, clean, and
prepare text data for both analysis
and conversion into a machine
interpretable formats.
Paul Shapiro | @fighto | #TechSEOBoost
Tokenize Words
Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
Noise and Junk Removal/Cleanup
• Punctuation and Special Characters
• Stop Words
• Common Abbreviations
• Common Character Cases
• Etc.
Paul Shapiro | @fighto | #TechSEOBoost
Lowercase + Remove Punctuation
Paul Shapiro | @fighto | #TechSEOBoost
Tokenize & Remove Stop Words
Paul Shapiro | @fighto | #TechSEOBoost
Expand Common Abbreviations
Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
Normalization and Standardization
• Stemming
• Lemmatization
Paul Shapiro | @fighto | #TechSEOBoost
Why Normalization, Text Analytics Ex
• Speeds up machine learning analysis
• Disambiguation
Say there are 500 jokes in our corpus that mention “Donald Trump”
• 25 of those jokes include the word “economy, 15 include the word “economic” and 10 mention “world
economies”.
• All of these jokes have to do with both “economics” and “Donald Trump” but would turn up as 3
distinct co-occurences.
Paul Shapiro | @fighto | #TechSEOBoost
Why Stemming and Pitfalls
• More basic method of reducing different forms of the same word to a common base
• Stemming chops off the end of the word to accomplish this
• Faster method
• Results in terms that are not real words:
Paul Shapiro | @fighto | #TechSEOBoost
Stemming
Paul Shapiro | @fighto | #TechSEOBoost
Why Lemmatization and Pitfalls
• More sophisticated method of reducing different forms of the same word to a common base
• Lemmatizations leverages vocabulary and grammar to infer the root of a word
• Requires Parts of Speech tagging
• Slower but more accurate method
Paul Shapiro | @fighto | #TechSEOBoost
Lemmatization
Paul Shapiro | @fighto | #TechSEOBoost
Information Extraction & Grouping
Getting more context
• N-Grams
• Parts of Speech Tagging
• Chunking/Chinking
• Named Entity Recognition
• Word Embeddings
Paul Shapiro | @fighto | #TechSEOBoost
N-Grams
Paul Shapiro | @fighto | #TechSEOBoost
Parts of Speech Tagging
Paul Shapiro | @fighto | #TechSEOBoost
Paul Shapiro | @fighto | #TechSEOBoost
Named Entity Recognition
Paul Shapiro | @fighto | #TechSEOBoost
Word Embeddings: word2vec, GloVe
Paul Shapiro | @fighto | #TechSEOBoost
Word Embeddings: word2vec, GloVe
Paul Shapiro | @fighto | #TechSEOBoost
Statistical Feature Creation
• Leverage personal heuristics to create customized numeric
representations that you think could be used by a machine
learning model to make predictions
Paul Shapiro | @fighto | #TechSEOBoost
Example: Joke Lines & Length
Paul Shapiro | @fighto | #TechSEOBoost
Example: Boolean Profanity
Paul Shapiro | @fighto | #TechSEOBoost
Example: Number of Profane Words
Paul Shapiro | @fighto | #TechSEOBoost
Feature Normalization
Box-Cox Power Transformations
• “A Box Cox transformation is a way to transform non-
normal dependent variables into a normal
shape. Normality is an important assumption for many
statistical techniques; if your data isn’t normal, applying a
Box-Cox means that you are able to run a broader
number of tests.”
https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/
Paul Shapiro | @fighto | #TechSEOBoost
Box-Cox Power Transformation
Paul Shapiro | @fighto | #TechSEOBoost
Check Distribution with Histogram
Paul Shapiro | @fighto | #TechSEOBoost
Check Distribution with Histogram
Paul Shapiro | @fighto | #TechSEOBoost
Box-Cox Transformation & Apply
Paul Shapiro | @fighto | #TechSEOBoost
Vectorization
• Count Vectorizer
• N-Gram Vectorizer
• TF-IDF Vectorizer
Paul Shapiro | @fighto | #TechSEOBoost
Count Vectorizer – Cleaning Function
Paul Shapiro | @fighto | #TechSEOBoost
Count Vectorizer
Paul Shapiro | @fighto | #TechSEOBoost
N-Gram Vectorizer
Paul Shapiro | @fighto | #TechSEOBoost
Let’s Talk About TF-IDF for a Moment
• Count Vectorizer looked at how many times a term or n-gram appeared in a joke and
represents as positive integer
• TF-IDF would create a score that considers how many time a term appears in a joke
as well as how many times it appears in the entire corpus of jokes.
• Rarer words are deemed to more important because they can be used distinguish one joke from
another.
• Higher TF-IDF value = more uncommon
• Lower TF-IDF value = less common
Paul Shapiro | @fighto | #TechSEOBoost
TF-IDF Vectorizer
Paul Shapiro | @fighto | #TechSEOBoost
Decision Trees
Will
[Sports
Team]
win?
Players
statistics
are
favorable?
Is the team
their
playing
historically
better?
Yes No?
Yes
No
Paul Shapiro | @fighto | #TechSEOBoost
Random Forest
Will [Sports
Team] win?
Players
statistics are
favorable?
Is the team
their playing
historically
better?
Yes No?
Yes
No
Will [Sports
Team] win?
Players
statistics are
favorable?
Is the team
their playing
historically
better?
Yes No?
Yes
No
Paul Shapiro | @fighto | #TechSEOBoost
Basic Machine Learning
Paul Shapiro | @fighto | #TechSEOBoost
Basic Machine Learning
Paul Shapiro | @fighto | #TechSEOBoost
Basic Machine Learning
Paul Shapiro | @fighto | #TechSEOBoost
Having Done This Better
• Reduce overfitting
• Standardize features (mixing sparse and non-sparse data)
• Word embeddings for more context
• More sophisticated models
Paul Shapiro | @fighto | #TechSEOBoost
More Applications for SEO
• Creating performant content (joke example extrapolated)
• Predicting natural link earning potential
• Natural language generation, writing bits of content
• Semantic content optimization
• Site architecture design and taxonomy
• User flow creation
• Keyword research
• Etc.
Paul Shapiro | @fighto | #TechSEOBoost
How to Learn More, Resources
• https://web.stanford.edu/~jurafsky/slp3/
• https://www.kaggle.com/learn/overview
• https://towardsdatascience.com
• https://github.com/keon/awesome-nlp
Paul Shapiro | @fighto | #TechSEOBoost
LET’S
REDEFINE
TECHNICAL
SEO
Paul Shapiro | @fighto | #TechSEOBoost
Thank You
–
Paul Shapiro, Senior Partner, Head of SEO, Catalyst
Paul.Shapiro@groupm.com
Paul Shapiro | @fighto | #TechSEOBoost
Thanks for Viewing the Slideshare!
–
Watch the Recording: https://youtube.com/session-example
Or
Contact us today to discover how Catalyst can deliver unparalleled SEO
results for your business. https://www.catalystdigital.com/

NLP for SEO

  • 1.
    Paul Shapiro |@fighto | #TechSEOBoost #TechSEOBoost | @CatalystSEM THANK YOU TO THIS YEAR’S SPONSORS NLP for SEO Paul Shapiro, Catalyst
  • 2.
    Paul Shapiro |@fighto | #TechSEOBoost Paul Shapiro, Catalyst Breaking Down NLP for SEO
  • 3.
    Paul Shapiro |@fighto | #TechSEOBoost Paul Shapiro Senior Partner, Head of SEO @ Catalyst, a GroupM Agency
  • 4.
    Paul Shapiro |@fighto | #TechSEOBoost Assumptions & Prerequisites • Familiarity with Python • Familiarity with common data science libraries such as pandas and NumPy • Familiarity with Jupyter Notebooks (optional) • But no prior knowledge of NLP
  • 5.
    Paul Shapiro |@fighto | #TechSEOBoost Libraries Used in Examples
  • 6.
    Paul Shapiro |@fighto | #TechSEOBoost KNIME as an Alternative https://www.knime.com
  • 7.
    Paul Shapiro |@fighto | #TechSEOBoost What is Natural Language Processing (NLP)?
  • 8.
    Paul Shapiro |@fighto | #TechSEOBoost What is NLP? “NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.” https://blog.algorithmia.com/introduction-natural-language-processing-nlp/
  • 9.
    Paul Shapiro |@fighto | #TechSEOBoost NLP Old New Linguistical Heuristics Statistics Machine Learning
  • 10.
    Paul Shapiro |@fighto | #TechSEOBoost Input: Parse Semi/Unstructured Text Data https://github.com/niderhoff/nlp-datasets
  • 11.
    Paul Shapiro |@fighto | #TechSEOBoost Example Data Sources • (Digital) Books • CSVs, Excel, JSON, XML, etc. • Word Docs/PDFs • Web Pages (most relevant to SEO)
  • 12.
    Paul Shapiro |@fighto | #TechSEOBoost
  • 13.
    Paul Shapiro |@fighto | #TechSEOBoost Text Pre-Processing Tokenization • Text must be broken into units aka tokens • (Usually individual words)
  • 14.
    Paul Shapiro |@fighto | #TechSEOBoost Text Pre-Processing We need to parse, clean, and prepare text data for both analysis and conversion into a machine interpretable formats.
  • 15.
    Paul Shapiro |@fighto | #TechSEOBoost Tokenize Words
  • 16.
    Paul Shapiro |@fighto | #TechSEOBoost Text Pre-Processing Noise and Junk Removal/Cleanup • Punctuation and Special Characters • Stop Words • Common Abbreviations • Common Character Cases • Etc.
  • 17.
    Paul Shapiro |@fighto | #TechSEOBoost Lowercase + Remove Punctuation
  • 18.
    Paul Shapiro |@fighto | #TechSEOBoost Tokenize & Remove Stop Words
  • 19.
    Paul Shapiro |@fighto | #TechSEOBoost Expand Common Abbreviations
  • 20.
    Paul Shapiro |@fighto | #TechSEOBoost Text Pre-Processing Normalization and Standardization • Stemming • Lemmatization
  • 21.
    Paul Shapiro |@fighto | #TechSEOBoost Why Normalization, Text Analytics Ex • Speeds up machine learning analysis • Disambiguation Say there are 500 jokes in our corpus that mention “Donald Trump” • 25 of those jokes include the word “economy, 15 include the word “economic” and 10 mention “world economies”. • All of these jokes have to do with both “economics” and “Donald Trump” but would turn up as 3 distinct co-occurences.
  • 22.
    Paul Shapiro |@fighto | #TechSEOBoost Why Stemming and Pitfalls • More basic method of reducing different forms of the same word to a common base • Stemming chops off the end of the word to accomplish this • Faster method • Results in terms that are not real words:
  • 23.
    Paul Shapiro |@fighto | #TechSEOBoost Stemming
  • 24.
    Paul Shapiro |@fighto | #TechSEOBoost Why Lemmatization and Pitfalls • More sophisticated method of reducing different forms of the same word to a common base • Lemmatizations leverages vocabulary and grammar to infer the root of a word • Requires Parts of Speech tagging • Slower but more accurate method
  • 25.
    Paul Shapiro |@fighto | #TechSEOBoost Lemmatization
  • 26.
    Paul Shapiro |@fighto | #TechSEOBoost Information Extraction & Grouping Getting more context • N-Grams • Parts of Speech Tagging • Chunking/Chinking • Named Entity Recognition • Word Embeddings
  • 27.
    Paul Shapiro |@fighto | #TechSEOBoost N-Grams
  • 28.
    Paul Shapiro |@fighto | #TechSEOBoost Parts of Speech Tagging
  • 29.
    Paul Shapiro |@fighto | #TechSEOBoost
  • 30.
    Paul Shapiro |@fighto | #TechSEOBoost Named Entity Recognition
  • 31.
    Paul Shapiro |@fighto | #TechSEOBoost Word Embeddings: word2vec, GloVe
  • 32.
    Paul Shapiro |@fighto | #TechSEOBoost Word Embeddings: word2vec, GloVe
  • 33.
    Paul Shapiro |@fighto | #TechSEOBoost Statistical Feature Creation • Leverage personal heuristics to create customized numeric representations that you think could be used by a machine learning model to make predictions
  • 34.
    Paul Shapiro |@fighto | #TechSEOBoost Example: Joke Lines & Length
  • 35.
    Paul Shapiro |@fighto | #TechSEOBoost Example: Boolean Profanity
  • 36.
    Paul Shapiro |@fighto | #TechSEOBoost Example: Number of Profane Words
  • 37.
    Paul Shapiro |@fighto | #TechSEOBoost Feature Normalization Box-Cox Power Transformations • “A Box Cox transformation is a way to transform non- normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.” https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/
  • 38.
    Paul Shapiro |@fighto | #TechSEOBoost Box-Cox Power Transformation
  • 39.
    Paul Shapiro |@fighto | #TechSEOBoost Check Distribution with Histogram
  • 40.
    Paul Shapiro |@fighto | #TechSEOBoost Check Distribution with Histogram
  • 41.
    Paul Shapiro |@fighto | #TechSEOBoost Box-Cox Transformation & Apply
  • 42.
    Paul Shapiro |@fighto | #TechSEOBoost Vectorization • Count Vectorizer • N-Gram Vectorizer • TF-IDF Vectorizer
  • 43.
    Paul Shapiro |@fighto | #TechSEOBoost Count Vectorizer – Cleaning Function
  • 44.
    Paul Shapiro |@fighto | #TechSEOBoost Count Vectorizer
  • 45.
    Paul Shapiro |@fighto | #TechSEOBoost N-Gram Vectorizer
  • 46.
    Paul Shapiro |@fighto | #TechSEOBoost Let’s Talk About TF-IDF for a Moment • Count Vectorizer looked at how many times a term or n-gram appeared in a joke and represents as positive integer • TF-IDF would create a score that considers how many time a term appears in a joke as well as how many times it appears in the entire corpus of jokes. • Rarer words are deemed to more important because they can be used distinguish one joke from another. • Higher TF-IDF value = more uncommon • Lower TF-IDF value = less common
  • 47.
    Paul Shapiro |@fighto | #TechSEOBoost TF-IDF Vectorizer
  • 48.
    Paul Shapiro |@fighto | #TechSEOBoost Decision Trees Will [Sports Team] win? Players statistics are favorable? Is the team their playing historically better? Yes No? Yes No
  • 49.
    Paul Shapiro |@fighto | #TechSEOBoost Random Forest Will [Sports Team] win? Players statistics are favorable? Is the team their playing historically better? Yes No? Yes No Will [Sports Team] win? Players statistics are favorable? Is the team their playing historically better? Yes No? Yes No
  • 50.
    Paul Shapiro |@fighto | #TechSEOBoost Basic Machine Learning
  • 51.
    Paul Shapiro |@fighto | #TechSEOBoost Basic Machine Learning
  • 52.
    Paul Shapiro |@fighto | #TechSEOBoost Basic Machine Learning
  • 53.
    Paul Shapiro |@fighto | #TechSEOBoost Having Done This Better • Reduce overfitting • Standardize features (mixing sparse and non-sparse data) • Word embeddings for more context • More sophisticated models
  • 54.
    Paul Shapiro |@fighto | #TechSEOBoost More Applications for SEO • Creating performant content (joke example extrapolated) • Predicting natural link earning potential • Natural language generation, writing bits of content • Semantic content optimization • Site architecture design and taxonomy • User flow creation • Keyword research • Etc.
  • 55.
    Paul Shapiro |@fighto | #TechSEOBoost How to Learn More, Resources • https://web.stanford.edu/~jurafsky/slp3/ • https://www.kaggle.com/learn/overview • https://towardsdatascience.com • https://github.com/keon/awesome-nlp
  • 56.
    Paul Shapiro |@fighto | #TechSEOBoost LET’S REDEFINE TECHNICAL SEO
  • 57.
    Paul Shapiro |@fighto | #TechSEOBoost Thank You – Paul Shapiro, Senior Partner, Head of SEO, Catalyst Paul.Shapiro@groupm.com
  • 58.
    Paul Shapiro |@fighto | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/