TF*IDF
Introduction to Topic
Modeling for SEO
24 Hours of SEO
Prepared for
January 25, 2018
Nick Eubanks
Presented by
@nick_eubanks
Hi, I’m Nick Eubanks
❏ First business (painting) at 19 years old. $280k in revenue
during summer of 2003.
❏ Launch first digital company in 2008 (atomni), built CMS for
rapid deployment of Microsites. Sold.
❏ Build custom CMS for reviews in Japan. Angel funded. Sold.
❏ Built and sold lead-gen websites in medical, legal, and SEO.
❏ Joined TrafficSafetyStore.com in May 2012. Grew from
~$3M to well north of ~$20 Million
❏ Currently split time between NK Tech and IFTF.
Nick Eubanks
Founder and CEO, I’m From The Future
Founder, TrafficThinkTank.com
Executive Director, NK Tech
Owner, ADDHero.com
@nick_eubanks
We All Want More Traffic to Our Website
@nick_eubanks
Ideally traffic that is...
❏ Sustainable
❏ Scalable
❏ Free
@nick_eubanks
SEO
@nick_eubanks
Understanding Search Engines
❏ Process of information retrieval; NLP
❏ Training machines to “understand” text
❏ Identifying patterns; topics
❏ Identifying importance; term frequency
@nick_eubanks
In Comes TF*IDF
❏ Term frequency x inverse document Frequency
❏ Process for machine-based information retrieval
❏ Calculates weights based on frequency of terms used throughout entire document set
@nick_eubanks
What does that look like...
❏ Defining term “weights” based on the
frequency of use through the
document
❏ Removal of common “stop” words like
the, and, is, because, and so on
❏ Excluding common HTML elements
such as words used in header, footer,
sidebars, and navigational elements
@nick_eubanks
Topic Identification
❏ Identify themes across terms to group
into “topics”
❏ Ahrefs has some of this
❏ Much better with trained corpora
using tf*idf
❏ Can do manually if you have the
patience
❏ Use topics to design your content
calendars
@nick_eubanks
Building Relevance
❏ Process:
Keyword research > Topic
Identification > Theme
❏ Pillar content = Top-level theme
❏ Build topic content in clusters
❏ Example:
https://www.hubspot.com/marketing
-statistics
@nick_eubanks
Imagine this Scenario
❏ I gave you a stack of 1,000 documents
❏ Could you answer these questions:
❏ How would you organize them?
❏ What if I asked what they were
about?
❏ What if I asked for all the documents
about watermelons?
❏ How would you provide this
information?
@nick_eubanks
You would need to create a topic model
❏ To do this, you could use a set of different
colored highlighters
❏ Then scan the documents looking for
words that are related to each other, and
highlight them
❏ Then you would create an index
❏ You would come up with a topic for
each set of related terms
❏ You would annotate which pages
terms relevant to that topic
appeared on
@nick_eubanks
This is How Search Engines Work
@nick_eubanks
Example Topic: Organic Farming
❏ Fruit
❏ Watermelons
❏ Lemons
❏ Apples
❏ Vegetables
❏ Squash
❏ Lettuce
❏ broccoli
❏ Organic foods
❏ usda
❏ Lower yields
❏ Organic standards
❏ Farmers
❏ Organic farmers
❏ Certified organic
❏ Synthetic pesticides
@nick_eubanks
TF*IDF for “Organic Farming”
@nick_eubanks
Compared to #1 Ranking URL
@nick_eubanks
Probabilistic Complexities
❏ Automating the identification of topics
❏ Validating topics and training your model
❏ Creation of word vectors
@nick_eubanks
What are Word Vectors
❏ Simply, a vector of weights
❏ Word2Vec
❏ Pre-trained
❏ King - man + woman = queen
❏ Spatial relations where proximity
represents relevance
@nick_eubanks
Term Clustering into
Topics using LDA
❏ Tf*idf has evolved - a look at latent dirichlet
allocation [LDA]
❏ LDA is a generative statistical model
❏ Ability to generate probabilities; this is
where it gets interesting...
@nick_eubanks
Topic Clustering
❏ Grouping topics into themes
❏ Where keyword research becomes
important
❏ Building corpora of documents
representative of relevant topics
❏ Scraping Google really helps with
this
@nick_eubanks
How Does This Get You More SEO Traffic?
@nick_eubanks
What if you knew...
❏ The terms Google was expecting to find
❏ The topics Google was expecting to find
❏ The concepts that should be linked to
❏ The concepts and content that your page
should be linked from?
@nick_eubanks
Not “Keyword Density”
❏ Dial in the term weights from tf*idf for a
specific document
❏ Stuff in a bunch of keywords because you
see them weighted heavily in the top
ranking pages
❏ Google is much, much smarter than this
❏ Look for terms that represent topics
❏ Use those topics to perform additional
keyword research
❏ Explore the pages currently ranking, and
analyze their
❏ Internal links (both directions)
❏ External links (both directions)
Don’t Do
@nick_eubanks
Implementing Data from a Topic Model
❏ 28” orange traffic
cones
❏ 18” lime traffic
cones
❏ Portable traffic
cones
❏ Yellow traffic cones
❏ Extendable traffic
cone barricades
Terms
❏ Orange traffic cones
❏ Lime traffic cones
❏ Colored traffic cones
❏ Collapsible traffic
cones
❏ Cone bars
❏ Work zone safety
Topics
❏ Traffic cones
❏ Safety cones
❏ Road safety cones
Themes
@nick_eubanks
@nick_eubanks
@nick_eubanks
@nick_eubanks
Looking at the URL Architecture
❏ Traffic Cones
❏ Orange and Lime Traffic Cones
❏ 28” Orange Traffic Cones
❏ 28” Lime Traffic Cones
❏ 18” Orange Traffic Cone
❏ Colored Traffic Cones
❏ Grabber Cones
❏ Cone Bars
❏ Collapsible Traffic Cones
❏ Traffic Cone Accessories
Product Content
❏ Traffic Cones
❏ History of Traffic Cones
❏ Custom Traffic Cones
❏ Traffic Cone Selection Guide
❏ Traffic Cones for Construction
@nick_eubanks
Implementing Data from a Topic Model
❏ Defining content type
❏ Designing content map
Reference: imfromthefuture.com/content-map/
❏ Laying out content calendar
❏ Designing an SEO-First URL Architecture
❏ Publishing
❏ Expanding keyword footprint
Reference: imfromthefuture.com/bigfoot-strategy/
@nick_eubanks
Take TF*IDF For a Spin
https://imfromthefuture.com/tfidf-embed/ @nick_eubanks
Take LDA Visualization For a Spin
https://imfromthefuture.com/lda-tool/ @nick_eubanks
TrafficThinkTank.com @nick_eubanks

TF*IDF and the Evolution of Topic Models

  • 1.
    TF*IDF Introduction to Topic Modelingfor SEO 24 Hours of SEO Prepared for January 25, 2018 Nick Eubanks Presented by @nick_eubanks
  • 2.
    Hi, I’m NickEubanks ❏ First business (painting) at 19 years old. $280k in revenue during summer of 2003. ❏ Launch first digital company in 2008 (atomni), built CMS for rapid deployment of Microsites. Sold. ❏ Build custom CMS for reviews in Japan. Angel funded. Sold. ❏ Built and sold lead-gen websites in medical, legal, and SEO. ❏ Joined TrafficSafetyStore.com in May 2012. Grew from ~$3M to well north of ~$20 Million ❏ Currently split time between NK Tech and IFTF. Nick Eubanks Founder and CEO, I’m From The Future Founder, TrafficThinkTank.com Executive Director, NK Tech Owner, ADDHero.com @nick_eubanks
  • 3.
    We All WantMore Traffic to Our Website @nick_eubanks
  • 4.
    Ideally traffic thatis... ❏ Sustainable ❏ Scalable ❏ Free @nick_eubanks
  • 5.
  • 6.
    Understanding Search Engines ❏Process of information retrieval; NLP ❏ Training machines to “understand” text ❏ Identifying patterns; topics ❏ Identifying importance; term frequency @nick_eubanks
  • 7.
    In Comes TF*IDF ❏Term frequency x inverse document Frequency ❏ Process for machine-based information retrieval ❏ Calculates weights based on frequency of terms used throughout entire document set @nick_eubanks
  • 8.
    What does thatlook like... ❏ Defining term “weights” based on the frequency of use through the document ❏ Removal of common “stop” words like the, and, is, because, and so on ❏ Excluding common HTML elements such as words used in header, footer, sidebars, and navigational elements @nick_eubanks
  • 9.
    Topic Identification ❏ Identifythemes across terms to group into “topics” ❏ Ahrefs has some of this ❏ Much better with trained corpora using tf*idf ❏ Can do manually if you have the patience ❏ Use topics to design your content calendars @nick_eubanks
  • 10.
    Building Relevance ❏ Process: Keywordresearch > Topic Identification > Theme ❏ Pillar content = Top-level theme ❏ Build topic content in clusters ❏ Example: https://www.hubspot.com/marketing -statistics @nick_eubanks
  • 11.
    Imagine this Scenario ❏I gave you a stack of 1,000 documents ❏ Could you answer these questions: ❏ How would you organize them? ❏ What if I asked what they were about? ❏ What if I asked for all the documents about watermelons? ❏ How would you provide this information? @nick_eubanks
  • 12.
    You would needto create a topic model ❏ To do this, you could use a set of different colored highlighters ❏ Then scan the documents looking for words that are related to each other, and highlight them ❏ Then you would create an index ❏ You would come up with a topic for each set of related terms ❏ You would annotate which pages terms relevant to that topic appeared on @nick_eubanks
  • 13.
    This is HowSearch Engines Work @nick_eubanks
  • 14.
    Example Topic: OrganicFarming ❏ Fruit ❏ Watermelons ❏ Lemons ❏ Apples ❏ Vegetables ❏ Squash ❏ Lettuce ❏ broccoli ❏ Organic foods ❏ usda ❏ Lower yields ❏ Organic standards ❏ Farmers ❏ Organic farmers ❏ Certified organic ❏ Synthetic pesticides @nick_eubanks
  • 15.
    TF*IDF for “OrganicFarming” @nick_eubanks
  • 16.
    Compared to #1Ranking URL @nick_eubanks
  • 17.
    Probabilistic Complexities ❏ Automatingthe identification of topics ❏ Validating topics and training your model ❏ Creation of word vectors @nick_eubanks
  • 18.
    What are WordVectors ❏ Simply, a vector of weights ❏ Word2Vec ❏ Pre-trained ❏ King - man + woman = queen ❏ Spatial relations where proximity represents relevance @nick_eubanks
  • 19.
    Term Clustering into Topicsusing LDA ❏ Tf*idf has evolved - a look at latent dirichlet allocation [LDA] ❏ LDA is a generative statistical model ❏ Ability to generate probabilities; this is where it gets interesting... @nick_eubanks
  • 21.
    Topic Clustering ❏ Groupingtopics into themes ❏ Where keyword research becomes important ❏ Building corpora of documents representative of relevant topics ❏ Scraping Google really helps with this @nick_eubanks
  • 22.
    How Does ThisGet You More SEO Traffic? @nick_eubanks
  • 23.
    What if youknew... ❏ The terms Google was expecting to find ❏ The topics Google was expecting to find ❏ The concepts that should be linked to ❏ The concepts and content that your page should be linked from? @nick_eubanks
  • 24.
    Not “Keyword Density” ❏Dial in the term weights from tf*idf for a specific document ❏ Stuff in a bunch of keywords because you see them weighted heavily in the top ranking pages ❏ Google is much, much smarter than this ❏ Look for terms that represent topics ❏ Use those topics to perform additional keyword research ❏ Explore the pages currently ranking, and analyze their ❏ Internal links (both directions) ❏ External links (both directions) Don’t Do @nick_eubanks
  • 25.
    Implementing Data froma Topic Model ❏ 28” orange traffic cones ❏ 18” lime traffic cones ❏ Portable traffic cones ❏ Yellow traffic cones ❏ Extendable traffic cone barricades Terms ❏ Orange traffic cones ❏ Lime traffic cones ❏ Colored traffic cones ❏ Collapsible traffic cones ❏ Cone bars ❏ Work zone safety Topics ❏ Traffic cones ❏ Safety cones ❏ Road safety cones Themes @nick_eubanks
  • 26.
  • 27.
  • 28.
  • 29.
    Looking at theURL Architecture ❏ Traffic Cones ❏ Orange and Lime Traffic Cones ❏ 28” Orange Traffic Cones ❏ 28” Lime Traffic Cones ❏ 18” Orange Traffic Cone ❏ Colored Traffic Cones ❏ Grabber Cones ❏ Cone Bars ❏ Collapsible Traffic Cones ❏ Traffic Cone Accessories Product Content ❏ Traffic Cones ❏ History of Traffic Cones ❏ Custom Traffic Cones ❏ Traffic Cone Selection Guide ❏ Traffic Cones for Construction @nick_eubanks
  • 30.
    Implementing Data froma Topic Model ❏ Defining content type ❏ Designing content map Reference: imfromthefuture.com/content-map/ ❏ Laying out content calendar ❏ Designing an SEO-First URL Architecture ❏ Publishing ❏ Expanding keyword footprint Reference: imfromthefuture.com/bigfoot-strategy/ @nick_eubanks
  • 31.
    Take TF*IDF Fora Spin https://imfromthefuture.com/tfidf-embed/ @nick_eubanks
  • 32.
    Take LDA VisualizationFor a Spin https://imfromthefuture.com/lda-tool/ @nick_eubanks
  • 33.