Some queries are very simple - a search for "wikipedia" is non-ambiguous. It’s straightforward and can be effectively returned by even a very basic web search engine. Other searches aren't nearly as simple. Let's look at how engines might order two results - a simple problem most of the time, it can be somewhat complex depending on the situation.Since Content A contains the word “Batman” and Content B does not, the engine an easily choose which one to rank.
The search engine can use TF*IDF to determine that “Wiggum” is a much less common word than “chief” and thus, Content A is more relevant to the query than Content B. NOTE: This example also does a good job of showing the inherent weakness of a metric like keyword density.
Using co-occurrence, the engine can determine that phrases like “Daily Planet” and “Clark Kent” appear with “Superman” and thus, Content B is more relevant than Content A.
As humans reading both sentences, we can infer that Content B is obviously about the musical instrument – a piano – and the woman playing it. But a search engine armed with only the methods we described above will struggle since both sentences use the words “keys” and “notes”, some of the few clues to the puzzle.NOTE: We were pretty excited to see that our LDA modeling tool correctly scored B than higher than A… but then things got REALLY interesting.
For complex queries or when relating large quantities of results with lots of content-related signals, search engines need ways to determine the intent of a particular page. Simply because it containsa keyword 4 or 5 times in prominent places or even mentions similar phrases/synonyms doesn’t necessarily mean that it's truly relevant to the searcher's query.
In this imaginary example, every word in the English language is related to either "cat" or "dog“. They are the only topics available. To measure whether a word is more related to "dog," we use a vector space model that displaysthose relationships mathematically. The illustration does a reasonable job showing our simplistic world. Words like "bigfoot" are perfectly in the middle with no more closeness to "cat" than "dog." But words like "canine" and "feline" are clearly closer to one that the other and the degree of the angle in the vector model illustrates this-and gives us a number.BTW, in an LDA vector space model, topics wouldn't have exact label associations like "dog" and "cat" but would instead be things like "the vector around the topic of dogs.“Taking the simple model above and scaling it to thousands or millions of topics, each of which would have its own dimension. Using this construct, the model can compute the similarity between any word or groups of words and the topics its created. You can learn more about this from Stanford University's posting of Introduction to Information Retrieval, <http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html> which has a specific section on Vector Space Models <http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html>
The correlation with rankings of the LDA scores are uncanny. Certainly, they're not a perfect correlation, but that’s expected, given the complexity of Google's ranking algorithm. Seeing LDA scores show this dramatic result makes us seriously question whether there was causation at work here. We hope to do additional research via our ranking models to attempt to show that impact. Perhaps, good links are more likely to point to pages that are more "relevant" via a topic model or some other aspect of Google's algorithm that we don't yet understand naturally biases towards these.
Like anything else in the SEO world, manipulatively applying the process is probably a terrible idea. Even if this tool worked perfectly to measure keyword relevance and topic modeling in Google, it would be unwise to simply stuff 50 keywords on your page to get the highest LDA score you could. Quality content that real people actually want to find should be the goal of SEO and Google is sophisticated enough to determine the difference between junk content that matches topic models and real content that real users will like,even if the tool's scoring can't do that.
We've just made the LDA Labs tool available. You can use this to input a word, phrase, chunk of text or an entire page's content (via the URL input box) along with a desired query (the keyword term/phrase you want to rank for) and the tool will give back a score that represents the cosine similarity in a percentage form (100% = perfect, 0% = no relationship).
If you're trying to do serious SEO analysis and improvement, Rand suggest you build a chart something like this.This chart shows SERPs analysis of "SEO" in Google.com w/Linkscape Metrics + LDA
Search engines have, classically, relied on a relatively universal algorithm - one that rates pages based on the metrics available, without massive swings between verticals. In the past few years, however, savvy searchers and many SEOs have noted a distinct shift to a model where certain types of sites have a greater opportunity to perform for certain queries. The odds aren't necessarily stacked against outsiders, but the engines appear to bias to the types of content providers that are likely to fulfill the users' intent.For example, when a user performs a search for "lamb shanks," it could make a lot of sense to give an extra boost to sites whose content is focused on recipes and food.BillSlawsky reported on Entity Association - Rather than just looking for brands, it’s more likely that Google is trying to understand when a query includes an entity – a specific person, place, or thing. And if it can identify an entity, that identification can influence the search results that you see...
Click and visit data is being used to rank results for better personalization.
Transcript of "iStrategy AMS 2011 - Gillian Muessig, SEO Moz"
From SEO to Cloud MarketingWhere We Came FromWhere We’re Headed What To Do About It<br />Gillian Muessig – iStrategy May 2011<br />
The Web’s Most Popular Search Marketing Software<br />
What IS SEO?<br />It is enabling the dissemination of ideas on the web<br />
1999-2008: What Page Ranked #1 for the Queries “Exit” & “Leave”?<br />http:/googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html<br />http://searchengineland.com/google-kills-bushs-miserable-failure-search-other-google-bombs-10363<br />
Topic Modeling<br />LDA correlates w/ Google rankings better than any other on-page feature<br />http://www.seomoz.org/blog/content-optimization-revisiting-topic-modeling-lda-our-labs-tool<br />
Twitter Data<br />Danny Sullivan:If an article is retweeted or referenced much in Twitter, do you count that as a signal outside of finding any non-nofollowed links that may naturally result from it?<br />Google: Yes, we do use it as a signal. It is used as a signal in our organic and news rankings. We also use it to enhance our news universal by marking how many people shared an article <br />http://searchengineland.com/what-social-signals-do-google-bing-really-count-55389<br />
Twitter Test<br />Page A<br />646 links from 36 root domains<br />2 tweets<br />Page B<br />1 link from 1 root domain<br />522 tweets<br />http:/googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html<br />http://www.seomoz.org/blog/how-do-tweets-influence-search-rankings-an-experiment-for-a-cause<br />
Twitter: Clearly Influencing Google<br />Page B – the tweeted version – ranks #1!<br />Page A<br />646 links from 36 root domains<br />2 tweets<br />Page B<br />1 link from 1 root domain<br />522 tweets<br />http:/googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html<br />http://www.seomoz.org/blog/how-do-tweets-influence-search-rankings-an-experiment-for-a-cause<br />
Twitter Data: Very Powerful for QDF<br />http:/googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html<br />http://www.seomoz.org/blog/tweets-effect-rankings-unexpected-case-study<br />
Don’t Bother Abusing Twitter for SEO<br />http:/googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html<br />http://www.seomoz.org/blog/tweets-effect-rankings-unexpected-case-study#jtc133590<br />
Author Authority<br />Danny Sullivan: Do you try to calculate the authority of someone who tweets that might be assigned to their Twitter page. Do you try to “know,” if you will, who they are?<br />Bing: Yes. We do calculate the authority of someone who tweets. For known public figures or publishers, we do associate them with who they are. (For example, query for Danny Sullivan)<br />Google: Yes we do compute and use author quality. We don’t know who anyone is in real life :-)<br />http:/googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html<br />http://searchengineland.com/what-social-signals-do-google-bing-really-count-55389<br />
From the Mouths of Googlers<br />Wired.com: How do you recognize a shallow-content site? <br />Singhal: (W)e asked… “Would you be comfortable giving this site your credit card? Would you be comfortable giving medicine prescribed by this site to your kids?”<br />Cutts: (Using) a rigorous set of questions… “Do you consider this site to be authoritative? Would it be okay if this was in a magazine? Does this site have excessive ads?”<br />Singhal: And based on that, we basically formed some definition of what could be considered low quality. <br />http://www.wired.com/epicenter/2011/03/the-panda-that-hates-farms/all/1<br />
From the Mouths of Googlers<br />Wired.com: But how do you implement that algorithmically?<br />Cutts: I think you look for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. <br />Singhal: You can imagine in a hyperspace a bunch of points, some points are red, some points are green, and in others there’s some mixture. Your job is to find a plane which says that most things on this side of the plane are red, and most of the things on that side of the plane are the opposite of red.<br />http://www.wired.com/epicenter/2011/03/the-panda-that-hates-farms/all/1<br />
From the Mouths of Googlers<br />Googlers want to know:<br /><ul><li>Trustworthy?
Don’t “Look” Like a Content Farm<br />http://hubpages.com/hub/WomensFashionsofthe1920-FlappersandtheJazz-Age<br />
Avoid “Classic” SEO Tactics<br />Directory Link Building<br />Keyword-Variant Abuse<br />Reciprocal Link Pages<br />Paid Links w/ Manipulative Anchor Text<br />Sitewide, Footer Links<br />Navigation for Engines, Not Humans<br />Low Cost/Quality, Outsourced Content<br />Generic Design and Layout<br />Anchor-Text Rich Internal Links<br />Anonymous Contact Forms<br />Keyword Stuffed Titles + Pages<br />Ad Blocks Dominating the Page<br />It’s great to do good SEO, just don’t look like the only reason the site exists is to draw Google traffic<br />
Take Advantage of New and Evolving Opportunities<br />
Become a “Brand”<br />Brands<br />Generics<br /><ul><li> Have real people working at a physical address
When in Rome…<br />Find Your Corporate Voice<br />Phenomenal analysis of statements by Googlers + how they translate to content/marketing actions: http://bit.ly/iGd7Pe<br />I’m excited to be able to share my life’s passion with you.<br />http://outspokenmedia.com/social-media/quora-hipsters/<br />
Embrace All of Inbound Marketing<br />News/Media/PR<br />SEO<br />Email<br />Research/White Papers<br />Blogs + Blogging<br />Infographics<br />Comment Marketing<br />Social Networks<br />Online Video<br />INBOUND MARKETING!(AKA all the “free” traffic sources)<br />Webinars<br />Forums<br />Document Sharing<br />Social Bookmarking<br />Word of Mouth<br />Podcasting<br />Direct/Referring Links<br />Type-In Traffic<br />Q+A Sites<br />