Necessity of SEO… Online advertising drives $6 offline (in stores) for every $1 spent online. Search marketing has a greater impact on in-store sales lift than display advertising—three times greater, in fact 74% of respondents used search engines to find local business information versus 65% who turned to print Yellow Pages, 50% who used Internet Yellow Pages, and 44% who used traditional newspapers. 86% surveyed said they have used the Internet to find a local business, a rise from the 70% figure reported the year before. 80% reported researching a product or service online, then making that purchase offline from a local business
“ Iprospect and Jupiter” Research… 62% of search engine users click on a search result within the first page of results, and 90% within the first three pages. 41% of search engine users who continue their search when not finding what they seek report changing their search term and/or search engine if they do not find what they’re looking for on the first page of results; 88% report doing so after three pages. 36% of users agree that “seeing a company listed among the top results on a search engine makes me think that the company is a top one within its field.”
How do “Search Engines” work ?
Defining “Search Engine” : system which collects , organizes & presents a way to select Web documents based on certain words, phrases, or patterns within documents
Model the Web as a full-text DB
Index a portion of the Web docs
Search Web documents using user-specified words/patterns in a text
Categories of “Search Engines”
general-purpose search engine, e.g. Yahoo !, AltaVista and Google
special-purpose search engines (or Internet Portals), e.g. LinuxStart (www.linuxstart.com)
Components of “Search Engines”
Two main components:
web crawler (spider), which collects massive Web pages.
large database , which stores and indexes collected Web pages.
Ranking has to be performed without accessing the text, just the index
“ Search Engine” Models
Information Retrieval (IR) is a key to search engine or Web Search.
Most commonly – used models:
Vector Space Model (VSM)
The PageRank in Google is defined as follow:
Assume page A has pages P 1 ...P n which point to it. The parameter d is a damping factor which can be set between 0 and 1. Also C ( P i ) is defined as the number of links going out of page P i . The PageRank of a page A is given as follows:
PR ( A ) = ( 1 - d ) + d ( PR ( P 1 )/ C ( P 1 ) + ... + PR ( P n )/ C ( P n ))
Usually the parameter d is set to 0.85. PageRank or PR ( A ) can be calculated using a simple iterative algorithm.
Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web.
Many pages containing search terms may be of poor quality or irrelevant
Example: a page with just a line “search engine”.
Many high-quality or relevant pages do not even contain the search terms
Example: Google homepage
Page containing more occurrences of the search terms are ranked higher; spamming is easy
Example: a page with line “search engine” Repeated many times
Based on link structure
Hyperlinks among web pages provide new web search opportunities.
A backlink of a page p is a link that points to p
A page with more backlinks is ranked higher
Each backlink is a “vote” for the page’s importance
Pages pointed by high-ranking pages are ranked higher
Definition is recursive by design
Web can be viewed as a huge directed graph G(V, E)
where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges).
Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks).
Each backlink of a page represents a citation to the page.
PageRank is a measure of global web page importance based on the backlinks of web pages.
“ Crawlers” or “Spiders” in Web… The link structure of the Web serves to bind together all of the pages that were made public as a result of someone linking to them. Through links, search engines’ automated robots, called crawlers or spiders can reach the many billions of interconnected documents.
Robots.txt : to prevent search engines from crawling your pages.
NOINDEX : prevent content appearing in search results by adding "NOINDEX" to robots meta tag.
.htaccess : to password protect directories
Google Webmaster tools : remove content that has already been crawled
Google “Crawl Budget”…
Latest Update by “ Matt Cutts”…
Factors effecting Crawl Budget :
1) PageRank : the number of pages that Google crawl is roughly proportional to the PageRank" : The pages that get linked to a lot tend to get discovered and crawled quite quickly
2) Host load : It refers to the maximum number of simultaneous connections that a particular web server can handle.
Low host load – Allows only one page to be fetched at time.
Social Network sites like Facebook, or Twitter have a very high host load because they can take a lot of simultaneous connections.
3) Content : Crawlers discard web pages with duplicate content.
Use 301 Redirects for duplicate URLs to merge those together into one single URL.
Note : “301 Redirects may result in certain PageRank loss”.
How to Search Engines Rank Websites ?
How Search Engine evaluate “trust in a Website”…
Key Factor : Click distance between your website and the most trusted websites.
“ Your website” “ Most trusted website” click distance
Search Engine “Retrieval and Ranking” Aspects…
Relevance : Degree to which the content of the documents returned in a search matches the user’s query intention and terms.
Importance or popularity : Relative importance, measured via citation (the act of one work referencing another, as often occurs in academic and business documents) of a given document that matches the user’s query.
Relative authority of the site, and the trust the search engine
How to determine “Relevancy and Importance ”
IR scientists realized that two critical components comprised the majority of search functionality: relevance and importance
Combination of relevance and importance determines the ranking order.
Popularity and relevance aren’t determined manually
Algorithm used : “Ranking factors” or “ Algorithmic ranking criteria” .
Analyzing Relevancy and Importance
Document analysis (including semantic analysis of concepts across documents)
Link (or citation) analysis.
Theories/ Concepts Used :
Fuzzy Logic Theory
Latent Semantic Indexing (LSI)
What does “Semantic Connectivity” refers…
Semantic connectivity or Co-occurrence refers to words or phrases that are commonly associated with one another.
For example, if you see the word aloha you associate it with Hawaii, not Florida.
Why to care about “Co-occurrence” ?
Brand visibility across search engines.
Co-citation of products and services.
Search volume co-occurrence (Co-Volume).
Positioning of documents in search results pages ( SERPs).
Keywords research and terms discovery.
Analysis of seasonal trends.
Design of thematic sites.
Global; extracted from databases
Local; extracted from individual documents
Fractal; extracted from self-similar, scaled distributions
What matters when working with co-occurrence data…
scope; i.e., whether the words behave as broader or narrower terms in a given context.
type; i.e., whether we are dealing with nouns, verbs, adjectives, stems, etc
synonymity; i.e., whether we are dealing with synonyms .
architecture; i.e., whether the documents reside in a horizontal , topic-specific vertical, or regional directory
seasonality; i.e., whether we are dealing with repositories containing seasonal trends and periodic fluctuations.
sequencing; i.e., the order in which terms are queried or appear in documents.
polysemy; . i.e., whether we are dealing with terms with multiple meanings
cognates; i.e., whether we are dealing with different terms with same meaning in different languages.
query modes; i.e., the retrieval modes used.
“ Broader” and “Narrower” Terms…
For search query “ dog pet” or “ dog canine”
scenario 1: k1 = dog, k2 = canine
scenario 2: k1 = dog, k2 = pet
As of 06/16/05, searches in Google for these terms return
53,400,000 results for dog
55,800,000 results for pet
3,570,000 results for canine
Dog and pet ( Broader terms) returns more results then Canine ( Narrower Term)
Canine is considered as narrower term because :
there is a synonymity relationship between "canine" and "dog" but not between "canine" and "pet" or "pet" and "dog".
"canine" has different meanings ( polysemy ). According to WordNet, "canine" can be used as a noun or adjective, each having different meanings.
"canine" is one of those terms that posses a meaning within a meaning. The terms behave as having a scope within a scope (or context within a context (fractality)] such as Canine of a canine
“ Global” Co-occurrence…
In Google search engine, the default query mode is “AND”
As of 06/16/05 searches in Google for these terms return
scenario 1: 12,800,000 for the query, k12 = k1 + k2 = dog pet
scenario 2: 1,710,000 for the query, k12 = k1 + k2 = dog canine
both queries return less number of documents
new set of results n12 and containing k1 and k2 must be a subset of n1 and n2; i.e., the sets containing k1 only or k2 only.
The term "dog" is more frequently co-cited with "pet" than with "canine" since:
in scenario 1 we are combining two broader terms.
in scenario 1 the terms are not synonyms.
in scenario 2 we are combining a broader term with a narrower term.
in scenario 2 the terms are synonyms and synonyms rarely occur together but appear in similar contexts.
“ Normalized” Co-occurence
Also known as Co-Occurrence Index" or C-index.
co-citation frequency between two and only two terms k1 and k2, the C-index is given by
c12 = 0 when n12 = 0; i.e., k1 and k2 do not co-occur (terms are mutually exclusive).
c12 > 0 when n12 > 0; i.e., k1 and k2 co-occur (terms are non mutually exclusive).
c12 = 1 when n12 = n1 = n2; i.e., k1 and k2 co-occur whenever either term occurs.
“ Syntagmatic and Paradigmatic Association” theory
Syntagmatic associations are terms that frequently occur together.
Paradigmatic associations are terms with high semantic similarity.
These type of associations allow us to understand why synonyms do not tend to co-occur together. This has a lot to do with contextuality or lexical neighborhoods.
Fuzzy Set Theory…
Discovers the semantic connectivity between two words .
e.g. . both oranges and bananas are fruits , but both oranges and bananas are not round .
a machine knows an orange is round and a banana is not by scanning thousands of occurrences of the words banana and orange in its index and noting that round and banana do not have great concurrence , while orange and round do.
Latent Semantic Indexing (LSI)
LSI (Latent Semantic Indexing) based on Fuzzy Logic theory uses semantic analysis to identify related web pages .
e.g , the search engine may notice one page that talks about doctors and another one that talks about physicians, and determine that there is a relationship between the pages based on the other words in common between the pages.
Common types of searches in the IR field.
Identifying Authority of Links.
Identifying Relevancy of Links.
Link neighborhood : concept of grouping sites based on their relevance is referred to as a link neighborhood .
Placement of Links
Top elements of SEO
Meta keyword tag
Alt attribute for images : alt attribute was originally intended to allow something to be rendered when viewing of the image is not possible
Content : that defines what a page is about.
Act as navigational elements for the search engines during crawl and to do a detailed analysis of each web page
search engine performs detailed analysis of all the words and phrases that appear on a web page, and then building a map of that data for it to consider showing your page in the results when a user enters a related search query. This map is referred as semantic map.
Defines the relationships between web pages so that the search engine can better understand how to match the right web pages with user search queries.
Google working on new techniques
search engines are able to detect that you are displaying an image, they have little idea what the image is a picture of, except for whatever information you provide them in the alt attribute
search engines will not recognize any text rendered in the image
Optical Character Recognition (OCR): to extract text from images
Search engines are beginning to extract information from Flash
A third type of content that search engines cannot see is the pictorial aspects of anything contained in Flash.
when text is converted into a vector-based outline (i.e., rendered graphically), the textual information that search engines can read is lost.
Audio and video files are also not easy for search engines to read. There are a few exceptions where the search engines can extract some limited data, such as ID3 tags within MP3 files,
Search engines also cannot read any content contained within a program
Moving Ahead with AJAX
technology that can present significant human-readable content that the search engines cannot see is AJAX.
Positive Ranking Factors
Keyword use in title tag
Anchor text of inbound link
Global link authority of site
Age of site
Link popularity within the site’s internal link structure
Topical relevance of inbound links
Link popularity of site in topical community
Keyword use in body text
Global link popularity of sites that link to the site
Negative Ranking Factor
Server is often inaccessible to crawlers
Search engines want their users to have good experiences. If your site is subject to frequent outages, by definition it is not providing a good user experience. So, if the search engine crawler frequently is unable to access your web pages, the search engine will assume that it is dealing with a low-quality site.
Content very similar to or duplicate of other web pages
External links to low-quality/spam sites
Participation in link schemes or actively selling links
Duplicate titles/meta tags on many pages
Other Ranking Factors
Rate of acquisition of links
Have Some “Google Caffeine”
a next-generation architecture for Google’s web search
A ranking system that heightens the importance of page load speeds
A more focused relevance on real-time search data
Stricter spam controls
Changes with Google Caffeine
Changes in how Google stores the massive amount of data gathered by their robots.
This is a direct response to the rise in new digital media such as streaming videos, blog posts, social media content ( Twitter, facebook ). The old Google infrastructure was built to handle data by way of Collection > Quality Ranking > Sandbox > Indexing. However with the explosion of real-time content, search engines are faced with the daunting task of filtering all this content to provide a real-time search.
Changes in how the Google collects its data
Google uses robots that crawl through the web for data ( googlebot ), this is traditionally data that may not change or update in real-time. The caffeine update must include changes to the robot to cater for real-time content. The theory currently is Google has developed several types of robots that differ in its indexing rate and craw rate to cater for different media content.
Google New Algorithm “ Caffeine”
an increased weighting on domain authority & some authoritative tag type pages ranking (like Technorati tag pages + Facebook tag pages), as well as pages on sites like Scribd ranking for some long tail queries based mostly on domain authority and sorta spammy on page text
perhaps slightly more weight on exact match domain names
perhaps a bit better understanding of related words / synonyms
tuning down some of the exposure for video & some universal search results
the new search engine improves the index size, the speed of the queries and most importantly, changes the value of search engine rankings.
A search for on the new infrastructure, for instance, returns video and news results midway down the page .
A search on the existing infrastructure, however, returns news at the top, video in the middle, and images at the bottom of the page.
Tools to evaluate speed of the site.
Page Speed: An open source Firefox/Firebug add-on that evaluates the performance of web pages and gives suggestions for improvement.
Yslow: A free tool from Yahoo! that suggests ways to improve website speed.
Webpage test: Shows a waterfall view of your pages’ load performance plus an optimization checklist.
In Webmaster Tools, Labs > Site Performance shows the speed of your website as experienced by users around the world as in the chart below.