SEO and Analytics SEO Introduction Analytics Introduction Search Engine basics Analytics – methods Technology Considerations Tools for Analytics Tweaking your Content Some Key Terminologies Promoting Web Pages Tools for Web Masters
SEO – what’s that??? Search Engine Optimization has been a buzz word since the advent of major search engines SEO deals with best practices outlined to make it easier for search engines to crawl, index and understand the content on your web page.
How do search engines work? Spiders (Also called Robots) comb the web by following links Search engine formats the data is finds and stores in its database. All the search engines maintain extensive and highly indexed databases.
SEO – what’s that??? All trademarks belong to respective owners
Indexing of the results is based on complex algorithms based on a number of complex parameters. Due to the years of expertise gained by Web masters in analyzing the behaviors of the major Search Engines, there is a considerable knowledgebase on what makes pages more Search Engine Friendly. SEO – what’s that???
Paid and Organic Search Results Many Search engines have launched paid services like the Google Ad Words The Organic Search results are the ones which are not influences by paid or sponsored programs SEO applies to the organic results. It normally has no impact on the results shown from sponsored links.
User-Agent HTTP Header Most web sites heavily make use of the UserAgent HTTP header to determine who the requestor of the page is. Often the Web sites behavior is altered depending on what is passed on the user agent field. Typical applications of this is changing the CSS for IE and Firefox - The (in)famous browser incompatibility issues Forwarding a user to a Mobile version of the Web Site if the user agent happens to be a Mobile Device.
The common Robot user agents The following are the most famous Robot user agent strings
Cloaking Cloaking has been a very popular methodology used in the earlier days for SEO It is a simple way disguising your website in to another text based (with a lot of keywords sprinkled all over) web site when a request is coming from a Web Robot (Spider). Most Spiders are identifiable by their User Agent headers. For e.g. the Google Robot is called the “Googlebot” As search engines strengthened their spam detection technologies, they often started penalizing “Cloaked” web sites by removing them altogether from their indices. As of today, cloaking is not considered a recommended practice and should be avoided in all scenarios.
URL Structure Simple-to-understand URLs will convey content information easily It is easier for the user as well as the crawlers to organize. Crawlers typically try to reduce priority of indexes of urls containing arbitrary numbers and characters. PageRank (TM – Google Inc.) algorithm gives a lot of weightage to the number of pages which link to your page. If your URLs are simpler it is easier for users to link your page. If your URL contains relevant words, this provides users and search engines with more information about the page than an ID or oddly named parameter would
URL best practices Avoid using lengthy URLs with unnecessary parameters and session IDs Avoid choosing generic page names like "page1.html" Keep the directory nesting as simple as possible Keep the directory names relevant to the content provided in the directory. Avoid using numbers for directory names Do not mix up capital case in urls – like CreateOrder.html? – Users always prefer a single case (and lower case always)
URL best practices Web sites should be as flat as possible, with content relating to highly competitive keywords implemented on pages high on the hierarchy. Rewrite URLs on the Server side to make them simpler and less nested. Note that Search engines always assign a lower relevance score to data which is found deep nested inside the Website. The Content on the top folders are considered much more relevant.
Canonical URL More than often, there are multiple ways to reach a same page on a Website. Canonicalization is the process of picking the best URL when there are several choices, usually referring to the homepage of a website. For e.g. consider http://www.google.com and http://google.com. Both URLs provide same content. Another example of this is “domain.com/aboutus.htm” and “blog.domain.com/aboutus.htm” More than often search engines are intelligent enough to recognize that the content on the pages is the same, and they would pick one of the URLs, which might not be out preferred one.
Canonical URL – best practices There are a few ways to ensure that the proper URL is indexed: When linking to your homepage always point to the same URL When requesting links from other sites, always point to the same URL Redirect the non‐www homepage to the www version of the homepage, use 301 Permanent redirects. A 301 redirect example (JSP) is shown below. <%response.setStatus(301);response.setHeader( "Location", "http://www.new-url.com/" );response.setHeader( "Connection", "close" );%>
HTTP 301 &HTTP 302 302 is a temporary redirect 301 is the permanent redirect As far as possible use only 301 for redirection. (Explained on previous slide) Always redirect from the server (Sample on previous slide) 302 redirects indicate that the content is temporary and will be changed in the near future. Popularity attained by the previous site or page will not be passed on to the new site. 301 Permanent Redirects should be used when the change is long‐term or permanent, which allows Page Rank and link popularity to transfer. This is taken care by the indexing engines of all major search engines.
Name Value pairs in URLs Name Value pairs are used on urls to provide information necessary to produce dynamic content. Urls tend to become lengthy with name value pairs They contain numbers which are typically treated as junk by Search engines. Further “prod_code” does not make any sense to a common user. A Product name would have been better Use valuable keywords in the name‐value pairs whenever possible and keep the quantity of pairs to no more than three.
User Input Fronting Screens Many sites have a front page where you need to enter your location or your details before it could give you information about products. Search engines cannot input information, or make selections from form drop downs. This means search engine spiders are effectively locked out of relevant content and cannot index or rank the content. Another problem is having a splash screen with a country chooser which does not allow people to go beyond that page without selecting the country to choose the locale. It is better to have a default locale and go inside and then give an option to change it. The Robot will be able to index your pages with such a design.
Provide alternative to flash content Spiders cannot read flash content All links embedded in flash is never navigated or indexed If you cannot do away with flash due to usability reasons, implement a site with the same links in HTML Implement user‐agent detection to deliver the HTML site to spiders and the Flash version to human visitors.
Excessive In page Scripting All Web crawlers limit the amount of content they index from a page Typically this is limited to 100 KB of data. If you have too much in-page scripting, the only thing the search engine might see is the script on your page Some of the content on your page will be ignored if the limit is reached. Crawlers ignore the <script> tag, but the total content read (100KB) includes the scripts as well. It is always sensible to have your scripts on a different file and included on to your page. This way, you are not risking running out of the crawlers content limitations and still write a lot of code for dynamic behavior.
Session Ids on the URL A web server assigns a unique session ID variable within the URL for each visit for tracking purposes. Search engine spiders revisiting a URL will be assigned a different session ID each visit, which will result in each visit to a page appearing as a unique URL and causing indexing inconsistencies, and possibly duplicate content penalties. Should implement user‐agent detection to remove the session ID’s for search engine visits.
“nofollow” settings Setting the value of the "rel" attribute of a link to "nofollow" will tell search engine robots that certain links on your site shouldn't be followed or pass your page's reputation to the pages linked to Very true for all the pages which allow user comments. Say you a famous company and allow people to post feedback on your blog. Always set the “nofollow” to avoid the scenario like the following ! Sample : <a href="http://www.cheapdrugs123.com" rel="nofollow">Comment by a spammer</a>
404 pages Pages or content that is moved, removed, or changed can result in errors, such as a 404 Page Not Found. Having a custom 404 page that kindly guides users back to a working page on your site can greatly improve a user's experience Your 404 page should probably have a link back to your root page and could also provide links to popular or related content on your site. NEVER EVER allow your 404 pages to be indexed in search engines Do not use a design for your 404 pages that isn't consistent with the rest of your site Repair all broken links as soon as possible
The <title> tag Most Search Engines give a lot of weightage to what is the content in the <title> HTML tag A title tag tells both users and search engines what the topic of a particular page is. The <title> tag should be placed within the <head> tag of the HTML document Ideally, you should create a unique title for each page on your site.
<title> tag tips Always put a sensible title for every page. Do not repeat the text in all the pages or a group of pages unless it makes sense . Make sure all your important business are reflected on the title Never choose a title that has no relation to the content on the page Never use default or vague titles like "Untitled" or "New Page 1“
<title> tag tips Always put a sensible title for every page. Do not repeat the text in all the pages or a group of pages unless it makes sense . Make sure all your important business are reflected on the title Never choose a title that has no relation to the content on the page Never use default or vague titles like "Untitled" or "New Page 1“ Google displays 63 characters from the page title on the search results, which means the first 63 characters should contain all relevant detail you needed.
<meta> tags A page's description meta tag gives search engines a summary of what the page is about Limit descriptions to 250 characters •Include all targeted key phrases •Copy should be written with users in mind (description copy appears in search results) •Create a unique meta description for every page
<meta> keywords tag Keywords are mentioned in the head section of the html. Google gives very little importance to this Bing and Yahoo searches give some importance to this (Still makes sense to specify this). The search engine normally does not display these content in the search results. Use only relevant phrases on this tag. Use distinct phrases for the pages.
Header tags <h1>, <h2>, <h3> A lot of importance is given by the Search engines to what content appears inside the header tags. Strictly one <h1> tag per page. This should be used for the most important heading on the page. <h2> and <h3> tags also should be used for the most relevant headings Always keep the natural hierarchy. First h1, second h2 and then h3.
Importance of Anchor text Anchor text is the clickable text that users will see as a result of a link, and is placed within the anchor tag <a href="..."></a>. e.g. <a href="http://www.mydomain.com/articles/our-prices.htm">Lowest prices on earth for international calls</a> This text tells search engines something about the page you're linking to. Avoid writing generic anchor text like "page", "article", or "click here" Avoid using text that is off-topic or has no relation to the content of the page linked to Avoid using CSS or text styling that make links look just like regular text
Duplication of Content Duplicate content exists when two or more pages within a website, or on different domains, share identical content. Different domain names do not create distinct content. company.com/aboutus.html blog.company.com/aboutus.html Major search engines consider duplicate content to be spam and are continually improving their spam filtering process to penalize and remove offenders. Avoid duplication of content as far as possible Use 301 permanent redirects to inform search engines of the proper URL to utilize.
Optimizing image content Images form an integral part of any website The "alt" attribute allows you to specify alternative text for the image if it cannot be displayed for some reason This is a very important usability aspect as the “screen reader” program used by blind people will identify and read out the alt text for them. Another reason is that if you're using an image as a link, the alt text for that image will be treated similarly to the anchor text of a text link. Optimizing your image filenames and alt text makes it easier for image search projects like Google Image Search to better understand and rank the images on your website.
The robots.txt file Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. A sample can be seen here : http://www.robotstxt.org/robotstxt.html All major search engine robots scan this file to see what pages are relevant to be crawled. The Disallow tags specify which pages should be ignored by the crawler. The robots.txt typically has the such information Disallow: /residential/customerService/ Disallow: /residential/customerService/contacts.html Disallow:/residential/customerService/contactus/billing.html
The robots.txt file There are some important considerations when using /robots.txt: Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. You could put all the files you don't want robots to visit in a separate sub directory, make that directory un-list-able on the web (by configuring your server), then place your files in there, and list only the directory name in the /robots.txt. Now an ill-willed robot won't traverse that directory unless you put a direct link on the web to one of your files, and then it's not /robots.txt fault.
Linking your websites Internal linking between pages within a web site, such as navigational elements or a site map, plays an important role in how search engines perceive the relevancy and theme of both web pages. Proper intra‐site linking will help facilitate effective spidering, in addition to increasing relevancy of pages Maintain a sitemap. Keep sitemap pages to less than 100 links per page Sitemaps should be linked directly from homepage and other major pages throughout the web site
Promotion through external channels Effectively promoting your new content will lead to faster discovery by those who are interested in the same subject Increasing back-linking to your site is one option, but it should be done properly. Social Media site (e.g. the facebook like) adds to your link count. Typically it is not advised to link every small update in this fashion, as search engines now-a-days even understand those patterns. You could include your updates to a RSS feed. You could link it from Blogs of people in the related community. Search engines of today, do not only go by page rank for determining the relevance. It also depends on traffic and content.
Webmaster tools Every major search engine has launched their own set of Web master tools Google: http://www.google.com/webmasters/ Yahoo: http://siteexplorer.search.yahoo.com/ Bing: http://www.bing.com/toolbox/webmasters/ We will examine some of the most important tools which Google provides.
Webmaster tools Google provides the following services: see which parts of a site Googlebot had problems crawling notify Google of an XML Sitemap file analyze and generate robots.txt files remove URLs already crawled by Googlebot specify your preferred domain identify issues with title and description meta tags understand the top searches used to reach a site get a glimpse at how Googlebot sees pages remove unwanted site links that Google may use in results receive notification of quality guideline violations and request a site reconsideration
Web Analytics - Introduction Web analytics is the measurement, collection, analysis and reporting of internet data for purposes of understanding and optimizing web usage. It is a very important tool for Business and market research Web analytics provides data on the number of visitors, page views, etc. to gauge the traffic and popularity trends which helps doing the market research. Predominantly 2 Types Off-site On-site
Web Analytics - Introduction Off-site web analytics refers to web measurement and analysis regardless of whether you own or maintain a website. It includes the measurement of a website's potential audience (opportunity), share of voice (visibility), and buzz (comments) that is happening on the Internet as a whole On-site web analytics measure a visitor's journey once on your website. This includes its drivers and conversions; for example, which pages encourage people to make a purchase. On-site web analytics measures the performance of your website in a commercial context.
Methods for measuring Log file analysis All Web servers record most of their transactions in a log file. (Access log for Apache) Was the most prominent method when the web evolved in late 90s. This involved running a tool to identify the hits to a page from the log file and determine statistics from the same Became very inaccurate in later times as there are a thousands of “non-human” actors on the web today. Googlebot is an example Log File analysis also failed when users enabled their browser caches. This resulted in pages being cached on the browser and when the user requested for the same pages, no hit was made on to the Web server.
Methods for measuring Log file analysis All Web servers record most of their transactions in a log file. (Access log for Apache) Was the most prominent method when the web evolved in late 90s. This involved running a tool to identify the hits to a page from the log file and determine statistics from the same Became very inaccurate in later times as there are a thousands of “non-human” actors on the web today. Googlebot is an example
Methods for measuring Log file analysis – contd.. The tools adapted to the robots by measuring the hits based on cookie tracking and ignoring the known robots This is not practical as robots are not only written by search engines, but also by spammers Log File analysis also failed when users enabled their browser caches. This resulted in pages being cached on the browser and when the user requested for the same pages again, no hit was made on to the Web server and content was delivered from the cache.
Methods for measuring Page tagging Developed during later stages of the web Embeds a Java Script code segment on the page When a tracking operation is triggered, data from the HTTP Request, browser/system info and cookies are collected by the Script The Script submits the data as parameters attached to a image request sent to the analytics server. (Single pixel image) For e.g. take a look at the Google analytics data collection request which gets sent out. http://www.google-analytics.com/__utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012& …..etc…..
Methods for measuring Page tagging contd.. After the invent of the XHR (XmlHttpRequest) some of the page tagging scripts have used a AJAX submission of user data on to the collection server. This is often bound to fail due to restrictions on the XHR (Domain of Origin) on most of the modern browsers. As the page tagging approach Involves downloading a one pixel image from a domain (like Google) this adds an additional DNS (Domain Name System) lookup to your page which is sometimes looked upon as obstructive to page loading.
Page tagging is the new Analytics Page tagging is the de-facto standard followed as of today It has a significant advantage that it works even for pages hosted on the cloud, meaning that you do not need to have dedicated web servers and monitor their logs Analytics today is mostly an outsourced service. There are many specialist providers like Google and Adobe. And page tagging is the only method supported there.
Major tools – Web Analytics Google Analytics Free from Google (5M page view cap per month for non AdWords advertisers.) Uses Page Tagging as Analytics Method User embeds a Script in to the page The Script collects information on the page actions and submits the same to the Analytics Server by using the data as parameters on an image fetch Detailed reports are presented to the user by logging into your Google account
Web Analytics KPIs KPIs are those metrics which give information on what changes could drive more effectiveness on your website All KPIs are metrics, but not all metrics are KPIs. In Web Analytics it becomes very critical to measure the right things.
First and Third Party Cookies First-party cookies are cookies that are associated with the host domain. Third-party cookies are cookies from any other domain. You go to the site http://yahoo.com There is a banner ad on this site for http://youbuy.com Both yahoo.com and youbuy.com place cookies on your browser So for you, the cookie from yahoo.com is a First Party cookie and the one from youbuy.com is a Third Party cookie.
First and Third Party Cookies So if I had placed the Google analytics Script on our page http://mozvo.com, and it had placed a cookie for the domain “google.com”, then that would have been a third party cookie Third party cookies are widely discouraged as there are quite a few sites which plant tracker cookies. A lot of users (about 40%) disable third party cookies All of the analytics providers have switched to using first party cookies to track information. Which means that the user will see only cookies from mozvo.com even though the Google analytics code is embedded on the page.
Bounce Rate and Click through rate The Bounce Rate : The bounce rate for the homepage, or any other page through which visitors enter your site, tells you how many people 'bounce' away (leave) from your site after viewing one page. Hence having a low bounce rate is preferred. Click Through Rate : Click-through rate (or click-thru rate) tells you how many people are clicking through to your site from a third-party. For example from a link, search engine, banner, advertising or email campaign. A Higher Click Through rate is preferred.
Click Stream Analysis Clickstreams, also known as clickpaths, are the route that visitors choose when clicking or navigating through a site. A clickstream is a list of all the pages viewed by a visitor, presented in the order the pages were viewed, also defined as the ‘succession of mouse clicks’ that each visitor makes. A clickstream will show you when and where a person came in to a site, all the pages viewed, the time spent on each page, and when and where they left. The most obvious reason for examining clickstreams is to extract specific information about what people are doing on your site..