Your SlideShare is downloading. ×
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Analysis Of The Modern Methods For Web Positioning

5,426

Published on

Master Thesis

Master Thesis

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
5,426
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Faculty of Computer Science and Management field of study: Computer Science specialization: Software Engineering Master thesis Analysis of the Modern Methods for Web Positioning Paweł Kowalski keywords: search engine, SEO, personalization, optimization, web positioning Thesis contains an analysis of personalization of search results mechanism in popular search engines. It presents experiments and considerations about impact of personalization on search rankings and how it affects on Search Engine Optimization (SEO). There are proposed some new SEO methods that take an advantage of personalization in search engines. Supervisor: dr inż. Dariusz Król ............................. ............................. name and surname grade signature Do celów archiwalnych pracę dyplomową zakwalifikowano do:* a) kategorii A (akta wieczyste) b) kategorii BE 50 (po 50 latach podlegające ekspertyzie) * niepotrzebne skreślić Stamp of the institute Wrocław 2010
  • 2. Contents Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1. Beginning of the SEO Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. Search Engines Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3. The Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2. Personalized Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1. Operation Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2. Methods for the Analysis of Behavioural Data . . . . . . . . . . . . . . . . . . 7 2.2.1. Methods of behavioural data collecting . . . . . . . . . . . . . . . . . . 8 2.2.2. Process of tracking user . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3. Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1. Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2. Phrase language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3. Search history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4. Spam Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 3. Impact of Personalized Search on SEO . . . . . . . . . . . . . . . . 21 3.1. Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1. Areas for Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4. SEO Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1. Website Presentation in Google Search . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1. Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.2. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.3. Sitelinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2. Website Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1. Unique content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2. Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3. Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1. Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.2. Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.3. Alternative texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.4. Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4. Internal Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.1. Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 i
  • 3. 4.4.2. Links anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.3. Broken links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.4. Nofollow attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.5. Sitemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5. Addresses and Redirects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.1. Friendly addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.2. Redirect 301 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6. Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6.1. Information for robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6.2. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 5. The System for Global Personalization . . . . . . . . . . . . . . . . 37 5.1. Problems to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.1. Web Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.2. Cooperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.3. Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4. Visitor Session Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5. Task Assignation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.6. Proof Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 ii
  • 4. Abstract Modern search engines are constantly improved. The most recent big step introduced into their algorithms concern personalization mechanism. Its goal is to extract information about user’s preferences implicitly from its search behaviour and also such factors as location, phrase language and search history. This information is the basis for building a user’s search profile. Motivation of this process is to provide more relevant search results for specific user and its interests. Thesis concern details of this personalization mechanism and try to examine how various factors affect search results. Author also analyse the methods for collecting behavioural data by search engines. He approaches to define possible impact of customization of search results on the Search Engine Optimization (SEO) issues like metrics, spam filtering or changes in significance of website optimization factors. Then author tries to evaluate the possibility of personalized search rankings manipulation through proposed system for generating human-like web traffic. Streszczenie Nowoczesne wyszukiwarki internetowe są ciągle ulepszane. Ostatni duży krok na- przód wprowadzony w ich algorytmach dotyczy mechanizmu personalizacji. Jego zadaniem jest zdobycie informacji na temat preferencji użytkownika z jego za- chowania podczas wyszukiwania informacji w wyszukiwarce pod względem takich czynników jak jego lokalizacja, język wyszukiwanej frazy i historia wyszukiwań. Te informacje są podstawą do utworzenia profilu użytkownika. Celem tych dzia- łań jest zwrócenie konkretnemu użytkownikowi rezultatów wyszukiwania bardziej odpowiadających jego zainteresowaniom. W pracy tej znajduje się szczegółowa analiza mechanizmu personalizacji oraz próba zbadania jak poszczególne czyn- niki wpływają na wyniki wyszukiwań. Autor analizuje także metody pozyskiwania danych na temat zachowań użytkowników przez wyszukiwarki. Podejmuje próbę określenia możliwego wpływu dostosowywania wyników wyszukiwania do użyt- kownika na tematy związane z pozycjonowaniem witryn internetowych takie, jak metryki, filtrowanie spamu lub zmiany w znaczeniu poszczególnych czynników w optymalizacji stron WWW. Następnie autor próbuje ocenić możliwość mani- pulowania spersonalizowanymi wynikami wyszukiwania poprzez zaproponowany system służący do generowania naturalnie wyglądającego ruchu sieciowego. iii
  • 5. Chapter 1 Introduction Before the Web and present-day search engines, searching meant simple matching the terms in a query to the exact appearance of these terms in a database filled with textual documents. Some database searches let you only locate documents where certain words appeared within a defined distance from other specified words from the same document. Sorting documents by relevance or importance would have been a monumental task, if possible at all. 1.1. Beginning of the SEO Concept When the Internet was introduced, it revolutionized the worldwide share of information. Free access for everyone to this web without any restrictions is the reason why the Internet is considered to be one of the greatest inventions of the 20th century. But this freedom in the Internet has serious implication – many problems with the organization of this enormous set of information. Hyperlinks turned out to be insufficient for the issue. This is why the first search engines was introduced. They quickly became the main source of visits in the commercial web- sites. A good search results started to be very important issue for content publishers. Those moment was the beginning of the SEO1 concept which is still the major element of the Internet marketing. The early search engines like AltaVista or Lycos were launched around 1994–1995 [7]. Their algorithms have only been analysing the content of websites and keywords in meta tags. It was easy to circumvent these algorithms by placing false information into keywords tags. Another popular fraud was filling website content with irrelevant text, which was visible only for search engine robots, but not for the user. As a result, search engine result pages (hereafter SERPs) contained websites filled with spam and inappropriate content [21]. 1. Search Engine Optimization (SEO) – the process of improving the volume or quality of traffic to a website from search engines. It is also used to take Web Positioning term as a synonym.
  • 6. Chapter 1. Introduction 2 1.2. Search Engines Evolution However, the relevance of the search results to the query is still based on keyword matching. But search engines started to understand differences in the importance of words located in different parts of a page. For example, if you searched for a certain phrase, pages containing those words in their titles and headlines might be considered more relevant than other pages where those words also appeared, but not in those “important” parts of pages. Google company, which started in September 1999, revolutionized search engines. Its co-founders, Larry Page and Siergiej Brin, developed PageRank algorithm [3]. This al- gorithm redefined search issue. Content of the websites became slightly less significant. Instead of text content, PageRank rates the websites mainly on the basis of quantity and quality of links leading to these websites. With help of such improvements, the Internet works as a kind of voting system. Every link is a vote for a website which leads to. Relevance was also found by indexing words that link to other pages. If a link leading to a page used the phrase ”american basketball” as anchor text2 , the page being pointed to would be considered relevant to the American basketball. The existence of links to pages also has been used to help define the perceived importance of a page. Information about the quality and quantity of links to a page can be used by search engines to get a sense of implied importance of the page being linked. Nevertheless, after a short time, the techniques [21] spoiling the results of PageRank have also been discovered. Basically, most of them work in such a way to increase the number of links leading to a particular website to enhance its PageRank value. Such activity in a large scale is usually called linkbaiting. There are many scripts and web catalogs to facilitate and to automate such activity. However, Google is also constantly working to improve their search engine algorithm. They try to make it resistant to linkbaiting. According to [11], many new factors are being introduced into websites evaluation process, in order to reduce the impact of linkbaiting which is a sort of spam. Besides, there is a limit to the effectiveness of this type of keyword matching. When two people perform a search at one of the major search engines, there is a chance that even if they use the same search terms, they might be looking for something completely different. For example, when an anthropologist searches for phrase ”jaguar” he expects websites with information about big cats as a result. But he can also receive a collection of websites about Jaguar cars instead. As search engines progressed and users were given more and more websites with valu- able information, the engines needed to respond with a refined approach to search. The main idea to improve relevance of the search results was to better understand the user’s intent and expectation then he types a certain phrase into a search box. So it seems that the next step in search engines improvement is tracking regular users in the Internet. Collecting data on their activity might give useful information about 2. Link label or link title, text in a hyperlink visible and clickable by the user.
  • 7. Chapter 1. Introduction 3 which websites are valuable for them. The major engines such as Google, Yahoo and Bing guard their search secrets closely, so one can never be absolutely certain how they are operating. But they are evolving, and personalization seems to be the wave of the future. 1.3. The Goal It seems to be quite clear that search engines fitted with personalization mechanism would have two main benefits: 1. Improvement of search results relevance for specific user. 2. Decrease of number of spam entries in SERPs. Google, the leader in search engine market, is already the first steps behind in this area. Such information reaching us from the official company blog [13]. Moreover they already has several patents connected with the personalization mechanism. For this reason this thesis will be mainly concerned with the Google Search. But high competition in the Internet market suggests, that other popular search engines like Yahoo and Bing are also being improved in this direction. The goal of this thesis is to analyse the possible aspects of the personalization mech- anism in Google Search, on the basis of available information. There will be sev- eral factors taken into account, which can have an influence on the changeability of SERPs: • geolocation • language of the query • web search history • query complexity • search behaviour (e.g. bounce rates3 , time of visits) It will be the base for several experiments which should determine how advanced is the current level of personalization introduced into considered search engine. This re- search also includes an analysis of the data used to describe users’ search behaviour – particularly the methods of collecting these data and types of them. The obtained results will be used to specify the potential impact on SEO and its metrics, in particular: • the possibility of using personalization of search engine to create a new SEO techniques 3. Bounce rate is a term used in website traffic analysis. It essentially represents the percentage of initial visitors to a site who ”bounce” away to a different site, rather than continue on to other pages within the same website.
  • 8. Chapter 1. Introduction 4 • usability of website ranking in search results as the measure of success in SEO activity Personalization is the opportunity for search engines to make spam less significant for search results and make SEO workers’ life harder. But it is only the spam in the sense of collection of websites with content of no value for human and irrelevant hyperlinks. But personalization also opens door for a another kind of information noise – behavioural data spam. The thesis presents the architecture of the distributed system generating artificial web traffic, therefore, imitating search activity of a real user. However, using such system can be seen as unethical, so the thesis contains only a conception and a design. Author has no intention to implement such system, but he tries to examine with available tools if building it would be reasonable. In this way might be indicated possible harmful actions, whom search engines should be protected against. After that, there is a short analysis of the known up-to-date information about signif- icant factors in web positioning. Together with results of the personalization research, they helped to prepare a collection of advices how to build a website attractive for search engines. It is a sort of a guide for webmasters. At the end of thesis there is short conclusion. It contains author’s few thoughts about future trends in search engines and SEO.
  • 9. Chapter 2 Personalized Search Pretschner [27] in 1999 wrote: With the exponentially growing amount of information available on the Internet, the task of retrieving documents of interest has become in- creasingly difficult. Search engines usually return more than 1,500 results per query, yet out of the top twenty results, only one half turn out to be relevant to the user. One reason for this is that Web queries are in general very short and give an incomplete specification of individual users’ information needs. To be more specific, Speretta [31] in 2005 wrote: [...] most common query length sub- mitted to a search engine (32.6%) was only two words long and 77.2% of all queries were three words long or less. These short queries are often ambiguous, providing little information to a search engine on which to base its selection of the most relevant Web pages among millions. According to Wikipedia, Google in 2006 has indexed over 25 billion web pages, 400 million queries per day, 1.3 billion images, and over one billion Usenet messages. The Internet grows very quickly. For this reason search accuracy is crucial area for con- stant improvement in modern search engines. One of the major solutions to meet the challenge is personalization. Personalized search is simply an attempt to deliver more relevant and useful results to the end user (searcher) and minimize less useful results. Personalization mechanism uses information about user’s past actions and behaviour to specify his profile and match relevant search result to this profile. It should provide more useful set of results or a set of results with less irrelevant or spam entries. For this reason personalized search seems to be desirable to the end user. Google puts it in this way: Search algorithms that are designed to take your personal preferences into account, including the things you search for and the sites you visit, have better odds of delivering useful results [13]. The goal is simple: to reduce spam and to deliver better results. This looks like dangerous weapon against SEO workers which are major offenders in generating spam.
  • 10. Chapter 2. Personalized Search 6 Official information [13], [25], [38] indicates that Google is the only major search engine that already introduce personalization mechanism. First personalized search result ap- peared almost 5 years ago [13], and from that time it has been constantly evolving. 2.1. Operation Principles Of course details of Google’s search algorithms are not public. But it can be expected, that the main principles are based on the ideas which can be found in the scientific literature. According to [31], personalization can be applied to search in two different ways: 1. by providing tools that help users organizing their own past searches, preferences, and visited URLs 2. by creating and maintaining sets of user’s interests, stored in profiles, that can be used by retrieval process of a search engine to provide better results. His research proved, that user profiles can be implicitly created out of the limited amount of information available to the search engine itself. The profiles are built on the basis of the user’s interactions with a particular search engine. Google has applied this second approach in their search engine, because they do not provide any additional tools like toolbars or browser add-ons for personalizing search. After [31]: In order to learn about a user, systems must collect personal information, analyze it, and store the results of the analysis in a user profile. Information can be collected from users in two ways: explicitly, for example asking for feedback such as preferences or ratings; or implicitly, for example observing user behaviors such as the time spent reading an on-line document. Google Search do not provide any forms, which let users to specify their interests and preferences. So to build user profile, this information must be collected in other way. According to [31], user browsing histories are the most frequently used source of information to create interest profiles. But not only browsing history (like such presented in figure 2.1) is significant. For example, a user after sending search query gets search results. He selects a specific entry seemed to be interesting. He clicks on it and the website is saved in his browsing history. However, user quickly realizes, that the selected website did not fit to his interests and goes back to the search results after few seconds. Such visit should be qualified rather negatively. So not only browsing history, but also a user’s behaviour should be taken into consideration in the personalization mechanism. Also studying a series of searches from the same user may offer a glimpse into mod- ified search behaviour. How does an individual change their queries after receiving unsatisfactory results? Are search terms shortened, lengthened or combined with new terms? There is much other information that a search engine might collect about a user when a search is performed – location, language preferences indicated in their browser or the type of device they are using (mobile phone, handheld or desktop). But how
  • 11. Chapter 2. Personalized Search 7 Figure 2.1. Google Web History panel such behavioural data can be collected by search engine? The answer is in the next section. 2.2. Methods for the Analysis of Behavioural Data Search engine robots, hereafter crawlers [26] continuously gather information from almost every website on the Internet. It is well known that Google collects enormous amount of data through this process. The reason for this is that these data have the greatest significance for search engine algorithm, thus classic web positioning methods are based on links maintenance, mainly the acquisition of them. Google processes these data and sorts the websites according to its value to the user. User sends queries to the search engine and gets appropriate SERPs. Because Google knows what people search, they are able to determine the popularity of specific in-
  • 12. Chapter 2. Personalized Search 8 formation on the Internet. But eventually, it is the user who decides which website is valuable for him and which is not. The value of a particular website is reflected in users’ activity – websites whose links have been clicked and time between these actions. These information is called behavioural data. It is reasonable to make all this data useful for search engine. Certainly, Google knows that, too. Probably this is the reason why they collect enormous amount of behavioural data in addition to data collected by crawlers. This kind of information is what this study is the most interested in. 2.2.1. Methods of behavioural data collecting The entire web is based on HTTP protocol which generates requests containing follow- ing information: • IP address of the user making request which can be used in geolocation of this user, • date and time of request, • language spoken by the user, • operating system of the user, • browser of the user, • address of the website which redirected user by the link to the requested website. These HTTP requests are being used by Google in: Click tracking – Google logs all of its users clicks on all of its services, Forms – Google logs every piece of information typed into every sending form, Javascript executing – requests and sometimes even more data are sent when a user’s browser executes the script embedded in website, Web Beacons – small (1 pixel by 1 pixel) transparent images in its websites, which cause sending a requests every time while user’s browser tries to download such images, Cookies – small text information stored on user’s computer which lets Google track users’ movement around the web every time when they get on any page that has Google advertisement. But these all elements has to be placed on websites being indexed by Google’s crawlers. Fortunately for Google, they have a lot of services, very useful for Internet publishers. Because they are mostly free to use, webmasters gladly use them in their websites. Google Analytics One of these attractive services is Google Analytics. It generates detailed statistics about:
  • 13. Chapter 2. Personalized Search 9 • the visitors to the website, • the previous website of the visitor, • activity (navigation) of the user in the website. Figure 2.2. Google Analytics main panel It is the most popular and one of the most powerful tool to examine the web traffic on our website. It gives the owner a lot of useful information about visitors on his website. Figure 2.2 shows a few features of this piece of software. The information which it provides is certainly very interesting for the Google itself. For this reason there is a lot of discussions between SEO workers about possible disadvantages of using Google Analytics in SEO campaigns. It is because poor results of particular website indicated through Google Analytics can inform Google’s search engine to decrease value of this website in search ranking. But this is only an unconfirmed speculation. Google Toolbar Another Google’s tool which provides them with even more valuable data is Google Toolbar. It is the plug-in adding a few new features to popular browsers, mainly a quick access to Google’s services. One of these features is checking the PageRank value
  • 14. Chapter 2. Personalized Search 10 for currently viewing webpage. This gives Google the information about every website which the users with installed Toolbar are viewing. Google AdSense There is also Google AdSense - context advertising program for website publishers. There are millions of websites which uses this service to generate some financial profits for their authors. The effect of this is that all these websites are displaying ads published by Google’s servers. It can provide to Google a similar information as Google Analytics and Google Toolbar. Google Public DNS The latest service launched by Google is Public DNS (Domain Name System) [23]. It is said to be faster and more secure than others and this is how Google encourages us to start using their DNS. It can generate massive amount of information about web traffic. Every single query to the DNS can be analyzed by its provider. So the more popular their DNS will be, the better for Google. It can provide a lot of information helpful on defining websites popularity. But because of DNS caching mechanisms [23], Google do not get all the desired infor- mation. DNS client sends query only when he or she wants to visit a domain for the first time. After he gets the IP address of this domain from DNS, he caches it for an interval determined by a value called time to live (TTL). Every next visit during this interval does not send any query. Consequently, Google is still in need for other services to gather desired information about the activity of particular website’s visitors. Other Google Services Google has other very popular services like for example YouTube, Google Maps (Fig. 2.3) etc. They allow users to embed objects like videos or maps on their own websites. There is also Google Reader which can indicate the popularity of particular websites by counting their RSS1 subscribers. There are many other ways for Google to gain the useful data [9]. In fact Google itself admits to the use of all described techniques in its privacy policy [12]. Most of these data are probably used by them to improve accuracy of their search engine and quality of their services. 2.2.2. Process of tracking user Described services can be a great source of behavioural data. It us no doubt on that. But tracking user’s search activity process would be incomplete without data provided by search engine. The next few sections present how the tracking process looks like. 1. RSS (most commonly expanded as Really Simple Syndication) is a family of web feed formats used to publish frequently updated works – such as blog entries, news headlines, audio, and video – in a standardized format
  • 15. Chapter 2. Personalized Search 11 Figure 2.3. Google Maps example screen Starting the session When the user is opening the search engine site (typing the www.google.com address in a browser), he sends an HTTP Request [37] to the server. This request contains the IP address of the user’s computer. Thanks to this information the search engine has the ability to relate following search queries with particular users. Each of them has been assigned a unique session identifier, stored on the server. This is the beginning of the user’s search session. The identifier expires after a certain period of user’s inactivity. In this way the search session is being terminated. Sending the search query The view presented on figure 2.4 should be known by every Internet surfer. This is the place where user cane type his search query. After the search query is sent, it is followed by two facts: 1. The query is stored into a database and connected with user’s session identifier. Then the personalization mechanism takes advantage from it. 2. The query is analysed and used by search algorithm to provide relevant search results to the user.
  • 16. Chapter 2. Personalized Search 12 Figure 2.4. Google Search main screen After that user receives an HTTP Response [37] with the search results as the HTML code. Result selection In the figure 2.5 there is presented one of the results of ”query example” with the hyperlink highlighted. Figure 2.5. Example of the search result Commonly click on the link sends HTTP Request to the server which link leads to. So in this case, it should be sent to: http://www.wisegeek.com/what-is-query-by-example.htm But when you look into source code of the SERP, you will find URL like this: http://www.google.com/url?sa=t&source=web&cd=6&ved=0CDMQFjAF& url=http://www.wisegeek.com/what-is-query-by-example.htm& ei=CekOTLT4EpHu0gTYitWXDg&usg=AFQjCNE3t34-kSehUAK8TFNwh5CV9K-OWg& sig2=PdwrnqnhLhowpC8t5-06bw The most important thing which can be noticed is that the links in the SERPs lead to Google’s server. But the chosen website finally appears on the user’s screen, because Google’s server is doing URL redirection (forwarding). This technique bears the down- side of the short delay caused by the additional request to search engine server. However, in this way search engine can log every user’s click in SERPs. What is more, there are some additional data in the result’s URL which probably provide some extra information. For example, the value of the cd parameter is the place number in the
  • 17. Chapter 2. Personalized Search 13 current search ranking. What is more, this URL is can be seen by target server, because browser are placing it into HTTP Request data as Referer field [37]. This fact is used by software like Google Analytics to aggregate traffic sources of the website with Analytics scripts. Thanks to this, the website owner can get the information of: • the most popular search phrases that result in visits to his website • the place in ranking of his website for particular search phrase and particular user (it can vary due to personalization mechanism) Of course the same information is being taken into consideration by search engine. Behavioural data extraction According to many research [1], [6], [7], [10], [20] and [36], more than half of a website’s visits comes from the SERPs. In case of e-commerce websites, this value is even higher, because such services usually do not have regular visitors. They mostly come from search engines (even 90% of visits) or from ad appearing on other websites. According to the analysis of real users web traffic [29], typical user spends about 2 hours per session and 5 minutes per page. These statistics concerns a website of good quality, relevant to the user’s interests. Visit on website with poor content would be terminated as soon as after few seconds – so called bounce. Such visit should indicate irrelevance of website selected by user, so it would be desirable if it wouldn’t appear in the search results on particular phrase. Not only what you select/interact with from a given set of search results (or the Ads served with them), but what you do not select or have minimal interactions with (bounce rates) can have an effect. These metrics can be used to create a greater prob- ability model for future search result sets. What is more, a mechanism based on cookies have been recently introduced on Google Search. It allows to learn about every user’s (also not logged in any Google’s service) history of search queries from the last 180 days. Officially it is used to personalize SERPs according to past interests of the user. Google [13] says about that: Because many people might search from a single computer, the browser cookie may be associated with more than one person’s search activity. For this reason, we don’t provide a method for viewing this signed-out search activity. The diagram in figure 2.6 shows the process of tracking user which has not been signed-in to Google Account. This is what Google [14] says about personalized search for signed-out users: When you search using Google, you get more relevant, useful search results, recom- mendations, and other personalized features. By personalizing your results, we hope to deliver you the most useful, relevant information on the Internet. In the past, the only way to receive better results was to sign up for personalized search. Now, you can get customized results whenever you use Google. Depending upon whether
  • 18. Chapter 2. Personalized Search 14 User Search Engine Enter search engine via URL Extract profile information [Else] Type search query Show profile based search page Save cookie Search for relevant documents [Else] Look for interesting website Prepare search results [Found history cookies] [Found something interesting] Re-rank found documents Click chosen website Log phrase-selection Visit the website Redirect to selected website [Else] [Curiosity satisfied] [Else] Log bounce Go back to search engine [Visit longer than 3 minutes] Log visit Close browser Figure 2.6. Activity diagram of a visit session or not you’re signed in to a Google Account when you search, the information we use for customizing your experience will be different: Signed-in personalization: When you’re signed in, Google personalizes your search experience based on your Web History. If you don’t want to receive personalized results while you’re signed in, you can turn off Web History and remove it from your Google Account. You can also view and remove individual items from your Web History. Signed-out customization: When you’re not signed in, Google customizes your search experience based on past search information linked to your browser, using a cookie. Google stores up to 180 days of signed-out search activity linked to your browser’s cookie, including queries and results you click.
  • 19. Chapter 2. Personalized Search 15 Table 2.1. Information used by Google Search in personalization Signed-in Personalized Signed-out Personalized Search Search Place of data Web History, linked to Google On Google’s servers, linked to storage Account an anonymous browser cookie Time interval Indefinitely or until remove it Up to 180 days of data storage Searches used Only signed-in search activity, Only signed-out search activity to customize and only if user is signed up for Web History 2.3. Research The goal of this section is to evaluate the current level of personalization based on several factors. For the task should be helpful this what Google [14] says about types of results customizations: When you use Google to search, we try to provide the best possible results. To do that, we sometimes customize your search results based on one or more factors: Search history: Sometimes, we customize your search results based on your past search activity on Google, such as searches you’ve done or results you’ve clicked. If you’re signed in to your Google Account and have Web History enabled, these customizations are based on your Web History. If you’re signed in and don’t have Web History enabled, no search history customizations will be made. (Using Web History, you can control exactly what searches are stored and used to personalize your results. Learn about using Web History) If you aren’t signed in to a Google Account, your search results may be customized based on past search information linked to your browser using a cookie. Because many people might be searching on one computer, Google doesn’t show a list of previous search activity on this computer. Learn how to turn off these customiza- tions Location: We try to use information about your location to customize your search results if there’s a reason to believe it’ll be helpful (for example, if you search for a restaurant chain, you may want to find the one near you). If you’re signed in to your Google Account, that customization may rely on a default location that you’ve previously specified (for example, in Google Maps). If you’re not signed in, the results may be customized for an approximate location based on your IP address. If you’d like Google to use a different location, you can sign in to or create a Google Account and provide a city or street address. Your specific location will be used not only for customizing search results, but also to improve your experience in Google Maps and other Google products.
  • 20. Chapter 2. Personalized Search 16 2.3.1. Location While you can search at google.com just about anywhere in the world, you can also access Google at a number of different country specific addresses, such as google.co.uk, www.google.fr, www.google.co.in. In fact, Google automatically redirects you on the proper domain using your IP address and determining your geolocation, Browser setting with recommended language was clear in this experiment. This experiment was performed in one location in Poland. However, to simulate request from the other locations, there was used a similar software environment like described in section 5.6.1. The used phrased ”jaguar” is multi-lingual, so the language of the phrase does not affect the search results. The first query was sent through three Tor hosts, where the exit host was located in Los Angeles, California, United States. The result of the query is presented in the figure 2.7 (only several first entries). All website in this SERP are in English, which is the standing language at the described location. Moreover, at the near bottom there are some places indicated on the Google Maps, which are physically close to the location of the exit host. The second query was sent via exit host located in Erfurt, Thuringen, Germany. The figure 2.8 presents results of this query. In Official Google Blog [13] is written, that the same query typed in multiple countries may deserve completely different results. Presented results clearly shows that those words are true. Unfortunately author failed to check if a search for the query ”football” provides dif- ferent results in the US, the UK, and Australia, because the term refers to completely different sports in those countries. But it is rather possible. A preferred country might include the country of the searcher as well as other countries that searcher might find acceptable, such as showing search results from the United States to people located in Canada. 2.3.2. Phrase language It is rather clear, that language of the searched phrase is significant for the results. Search engines, despite personalization, still use matching phrases to the content of indexed pages as the major factor for evaluation of the search relevance. For this reason, the phrases identical in the semantic meaning but in other languages are completely different in general. So serving search results with pages in English about birds would be senseless if the user typed phrase ”Vogel” in search box, which means ”bird” in German.
  • 21. Chapter 2. Personalized Search 17 Figure 2.7. Search results of the query sent via host located in USA 2.3.3. Search history The most interesting factor which is said to have influence in personalization mechanism in Google search engine is user’s search history. Figure 2.9 shows search results which was slightly modified by re-ranking based on search history. On the fifth position, right after two video thumbnails, there is a link to the website, which was visited 4 times (exact number of visits is visible on the right side of the hyperlink) used by the author of this this study to gather information. These fluctuations appear only when user is signed into Google Account, otherwise there is no access to the web history (figure 2.1). This modified search result was not
  • 22. Chapter 2. Personalized Search 18 Figure 2.8. Search results of the query sent via host located in Germany the author’s intentionally. The phrase ”personalized search” was not the object of the experiment. However, this result shows, that search history affects future search results on similar areas of information. To compare modified results with the original (without impact of personalization), there are two ways to disable results customization: 1. signing-out from Google Account 2. using ”View customization” which is available on the bottom of results screen After using one of these options, we can check the original position of the visited website in the ranking. In this particular case, the website holds 17th position in the results with no customization. So after personalization re-rank, there was position increase by 12 places. But the most important is the fact that this change shows visited website on the first SERP of the search ranking. In most cases (more than 90% of searches) users does not go beyond first page of the results. So such change in ranking causes huge increment of the visitors via this phrase.
  • 23. Chapter 2. Personalized Search 19 Figure 2.9. Search results personalized by user’s search history Unfortunately author has failed in forcing search engine to re-rank search results in- tentionally. So after this experiment, approximation of the re-rank algorithm is impos- sible. 2.4. Spam Issue There is a huge amount of value from getting to the top of the search results. Especially considering competitive phrases related to the business. This is a marketing area with millions of dollars in it, quite often. So spammers are highly motivated, because there is a lot of money at stake. Unfortunately regular users, searching for valuable content are the main victims of these practices. One of the more interesting parts about implicit/explicit user feedback during search personalization process is that it can be very effective in dealing with spam. The more personalized the results, the less chance that spam will will appear in search ranking. Because in most cases, spammy websites are clickable by users (which are tricked by link with false information about target website), but after realize the real value of those website, they quickly go away and do not come back to them.
  • 24. Chapter 2. Personalized Search 20 Not only will this enable them to help limit spam through personalization, it would also be a great source of query/click analysis for Google. It is worth to consider that the click data across multiple users shows that a given entry in a query space rarely is clicked, or shows a high bounce rate. Google might just use that signal as a dampening factor for spam result.
  • 25. Chapter 3 Impact of Personalized Search on SEO 3.1. Metrics For quite a long time, SEO workers were using position in search ranking for particular phrases as the indicator of the web positioning. Increase of the website position has been always a desirable consequence of the SEO actions. After implementation of personalization mechanism, the issue is not so simple. Al- though customizations in rankings are still not very influential (only one entry in whole SERP), the highly visible benefits of the personalization suggest, that the impact will be increasing. For this reason ranking position cannot be the major metrics of success any longer. It is because position in ranking of particular website can be different for every user. Especially for those of them, which are regular visitors of this website. Of course, this indicator is still measurable, because position monitors1 are not person- alization subject (they do not use cookies and Google Account). It can also give useful information about position seen by users searching for concerned keywords for the first time. But it is the increase of inbound traffic which has always been the main moti- vation of SEO actions. So the major metrics of this actions should be closely related to this motivation. For example, those are metrics for SEO in time of personalized search: Number of unique visitors. Higher value indicates good result of the advertising campaign and gaining popularity among new customers. Previous search queries. As an example: if the searcher has been recently searching the term ‘diabetes’ and submits a query for ‘organic food’ the system attempts to learn and presents additional results relating to organic foods that are helpful in fighting diabetes. 1. Software which automates monitoring of a website’s position in search rankings for given phrases
  • 26. Chapter 3. Impact of Personalized Search on SEO 22 Previously presented results. Results that have been presented to the end user can be omitted in future results for a given period of time in exchange for other potentially viable results. User query selection. Past selected or preferred documents can be analysed and similar documents or linking documents can be used to refine subsequent results. Furthermore, certain documents types can be seen as preferred, in what would be a combination of Universal Search concepts. Common websites that accessed can also be tagged as preferred locations for further weighting. Selection and bounce rates (and user activity on website). An editorial scor- ing can be devised from the amount of time a user spends on a page, the amount of scrolling activity, what has been printed, or even what has been saved or book- marked. All can be used to further refine the ‘intent’ and ‘satisfaction’ with a given result that has been accessed. Advertising activity. The advertisements clicked on can also begin to add to a clearer understanding of the end users preferences and interests. User preferences. The end user can also provide specific information as to personal interests or location specific ranking prominence. It could also include favourite types of music or sports, inclusive of geo-graphic preferences such as a favourite sport in a given city. Historical user patterns. A persons surfing habits over a given period of time (e.g. 6 months) can also play a role in defining what is more likely to be of interest to them in a given query result. More recent information (on above factors) is likely to be weighted more than older historical performance metrics within a set of results. Past visited sites. Many of the above metrics, such as time spent and scrolling on a given web page or historical patterns and preferred locations can also be collected in a variety of ways (invasive or non-invasive). Cookies actually save resources for the Search Engine, an added benefit. The advices how to improve values of such metrics are presented in the next chap- ter. Higher position in rankings not always implicate more visitors. Moreover, there is no significant difference for positions between 6 and 10. Very often the proper website optimization of page’s title and description visible in a SERP is more important and brings more visitors than higher position. Better website titles and meta-descriptions would have an advantage as getting the user to engage with the SERP listing upon initial presentation would be at a premium. Quality content as well would begin to take on a more meaningful role than it has in the past, as bounce rates and user satisfaction now starts to play into actual search results rankings.
  • 27. Chapter 3. Impact of Personalized Search on SEO 23 3.1.1. Areas for Consideration Author’s experience in commercial SEO, which is closely related to the topic of mar- keting, is rather small. Thus this section is based on [16]. Demographics It should be ensure to leverage any obvious demographics that may apply to your site. If it is geographic, topical (sports, politics) or even a given age group, ensuring that this is targeted effectively is important in that the ‘topical’ nature of personalized search can group results prior to even ranking them. If the particular website is not clear in each of these areas, it risks less weighting to tighter demographic starting document sets. Even your off site activities (link building, Social Media Marketing etc.) should be as tightly targeted as possible. Relevance profile Of particular interest is potential categorization in terms of topical relevance. Ensur- ing that your site provides a strong relevance train would be particularly valuable. Much like phrase based indexing and retrieval concepts, probabilities play a large role. When refining results the search engine looks at related probable matches. Through a concerned effort with on-site and off-site relevance strengthening, you increase the odds of making it to a given set of results in a world of ‘flux’. It never hurts to review the concepts surrounding Phrase Based Indexing and Retrieval as many of the related patents addressed deriving concepts/topics from phrases. One would also have to imagine that tightening up the relevance profile in your Social Media Marketing efforts would also be beneficial to a tighter topical link profile. Fur- thermore, many topically targeted visitors that enter a site may bookmark (or passive collection) your site which ads to the organic search profile without ever being included in a search result. As such, there are many exterior opportunities to be had beyond the traditional off-site SEO. Keyword Targeting Building out from your core terms will be important as far as understanding search behaviour. The long tail as we know it would be targeted towards potential query refinements on a given subset of searcher types. Building out logical phrase extensions and potential query refinements would be something to look at. Furthermore, with changeable personalized ranks we would measure SEO success in actual traffic and conversions which puts term targeting into a new light as far as nailing money terms and having a cohesive plan that targets query refinement long-tail opportunities. Quality Content In considering the value of a website, user interaction becomes a consideration as far as bounce rates, time spent on page and scrolling activities are concerned. Producing
  • 28. Chapter 3. Impact of Personalized Search on SEO 24 compelling and resourceful content would be at a premium to best leverage these tendencies of the system. If a searcher has selected and interacted with your site on multiple occasions your site would be given weight in their personal rankings as well as related topical and searcher types. The more effective a resource the greater the ranking weight increase. Search result conversion Working with the page title, meta-description and snippets takes on a more important role in your SEO efforts when adjusting for personalized search. I dare say using ana- lytics and a form of split testing would be a great advantage as far as satisfying what not only ranks, but converts. Freshness Another area which may be important is document freshness in that people could be able to set default date ranges or the system could passively begin to see a pattern of a user accessing more current content. Valuable website that has been ranking well for a year that may no longer be getting all the traffic that is has been used to. It should be looked at updating such pages with fresh information, or creating new related pages and pass the flow via internal links. Depending on the nature of the content (searcher group profile) more current content may be more popular over the larger data set and thus newer content would be weighted more overall. Site Usability From a crawler or the end user perspective, having logical architecture and a quality end user experience is also at a premium. If similar searcher types embark on similar pathways and related actions (bookmark, print, navigate, and subscribe to RSS) then this will give greater value to those target pages within that community of search types. This also furthers the relevance profile. Analytics It can be noticed there is a strong need for the use of analytics in understanding traffic flows, understanding common pathways, bottlenecks, the paths to conversion, and much more. This data will be of immeasurable use in dealing with many of the factors that can affect Personalized Web Positioning. This issue is closely connected with psychology (particularly behavioural targeting).
  • 29. Chapter 4 SEO Guide This chapter presents a set of areas for consideration during a process of website opti- mization. The prepared advices concern only areas which may have positive result in gaining popularity of the website. They should be helpful for achieve higher position in search rankings which should increase the number of visitors. They also should make the website more attractive for users. This fact probably will decrease bounce rate which which has negative impact on the website in personalization re-ranking process. Concerned areas in this guide do not take into account any SEO techniques which are connected with external actions. So those that require contact with other websites, such as: • linking (the acquisition of links), free or paid • advertising • presell pages1 Listed methods are closely connected with generating spam. Due to this, they reduce the rate of quality content in the Internet. So the Internet surfers has no benefits from them. This chapter is based mainly on the information from [13], [15], [21] and [36]. 1. Presell page – page created only for SEO purpose. Text on such page is only a surrounding for link leading to positioned website. Content has no value for human reader because. It is only prepared to look like natural for crawlers, not to be filtered as spam.
  • 30. Chapter 4. SEO Guide 26 4.1. Website Presentation in Google Search 4.1.1. Title Title is the first information about particular website in SERP. It is also one of the main factors in with impact on the website ranking. An example of such title in html code looks like this: <title>Jaguars, Jaguar Pictures, Jaguar Facts - National Geographic</title> Such title presented in SERP looks like in the figure 4.1. Figure 4.1. Presentation of a website title in Google Search These are the issues connected with website title, which are significant in SEO: Length up to about 65 characters Longer titles can be also indexed by crawler, but title with 65 characters is rather optimal and it entire fits in SERP. Longer titles are shortened with ellipsis. Diversity of titles Each of the website pages relates slightly different information (e.g. product page, contact form etc.). The title should be prepared individually for each of them. Keywords There are 3 principles related to creating a title: 1. Keywords should be distributed on all the pages. Each of the pages must be optimized for only 3–4 keywords. Front page title should have most general ex- pression, titles of product pages should contain words characterizing the type of these products etc. Sticking to this rule is very important, because in other case the pages of the website could be treated by crawler as duplicated content. 2. The most important keywords should be place at the beginning of title.
  • 31. Chapter 4. SEO Guide 27 3. Google can connect keywords from title into different phrases. But these which appear one after another have the greatest impact on their position in ranking. Due to this fact, key phrase should not be separated. 4.1.2. Description Title is the second information about particular website, presented next after title in SERP. Such description presented in SERP looks like in the figure 4.2. Figure 4.2. Presentation of a website description in Google Search Description presentation in SERP can be generated from following sources: • description metatag, for example: <meta name="description" content="Learn all you wanted to know about jaguars with pictures, videos, photos, facts, and news from National Geographic." /> • a fragment of the website content (in case the description metatag is too long or there is no such one in the source code) Here are some tips on the description page in metatag: Length up to about 150 characters Longer descriptions will not be presented in SERP as they have been written. Diversity of descriptions Just as titles, description of a particular page should be slightly different from the other. It should be specific for the information presented in the page. Keywords Description should contain keywords concerned by the SEO strategy. When it does, the keywords will be bold in the search results for query phrase based on such keywords. It should call users’ attention on our website. However, the description should be prepared in the way to encourage users to visit the website.
  • 32. Chapter 4. SEO Guide 28 4.1.3. Sitelinks Sitelinks are links leading to other pages of the same website. The can be presented in SERPs in 2 ways: 1. Horizontally – 4 links in 1 row (presented in figure 4.3) 2. Vertically – 8 links in 2 columns Figure 4.3. Presentation of a website sitelinks in Google Search There is no manual way for publishers to force sitelinks presenting in SERP. It depends on how the website was indexed. But it can be made easier for crawler to make it correctly. There are two things which can be done: 1. First of all it must be well designed source code related to navigation on our website. Its syntax must be very clear. 2. Prepare a sitemap of website (e.g. in XML format). This issue will be described later. 4.2. Website Content 4.2.1. Unique content The basis of the proper content optimization is its uniqueness. This means that the same text or its larger fragments should not be reproduced on other websites or on different pages of our website. In order to verify the degree of uniqueness of our content, it can be used this tool: http://www.copyscape.com 4.2.2. Keywords It is very important make search engines able to relate our website to specific theme and keywords. In order to make it possible, keywords must be considered not only in website title and description design process. Keywords must be also contained in website content. In preparing the text for the website it suggested to stick following principles:
  • 33. Chapter 4. SEO Guide 29 Repetition Keywords should be repeated several times on every page. But it cannot be forgotten that the text should be written primarily for users. The task is to find a compromise between attractive text for users and good for SEO. Too high density of keywords on particular page can be treated by search engine as an abuse. In such situation our website will be penalize by ranking exclusion. Variations and synonyms The website content will be more natural, if contained keywords are used in many variations (grammatical). The proficiency of modern search engines can also detect using synonyms. For this reason we can use for example word ”drug” in the content being optimized for ”medicine” keyword. Location Keywords should located on whole page with similar density. This will give a better result in positioning than accumulation of keywords for example only at the beginning of the page. 4.3. Source Code Website’s source code has not direct influence on the position in search ranking. How- ever, some errors can cause problem with proper indexing by search engine robots. For this reason it is worth to ensure that the code contains no errors and it is compatible with current WWW standards. Very useful is the code validation tool, provided by the World Wide Web Consortium (W3C). It can by found here: http://validator.w3.org 4.3.1. Headers HTML headers tags (h1–h6) are very significant for proper indexation of the website content. Right usage of them is very important in desing of a website. There couple issues which must be considered form SEO point of view. Hierarchy Headers tags are designed to separate particular sections of a document. They must be used in the correct order and only when there is a need to use.
  • 34. Chapter 4. SEO Guide 30 Repetition Header of the first degree (h1 tag) by current HTML standards may occur only once in whole document. Other headers can be used repeatedly. Keywords It is suggested to put keywords into header tags. It is because they have more ”posi- tioning power” than a regular text. This power is probably respective to the headers hierarchy, so the most important keywords should be placed in h1 tag. 4.3.2. Highlights Keywords can be distinguished from the rest of text by using tags ¡strong¿ (bold) and ¡em¿ (italics). In this way, keywords are highlighted either for users end crawlers. Although it should be done with restraint. Not every occurrence of the keyword should be highlighted but only the most important of them. 4.3.3. Alternative texts Sometimes there are some images placed in the document. It is recommended to include alternative texts to those images. It can be done in this way: <img src="path/to/image" alt="alternative text" /> It is displayed on the screen in the case the browser can not display images (e.g. when they are unavailable on the server). These alternative texts are also interpreted by search engine robots. These data is then used in search for images (when search engine has such option). 4.3.4. Layout Well indexing website should have clear and minimalistic layout. The content is the most important factor, so even ratio of text amount to html code is significant. The higher this value is the better and more valuable website in the search engine point of view 4.4. Internal Linking Quite important issue in website optimization is internal linking. Internal link is the hyperlink which leads to another page of the same website. There are some recommen- dation connected with this link type.
  • 35. Chapter 4. SEO Guide 31 4.4.1. Distribution Each of the pages should be available in 3 or 4 clicks at most. If not, the website navigation must be re-designed. Attention should be given especially to the links on the main page. The structure of the website must be clear. What is more, it is also very important for usability of the website. Complicated navi- gation can discourage user to continue the visit. 4.4.2. Links anchors Link anchor is the clickable text. It is displayed for the user on a website, instead of plain URL which is rather unreadable for human. It looks like this: <a href="some_url.html">Anchor text</a> Anchors should describe the content of pages which their links lead to. If links are located among the other text, it should match the context of whole text. For example, it is not advised to write ”click here” like it was popular couple years ago. 4.4.3. Broken links Very important thing in website positioning is to beware of links which lead to unavail- able URLs. Such issue is very annoying and discouraging for visitors. The website with broken links will be also less valuable for search engines, because robots crawl the web using links. After indexing a page robot uses one of the links placed on this page to go to another page. When such link is broken, crawler can interrupt indexing process. It will cause the situation where not every page of the website will be indexed. 4.4.4. Nofollow attribute Nofollow is an HTML attribute value used to instruct some search engines that a hyperlink should not influence the link target’s ranking in the search engine’s index. This is example of such hyperlink: <a href="some_url" rel="nofollow">Some website</a> It is intended to reduce the effectiveness of certain types of search engine spam, thereby improving the quality of search engine results and preventing indexing particular web- site as spam. Nofollow attribute is used commonly in outbound links2 , for example in paid advertising. 2. Links which target at other websites
  • 36. Chapter 4. SEO Guide 32 4.4.5. Sitemap A sitemap is a list of pages of a website accessible to crawlers or users. This helps visitors and search engine bots find pages on the website. Sitemap for users It can be prepared a page, where will be placed links leading to all website’s pages or only the most important ones. Thanks to this, users having problems with navigation will be able to find quickly what they are looking for. The example of such sitemap located in footer is presented in figure 4.4. Figure 4.4. Example of sitemap for visitor Sitemap for robots Sitemap for crawlers must be easy to automatic processing. Such sitemap is mostly being prepared in the XML document format. This is how example looks like: <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.example.com/catalog?item=12</loc> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=73</loc> <lastmod>2004-12-23</lastmod>
  • 37. Chapter 4. SEO Guide 33 <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=74</loc> <lastmod>2004-12-23T18:00:15+00:00</lastmod> <priority>0.3</priority> </url> <url> <loc>http://www.example.com/catalog?item=83</loc> <lastmod>2004-11-23</lastmod> </url> </urlset> As it can be noticed, such document contains some information about each link: loc: URL to particular page lastmod: time of last modification of the page changefreq: average period time between changes in the page priority: value of priority for crawler to index particular page Such information are welcome by crawlers. It can profits to publisher with faster in- dexing by crawler. In most cases such documents are being prepared using software tools as like: http://www.xml-sitemaps.com/ After when the sitemap document is prepared, the search engine must be notified about its existence by special form. 4.5. Addresses and Redirects Among previously described factors used in rank algorithm, search engines also consider form of indexed website’s URLs and information included in HTTP Responses. 4.5.1. Friendly addresses Search engines give higher rank value to those websites whom pages have URLs more readable for human. For example, address like this: http://www.example.com/index.php?page=product&num=5 can be written in this way: http://www.example.com/product/5 Such effect can be achieved using mod rewrite. It is module to the Apache Server, which allow to create regular expression patterns for mapping URLs to particular pages. Such
  • 38. Chapter 4. SEO Guide 34 possibility have also modern web frameworks, such as Django Framework or Ruby on Rails. Moreover, such possibility gives another opportunity for placing keywords. Due to this fact, page being optimized for particular keyword should have this keyword in its URL. If it is phrase with couple words, it is suggested to separate them with dash. 4.5.2. Redirect 301 Redirect 301 is the constant redirect from one address to another. After using it: • visitors writing into the address bar in browser the old address, will be redirected into the new one • some search engines will switch the old address in the database into new one So it is very useful after domain change. Earlier it author said, that content of the website should not be duplicated. It is often forgotten, that allowing to entry a website through several addresses has the same result. Sometimes the same website is downloaded via: • example.com • www.example.com • example.com/index.html • www.example.com/index.html • example.com/default • www.example.com/default In such case it must be decided if the main address of the website will have the ”www” prefix. If it will, it should be placed .htaccess file in the main folder of the server, with such content: RewriteCond %{HTTP_HOST} ^example.com$ [NC] RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L] Similarly we can manage the redirection from the index.html file: RewriteCond %{REQUEST_FILENAME} index.html RewriteRule ^(.*)$ http://www.example.com [R=301,L] There are also many other possibilities which can be managed likewise. 4.6. Other Issues There is couple other things, which have some influence on the website ranking.
  • 39. Chapter 4. SEO Guide 35 4.6.1. Information for robots Sometimes publishers do not want robots to index some website’s pages, but keep them still available for regular visitors. For example: • results from internal search engines • data sort results • print version of pages • pages which should not be indexed, like login page to administration panel To manage this issue it can be prepared robots.txt file with the content like this: User-agent: * Disallow: /admin-panel/ It should prevent robots from indexing pages whom URLs start with www.example.com/admin-panel/. 4.6.2. Performance One of the most recent factors introduced Google search engine algorithm is website performance value. Google promotes websites with short time period of download pro- cess. It is not as significant as internal linking or quality of content. Big information portals are very complex, so they can not be downloaded as fast as e.g. small blog. But valuable information is the most important factor. However, good performance can increase rank value of the website compared to an another one with similar content but not so efficient. To improve website performance it can be used several tools, like PageSpeed by Google: http://code.google.com/speed/page-speed It provides the analysis of the website downloading efficiency and gives some tips on how to improve the performance. Next, will be presented the most suggestion being given by this software. Gzip compression Modern web browser allow to use gzip compression mechanism to reduce the size of website files (images, CSS files, Javascript files). If there is such possibility, it is recom- mended to use it. Number of DNS lookups DNS caching mechanism [23] causes there is no need to look for IP address matching to particular domain several times. For this reason, when every file used by the website is located on the same server (or another server in the same domain), there is only on DNS lookup in during download process.
  • 40. Chapter 4. SEO Guide 36 So it should be avoided placing media files (images, CSS, Javascript) on the different domain without clear need. External files Information commonly located in the external files, like CSS Style Sheets or Javascript, can be also located inside the HTML document. However it should be avoided, because it causes parsing of source code process more complex. Thus the browser need more time to display website on the screen. 4.7. Summary Tips presented in this chapter should significantly increase the rank value of every website. With the higher ranking in the search engine will be it will be more inbound traffic. In other words, the website will gain more popularity. Making the website more attractive to visitors should implicate better results of per- sonalization re-ranking. The assumptions of the impact of personalized search results on global ranking are very likely. So improving the quality of website on the basis of presented guide should increase website’s ranking in general. Both through the per- sonalization impact and collecting inbound links as the effect of increase in popular- ity.
  • 41. Chapter 5 The System for Global Personalization The goal of this chapter is to propose a method for improving the website’s search ranking through affecting the personalization mechanism in search engine. The idea of this method is to generate artificial behavioural data. The author of this thesis is the co-author of the article [19], which this chapter is based on. In section 2.2 of this thesis it has been shown that a lot of data goes into Google and a lot of useful manipulated data comes out. But we can only guess what happens in between or try to learn from the observation of the data coming out of Google. Evans wrote [10] that identifying the factors involved in a search engine ranking algo- rithm is extremely difficult without a large dataset of millions of SERPs and extremely sophisticated data-mining techniques. That is why, only an observation, experience and common sense are the main source of knowledge on Search Engine Optimization (SEO) methods. It was according to this knowledge that Search Engine Ranking Factors [11] was created. The last edition of it assumes that traffic generated by the visitors of a website has 7% of importance in the Google’s evaluation of the website value. It is, after links to the specific website and its content value, the most significant factor in website evaluation process. On the basis of the previous editions of the ranking, one can notice that the importance of this factor is increasing. Because these all are only reasonable assumptions, the intention is to evaluate the validity level of the described factor in web positioning efforts. For this purpose we need a simulation tool which will generate necessary human-like traffic on a tested website. The tool is going to be a multi-agent system (MAS) which will imitate real visitors of the websites. 5.1. Problems to Solve Fig. 5.1 presents the main reason why the system must be distributed. A few queries to Google, have been sent frequently one after another from the same IP address, are
  • 42. Chapter 5. The System for Global Personalization 38 Figure 5.1. Information displayed by Google on the abuse detection detected by Google and treated as abuse. Google suspects an automated activity and requires completion of the captcha form in order to continue searching. In case of using the distributed system, the queries would be sent from many different IP addresses. It should guarantee, that Google will not consider this issue abusing. This issue cannot be solved by using a set of public proxy servers. Google has probably put them into their black list. Every single query to Google via such proxy server leads to the same end – captcha request. What is more, after Tuzhilin [35] we can say that Google puts a lot of reasonable effort into invalid clicks on advertisement filtering. There is a big chance, that some of those mechanism Google uses in the analysis of the web traffic. This is the reason why generating behavioural data should be our concern. Recognized artificial web traffic could be treated by Google as an abuse and cause being punished (decline of website position). 5.2. Objectives 5.2.1. Web Positioning The main goal of the system is to improve the position of a website by generating traffic related to the website. The only activity which can be visible for Google the system should care of. It shows that there is no need to download all content from particular website. It would only waste the bandwidth. The system should only send to Google services requests used by a particular website, for example: • links to the website on SERPs, • Google Analytics scripts, • Google Public DNS queries,
  • 43. Chapter 5. The System for Global Personalization 39 • Google media embedded on the website like AdSense advertisement, maps, YouTube videos, calendars etc. 5.2.2. Cooperation The whole idea of the system is to spread positioning traffic into world wide IP ad- dresses. As a result of this distributed character, the system require a large group of cooperating users. Nobody will use the system if there is no benefits to him. A mech- anism which will let the system users to share their Internet connections in order to help themselves in web positioning must be introduced. What is more, the mechanism must treat all users equally-fairly. It means it should not allow to take benefits without any contribution. 5.2.3. Control According to [36], web positioning is not a single action, but a process. This process should be able to be controlled. Otherwise it could be destructive, instead of improving the website position. For this reason, the system should allow users to: • control the impact of the system activity on their websites, • check the current results of the system activity (changes in the website position on SERPs), • check the current state of the website in the web positioning process. 5.3. Architecture Fig. 5.2 presents the architecture of the system which take under consideration all specified problems and objectives. Server is necessary to control the whole process of generating the web traffic by specified algorithm. It gives the orders for clients to start generating traffic on the specified websites. It also gets the information from clients with amount of requests sent to particular Google’s services on the website account. Database serves as storage for process statistics. They can be presented to clients via web interface. They are also useful to server for creating the orders in accordance with the algorithm. Clients are the agents of the presented MAS. They take orders from the server with particular webiste registered in the database to be processed. Processing the website is to mimic its real visitor. Client performs this autonomously using the visitor session algorithm described in the next section.
  • 44. Chapter 5. The System for Global Personalization 40 Clients Cloud Website 1 Client 1 Google Website 2 All Google Search Services Client 2 Server Database Website 3 Client 3 Figure 5.2. System architecture 5.4. Visitor Session Algorithm According to many research [1], [6], [7], [10], [20] and [36], more than half of a web- site’s visits comes from the SERPs. That is why starting the single visitor session (the sequence of requests considering single website registered in the system) with querying Google Search sounds reasonably. However, only if currently considered website appears on one of the first few SERPs. Otherwise, visitor session should be started directly on the processed website or should refer to an incoming link, but it must be existing, if there is such one. Because of the likely Google’s actions in order to detect abuses, the visitor session should be possibly human-like. The analysis of real users web traffic [29] is very useful at this moment. According to it, a typical user: • visits about 22 pages in 5 websites in one sitting, • follows 5 links before jump to a new website, • spends about 2 hours per session and 5 minutes per page. These statistics clearly indicate that typical visit session concerns a website of good quality. Visit on a poor website would be aborted as soon as after few seconds. Such visit could have a negative impact on the website quality evaluation by Google. 1 – Server retrieves from the database information about the next website to be pro- cessed. 2 – Task assignation to the client. 3 – Client starts the visitor session.
  • 45. Chapter 5. The System for Global Personalization 41 Database 6 1 4 Visit session 2 3 5 Google Website All Google Search Services Client Server 7 Figure 5.3. Visitor session algorithm 4 – Searching on SERPs for a link to the processed website. 5 – If a link has been found – click on the link, otherwise direct request. 6 – Processing the visit session. 7 – Request to the server for another website to process. 5.5. Task Assignation Algorithm Task assignation algorithm helps server to build a queue of registered websites ordered by the visitor session priority. The website with the highest value of the priority is the next one to start visitor session. In other words, client always receive the website with the highest priority value to process. The priority value P V is calculated using the function: v(α) P V (α) = r(α) · t(α) · (5.1) T (α) where α — record in the system (website with phrase for web positioning) r(α) — returns current position in the search engine ranking for the α (returns 0 if there is no α in the ranking) t(α) — returns time since the end of the last visitor session on the α (in seconds) v(α) — returns number of visitor sessions made by α owner’s client T (α) — returns time since the registration of α in the system (in days)
  • 46. Chapter 5. The System for Global Personalization 42 Presented function gives the highest ”power” to the ranking factor. The reason for this is that websites with high ranking value should have more real visitors, so the system efforts will not be so crucial for its popularity. Time of participation in the system is not very significant. Novice participants have equal chance to gain attention for their websites as the senior ones. However, function promotes continuous activity of the clients. Worth to consider is also the possibility to dynamically modify weights of individual factors depending on the results. Because of the fact that websites queue is built by the server, it is possible to change whole function during the system activity. 5.6. Proof Study Presented system requires a large number of users to work properly. In other case, the generated traffic would not be distributed enough, thus would look unnaturally. As it was shown, centralized series of queries are being seen as abuse. Unfortunately, thesis author’s resources have been insufficient for this purpose. However, a simulation has been performed, which had to prove proposed concept. 5.6.1. Tools The idea was to use Tor application (http://www.torproject.org) to make the single host (the author’s computer) generate distributed traffic. In such way, the behavioural data of one real user could be seen by search engine as multi-user traffic. Tor is a free software enabling Internet anonymity by thwarting network traffic analysis. Tor aims to conceal its users’ identity and their network activity from traffic analysis. Operators of the system operate an overlay network of onion routers which provides anonymity in network location as well as anonymous hidden services. Users of a Tor network run an onion proxy on their machine. The Tor software peri- odically negotiates a virtual circuit through the Tor network. Application like browser may be pointed at Tor, which then multiplexes the traffic through a Tor virtual circuit. Once inside a Tor network, the encrypted traffic is sent from one host to another, ultimately reaching an exit node at which point the decrypted packet is available and is forwarded on to its original destination. Viewed from the destination, the source of the traffic appears to be at the Tor exit node. As the figure 5.4 shows, Tor has became quite popular, so its network involves large number of users. This makes Tor fit to the objective in this study. Mozilla Firefox browser have been used, connected with the Tor. Additionally has been also installed iMacros plug-in in order to automate executing of visitor sessions. For analysis of the behavioural data being received by Google during the study, Google Analytics (shown in figure 2.2) software has been used. It was installed on every ex- amined website.
  • 47. Chapter 5. The System for Global Personalization 43 Figure 5.4. Tor interface screen 5.6.2. Results Distributing traffic issue ended with success. After opening the Google Search main page (www.google.com), server redirects to the domain belonging to the country which the Tor exit node of particular session was located in. For example, when exit node was in Germany, Google server redirected browser from google.com to google.de address. There was displayed Google Search page in the appropriate language, in spite the fact, that browser’s setting with default language was removed. After visit on examined pages, Google Analytics also indicated, that the source of visits was not in Poland (where the study was actually conducted) but in the countries of exit nodes of the traffic. However, routing the traffic through distributed Tor network appeared to be insufficient solution. Firstly, the traffic routed by the Tor is significantly slowed down. From time to time there were even difficulties with download complete search engine site. What is more, despite the large number of visit sources, there are still only one browser and one real user. Because of this, this simulation could only imitate one singed-in user or a group of singed-out.
  • 48. Chapter 5. The System for Global Personalization 44 Signed-in user In the first case there is essentially no difference between visiting through Tor proxy network or directly. Like it was described, Tor is the tool to concealing the identity of a user. But after sign-in into Google Account, the identity is evident. From the Google point of view, such visit is seen as regular user travelling very quickly all around the world (metaphorically speaking). But it is still only one user, and applied personalized search results to him should not be globally significant. Group of signed-out users As it was described earlier, Google introduced personalization mechanism not only for users with Google Account. There is also personalization in search results for users with no such profile account. It is based on storing cookies in user’s browser up to 180 days, which contain information about past search activities. But cookies are not related to specific user, but to a browser. In this case, search results are re-ranked not for the person but rather for the particular computer which this person uses. This simulating system uses only one browser, so there was no possibility to evaluate the impact of personalization re-ranking on search ranking in the global perspective. Disabling storing of cookies option in the browser makes personalization impossible to act, because there is no way to relate past queries in search engine to particular user. Moreover, browser with blocked cookies is rather rare situation nowadays. Therefore Google search engine is rather suspicious about traffic with blocked cookies and for such requests they serve ”Sorry page” (figure 5.1). 5.7. Summary Generating artificial traffic on the Internet seems to be not very praiseworthy as it is dangerously close to spam appearance and causes the information noise into vis- itors statistics. On the other hand, this is not worse than other SEO activities like linkbaiting. After [6], today’s search engines use mainly link-popularity metrics to measure the ”quality” of a page. It is the main idea of the PageRank algorithm [3]. This fact causes the ”rich-get-richer” phenomenon. More popular websites appear higher on SERPs, which brings them more popularity. Unfortunately, it is not very beneficial for the new, unknown pages which have not gained popularity yet. There is a possibility, that these websites contain more valuable information than the popular ones. Despite this fact, they are ignored by search engines because of small amount of links. These sites, in particular, need SEO efforts. Probably classic techniques will be more effective than the one presented in this paper. Nevertheless, the methods presented in this article are likely to improve the rate of web positioning, because web traffic can be noticed by search engines immediately. It
  • 49. Chapter 5. The System for Global Personalization 45 is quite opposite to crawlers, where it can lasts much longer time until they find links to the considered websites and will improve their ratings for search engine. Of course the best results should be given by using presented technique with classical methods at the same time. It would mimic the gain of popularity in the most natural way. The amount of links leading to the website usually raises linearly with the traffic and the number of visitors. However, the present-day SEO market focuses merely on classical methods. What is more, presented system could be used not only in SEO. It could be also very useful in the other areas. Nowadays, performance testing of a web applications is quite challenging issue. Of course, there some solutions for that, like JMeter application (http://jakarta.apache.org/jmeter). However, presented MAS system would be generating more natural traffic, so it could better simulate users behaviour ”in real life”. To sum up, the aim of this study is to show that methods based on manipulation of dynamic data are worth to involve in the SEO efforts. Moreover, these methods do not increase the amount of spam seen by the Internet users. In contrast to classical methods, which some of them involve generating static data links and texts with no value for end users.
  • 50. Chapter 6 Conclusion Google [13] says that in the future users will have a much greater choice of service with better, more targeted results. For example, a search engine should be able to recommend books or news articles that are particularly relevant - or jobs that an individual user would be especially well suited to. Personalization should, and likely will have a big impact on the way people search, what publishers learn about their intended audiences and measuring the effectiveness of SEO campaigns – especially to SEO firms using ranking reports as one way of measuring the efficacy of their efforts. No longer can the SEO practitioner think simply in terms of ranking for a main index query result. It is since there are so many other potential rankings for documents that are not as readily available in the traditional sense. Theoretically rankings could seemingly drop a few places but traffic actually increase due to a niche crowd following that is assisting your rankings via personalized search activities and being popular among a given sub-set of users. While we are left to speculate about search engine behaviour and observe the changing landscape, there are some steps that an SEO professional or any website owner can take while anticipating the effects of personalization: • learn about Social Networking Theory and Online Social Networks • recognize and share with clients the diminishing value of ranking reports • aim towards measuring results and conversions in a meaningful manner from log file analysis and Web analytics tools • find ways to learn more about your intended audiences and existing customers Comparing selected websites on SERPs of one user to that of another using the same query could be very telling. Although the search engines will not share their strategies, it is clear that this type of analysis is being used elsewhere on the Web. For exam- ple considering the recommendations in the Internet shops being offered when people perform searches at that store (”people who purchased this book were also interested
  • 51. Chapter 6. Conclusion 47 in...”). Search engine can recommend pages selected by other users who searched using the same terms. In attempting to provide personalized search results, the focus of search engines’ ef- forts has shifted from matching keywords to knowing more about the true interests of searchers. Keyword matching still plays a role in what search engines do when returning results, but information gathered from those searchers is playing an increasing role in the results they see. Particularly interesting is how this personalization can be used in the global perspec- tive. Can be aggregated information collected from a large number of interactions between users and search engines. What pages people click when faced with a list of search results. If the vast majority of those searching for ”jaguar” choose pages about animal, it would make sense to show more pages with animal facts in search results and fewer pages about cars. However, the fact that affection on the personalization mechanism in the way presented in the chapter 5 is not easy, seems to be quite positive conclusion. Because personal- ization is the big step to improve filtering spam in search results. And such method used by SEO marketers could reduce the positive effect of the global personalization. Unfortunately there are already some indications, that a kind of ”SEO NetBot” is under construction, so in the near future, personalization mechanism will become vulnerable to manipulation.
  • 52. List of Figures 2.1 Google Web History panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Google Analytics main panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Google Maps example screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Google Search main screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Example of the search result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Activity diagram of a visit session . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7 Search results of the query sent via host located in USA . . . . . . . . . . . . . 17 2.8 Search results of the query sent via host located in Germany . . . . . . . . . . . 18 2.9 Search results personalized by user’s search history . . . . . . . . . . . . . . . . 19 4.1 Presentation of a website title in Google Search . . . . . . . . . . . . . . . . . . 26 4.2 Presentation of a website description in Google Search . . . . . . . . . . . . . . 27 4.3 Presentation of a website sitelinks in Google Search . . . . . . . . . . . . . . . . 28 4.4 Example of sitemap for visitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 Information displayed by Google on the abuse detection . . . . . . . . . . . . . 38 5.2 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Visitor session algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4 Tor interface screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
  • 53. Bibliography [1] Bifet A., Castillo C., Chirita P., Weber I., An Analysis of Factors Used in Search Engine Ranking. In First International Workshop on Adversarial Information Retrieval on the Web, pp. 48–57. Lehigh University, Bethlehem (2005) [2] Blankson S., Meta Tags, Optimising Your Website for Internet Search Engines. Blank- son Enterprises Limited (2007) [3] Brin S., Page L., The Anatomy of a Large-Scale Hypertextual Web Search Engine. World Wide Web 7, 107–117 (1998) [4] Chirita P.A., Firan C., Nejdl W., Summarizing local context to personalize global web search. Information and Knowledge Management (2006) [5] Carterette B., Jones R., Evaluating Search Engines by Modeling the Relationship Be- tween Relevance and Clicks. Advances in Neural Information Processing (2007) [6] Cho J., Roy S., Impact of Web Search Engines on Page Popularity. World Wide Web 13, 20–29 (2004) [7] Chu H., Rosenthal M., Search Engines for the World Wide Web, A Comparative Study and Evaluation Methodology. American Society for Information Science 33, 127–135 (1996) [8] Dou Z., Song R., Wen J., A large-scale evaluation and analysis of personalized search strategies. Research and Development in Information Retrieval 33 (2007). [9] Dover D., The Comprehensive List of All the Data Google Admits to Col- lecting from Users. http://www.seomoz.org/user_files/google-user-data/ SEOmoz-Google-User-Data.pdf [10] Evans M.P., Analysing Google rankings through search engine optimization data. Inter- net Research 17(1), 21–37 (2007) [11] Fishkin R., Search Engine Ranking Factors 2009. http://www.seomoz.org/article/ search-ranking-factors [12] Google Inc., The Google Privacy Policy. http://www.google.com/privacypolicy. html
  • 54. Bibliography 50 [13] Google Inc., The Official Google Blog. http://googleblog.blogspot.com [14] Google Inc., Web Search Help. http://www.google.com/support/websearch [15] Gryszko M., Darmowy poradnik o podstawach optymalizacji stron WWW. http://www. lexy.com.pl (in Polish) [16] Harry D., The Fire Horse Guide to Google Personalized Search For Search Marketers. http://www.huomah.com/search-engines/learn-seo/ fire-horse-guide-to-personalized-search.html [17] Jeh G. Widom J., Scaling personalized web search. World Wide Web, 271-279 (2003) [18] Joachims T., Optimizing search engines using clickthrough data. Knowledge discovery and data mining 8, 133—142 (2002) [19] Kowalski P., Król D., An Approach to Evaluate the Impact of Web Traffic in Web Positioning. KES-AMSTA 2010, LNAI 6071, Springer: 380–389 (2010) [20] Lawrence S., Giles C.L., Searching the World Wide Web. Science 280, 98–100 (1998) [21] Ledford J.L., SEO, Search Engine Optimization Bible. Wiley Publishing, Inc., Indi- anapolis (2008) [22] Liu F., Yu ,C., Meng W., Personalized web search by mapping user queries to categories. Information and Knowledge Management, 558—565 (2002) [23] Liu C., Albitz P., DNS and BIND, Fifth Edition. O’Reilly Media (2006) [24] Micarelli A., Gasparetti F., Sciarrone F., Gauch S., Personalized search on the world wide web. The Adaptive Web, Lecture Notes in Computer Science 4321, 195–230 (2007) [25] Microsoft Corporation, Bing Search Blog. http://www.bing.com/toolbox/blogs/ search/default.aspx [26] Pant G., Srinivasan P., Menczer F., Crawling the Web. In, Levene, M., Poulovassilis, A. (eds.) Web Dynamics, pp. 153–178. Springer-Verlag (2004) [27] Pretschner A., Gauch S., Ontology based personalized search. Tools with Artificial Intelligence 11, 391-–398 (1999) [28] Qiu F., Cho J., Automatic identification of user interest for personalized search. World Wide Web, 727-–736 (2006) [29] Qiu F., Liu Z., Cho J., Analysis of User Web Traffic with a Focus on Search Activities. Web and Databases 8, 103–108 (2005) [30] Shen X., Tan B., Zhai C., Implicit user modeling for personalized search. Information and Knowledge Management, 824—831 (2005) [31] Spereta M., Gauch S., Personalizing Search Based on User Search Histories. In Pro- ceedings of WI ’05, pages 622-–628 (2005) [32] Sugiyama K., Hatano K., Yoshikawa M., Adaptive web search based on user profile constructed without any effort from users. World Wide Web 12, 675-–684 (2004) [33] Teevan J., Dumais S.T., Horvitz E., Beyond the commons, Investigating the value of personalizing Web search. The Workshop on New Technologies for Personalized Infor- mation Access (2005)
  • 55. Bibliography 51 [34] Teevan J., Dumais S.T., Horvitz E., Personalizing search via automated analysis of in- terests and activities. Research and Development in Information Retrieval 31, 449—456 (2005) [35] Tuzhilin A., The Lane’s Gifts v. Google Report. http://googleblog.blogspot.com/ pdf/Tuzhilin_Report.pdf [36] Walter A., Building Findable Web Sites, Web Standards SEO and Beyond. New Riders, Berkeley (2008) [37] World Wide Web Consortium, Hypertext Transfer Protocol - HTTP/1.1. [38] Yahoo! Inc., Yahoo! Search Blog. http://www.ysearchblog.com/ [39] Yi X., Raghavan H., Leggetter C., Discovering Users’ Specific Geo Intention in Web Search. World Wide Web (2009)

×