SlideShare a Scribd company logo
1 of 55
Download to read offline
Faculty of Computer Science and Management
field of study: Computer Science
specialization: Software Engineering




                                               Master thesis
                              Analysis of the Modern Methods
                                          for Web Positioning

                                                Paweł Kowalski


                                                                            keywords:
                    search engine, SEO, personalization, optimization, web positioning


Thesis contains an analysis of personalization of search results mechanism in
popular search engines. It presents experiments and considerations about impact of
personalization on search rankings and how it affects on Search Engine Optimization
(SEO). There are proposed some new SEO methods that take an advantage of
personalization in search engines.

Supervisor:          dr inż. Dariusz Król ............................. .............................
                         name and surname                  grade                    signature



Do celów archiwalnych pracę dyplomową zakwalifikowano do:*
     a) kategorii A (akta wieczyste)
     b) kategorii BE 50 (po 50 latach podlegające ekspertyzie)
*
  niepotrzebne skreślić
                                                                            Stamp of the institute



                                       Wrocław 2010
Contents

Chapter 1. Introduction . . . . . . .       .   .   .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1
 1.1. Beginning of the SEO Concept .        .   .   .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1
 1.2. Search Engines Evolution . . . .      .   .   .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2
 1.3. The Goal . . . . . . . . . . . . .    .   .   .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
Chapter 2. Personalized Search . . . . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
 2.1. Operation Principles . . . . . . . . . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
 2.2. Methods for the Analysis of Behavioural Data                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
      2.2.1. Methods of behavioural data collecting                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
      2.2.2. Process of tracking user . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
 2.3. Research . . . . . . . . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
      2.3.1. Location . . . . . . . . . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
      2.3.2. Phrase language . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
      2.3.3. Search history . . . . . . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
 2.4. Spam Issue . . . . . . . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
Chapter 3. Impact of Personalized Search on SEO . . . . . . . . . . . . . . . .                                                                         21
 3.1. Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                           21
      3.1.1. Areas for Consideration . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                23
Chapter 4. SEO Guide . . . . . . . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
 4.1. Website Presentation in Google Search                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
      4.1.1. Title . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
      4.1.2. Description . . . . . . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
      4.1.3. Sitelinks . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
 4.2. Website Content . . . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
      4.2.1. Unique content . . . . . . . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
      4.2.2. Keywords . . . . . . . . . . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
 4.3. Source Code . . . . . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
      4.3.1. Headers . . . . . . . . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
      4.3.2. Highlights . . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
      4.3.3. Alternative texts . . . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
      4.3.4. Layout . . . . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
 4.4. Internal Linking . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
      4.4.1. Distribution . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31

                                                        i
4.4.2. Links anchors . . . . .     .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
       4.4.3. Broken links . . . . . .    .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
       4.4.4. Nofollow attribute . .      .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
       4.4.5. Sitemap . . . . . . . .     .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
  4.5. Addresses and Redirects . . .      .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
       4.5.1. Friendly addresses . .      .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
       4.5.2. Redirect 301 . . . . . .    .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   34
  4.6. Other Issues . . . . . . . . . .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   34
       4.6.1. Information for robots      .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35
       4.6.2. Performance . . . . . .     .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35
  4.7. Summary . . . . . . . . . . . .    .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
Chapter 5. The System for Global Personalization                                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   37
 5.1. Problems to Solve . . . . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   37
 5.2. Objectives . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
      5.2.1. Web Positioning . . . . . . . . . . . . . .                               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
      5.2.2. Cooperation . . . . . . . . . . . . . . . . .                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
      5.2.3. Control . . . . . . . . . . . . . . . . . . .                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
 5.3. Architecture . . . . . . . . . . . . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
 5.4. Visitor Session Algorithm . . . . . . . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   40
 5.5. Task Assignation Algorithm . . . . . . . . . . . .                               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
 5.6. Proof Study . . . . . . . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
      5.6.1. Tools . . . . . . . . . . . . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
      5.6.2. Results . . . . . . . . . . . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   43
 5.7. Summary . . . . . . . . . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   44
Chapter 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                              46
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                          48
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                           49




                                                      ii
Abstract
Modern search engines are constantly improved. The most recent big step
introduced into their algorithms concern personalization mechanism. Its goal is to
extract information about user’s preferences implicitly from its search behaviour
and also such factors as location, phrase language and search history. This
information is the basis for building a user’s search profile. Motivation of this
process is to provide more relevant search results for specific user and its interests.
Thesis concern details of this personalization mechanism and try to examine
how various factors affect search results. Author also analyse the methods for
collecting behavioural data by search engines. He approaches to define possible
impact of customization of search results on the Search Engine Optimization
(SEO) issues like metrics, spam filtering or changes in significance of website
optimization factors. Then author tries to evaluate the possibility of personalized
search rankings manipulation through proposed system for generating human-like
web traffic.

                                   Streszczenie
Nowoczesne wyszukiwarki internetowe są ciągle ulepszane. Ostatni duży krok na-
przód wprowadzony w ich algorytmach dotyczy mechanizmu personalizacji. Jego
zadaniem jest zdobycie informacji na temat preferencji użytkownika z jego za-
chowania podczas wyszukiwania informacji w wyszukiwarce pod względem takich
czynników jak jego lokalizacja, język wyszukiwanej frazy i historia wyszukiwań.
Te informacje są podstawą do utworzenia profilu użytkownika. Celem tych dzia-
łań jest zwrócenie konkretnemu użytkownikowi rezultatów wyszukiwania bardziej
odpowiadających jego zainteresowaniom. W pracy tej znajduje się szczegółowa
analiza mechanizmu personalizacji oraz próba zbadania jak poszczególne czyn-
niki wpływają na wyniki wyszukiwań. Autor analizuje także metody pozyskiwania
danych na temat zachowań użytkowników przez wyszukiwarki. Podejmuje próbę
określenia możliwego wpływu dostosowywania wyników wyszukiwania do użyt-
kownika na tematy związane z pozycjonowaniem witryn internetowych takie, jak
metryki, filtrowanie spamu lub zmiany w znaczeniu poszczególnych czynników
w optymalizacji stron WWW. Następnie autor próbuje ocenić możliwość mani-
pulowania spersonalizowanymi wynikami wyszukiwania poprzez zaproponowany
system służący do generowania naturalnie wyglądającego ruchu sieciowego.




                                         iii
Chapter 1

Introduction

Before the Web and present-day search engines, searching meant simple matching the
terms in a query to the exact appearance of these terms in a database filled with textual
documents. Some database searches let you only locate documents where certain words
appeared within a defined distance from other specified words from the same document.
Sorting documents by relevance or importance would have been a monumental task, if
possible at all.



1.1. Beginning of the SEO Concept

When the Internet was introduced, it revolutionized the worldwide share of information.
Free access for everyone to this web without any restrictions is the reason why the
Internet is considered to be one of the greatest inventions of the 20th century. But this
freedom in the Internet has serious implication – many problems with the organization
of this enormous set of information.
Hyperlinks turned out to be insufficient for the issue. This is why the first search engines
was introduced. They quickly became the main source of visits in the commercial web-
sites. A good search results started to be very important issue for content publishers.
Those moment was the beginning of the SEO1 concept which is still the major element
of the Internet marketing.
The early search engines like AltaVista or Lycos were launched around 1994–1995 [7].
Their algorithms have only been analysing the content of websites and keywords in
meta tags. It was easy to circumvent these algorithms by placing false information
into keywords tags. Another popular fraud was filling website content with irrelevant
text, which was visible only for search engine robots, but not for the user. As a result,
search engine result pages (hereafter SERPs) contained websites filled with spam and
inappropriate content [21].
 1. Search Engine Optimization (SEO) – the process of improving the volume or quality of traffic to
    a website from search engines. It is also used to take Web Positioning term as a synonym.
Chapter 1. Introduction                                                                 2

1.2. Search Engines Evolution

However, the relevance of the search results to the query is still based on keyword
matching. But search engines started to understand differences in the importance of
words located in different parts of a page. For example, if you searched for a certain
phrase, pages containing those words in their titles and headlines might be considered
more relevant than other pages where those words also appeared, but not in those
“important” parts of pages.
Google company, which started in September 1999, revolutionized search engines. Its
co-founders, Larry Page and Siergiej Brin, developed PageRank algorithm [3]. This al-
gorithm redefined search issue. Content of the websites became slightly less significant.
Instead of text content, PageRank rates the websites mainly on the basis of quantity
and quality of links leading to these websites. With help of such improvements, the
Internet works as a kind of voting system. Every link is a vote for a website which leads
to.
Relevance was also found by indexing words that link to other pages. If a link leading to
a page used the phrase ”american basketball” as anchor text2 , the page being pointed
to would be considered relevant to the American basketball. The existence of links to
pages also has been used to help define the perceived importance of a page. Information
about the quality and quantity of links to a page can be used by search engines to get
a sense of implied importance of the page being linked.
Nevertheless, after a short time, the techniques [21] spoiling the results of PageRank
have also been discovered. Basically, most of them work in such a way to increase the
number of links leading to a particular website to enhance its PageRank value. Such
activity in a large scale is usually called linkbaiting. There are many scripts and web
catalogs to facilitate and to automate such activity. However, Google is also constantly
working to improve their search engine algorithm. They try to make it resistant to
linkbaiting. According to [11], many new factors are being introduced into websites
evaluation process, in order to reduce the impact of linkbaiting which is a sort of
spam.
Besides, there is a limit to the effectiveness of this type of keyword matching. When
two people perform a search at one of the major search engines, there is a chance that
even if they use the same search terms, they might be looking for something completely
different. For example, when an anthropologist searches for phrase ”jaguar” he expects
websites with information about big cats as a result. But he can also receive a collection
of websites about Jaguar cars instead.
As search engines progressed and users were given more and more websites with valu-
able information, the engines needed to respond with a refined approach to search.
The main idea to improve relevance of the search results was to better understand
the user’s intent and expectation then he types a certain phrase into a search box. So
it seems that the next step in search engines improvement is tracking regular users
in the Internet. Collecting data on their activity might give useful information about
 2. Link label or link title, text in a hyperlink visible and clickable by the user.
Chapter 1. Introduction                                                                            3

which websites are valuable for them. The major engines such as Google, Yahoo and
Bing guard their search secrets closely, so one can never be absolutely certain how they
are operating. But they are evolving, and personalization seems to be the wave of the
future.



1.3. The Goal

It seems to be quite clear that search engines fitted with personalization mechanism
would have two main benefits:
 1. Improvement of search results relevance for specific user.
 2. Decrease of number of spam entries in SERPs.
Google, the leader in search engine market, is already the first steps behind in this area.
Such information reaching us from the official company blog [13]. Moreover they already
has several patents connected with the personalization mechanism. For this reason this
thesis will be mainly concerned with the Google Search. But high competition in the
Internet market suggests, that other popular search engines like Yahoo and Bing are
also being improved in this direction.
The goal of this thesis is to analyse the possible aspects of the personalization mech-
anism in Google Search, on the basis of available information. There will be sev-
eral factors taken into account, which can have an influence on the changeability of
SERPs:
 • geolocation
 • language of the query
 • web search history
 • query complexity
 • search behaviour (e.g. bounce rates3 , time of visits)
It will be the base for several experiments which should determine how advanced is
the current level of personalization introduced into considered search engine. This re-
search also includes an analysis of the data used to describe users’ search behaviour –
particularly the methods of collecting these data and types of them.
The obtained results will be used to specify the potential impact on SEO and its
metrics, in particular:
 • the possibility of using personalization of search engine to create a new SEO
   techniques

 3. Bounce rate is a term used in website traffic analysis. It essentially represents the percentage of
    initial visitors to a site who ”bounce” away to a different site, rather than continue on to other
    pages within the same website.
Chapter 1. Introduction                                                                4

 • usability of website ranking in search results as the measure of success in SEO
   activity
Personalization is the opportunity for search engines to make spam less significant for
search results and make SEO workers’ life harder. But it is only the spam in the sense of
collection of websites with content of no value for human and irrelevant hyperlinks. But
personalization also opens door for a another kind of information noise – behavioural
data spam. The thesis presents the architecture of the distributed system generating
artificial web traffic, therefore, imitating search activity of a real user. However, using
such system can be seen as unethical, so the thesis contains only a conception and a
design. Author has no intention to implement such system, but he tries to examine
with available tools if building it would be reasonable. In this way might be indicated
possible harmful actions, whom search engines should be protected against.
After that, there is a short analysis of the known up-to-date information about signif-
icant factors in web positioning. Together with results of the personalization research,
they helped to prepare a collection of advices how to build a website attractive for
search engines. It is a sort of a guide for webmasters.
At the end of thesis there is short conclusion. It contains author’s few thoughts about
future trends in search engines and SEO.
Chapter 2

Personalized Search

Pretschner [27] in 1999 wrote: With the exponentially growing amount of information
available on the Internet, the task of retrieving documents of interest has become in-
creasingly difficult. Search engines usually return more than 1,500 results per query,
yet out of the top twenty results, only one half turn out to be relevant to the user. One
reason for this is that Web queries are in general very short and give an incomplete
specification of individual users’ information needs.

To be more specific, Speretta [31] in 2005 wrote: [...] most common query length sub-
mitted to a search engine (32.6%) was only two words long and 77.2% of all queries
were three words long or less. These short queries are often ambiguous, providing little
information to a search engine on which to base its selection of the most relevant Web
pages among millions.

According to Wikipedia, Google in 2006 has indexed over 25 billion web pages, 400
million queries per day, 1.3 billion images, and over one billion Usenet messages. The
Internet grows very quickly. For this reason search accuracy is crucial area for con-
stant improvement in modern search engines. One of the major solutions to meet the
challenge is personalization.

Personalized search is simply an attempt to deliver more relevant and useful results
to the end user (searcher) and minimize less useful results. Personalization mechanism
uses information about user’s past actions and behaviour to specify his profile and
match relevant search result to this profile. It should provide more useful set of results
or a set of results with less irrelevant or spam entries. For this reason personalized
search seems to be desirable to the end user.

Google puts it in this way: Search algorithms that are designed to take your personal
preferences into account, including the things you search for and the sites you visit,
have better odds of delivering useful results [13]. The goal is simple: to reduce spam
and to deliver better results. This looks like dangerous weapon against SEO workers
which are major offenders in generating spam.
Chapter 2. Personalized Search                                                         6

Official information [13], [25], [38] indicates that Google is the only major search engine
that already introduce personalization mechanism. First personalized search result ap-
peared almost 5 years ago [13], and from that time it has been constantly evolving.



2.1. Operation Principles

Of course details of Google’s search algorithms are not public. But it can be expected,
that the main principles are based on the ideas which can be found in the scientific
literature.
According to [31], personalization can be applied to search in two different ways:
 1. by providing tools that help users organizing their own past searches, preferences,
    and visited URLs
 2. by creating and maintaining sets of user’s interests, stored in profiles, that can be
    used by retrieval process of a search engine to provide better results.
His research proved, that user profiles can be implicitly created out of the limited
amount of information available to the search engine itself. The profiles are built on
the basis of the user’s interactions with a particular search engine. Google has applied
this second approach in their search engine, because they do not provide any additional
tools like toolbars or browser add-ons for personalizing search.
After [31]: In order to learn about a user, systems must collect personal information,
analyze it, and store the results of the analysis in a user profile. Information can be
collected from users in two ways: explicitly, for example asking for feedback such as
preferences or ratings; or implicitly, for example observing user behaviors such as the
time spent reading an on-line document.
Google Search do not provide any forms, which let users to specify their interests
and preferences. So to build user profile, this information must be collected in other
way. According to [31], user browsing histories are the most frequently used source
of information to create interest profiles. But not only browsing history (like such
presented in figure 2.1) is significant.
For example, a user after sending search query gets search results. He selects a specific
entry seemed to be interesting. He clicks on it and the website is saved in his browsing
history. However, user quickly realizes, that the selected website did not fit to his
interests and goes back to the search results after few seconds. Such visit should be
qualified rather negatively. So not only browsing history, but also a user’s behaviour
should be taken into consideration in the personalization mechanism.
Also studying a series of searches from the same user may offer a glimpse into mod-
ified search behaviour. How does an individual change their queries after receiving
unsatisfactory results? Are search terms shortened, lengthened or combined with new
terms? There is much other information that a search engine might collect about a user
when a search is performed – location, language preferences indicated in their browser
or the type of device they are using (mobile phone, handheld or desktop). But how
Chapter 2. Personalized Search                                                      7




                       Figure 2.1. Google Web History panel


such behavioural data can be collected by search engine? The answer is in the next
section.



2.2. Methods for the Analysis of Behavioural Data

Search engine robots, hereafter crawlers [26] continuously gather information from
almost every website on the Internet. It is well known that Google collects enormous
amount of data through this process. The reason for this is that these data have the
greatest significance for search engine algorithm, thus classic web positioning methods
are based on links maintenance, mainly the acquisition of them.
Google processes these data and sorts the websites according to its value to the user.
User sends queries to the search engine and gets appropriate SERPs. Because Google
knows what people search, they are able to determine the popularity of specific in-
Chapter 2. Personalized Search                                                         8

formation on the Internet. But eventually, it is the user who decides which website
is valuable for him and which is not. The value of a particular website is reflected in
users’ activity – websites whose links have been clicked and time between these actions.
These information is called behavioural data.
It is reasonable to make all this data useful for search engine. Certainly, Google knows
that, too. Probably this is the reason why they collect enormous amount of behavioural
data in addition to data collected by crawlers. This kind of information is what this
study is the most interested in.


2.2.1. Methods of behavioural data collecting

The entire web is based on HTTP protocol which generates requests containing follow-
ing information:
 • IP address of the user making request which can be used in geolocation of this
   user,
 • date and time of request,
 • language spoken by the user,
 • operating system of the user,
 • browser of the user,
 • address of the website which redirected user by the link to the requested website.
These HTTP requests are being used by Google in:
Click tracking – Google logs all of its users clicks on all of its services,
Forms – Google logs every piece of information typed into every sending form,
Javascript executing – requests and sometimes even more data are sent when a
    user’s browser executes the script embedded in website,
Web Beacons – small (1 pixel by 1 pixel) transparent images in its websites, which
   cause sending a requests every time while user’s browser tries to download such
   images,
Cookies – small text information stored on user’s computer which lets Google track
   users’ movement around the web every time when they get on any page that has
   Google advertisement.
But these all elements has to be placed on websites being indexed by Google’s crawlers.
Fortunately for Google, they have a lot of services, very useful for Internet publishers.
Because they are mostly free to use, webmasters gladly use them in their websites.

Google Analytics

One of these attractive services is Google Analytics. It generates detailed statistics
about:
Chapter 2. Personalized Search                                                          9

 • the visitors to the website,
 • the previous website of the visitor,
 • activity (navigation) of the user in the website.




                       Figure 2.2. Google Analytics main panel


It is the most popular and one of the most powerful tool to examine the web traffic on
our website. It gives the owner a lot of useful information about visitors on his website.
Figure 2.2 shows a few features of this piece of software. The information which it
provides is certainly very interesting for the Google itself. For this reason there is a
lot of discussions between SEO workers about possible disadvantages of using Google
Analytics in SEO campaigns. It is because poor results of particular website indicated
through Google Analytics can inform Google’s search engine to decrease value of this
website in search ranking. But this is only an unconfirmed speculation.

Google Toolbar

Another Google’s tool which provides them with even more valuable data is Google
Toolbar. It is the plug-in adding a few new features to popular browsers, mainly a
quick access to Google’s services. One of these features is checking the PageRank value
Chapter 2. Personalized Search                                                                    10

for currently viewing webpage. This gives Google the information about every website
which the users with installed Toolbar are viewing.

Google AdSense

There is also Google AdSense - context advertising program for website publishers.
There are millions of websites which uses this service to generate some financial profits
for their authors. The effect of this is that all these websites are displaying ads published
by Google’s servers. It can provide to Google a similar information as Google Analytics
and Google Toolbar.

Google Public DNS

The latest service launched by Google is Public DNS (Domain Name System) [23]. It
is said to be faster and more secure than others and this is how Google encourages us
to start using their DNS. It can generate massive amount of information about web
traffic. Every single query to the DNS can be analyzed by its provider. So the more
popular their DNS will be, the better for Google. It can provide a lot of information
helpful on defining websites popularity.
But because of DNS caching mechanisms [23], Google do not get all the desired infor-
mation. DNS client sends query only when he or she wants to visit a domain for the
first time. After he gets the IP address of this domain from DNS, he caches it for an
interval determined by a value called time to live (TTL). Every next visit during this
interval does not send any query. Consequently, Google is still in need for other services
to gather desired information about the activity of particular website’s visitors.

Other Google Services

Google has other very popular services like for example YouTube, Google Maps (Fig.
2.3) etc. They allow users to embed objects like videos or maps on their own websites.
There is also Google Reader which can indicate the popularity of particular websites
by counting their RSS1 subscribers.
There are many other ways for Google to gain the useful data [9]. In fact Google itself
admits to the use of all described techniques in its privacy policy [12]. Most of these
data are probably used by them to improve accuracy of their search engine and quality
of their services.


2.2.2. Process of tracking user

Described services can be a great source of behavioural data. It us no doubt on that. But
tracking user’s search activity process would be incomplete without data provided by
search engine. The next few sections present how the tracking process looks like.
 1. RSS (most commonly expanded as Really Simple Syndication) is a family of web feed formats used
    to publish frequently updated works – such as blog entries, news headlines, audio, and video – in
    a standardized format
Chapter 2. Personalized Search                                                        11




                       Figure 2.3. Google Maps example screen


Starting the session

When the user is opening the search engine site (typing the www.google.com address
in a browser), he sends an HTTP Request [37] to the server. This request contains the
IP address of the user’s computer. Thanks to this information the search engine has
the ability to relate following search queries with particular users. Each of them has
been assigned a unique session identifier, stored on the server. This is the beginning of
the user’s search session.
The identifier expires after a certain period of user’s inactivity. In this way the search
session is being terminated.

Sending the search query

The view presented on figure 2.4 should be known by every Internet surfer. This is the
place where user cane type his search query.
After the search query is sent, it is followed by two facts:
 1. The query is stored into a database and connected with user’s session identifier.
    Then the personalization mechanism takes advantage from it.
 2. The query is analysed and used by search algorithm to provide relevant search
    results to the user.
Chapter 2. Personalized Search                                                    12




                       Figure 2.4. Google Search main screen


After that user receives an HTTP Response [37] with the search results as the HTML
code.

Result selection

In the figure 2.5 there is presented one of the results of ”query example” with the
hyperlink highlighted.




                      Figure 2.5. Example of the search result


Commonly click on the link sends HTTP Request to the server which link leads to. So
in this case, it should be sent to:
http://www.wisegeek.com/what-is-query-by-example.htm
But when you look into source code of the SERP, you will find URL like this:
http://www.google.com/url?sa=t&source=web&cd=6&ved=0CDMQFjAF&
  url=http://www.wisegeek.com/what-is-query-by-example.htm&
  ei=CekOTLT4EpHu0gTYitWXDg&usg=AFQjCNE3t34-kSehUAK8TFNwh5CV9K-OWg&
  sig2=PdwrnqnhLhowpC8t5-06bw
The most important thing which can be noticed is that the links in the SERPs lead to
Google’s server. But the chosen website finally appears on the user’s screen, because
Google’s server is doing URL redirection (forwarding). This technique bears the down-
side of the short delay caused by the additional request to search engine server.
However, in this way search engine can log every user’s click in SERPs. What is more,
there are some additional data in the result’s URL which probably provide some extra
information. For example, the value of the cd parameter is the place number in the
Chapter 2. Personalized Search                                                           13

current search ranking. What is more, this URL is can be seen by target server, because
browser are placing it into HTTP Request data as Referer field [37]. This fact is used by
software like Google Analytics to aggregate traffic sources of the website with Analytics
scripts. Thanks to this, the website owner can get the information of:
 • the most popular search phrases that result in visits to his website
 • the place in ranking of his website for particular search phrase and particular user
   (it can vary due to personalization mechanism)
Of course the same information is being taken into consideration by search engine.

Behavioural data extraction

According to many research [1], [6], [7], [10], [20] and [36], more than half of a website’s
visits comes from the SERPs. In case of e-commerce websites, this value is even higher,
because such services usually do not have regular visitors. They mostly come from
search engines (even 90% of visits) or from ad appearing on other websites.
According to the analysis of real users web traffic [29], typical user spends about 2
hours per session and 5 minutes per page.
These statistics concerns a website of good quality, relevant to the user’s interests. Visit
on website with poor content would be terminated as soon as after few seconds – so
called bounce. Such visit should indicate irrelevance of website selected by user, so it
would be desirable if it wouldn’t appear in the search results on particular phrase.
Not only what you select/interact with from a given set of search results (or the Ads
served with them), but what you do not select or have minimal interactions with
(bounce rates) can have an effect. These metrics can be used to create a greater prob-
ability model for future search result sets.
What is more, a mechanism based on cookies have been recently introduced on Google
Search. It allows to learn about every user’s (also not logged in any Google’s service)
history of search queries from the last 180 days. Officially it is used to personalize
SERPs according to past interests of the user. Google [13] says about that: Because
many people might search from a single computer, the browser cookie may be associated
with more than one person’s search activity. For this reason, we don’t provide a method
for viewing this signed-out search activity.
The diagram in figure 2.6 shows the process of tracking user which has not been
signed-in to Google Account.
This is what Google [14] says about personalized search for signed-out users:
When you search using Google, you get more relevant, useful search results, recom-
mendations, and other personalized features. By personalizing your results, we hope to
deliver you the most useful, relevant information on the Internet.
In the past, the only way to receive better results was to sign up for personalized search.
Now, you can get customized results whenever you use Google. Depending upon whether
Chapter 2. Personalized Search                                                                                                                  14

                                       User                                                    Search Engine


                                     Enter search engine via URL               Extract profile information




                          [Else]            Type search query               Show profile based search page




                                                                             Save cookie                      Search for relevant documents




                                                                                                             [Else]
                                      Look for interesting website        Prepare search results


                                                                                                                      [Found history cookies]


           [Found something interesting]                                                      Re-rank found documents




                                           Click chosen website                Log phrase-selection




                                                      Visit the website    Redirect to selected website


                                      [Else]
  [Curiosity satisfied]
                                                                                  [Else]
                                                                                                     Log bounce

                           Go back to search engine
                                                                              [Visit longer
                                                                            than 3 minutes]
                                                                                                       Log visit

                               Close browser




                                           Figure 2.6. Activity diagram of a visit session


or not you’re signed in to a Google Account when you search, the information we use
for customizing your experience will be different:

Signed-in personalization: When you’re signed in, Google personalizes your search
    experience based on your Web History. If you don’t want to receive personalized
    results while you’re signed in, you can turn off Web History and remove it from
    your Google Account. You can also view and remove individual items from your
    Web History.

Signed-out customization: When you’re not signed in, Google customizes your search
    experience based on past search information linked to your browser, using a cookie.
    Google stores up to 180 days of signed-out search activity linked to your browser’s
    cookie, including queries and results you click.
Chapter 2. Personalized Search                                                          15

          Table 2.1. Information used by Google Search in personalization
                      Signed-in       Personalized        Signed-out     Personalized
                      Search                              Search
 Place of data        Web History, linked to Google       On Google’s servers, linked to
 storage              Account                             an anonymous browser cookie
 Time interval        Indefinitely or until remove it      Up to 180 days
 of data storage
 Searches used        Only signed-in search activity,     Only signed-out search activity
 to customize         and only if user is signed up for
                      Web History


2.3. Research

The goal of this section is to evaluate the current level of personalization based on
several factors. For the task should be helpful this what Google [14] says about types
of results customizations:

When you use Google to search, we try to provide the best possible results. To do that,
we sometimes customize your search results based on one or more factors:

Search history: Sometimes, we customize your search results based on your past
    search activity on Google, such as searches you’ve done or results you’ve clicked.
    If you’re signed in to your Google Account and have Web History enabled, these
    customizations are based on your Web History. If you’re signed in and don’t have
    Web History enabled, no search history customizations will be made. (Using Web
    History, you can control exactly what searches are stored and used to personalize
    your results. Learn about using Web History)

     If you aren’t signed in to a Google Account, your search results may be customized
     based on past search information linked to your browser using a cookie. Because
     many people might be searching on one computer, Google doesn’t show a list of
     previous search activity on this computer. Learn how to turn off these customiza-
     tions

Location: We try to use information about your location to customize your search
    results if there’s a reason to believe it’ll be helpful (for example, if you search for
    a restaurant chain, you may want to find the one near you). If you’re signed in
    to your Google Account, that customization may rely on a default location that
    you’ve previously specified (for example, in Google Maps). If you’re not signed
    in, the results may be customized for an approximate location based on your IP
    address.

     If you’d like Google to use a different location, you can sign in to or create a
     Google Account and provide a city or street address. Your specific location will be
     used not only for customizing search results, but also to improve your experience
     in Google Maps and other Google products.
Chapter 2. Personalized Search                                                         16

2.3.1. Location


While you can search at google.com just about anywhere in the world, you can also
access Google at a number of different country specific addresses, such as google.co.uk,
www.google.fr, www.google.co.in. In fact, Google automatically redirects you on the
proper domain using your IP address and determining your geolocation, Browser setting
with recommended language was clear in this experiment.

This experiment was performed in one location in Poland. However, to simulate request
from the other locations, there was used a similar software environment like described
in section 5.6.1. The used phrased ”jaguar” is multi-lingual, so the language of the
phrase does not affect the search results.

The first query was sent through three Tor hosts, where the exit host was located in
Los Angeles, California, United States. The result of the query is presented in the figure
2.7 (only several first entries).

All website in this SERP are in English, which is the standing language at the described
location. Moreover, at the near bottom there are some places indicated on the Google
Maps, which are physically close to the location of the exit host.

The second query was sent via exit host located in Erfurt, Thuringen, Germany. The
figure 2.8 presents results of this query.

In Official Google Blog [13] is written, that the same query typed in multiple countries
may deserve completely different results. Presented results clearly shows that those
words are true.

Unfortunately author failed to check if a search for the query ”football” provides dif-
ferent results in the US, the UK, and Australia, because the term refers to completely
different sports in those countries. But it is rather possible. A preferred country might
include the country of the searcher as well as other countries that searcher might find
acceptable, such as showing search results from the United States to people located in
Canada.



2.3.2. Phrase language


It is rather clear, that language of the searched phrase is significant for the results.
Search engines, despite personalization, still use matching phrases to the content of
indexed pages as the major factor for evaluation of the search relevance. For this reason,
the phrases identical in the semantic meaning but in other languages are completely
different in general.

So serving search results with pages in English about birds would be senseless if the
user typed phrase ”Vogel” in search box, which means ”bird” in German.
Chapter 2. Personalized Search                                                        17




        Figure 2.7. Search results of the query sent via host located in USA


2.3.3. Search history

The most interesting factor which is said to have influence in personalization mechanism
in Google search engine is user’s search history. Figure 2.9 shows search results which
was slightly modified by re-ranking based on search history. On the fifth position, right
after two video thumbnails, there is a link to the website, which was visited 4 times
(exact number of visits is visible on the right side of the hyperlink) used by the author
of this this study to gather information.
These fluctuations appear only when user is signed into Google Account, otherwise
there is no access to the web history (figure 2.1). This modified search result was not
Chapter 2. Personalized Search                                                          18




      Figure 2.8. Search results of the query sent via host located in Germany


the author’s intentionally. The phrase ”personalized search” was not the object of the
experiment. However, this result shows, that search history affects future search results
on similar areas of information.

To compare modified results with the original (without impact of personalization),
there are two ways to disable results customization:

 1. signing-out from Google Account

 2. using ”View customization” which is available on the bottom of results screen

After using one of these options, we can check the original position of the visited website
in the ranking. In this particular case, the website holds 17th position in the results
with no customization. So after personalization re-rank, there was position increase by
12 places.

But the most important is the fact that this change shows visited website on the first
SERP of the search ranking. In most cases (more than 90% of searches) users does not
go beyond first page of the results. So such change in ranking causes huge increment
of the visitors via this phrase.
Chapter 2. Personalized Search                                                        19




           Figure 2.9. Search results personalized by user’s search history


Unfortunately author has failed in forcing search engine to re-rank search results in-
tentionally. So after this experiment, approximation of the re-rank algorithm is impos-
sible.



2.4. Spam Issue

There is a huge amount of value from getting to the top of the search results. Especially
considering competitive phrases related to the business. This is a marketing area with
millions of dollars in it, quite often. So spammers are highly motivated, because there
is a lot of money at stake. Unfortunately regular users, searching for valuable content
are the main victims of these practices.
One of the more interesting parts about implicit/explicit user feedback during search
personalization process is that it can be very effective in dealing with spam. The more
personalized the results, the less chance that spam will will appear in search ranking.
Because in most cases, spammy websites are clickable by users (which are tricked by
link with false information about target website), but after realize the real value of
those website, they quickly go away and do not come back to them.
Chapter 2. Personalized Search                                                       20

Not only will this enable them to help limit spam through personalization, it would
also be a great source of query/click analysis for Google. It is worth to consider that
the click data across multiple users shows that a given entry in a query space rarely is
clicked, or shows a high bounce rate. Google might just use that signal as a dampening
factor for spam result.
Chapter 3

Impact of Personalized Search on
SEO

3.1. Metrics


For quite a long time, SEO workers were using position in search ranking for particular
phrases as the indicator of the web positioning. Increase of the website position has
been always a desirable consequence of the SEO actions.

After implementation of personalization mechanism, the issue is not so simple. Al-
though customizations in rankings are still not very influential (only one entry in whole
SERP), the highly visible benefits of the personalization suggest, that the impact will
be increasing. For this reason ranking position cannot be the major metrics of success
any longer. It is because position in ranking of particular website can be different for
every user. Especially for those of them, which are regular visitors of this website.

Of course, this indicator is still measurable, because position monitors1 are not person-
alization subject (they do not use cookies and Google Account). It can also give useful
information about position seen by users searching for concerned keywords for the first
time. But it is the increase of inbound traffic which has always been the main moti-
vation of SEO actions. So the major metrics of this actions should be closely related
to this motivation. For example, those are metrics for SEO in time of personalized
search:

Number of unique visitors. Higher value indicates good result of the advertising
   campaign and gaining popularity among new customers.

Previous search queries. As an example: if the searcher has been recently searching
    the term ‘diabetes’ and submits a query for ‘organic food’ the system attempts
    to learn and presents additional results relating to organic foods that are helpful
    in fighting diabetes.

 1. Software which automates monitoring of a website’s position in search rankings for given phrases
Chapter 3. Impact of Personalized Search on SEO                                      22

Previously presented results. Results that have been presented to the end user
    can be omitted in future results for a given period of time in exchange for other
    potentially viable results.

User query selection. Past selected or preferred documents can be analysed and
    similar documents or linking documents can be used to refine subsequent results.
    Furthermore, certain documents types can be seen as preferred, in what would be
    a combination of Universal Search concepts. Common websites that accessed can
    also be tagged as preferred locations for further weighting.

Selection and bounce rates (and user activity on website). An editorial scor-
    ing can be devised from the amount of time a user spends on a page, the amount
    of scrolling activity, what has been printed, or even what has been saved or book-
    marked. All can be used to further refine the ‘intent’ and ‘satisfaction’ with a
    given result that has been accessed.

Advertising activity. The advertisements clicked on can also begin to add to a
   clearer understanding of the end users preferences and interests.

User preferences. The end user can also provide specific information as to personal
    interests or location specific ranking prominence. It could also include favourite
    types of music or sports, inclusive of geo-graphic preferences such as a favourite
    sport in a given city.

Historical user patterns. A persons surfing habits over a given period of time (e.g.
    6 months) can also play a role in defining what is more likely to be of interest
    to them in a given query result. More recent information (on above factors) is
    likely to be weighted more than older historical performance metrics within a set
    of results.

Past visited sites. Many of the above metrics, such as time spent and scrolling on a
    given web page or historical patterns and preferred locations can also be collected
    in a variety of ways (invasive or non-invasive). Cookies actually save resources for
    the Search Engine, an added benefit.

The advices how to improve values of such metrics are presented in the next chap-
ter.

Higher position in rankings not always implicate more visitors. Moreover, there is no
significant difference for positions between 6 and 10. Very often the proper website
optimization of page’s title and description visible in a SERP is more important and
brings more visitors than higher position. Better website titles and meta-descriptions
would have an advantage as getting the user to engage with the SERP listing upon
initial presentation would be at a premium. Quality content as well would begin to take
on a more meaningful role than it has in the past, as bounce rates and user satisfaction
now starts to play into actual search results rankings.
Chapter 3. Impact of Personalized Search on SEO                                           23

3.1.1. Areas for Consideration

Author’s experience in commercial SEO, which is closely related to the topic of mar-
keting, is rather small. Thus this section is based on [16].

Demographics

It should be ensure to leverage any obvious demographics that may apply to your site.
If it is geographic, topical (sports, politics) or even a given age group, ensuring that this
is targeted effectively is important in that the ‘topical’ nature of personalized search
can group results prior to even ranking them. If the particular website is not clear in
each of these areas, it risks less weighting to tighter demographic starting document
sets. Even your off site activities (link building, Social Media Marketing etc.) should
be as tightly targeted as possible.

Relevance profile

Of particular interest is potential categorization in terms of topical relevance. Ensur-
ing that your site provides a strong relevance train would be particularly valuable.
Much like phrase based indexing and retrieval concepts, probabilities play a large role.
When refining results the search engine looks at related probable matches. Through
a concerned effort with on-site and off-site relevance strengthening, you increase the
odds of making it to a given set of results in a world of ‘flux’. It never hurts to review
the concepts surrounding Phrase Based Indexing and Retrieval as many of the related
patents addressed deriving concepts/topics from phrases.
One would also have to imagine that tightening up the relevance profile in your Social
Media Marketing efforts would also be beneficial to a tighter topical link profile. Fur-
thermore, many topically targeted visitors that enter a site may bookmark (or passive
collection) your site which ads to the organic search profile without ever being included
in a search result. As such, there are many exterior opportunities to be had beyond
the traditional off-site SEO.

Keyword Targeting

Building out from your core terms will be important as far as understanding search
behaviour. The long tail as we know it would be targeted towards potential query
refinements on a given subset of searcher types. Building out logical phrase extensions
and potential query refinements would be something to look at. Furthermore, with
changeable personalized ranks we would measure SEO success in actual traffic and
conversions which puts term targeting into a new light as far as nailing money terms
and having a cohesive plan that targets query refinement long-tail opportunities.

Quality Content

In considering the value of a website, user interaction becomes a consideration as far
as bounce rates, time spent on page and scrolling activities are concerned. Producing
Chapter 3. Impact of Personalized Search on SEO                                       24

compelling and resourceful content would be at a premium to best leverage these
tendencies of the system. If a searcher has selected and interacted with your site on
multiple occasions your site would be given weight in their personal rankings as well
as related topical and searcher types. The more effective a resource the greater the
ranking weight increase.

Search result conversion

Working with the page title, meta-description and snippets takes on a more important
role in your SEO efforts when adjusting for personalized search. I dare say using ana-
lytics and a form of split testing would be a great advantage as far as satisfying what
not only ranks, but converts.

Freshness

Another area which may be important is document freshness in that people could be
able to set default date ranges or the system could passively begin to see a pattern of
a user accessing more current content. Valuable website that has been ranking well for
a year that may no longer be getting all the traffic that is has been used to. It should
be looked at updating such pages with fresh information, or creating new related pages
and pass the flow via internal links. Depending on the nature of the content (searcher
group profile) more current content may be more popular over the larger data set and
thus newer content would be weighted more overall.

Site Usability

From a crawler or the end user perspective, having logical architecture and a quality
end user experience is also at a premium. If similar searcher types embark on similar
pathways and related actions (bookmark, print, navigate, and subscribe to RSS) then
this will give greater value to those target pages within that community of search types.
This also furthers the relevance profile.

Analytics

It can be noticed there is a strong need for the use of analytics in understanding
traffic flows, understanding common pathways, bottlenecks, the paths to conversion,
and much more. This data will be of immeasurable use in dealing with many of the
factors that can affect Personalized Web Positioning. This issue is closely connected
with psychology (particularly behavioural targeting).
Chapter 4

SEO Guide


This chapter presents a set of areas for consideration during a process of website opti-
mization. The prepared advices concern only areas which may have positive result in
gaining popularity of the website. They should be helpful for achieve higher position in
search rankings which should increase the number of visitors. They also should make the
website more attractive for users. This fact probably will decrease bounce rate which
which has negative impact on the website in personalization re-ranking process.


Concerned areas in this guide do not take into account any SEO techniques which are
connected with external actions. So those that require contact with other websites,
such as:


 • linking (the acquisition of links), free or paid


 • advertising


 • presell pages1


Listed methods are closely connected with generating spam. Due to this, they reduce
the rate of quality content in the Internet. So the Internet surfers has no benefits from
them.


This chapter is based mainly on the information from [13], [15], [21] and [36].



1. Presell page – page created only for SEO purpose. Text on such page is only a surrounding for link
   leading to positioned website. Content has no value for human reader because. It is only prepared
   to look like natural for crawlers, not to be filtered as spam.
Chapter 4. SEO Guide                                                               26

4.1. Website Presentation in Google Search

4.1.1. Title

Title is the first information about particular website in SERP. It is also one of the
main factors in with impact on the website ranking. An example of such title in html
code looks like this:
<title>Jaguars, Jaguar Pictures, Jaguar Facts -
National Geographic</title>
Such title presented in SERP looks like in the figure 4.1.




             Figure 4.1. Presentation of a website title in Google Search


These are the issues connected with website title, which are significant in SEO:

Length up to about 65 characters

Longer titles can be also indexed by crawler, but title with 65 characters is rather
optimal and it entire fits in SERP. Longer titles are shortened with ellipsis.

Diversity of titles

Each of the website pages relates slightly different information (e.g. product page,
contact form etc.). The title should be prepared individually for each of them.

Keywords

There are 3 principles related to creating a title:
 1. Keywords should be distributed on all the pages. Each of the pages must be
    optimized for only 3–4 keywords. Front page title should have most general ex-
    pression, titles of product pages should contain words characterizing the type of
    these products etc. Sticking to this rule is very important, because in other case
    the pages of the website could be treated by crawler as duplicated content.
 2. The most important keywords should be place at the beginning of title.
Chapter 4. SEO Guide                                                                 27

 3. Google can connect keywords from title into different phrases. But these which
    appear one after another have the greatest impact on their position in ranking.
    Due to this fact, key phrase should not be separated.


4.1.2. Description

Title is the second information about particular website, presented next after title in
SERP. Such description presented in SERP looks like in the figure 4.2.




         Figure 4.2. Presentation of a website description in Google Search


Description presentation in SERP can be generated from following sources:
 • description metatag, for example:
     <meta name="description" content="Learn all you wanted to know
     about jaguars with pictures, videos, photos, facts,
     and news from National Geographic." />
 • a fragment of the website content (in case the description metatag is too long or
   there is no such one in the source code)
Here are some tips on the description page in metatag:

Length up to about 150 characters

Longer descriptions will not be presented in SERP as they have been written.

Diversity of descriptions

Just as titles, description of a particular page should be slightly different from the
other. It should be specific for the information presented in the page.

Keywords

Description should contain keywords concerned by the SEO strategy. When it does, the
keywords will be bold in the search results for query phrase based on such keywords. It
should call users’ attention on our website. However, the description should be prepared
in the way to encourage users to visit the website.
Chapter 4. SEO Guide                                                                    28

4.1.3. Sitelinks

Sitelinks are links leading to other pages of the same website. The can be presented in
SERPs in 2 ways:
 1. Horizontally – 4 links in 1 row (presented in figure 4.3)
 2. Vertically – 8 links in 2 columns




           Figure 4.3. Presentation of a website sitelinks in Google Search

There is no manual way for publishers to force sitelinks presenting in SERP. It depends
on how the website was indexed. But it can be made easier for crawler to make it
correctly. There are two things which can be done:
 1. First of all it must be well designed source code related to navigation on our
    website. Its syntax must be very clear.
 2. Prepare a sitemap of website (e.g. in XML format). This issue will be described
    later.


4.2. Website Content

4.2.1. Unique content

The basis of the proper content optimization is its uniqueness. This means that the
same text or its larger fragments should not be reproduced on other websites or on
different pages of our website.
In order to verify the degree of uniqueness of our content, it can be used this tool:
http://www.copyscape.com


4.2.2. Keywords

It is very important make search engines able to relate our website to specific theme
and keywords. In order to make it possible, keywords must be considered not only
in website title and description design process. Keywords must be also contained in
website content.
In preparing the text for the website it suggested to stick following principles:
Chapter 4. SEO Guide                                                                29

Repetition

Keywords should be repeated several times on every page. But it cannot be forgotten
that the text should be written primarily for users. The task is to find a compromise
between attractive text for users and good for SEO. Too high density of keywords on
particular page can be treated by search engine as an abuse. In such situation our
website will be penalize by ranking exclusion.

Variations and synonyms

The website content will be more natural, if contained keywords are used in many
variations (grammatical). The proficiency of modern search engines can also detect
using synonyms. For this reason we can use for example word ”drug” in the content
being optimized for ”medicine” keyword.

Location

Keywords should located on whole page with similar density. This will give a better
result in positioning than accumulation of keywords for example only at the beginning
of the page.



4.3. Source Code

Website’s source code has not direct influence on the position in search ranking. How-
ever, some errors can cause problem with proper indexing by search engine robots. For
this reason it is worth to ensure that the code contains no errors and it is compatible
with current WWW standards.
Very useful is the code validation tool, provided by the World Wide Web Consortium
(W3C). It can by found here:
http://validator.w3.org


4.3.1. Headers

HTML headers tags (h1–h6) are very significant for proper indexation of the website
content. Right usage of them is very important in desing of a website. There couple
issues which must be considered form SEO point of view.

Hierarchy

Headers tags are designed to separate particular sections of a document. They must
be used in the correct order and only when there is a need to use.
Chapter 4. SEO Guide                                                                 30

Repetition

Header of the first degree (h1 tag) by current HTML standards may occur only once
in whole document. Other headers can be used repeatedly.

Keywords

It is suggested to put keywords into header tags. It is because they have more ”posi-
tioning power” than a regular text. This power is probably respective to the headers
hierarchy, so the most important keywords should be placed in h1 tag.


4.3.2. Highlights

Keywords can be distinguished from the rest of text by using tags ¡strong¿ (bold)
and ¡em¿ (italics). In this way, keywords are highlighted either for users end crawlers.
Although it should be done with restraint. Not every occurrence of the keyword should
be highlighted but only the most important of them.


4.3.3. Alternative texts

Sometimes there are some images placed in the document. It is recommended to include
alternative texts to those images. It can be done in this way:
<img src="path/to/image" alt="alternative text" />
It is displayed on the screen in the case the browser can not display images (e.g. when
they are unavailable on the server).
These alternative texts are also interpreted by search engine robots. These data is then
used in search for images (when search engine has such option).


4.3.4. Layout

Well indexing website should have clear and minimalistic layout. The content is the
most important factor, so even ratio of text amount to html code is significant. The
higher this value is the better and more valuable website in the search engine point of
view



4.4. Internal Linking

Quite important issue in website optimization is internal linking. Internal link is the
hyperlink which leads to another page of the same website. There are some recommen-
dation connected with this link type.
Chapter 4. SEO Guide                                                                  31

4.4.1. Distribution

Each of the pages should be available in 3 or 4 clicks at most. If not, the website
navigation must be re-designed. Attention should be given especially to the links on
the main page. The structure of the website must be clear.
What is more, it is also very important for usability of the website. Complicated navi-
gation can discourage user to continue the visit.


4.4.2. Links anchors

Link anchor is the clickable text. It is displayed for the user on a website, instead of
plain URL which is rather unreadable for human. It looks like this:
<a href="some_url.html">Anchor text</a>
Anchors should describe the content of pages which their links lead to. If links are
located among the other text, it should match the context of whole text. For example,
it is not advised to write ”click here” like it was popular couple years ago.


4.4.3. Broken links

Very important thing in website positioning is to beware of links which lead to unavail-
able URLs. Such issue is very annoying and discouraging for visitors.
The website with broken links will be also less valuable for search engines, because
robots crawl the web using links. After indexing a page robot uses one of the links
placed on this page to go to another page. When such link is broken, crawler can
interrupt indexing process. It will cause the situation where not every page of the
website will be indexed.


4.4.4. Nofollow attribute

Nofollow is an HTML attribute value used to instruct some search engines that a
hyperlink should not influence the link target’s ranking in the search engine’s index.
This is example of such hyperlink:
<a href="some_url" rel="nofollow">Some website</a>
It is intended to reduce the effectiveness of certain types of search engine spam, thereby
improving the quality of search engine results and preventing indexing particular web-
site as spam. Nofollow attribute is used commonly in outbound links2 , for example in
paid advertising.

 2. Links which target at other websites
Chapter 4. SEO Guide                                                                32

4.4.5. Sitemap

A sitemap is a list of pages of a website accessible to crawlers or users. This helps
visitors and search engine bots find pages on the website.

Sitemap for users

It can be prepared a page, where will be placed links leading to all website’s pages or
only the most important ones. Thanks to this, users having problems with navigation
will be able to find quickly what they are looking for. The example of such sitemap
located in footer is presented in figure 4.4.




                     Figure 4.4. Example of sitemap for visitor


Sitemap for robots

Sitemap for crawlers must be easy to automatic processing. Such sitemap is mostly
being prepared in the XML document format. This is how example looks like:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=12</loc>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=73</loc>
      <lastmod>2004-12-23</lastmod>
Chapter 4. SEO Guide                                                                33

      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=74</loc>
      <lastmod>2004-12-23T18:00:15+00:00</lastmod>
      <priority>0.3</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=83</loc>
      <lastmod>2004-11-23</lastmod>
   </url>
</urlset>
As it can be noticed, such document contains some information about each link:
loc: URL to particular page
lastmod: time of last modification of the page
changefreq: average period time between changes in the page
priority: value of priority for crawler to index particular page
Such information are welcome by crawlers. It can profits to publisher with faster in-
dexing by crawler.
In most cases such documents are being prepared using software tools as like:
http://www.xml-sitemaps.com/
After when the sitemap document is prepared, the search engine must be notified about
its existence by special form.



4.5. Addresses and Redirects

Among previously described factors used in rank algorithm, search engines also consider
form of indexed website’s URLs and information included in HTTP Responses.


4.5.1. Friendly addresses

Search engines give higher rank value to those websites whom pages have URLs more
readable for human. For example, address like this:
http://www.example.com/index.php?page=product&num=5
can be written in this way:
http://www.example.com/product/5
Such effect can be achieved using mod rewrite. It is module to the Apache Server, which
allow to create regular expression patterns for mapping URLs to particular pages. Such
Chapter 4. SEO Guide                                                                    34

possibility have also modern web frameworks, such as Django Framework or Ruby on
Rails.
Moreover, such possibility gives another opportunity for placing keywords. Due to this
fact, page being optimized for particular keyword should have this keyword in its URL.
If it is phrase with couple words, it is suggested to separate them with dash.


4.5.2. Redirect 301

Redirect 301 is the constant redirect from one address to another. After using it:
 • visitors writing into the address bar in browser the old address, will be redirected
   into the new one
 • some search engines will switch the old address in the database into new one
So it is very useful after domain change.
Earlier it author said, that content of the website should not be duplicated. It is often
forgotten, that allowing to entry a website through several addresses has the same
result. Sometimes the same website is downloaded via:
 • example.com
 • www.example.com
 • example.com/index.html
 • www.example.com/index.html
 • example.com/default
 • www.example.com/default
In such case it must be decided if the main address of the website will have the ”www”
prefix. If it will, it should be placed .htaccess file in the main folder of the server, with
such content:
RewriteCond %{HTTP_HOST} ^example.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
Similarly we can manage the redirection from the index.html file:
RewriteCond %{REQUEST_FILENAME} index.html
RewriteRule ^(.*)$ http://www.example.com [R=301,L]
There are also many other possibilities which can be managed likewise.



4.6. Other Issues

There is couple other things, which have some influence on the website ranking.
Chapter 4. SEO Guide                                                                     35

4.6.1. Information for robots

Sometimes publishers do not want robots to index some website’s pages, but keep them
still available for regular visitors. For example:
 • results from internal search engines
 • data sort results
 • print version of pages
 • pages which should not be indexed, like login page to administration panel
To manage this issue it can be prepared robots.txt file with the content like this:
User-agent: *
Disallow: /admin-panel/
It should prevent robots from indexing pages whom URLs start with www.example.com/admin-panel/.


4.6.2. Performance

One of the most recent factors introduced Google search engine algorithm is website
performance value. Google promotes websites with short time period of download pro-
cess. It is not as significant as internal linking or quality of content. Big information
portals are very complex, so they can not be downloaded as fast as e.g. small blog. But
valuable information is the most important factor.
However, good performance can increase rank value of the website compared to an
another one with similar content but not so efficient.
To improve website performance it can be used several tools, like PageSpeed by Google:
http://code.google.com/speed/page-speed
It provides the analysis of the website downloading efficiency and gives some tips on
how to improve the performance. Next, will be presented the most suggestion being
given by this software.

Gzip compression

Modern web browser allow to use gzip compression mechanism to reduce the size of
website files (images, CSS files, Javascript files). If there is such possibility, it is recom-
mended to use it.

Number of DNS lookups

DNS caching mechanism [23] causes there is no need to look for IP address matching
to particular domain several times. For this reason, when every file used by the website
is located on the same server (or another server in the same domain), there is only on
DNS lookup in during download process.
Chapter 4. SEO Guide                                                                36

So it should be avoided placing media files (images, CSS, Javascript) on the different
domain without clear need.

External files

Information commonly located in the external files, like CSS Style Sheets or Javascript,
can be also located inside the HTML document. However it should be avoided, because
it causes parsing of source code process more complex. Thus the browser need more
time to display website on the screen.



4.7. Summary

Tips presented in this chapter should significantly increase the rank value of every
website. With the higher ranking in the search engine will be it will be more inbound
traffic. In other words, the website will gain more popularity.
Making the website more attractive to visitors should implicate better results of per-
sonalization re-ranking. The assumptions of the impact of personalized search results
on global ranking are very likely. So improving the quality of website on the basis of
presented guide should increase website’s ranking in general. Both through the per-
sonalization impact and collecting inbound links as the effect of increase in popular-
ity.
Chapter 5

The System for Global
Personalization

The goal of this chapter is to propose a method for improving the website’s search
ranking through affecting the personalization mechanism in search engine. The idea of
this method is to generate artificial behavioural data. The author of this thesis is the
co-author of the article [19], which this chapter is based on.
In section 2.2 of this thesis it has been shown that a lot of data goes into Google
and a lot of useful manipulated data comes out. But we can only guess what happens
in between or try to learn from the observation of the data coming out of Google.
Evans wrote [10] that identifying the factors involved in a search engine ranking algo-
rithm is extremely difficult without a large dataset of millions of SERPs and extremely
sophisticated data-mining techniques.
That is why, only an observation, experience and common sense are the main source
of knowledge on Search Engine Optimization (SEO) methods. It was according to this
knowledge that Search Engine Ranking Factors [11] was created. The last edition of it
assumes that traffic generated by the visitors of a website has 7% of importance in the
Google’s evaluation of the website value. It is, after links to the specific website and its
content value, the most significant factor in website evaluation process. On the basis of
the previous editions of the ranking, one can notice that the importance of this factor
is increasing.
Because these all are only reasonable assumptions, the intention is to evaluate the
validity level of the described factor in web positioning efforts. For this purpose we
need a simulation tool which will generate necessary human-like traffic on a tested
website. The tool is going to be a multi-agent system (MAS) which will imitate real
visitors of the websites.



5.1. Problems to Solve

Fig. 5.1 presents the main reason why the system must be distributed. A few queries
to Google, have been sent frequently one after another from the same IP address, are
Chapter 5. The System for Global Personalization                                     38




        Figure 5.1. Information displayed by Google on the abuse detection


detected by Google and treated as abuse. Google suspects an automated activity and
requires completion of the captcha form in order to continue searching. In case of using
the distributed system, the queries would be sent from many different IP addresses. It
should guarantee, that Google will not consider this issue abusing. This issue cannot
be solved by using a set of public proxy servers. Google has probably put them into
their black list. Every single query to Google via such proxy server leads to the same
end – captcha request.
What is more, after Tuzhilin [35] we can say that Google puts a lot of reasonable
effort into invalid clicks on advertisement filtering. There is a big chance, that some of
those mechanism Google uses in the analysis of the web traffic. This is the reason why
generating behavioural data should be our concern. Recognized artificial web traffic
could be treated by Google as an abuse and cause being punished (decline of website
position).



5.2. Objectives

5.2.1. Web Positioning

The main goal of the system is to improve the position of a website by generating traffic
related to the website. The only activity which can be visible for Google the system
should care of. It shows that there is no need to download all content from particular
website. It would only waste the bandwidth. The system should only send to Google
services requests used by a particular website, for example:
 • links to the website on SERPs,
 • Google Analytics scripts,
 • Google Public DNS queries,
Chapter 5. The System for Global Personalization                                      39

 • Google media embedded on the website like AdSense advertisement, maps, YouTube
   videos, calendars etc.



5.2.2. Cooperation

The whole idea of the system is to spread positioning traffic into world wide IP ad-
dresses. As a result of this distributed character, the system require a large group of
cooperating users. Nobody will use the system if there is no benefits to him. A mech-
anism which will let the system users to share their Internet connections in order to
help themselves in web positioning must be introduced. What is more, the mechanism
must treat all users equally-fairly. It means it should not allow to take benefits without
any contribution.



5.2.3. Control

According to [36], web positioning is not a single action, but a process. This process
should be able to be controlled. Otherwise it could be destructive, instead of improving
the website position. For this reason, the system should allow users to:

 • control the impact of the system activity on their websites,

 • check the current results of the system activity (changes in the website position
   on SERPs),

 • check the current state of the website in the web positioning process.




5.3. Architecture

Fig. 5.2 presents the architecture of the system which take under consideration all
specified problems and objectives. Server is necessary to control the whole process of
generating the web traffic by specified algorithm. It gives the orders for clients to start
generating traffic on the specified websites. It also gets the information from clients
with amount of requests sent to particular Google’s services on the website account.
Database serves as storage for process statistics. They can be presented to clients via
web interface. They are also useful to server for creating the orders in accordance with
the algorithm.

Clients are the agents of the presented MAS. They take orders from the server with
particular webiste registered in the database to be processed. Processing the website
is to mimic its real visitor. Client performs this autonomously using the visitor session
algorithm described in the next section.
Chapter 5. The System for Global Personalization                                       40

               Clients Cloud




                                                                 Website 1

                   Client 1




                                          Google                 Website 2     All Google
                                          Search                                Services
                   Client 2
  Server




 Database                                                        Website 3

                Client 3




                               Figure 5.2. System architecture


5.4. Visitor Session Algorithm

According to many research [1], [6], [7], [10], [20] and [36], more than half of a web-
site’s visits comes from the SERPs. That is why starting the single visitor session (the
sequence of requests considering single website registered in the system) with querying
Google Search sounds reasonably. However, only if currently considered website appears
on one of the first few SERPs. Otherwise, visitor session should be started directly on
the processed website or should refer to an incoming link, but it must be existing, if
there is such one. Because of the likely Google’s actions in order to detect abuses, the
visitor session should be possibly human-like. The analysis of real users web traffic [29]
is very useful at this moment. According to it, a typical user:
 • visits about 22 pages in 5 websites in one sitting,
 • follows 5 links before jump to a new website,
 • spends about 2 hours per session and 5 minutes per page.
These statistics clearly indicate that typical visit session concerns a website of good
quality. Visit on a poor website would be aborted as soon as after few seconds. Such
visit could have a negative impact on the website quality evaluation by Google.
1 – Server retrieves from the database information about the next website to be pro-
     cessed.
2 – Task assignation to the client.
3 – Client starts the visitor session.
Chapter 5. The System for Global Personalization                                               41


 Database                                                                         6
             1                              4
                                                                  Visit session
             2                3                      5


                                        Google                       Website      All Google
                                        Search                                     Services
                 Client
  Server



                                    7

                          Figure 5.3. Visitor session algorithm


4 – Searching on SERPs for a link to the processed website.
5 – If a link has been found – click on the link, otherwise direct request.
6 – Processing the visit session.
7 – Request to the server for another website to process.



5.5. Task Assignation Algorithm

Task assignation algorithm helps server to build a queue of registered websites ordered
by the visitor session priority. The website with the highest value of the priority is the
next one to start visitor session. In other words, client always receive the website with
the highest priority value to process.
The priority value P V is calculated using the function:


                                                          v(α)
                              P V (α) = r(α) ·   t(α) ·                                    (5.1)
                                                          T (α)

where
    α — record in the system (website with phrase for web positioning)
r(α) — returns current position in the search engine ranking for the α (returns
        0 if there is no α in the ranking)
 t(α) — returns time since the end of the last visitor session on the α (in
        seconds)
v(α) — returns number of visitor sessions made by α owner’s client
T (α) — returns time since the registration of α in the system (in days)
Chapter 5. The System for Global Personalization                                    42

Presented function gives the highest ”power” to the ranking factor. The reason for this
is that websites with high ranking value should have more real visitors, so the system
efforts will not be so crucial for its popularity.
Time of participation in the system is not very significant. Novice participants have
equal chance to gain attention for their websites as the senior ones. However, function
promotes continuous activity of the clients.
Worth to consider is also the possibility to dynamically modify weights of individual
factors depending on the results. Because of the fact that websites queue is built by
the server, it is possible to change whole function during the system activity.



5.6. Proof Study

Presented system requires a large number of users to work properly. In other case, the
generated traffic would not be distributed enough, thus would look unnaturally. As it
was shown, centralized series of queries are being seen as abuse. Unfortunately, thesis
author’s resources have been insufficient for this purpose. However, a simulation has
been performed, which had to prove proposed concept.


5.6.1. Tools

The idea was to use Tor application (http://www.torproject.org) to make the single
host (the author’s computer) generate distributed traffic. In such way, the behavioural
data of one real user could be seen by search engine as multi-user traffic.
Tor is a free software enabling Internet anonymity by thwarting network traffic analysis.
Tor aims to conceal its users’ identity and their network activity from traffic analysis.
Operators of the system operate an overlay network of onion routers which provides
anonymity in network location as well as anonymous hidden services.
Users of a Tor network run an onion proxy on their machine. The Tor software peri-
odically negotiates a virtual circuit through the Tor network. Application like browser
may be pointed at Tor, which then multiplexes the traffic through a Tor virtual circuit.
Once inside a Tor network, the encrypted traffic is sent from one host to another,
ultimately reaching an exit node at which point the decrypted packet is available and
is forwarded on to its original destination. Viewed from the destination, the source of
the traffic appears to be at the Tor exit node.
As the figure 5.4 shows, Tor has became quite popular, so its network involves large
number of users. This makes Tor fit to the objective in this study. Mozilla Firefox
browser have been used, connected with the Tor. Additionally has been also installed
iMacros plug-in in order to automate executing of visitor sessions.
For analysis of the behavioural data being received by Google during the study, Google
Analytics (shown in figure 2.2) software has been used. It was installed on every ex-
amined website.
Chapter 5. The System for Global Personalization                                     43




                           Figure 5.4. Tor interface screen



5.6.2. Results



Distributing traffic issue ended with success. After opening the Google Search main
page (www.google.com), server redirects to the domain belonging to the country which
the Tor exit node of particular session was located in. For example, when exit node was
in Germany, Google server redirected browser from google.com to google.de address.
There was displayed Google Search page in the appropriate language, in spite the fact,
that browser’s setting with default language was removed. After visit on examined
pages, Google Analytics also indicated, that the source of visits was not in Poland
(where the study was actually conducted) but in the countries of exit nodes of the
traffic.


However, routing the traffic through distributed Tor network appeared to be insufficient
solution. Firstly, the traffic routed by the Tor is significantly slowed down. From time
to time there were even difficulties with download complete search engine site. What
is more, despite the large number of visit sources, there are still only one browser and
one real user. Because of this, this simulation could only imitate one singed-in user or
a group of singed-out.
Chapter 5. The System for Global Personalization                                      44

Signed-in user

In the first case there is essentially no difference between visiting through Tor proxy
network or directly. Like it was described, Tor is the tool to concealing the identity of
a user. But after sign-in into Google Account, the identity is evident. From the Google
point of view, such visit is seen as regular user travelling very quickly all around the
world (metaphorically speaking). But it is still only one user, and applied personalized
search results to him should not be globally significant.

Group of signed-out users

As it was described earlier, Google introduced personalization mechanism not only for
users with Google Account. There is also personalization in search results for users
with no such profile account. It is based on storing cookies in user’s browser up to
180 days, which contain information about past search activities. But cookies are not
related to specific user, but to a browser. In this case, search results are re-ranked not
for the person but rather for the particular computer which this person uses.
This simulating system uses only one browser, so there was no possibility to evaluate the
impact of personalization re-ranking on search ranking in the global perspective.
Disabling storing of cookies option in the browser makes personalization impossible to
act, because there is no way to relate past queries in search engine to particular user.
Moreover, browser with blocked cookies is rather rare situation nowadays. Therefore
Google search engine is rather suspicious about traffic with blocked cookies and for
such requests they serve ”Sorry page” (figure 5.1).



5.7. Summary

Generating artificial traffic on the Internet seems to be not very praiseworthy as it
is dangerously close to spam appearance and causes the information noise into vis-
itors statistics. On the other hand, this is not worse than other SEO activities like
linkbaiting.
After [6], today’s search engines use mainly link-popularity metrics to measure the
”quality” of a page. It is the main idea of the PageRank algorithm [3]. This fact causes
the ”rich-get-richer” phenomenon. More popular websites appear higher on SERPs,
which brings them more popularity.
Unfortunately, it is not very beneficial for the new, unknown pages which have not
gained popularity yet. There is a possibility, that these websites contain more valuable
information than the popular ones. Despite this fact, they are ignored by search engines
because of small amount of links. These sites, in particular, need SEO efforts. Probably
classic techniques will be more effective than the one presented in this paper.
Nevertheless, the methods presented in this article are likely to improve the rate of
web positioning, because web traffic can be noticed by search engines immediately. It
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning

More Related Content

What's hot

Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analyticsLeon Henry
 
Daftar isi print
Daftar isi printDaftar isi print
Daftar isi printdimas34343
 
School library management system software
School library management system softwareSchool library management system software
School library management system softwareRanganath Shivaram
 
Mongo db notes for professionals
Mongo db notes for professionalsMongo db notes for professionals
Mongo db notes for professionalsZafer Galip Ozberk
 
Scrapbook User\'s Manual
Scrapbook User\'s ManualScrapbook User\'s Manual
Scrapbook User\'s Manualhaven832
 
Data Export 2010 for MySQL
Data Export 2010 for MySQLData Export 2010 for MySQL
Data Export 2010 for MySQLwebhostingguy
 
Joomla 2.5 Tutorial For Beginner PDF
Joomla 2.5 Tutorial For Beginner PDFJoomla 2.5 Tutorial For Beginner PDF
Joomla 2.5 Tutorial For Beginner PDFVineet Kumar Saini
 
Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...
Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...
Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...Daniel Sutic
 
SchoolAdmin - School Fees Collection & Accounting Software
SchoolAdmin - School Fees Collection & Accounting SoftwareSchoolAdmin - School Fees Collection & Accounting Software
SchoolAdmin - School Fees Collection & Accounting SoftwareRanganath Shivaram
 
B4X Programming Language Guide
B4X Programming Language GuideB4X Programming Language Guide
B4X Programming Language GuideB4X
 
First Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis ReportFirst Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis ReportLauren Zahringer
 
Protel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_designProtel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_designhoat6061
 
Martin Gregory - Capability Statement 2013
Martin Gregory -  Capability Statement 2013Martin Gregory -  Capability Statement 2013
Martin Gregory - Capability Statement 2013Martin Gregory
 

What's hot (19)

IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
 
Slackbook 2.0
Slackbook 2.0Slackbook 2.0
Slackbook 2.0
 
Daftar isi print
Daftar isi printDaftar isi print
Daftar isi print
 
School library management system software
School library management system softwareSchool library management system software
School library management system software
 
Mongo db notes for professionals
Mongo db notes for professionalsMongo db notes for professionals
Mongo db notes for professionals
 
Scrapbook User\'s Manual
Scrapbook User\'s ManualScrapbook User\'s Manual
Scrapbook User\'s Manual
 
Guide to managing and publishing a journal on the LAMJOL
Guide to managing and publishing a journal on the LAMJOLGuide to managing and publishing a journal on the LAMJOL
Guide to managing and publishing a journal on the LAMJOL
 
Data Export 2010 for MySQL
Data Export 2010 for MySQLData Export 2010 for MySQL
Data Export 2010 for MySQL
 
Joomla 2.5 Tutorial For Beginner PDF
Joomla 2.5 Tutorial For Beginner PDFJoomla 2.5 Tutorial For Beginner PDF
Joomla 2.5 Tutorial For Beginner PDF
 
Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...
Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...
Conceptualising and Measuring the Sport Identity and Motives of Rugby League ...
 
SchoolAdmin - School Fees Collection & Accounting Software
SchoolAdmin - School Fees Collection & Accounting SoftwareSchoolAdmin - School Fees Collection & Accounting Software
SchoolAdmin - School Fees Collection & Accounting Software
 
B4X Programming Language Guide
B4X Programming Language GuideB4X Programming Language Guide
B4X Programming Language Guide
 
First Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis ReportFirst Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis Report
 
2003guide
2003guide2003guide
2003guide
 
Protel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_designProtel 99 se_traning_manual_pcb_design
Protel 99 se_traning_manual_pcb_design
 
Report on dotnetnuke
Report on dotnetnukeReport on dotnetnuke
Report on dotnetnuke
 
Official basketball rules_2014_y
Official basketball rules_2014_yOfficial basketball rules_2014_y
Official basketball rules_2014_y
 
Martin Gregory - Capability Statement 2013
Martin Gregory -  Capability Statement 2013Martin Gregory -  Capability Statement 2013
Martin Gregory - Capability Statement 2013
 

Similar to Analysis Of The Modern Methods For Web Positioning

Similar to Analysis Of The Modern Methods For Web Positioning (20)

Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
Dsa
DsaDsa
Dsa
 
cs-2002-01
cs-2002-01cs-2002-01
cs-2002-01
 
Master thesis xavier pererz sala
Master thesis  xavier pererz salaMaster thesis  xavier pererz sala
Master thesis xavier pererz sala
 
test5
test5test5
test5
 
test6
test6test6
test6
 
test4
test4test4
test4
 
test5
test5test5
test5
 
test6
test6test6
test6
 
Sdd 2
Sdd 2Sdd 2
Sdd 2
 
Thesis
ThesisThesis
Thesis
 
Dimensional modeling in a bi environment
Dimensional modeling in a bi environmentDimensional modeling in a bi environment
Dimensional modeling in a bi environment
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learning
 
XAdES Specification based on the Apache XMLSec Project
XAdES Specification based on the Apache XMLSec Project XAdES Specification based on the Apache XMLSec Project
XAdES Specification based on the Apache XMLSec Project
 
Red book Blueworks Live
Red book Blueworks LiveRed book Blueworks Live
Red book Blueworks Live
 
Bwl red book
Bwl red bookBwl red book
Bwl red book
 
Information extraction systems aspects and characteristics
Information extraction systems  aspects and characteristicsInformation extraction systems  aspects and characteristics
Information extraction systems aspects and characteristics
 
Location In Wsn
Location In WsnLocation In Wsn
Location In Wsn
 
Yii blog-1.1.9
Yii blog-1.1.9Yii blog-1.1.9
Yii blog-1.1.9
 
&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />
 

Analysis Of The Modern Methods For Web Positioning

  • 1. Faculty of Computer Science and Management field of study: Computer Science specialization: Software Engineering Master thesis Analysis of the Modern Methods for Web Positioning Paweł Kowalski keywords: search engine, SEO, personalization, optimization, web positioning Thesis contains an analysis of personalization of search results mechanism in popular search engines. It presents experiments and considerations about impact of personalization on search rankings and how it affects on Search Engine Optimization (SEO). There are proposed some new SEO methods that take an advantage of personalization in search engines. Supervisor: dr inż. Dariusz Król ............................. ............................. name and surname grade signature Do celów archiwalnych pracę dyplomową zakwalifikowano do:* a) kategorii A (akta wieczyste) b) kategorii BE 50 (po 50 latach podlegające ekspertyzie) * niepotrzebne skreślić Stamp of the institute Wrocław 2010
  • 2. Contents Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1. Beginning of the SEO Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. Search Engines Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3. The Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2. Personalized Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1. Operation Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2. Methods for the Analysis of Behavioural Data . . . . . . . . . . . . . . . . . . 7 2.2.1. Methods of behavioural data collecting . . . . . . . . . . . . . . . . . . 8 2.2.2. Process of tracking user . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3. Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1. Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2. Phrase language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3. Search history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4. Spam Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 3. Impact of Personalized Search on SEO . . . . . . . . . . . . . . . . 21 3.1. Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1. Areas for Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4. SEO Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1. Website Presentation in Google Search . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1. Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.2. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.3. Sitelinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2. Website Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1. Unique content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2. Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3. Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1. Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.2. Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.3. Alternative texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.4. Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4. Internal Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.1. Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 i
  • 3. 4.4.2. Links anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.3. Broken links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.4. Nofollow attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.5. Sitemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5. Addresses and Redirects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.1. Friendly addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.2. Redirect 301 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6. Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6.1. Information for robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6.2. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 5. The System for Global Personalization . . . . . . . . . . . . . . . . 37 5.1. Problems to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.1. Web Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.2. Cooperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.3. Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4. Visitor Session Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5. Task Assignation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.6. Proof Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.6.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 ii
  • 4. Abstract Modern search engines are constantly improved. The most recent big step introduced into their algorithms concern personalization mechanism. Its goal is to extract information about user’s preferences implicitly from its search behaviour and also such factors as location, phrase language and search history. This information is the basis for building a user’s search profile. Motivation of this process is to provide more relevant search results for specific user and its interests. Thesis concern details of this personalization mechanism and try to examine how various factors affect search results. Author also analyse the methods for collecting behavioural data by search engines. He approaches to define possible impact of customization of search results on the Search Engine Optimization (SEO) issues like metrics, spam filtering or changes in significance of website optimization factors. Then author tries to evaluate the possibility of personalized search rankings manipulation through proposed system for generating human-like web traffic. Streszczenie Nowoczesne wyszukiwarki internetowe są ciągle ulepszane. Ostatni duży krok na- przód wprowadzony w ich algorytmach dotyczy mechanizmu personalizacji. Jego zadaniem jest zdobycie informacji na temat preferencji użytkownika z jego za- chowania podczas wyszukiwania informacji w wyszukiwarce pod względem takich czynników jak jego lokalizacja, język wyszukiwanej frazy i historia wyszukiwań. Te informacje są podstawą do utworzenia profilu użytkownika. Celem tych dzia- łań jest zwrócenie konkretnemu użytkownikowi rezultatów wyszukiwania bardziej odpowiadających jego zainteresowaniom. W pracy tej znajduje się szczegółowa analiza mechanizmu personalizacji oraz próba zbadania jak poszczególne czyn- niki wpływają na wyniki wyszukiwań. Autor analizuje także metody pozyskiwania danych na temat zachowań użytkowników przez wyszukiwarki. Podejmuje próbę określenia możliwego wpływu dostosowywania wyników wyszukiwania do użyt- kownika na tematy związane z pozycjonowaniem witryn internetowych takie, jak metryki, filtrowanie spamu lub zmiany w znaczeniu poszczególnych czynników w optymalizacji stron WWW. Następnie autor próbuje ocenić możliwość mani- pulowania spersonalizowanymi wynikami wyszukiwania poprzez zaproponowany system służący do generowania naturalnie wyglądającego ruchu sieciowego. iii
  • 5. Chapter 1 Introduction Before the Web and present-day search engines, searching meant simple matching the terms in a query to the exact appearance of these terms in a database filled with textual documents. Some database searches let you only locate documents where certain words appeared within a defined distance from other specified words from the same document. Sorting documents by relevance or importance would have been a monumental task, if possible at all. 1.1. Beginning of the SEO Concept When the Internet was introduced, it revolutionized the worldwide share of information. Free access for everyone to this web without any restrictions is the reason why the Internet is considered to be one of the greatest inventions of the 20th century. But this freedom in the Internet has serious implication – many problems with the organization of this enormous set of information. Hyperlinks turned out to be insufficient for the issue. This is why the first search engines was introduced. They quickly became the main source of visits in the commercial web- sites. A good search results started to be very important issue for content publishers. Those moment was the beginning of the SEO1 concept which is still the major element of the Internet marketing. The early search engines like AltaVista or Lycos were launched around 1994–1995 [7]. Their algorithms have only been analysing the content of websites and keywords in meta tags. It was easy to circumvent these algorithms by placing false information into keywords tags. Another popular fraud was filling website content with irrelevant text, which was visible only for search engine robots, but not for the user. As a result, search engine result pages (hereafter SERPs) contained websites filled with spam and inappropriate content [21]. 1. Search Engine Optimization (SEO) – the process of improving the volume or quality of traffic to a website from search engines. It is also used to take Web Positioning term as a synonym.
  • 6. Chapter 1. Introduction 2 1.2. Search Engines Evolution However, the relevance of the search results to the query is still based on keyword matching. But search engines started to understand differences in the importance of words located in different parts of a page. For example, if you searched for a certain phrase, pages containing those words in their titles and headlines might be considered more relevant than other pages where those words also appeared, but not in those “important” parts of pages. Google company, which started in September 1999, revolutionized search engines. Its co-founders, Larry Page and Siergiej Brin, developed PageRank algorithm [3]. This al- gorithm redefined search issue. Content of the websites became slightly less significant. Instead of text content, PageRank rates the websites mainly on the basis of quantity and quality of links leading to these websites. With help of such improvements, the Internet works as a kind of voting system. Every link is a vote for a website which leads to. Relevance was also found by indexing words that link to other pages. If a link leading to a page used the phrase ”american basketball” as anchor text2 , the page being pointed to would be considered relevant to the American basketball. The existence of links to pages also has been used to help define the perceived importance of a page. Information about the quality and quantity of links to a page can be used by search engines to get a sense of implied importance of the page being linked. Nevertheless, after a short time, the techniques [21] spoiling the results of PageRank have also been discovered. Basically, most of them work in such a way to increase the number of links leading to a particular website to enhance its PageRank value. Such activity in a large scale is usually called linkbaiting. There are many scripts and web catalogs to facilitate and to automate such activity. However, Google is also constantly working to improve their search engine algorithm. They try to make it resistant to linkbaiting. According to [11], many new factors are being introduced into websites evaluation process, in order to reduce the impact of linkbaiting which is a sort of spam. Besides, there is a limit to the effectiveness of this type of keyword matching. When two people perform a search at one of the major search engines, there is a chance that even if they use the same search terms, they might be looking for something completely different. For example, when an anthropologist searches for phrase ”jaguar” he expects websites with information about big cats as a result. But he can also receive a collection of websites about Jaguar cars instead. As search engines progressed and users were given more and more websites with valu- able information, the engines needed to respond with a refined approach to search. The main idea to improve relevance of the search results was to better understand the user’s intent and expectation then he types a certain phrase into a search box. So it seems that the next step in search engines improvement is tracking regular users in the Internet. Collecting data on their activity might give useful information about 2. Link label or link title, text in a hyperlink visible and clickable by the user.
  • 7. Chapter 1. Introduction 3 which websites are valuable for them. The major engines such as Google, Yahoo and Bing guard their search secrets closely, so one can never be absolutely certain how they are operating. But they are evolving, and personalization seems to be the wave of the future. 1.3. The Goal It seems to be quite clear that search engines fitted with personalization mechanism would have two main benefits: 1. Improvement of search results relevance for specific user. 2. Decrease of number of spam entries in SERPs. Google, the leader in search engine market, is already the first steps behind in this area. Such information reaching us from the official company blog [13]. Moreover they already has several patents connected with the personalization mechanism. For this reason this thesis will be mainly concerned with the Google Search. But high competition in the Internet market suggests, that other popular search engines like Yahoo and Bing are also being improved in this direction. The goal of this thesis is to analyse the possible aspects of the personalization mech- anism in Google Search, on the basis of available information. There will be sev- eral factors taken into account, which can have an influence on the changeability of SERPs: • geolocation • language of the query • web search history • query complexity • search behaviour (e.g. bounce rates3 , time of visits) It will be the base for several experiments which should determine how advanced is the current level of personalization introduced into considered search engine. This re- search also includes an analysis of the data used to describe users’ search behaviour – particularly the methods of collecting these data and types of them. The obtained results will be used to specify the potential impact on SEO and its metrics, in particular: • the possibility of using personalization of search engine to create a new SEO techniques 3. Bounce rate is a term used in website traffic analysis. It essentially represents the percentage of initial visitors to a site who ”bounce” away to a different site, rather than continue on to other pages within the same website.
  • 8. Chapter 1. Introduction 4 • usability of website ranking in search results as the measure of success in SEO activity Personalization is the opportunity for search engines to make spam less significant for search results and make SEO workers’ life harder. But it is only the spam in the sense of collection of websites with content of no value for human and irrelevant hyperlinks. But personalization also opens door for a another kind of information noise – behavioural data spam. The thesis presents the architecture of the distributed system generating artificial web traffic, therefore, imitating search activity of a real user. However, using such system can be seen as unethical, so the thesis contains only a conception and a design. Author has no intention to implement such system, but he tries to examine with available tools if building it would be reasonable. In this way might be indicated possible harmful actions, whom search engines should be protected against. After that, there is a short analysis of the known up-to-date information about signif- icant factors in web positioning. Together with results of the personalization research, they helped to prepare a collection of advices how to build a website attractive for search engines. It is a sort of a guide for webmasters. At the end of thesis there is short conclusion. It contains author’s few thoughts about future trends in search engines and SEO.
  • 9. Chapter 2 Personalized Search Pretschner [27] in 1999 wrote: With the exponentially growing amount of information available on the Internet, the task of retrieving documents of interest has become in- creasingly difficult. Search engines usually return more than 1,500 results per query, yet out of the top twenty results, only one half turn out to be relevant to the user. One reason for this is that Web queries are in general very short and give an incomplete specification of individual users’ information needs. To be more specific, Speretta [31] in 2005 wrote: [...] most common query length sub- mitted to a search engine (32.6%) was only two words long and 77.2% of all queries were three words long or less. These short queries are often ambiguous, providing little information to a search engine on which to base its selection of the most relevant Web pages among millions. According to Wikipedia, Google in 2006 has indexed over 25 billion web pages, 400 million queries per day, 1.3 billion images, and over one billion Usenet messages. The Internet grows very quickly. For this reason search accuracy is crucial area for con- stant improvement in modern search engines. One of the major solutions to meet the challenge is personalization. Personalized search is simply an attempt to deliver more relevant and useful results to the end user (searcher) and minimize less useful results. Personalization mechanism uses information about user’s past actions and behaviour to specify his profile and match relevant search result to this profile. It should provide more useful set of results or a set of results with less irrelevant or spam entries. For this reason personalized search seems to be desirable to the end user. Google puts it in this way: Search algorithms that are designed to take your personal preferences into account, including the things you search for and the sites you visit, have better odds of delivering useful results [13]. The goal is simple: to reduce spam and to deliver better results. This looks like dangerous weapon against SEO workers which are major offenders in generating spam.
  • 10. Chapter 2. Personalized Search 6 Official information [13], [25], [38] indicates that Google is the only major search engine that already introduce personalization mechanism. First personalized search result ap- peared almost 5 years ago [13], and from that time it has been constantly evolving. 2.1. Operation Principles Of course details of Google’s search algorithms are not public. But it can be expected, that the main principles are based on the ideas which can be found in the scientific literature. According to [31], personalization can be applied to search in two different ways: 1. by providing tools that help users organizing their own past searches, preferences, and visited URLs 2. by creating and maintaining sets of user’s interests, stored in profiles, that can be used by retrieval process of a search engine to provide better results. His research proved, that user profiles can be implicitly created out of the limited amount of information available to the search engine itself. The profiles are built on the basis of the user’s interactions with a particular search engine. Google has applied this second approach in their search engine, because they do not provide any additional tools like toolbars or browser add-ons for personalizing search. After [31]: In order to learn about a user, systems must collect personal information, analyze it, and store the results of the analysis in a user profile. Information can be collected from users in two ways: explicitly, for example asking for feedback such as preferences or ratings; or implicitly, for example observing user behaviors such as the time spent reading an on-line document. Google Search do not provide any forms, which let users to specify their interests and preferences. So to build user profile, this information must be collected in other way. According to [31], user browsing histories are the most frequently used source of information to create interest profiles. But not only browsing history (like such presented in figure 2.1) is significant. For example, a user after sending search query gets search results. He selects a specific entry seemed to be interesting. He clicks on it and the website is saved in his browsing history. However, user quickly realizes, that the selected website did not fit to his interests and goes back to the search results after few seconds. Such visit should be qualified rather negatively. So not only browsing history, but also a user’s behaviour should be taken into consideration in the personalization mechanism. Also studying a series of searches from the same user may offer a glimpse into mod- ified search behaviour. How does an individual change their queries after receiving unsatisfactory results? Are search terms shortened, lengthened or combined with new terms? There is much other information that a search engine might collect about a user when a search is performed – location, language preferences indicated in their browser or the type of device they are using (mobile phone, handheld or desktop). But how
  • 11. Chapter 2. Personalized Search 7 Figure 2.1. Google Web History panel such behavioural data can be collected by search engine? The answer is in the next section. 2.2. Methods for the Analysis of Behavioural Data Search engine robots, hereafter crawlers [26] continuously gather information from almost every website on the Internet. It is well known that Google collects enormous amount of data through this process. The reason for this is that these data have the greatest significance for search engine algorithm, thus classic web positioning methods are based on links maintenance, mainly the acquisition of them. Google processes these data and sorts the websites according to its value to the user. User sends queries to the search engine and gets appropriate SERPs. Because Google knows what people search, they are able to determine the popularity of specific in-
  • 12. Chapter 2. Personalized Search 8 formation on the Internet. But eventually, it is the user who decides which website is valuable for him and which is not. The value of a particular website is reflected in users’ activity – websites whose links have been clicked and time between these actions. These information is called behavioural data. It is reasonable to make all this data useful for search engine. Certainly, Google knows that, too. Probably this is the reason why they collect enormous amount of behavioural data in addition to data collected by crawlers. This kind of information is what this study is the most interested in. 2.2.1. Methods of behavioural data collecting The entire web is based on HTTP protocol which generates requests containing follow- ing information: • IP address of the user making request which can be used in geolocation of this user, • date and time of request, • language spoken by the user, • operating system of the user, • browser of the user, • address of the website which redirected user by the link to the requested website. These HTTP requests are being used by Google in: Click tracking – Google logs all of its users clicks on all of its services, Forms – Google logs every piece of information typed into every sending form, Javascript executing – requests and sometimes even more data are sent when a user’s browser executes the script embedded in website, Web Beacons – small (1 pixel by 1 pixel) transparent images in its websites, which cause sending a requests every time while user’s browser tries to download such images, Cookies – small text information stored on user’s computer which lets Google track users’ movement around the web every time when they get on any page that has Google advertisement. But these all elements has to be placed on websites being indexed by Google’s crawlers. Fortunately for Google, they have a lot of services, very useful for Internet publishers. Because they are mostly free to use, webmasters gladly use them in their websites. Google Analytics One of these attractive services is Google Analytics. It generates detailed statistics about:
  • 13. Chapter 2. Personalized Search 9 • the visitors to the website, • the previous website of the visitor, • activity (navigation) of the user in the website. Figure 2.2. Google Analytics main panel It is the most popular and one of the most powerful tool to examine the web traffic on our website. It gives the owner a lot of useful information about visitors on his website. Figure 2.2 shows a few features of this piece of software. The information which it provides is certainly very interesting for the Google itself. For this reason there is a lot of discussions between SEO workers about possible disadvantages of using Google Analytics in SEO campaigns. It is because poor results of particular website indicated through Google Analytics can inform Google’s search engine to decrease value of this website in search ranking. But this is only an unconfirmed speculation. Google Toolbar Another Google’s tool which provides them with even more valuable data is Google Toolbar. It is the plug-in adding a few new features to popular browsers, mainly a quick access to Google’s services. One of these features is checking the PageRank value
  • 14. Chapter 2. Personalized Search 10 for currently viewing webpage. This gives Google the information about every website which the users with installed Toolbar are viewing. Google AdSense There is also Google AdSense - context advertising program for website publishers. There are millions of websites which uses this service to generate some financial profits for their authors. The effect of this is that all these websites are displaying ads published by Google’s servers. It can provide to Google a similar information as Google Analytics and Google Toolbar. Google Public DNS The latest service launched by Google is Public DNS (Domain Name System) [23]. It is said to be faster and more secure than others and this is how Google encourages us to start using their DNS. It can generate massive amount of information about web traffic. Every single query to the DNS can be analyzed by its provider. So the more popular their DNS will be, the better for Google. It can provide a lot of information helpful on defining websites popularity. But because of DNS caching mechanisms [23], Google do not get all the desired infor- mation. DNS client sends query only when he or she wants to visit a domain for the first time. After he gets the IP address of this domain from DNS, he caches it for an interval determined by a value called time to live (TTL). Every next visit during this interval does not send any query. Consequently, Google is still in need for other services to gather desired information about the activity of particular website’s visitors. Other Google Services Google has other very popular services like for example YouTube, Google Maps (Fig. 2.3) etc. They allow users to embed objects like videos or maps on their own websites. There is also Google Reader which can indicate the popularity of particular websites by counting their RSS1 subscribers. There are many other ways for Google to gain the useful data [9]. In fact Google itself admits to the use of all described techniques in its privacy policy [12]. Most of these data are probably used by them to improve accuracy of their search engine and quality of their services. 2.2.2. Process of tracking user Described services can be a great source of behavioural data. It us no doubt on that. But tracking user’s search activity process would be incomplete without data provided by search engine. The next few sections present how the tracking process looks like. 1. RSS (most commonly expanded as Really Simple Syndication) is a family of web feed formats used to publish frequently updated works – such as blog entries, news headlines, audio, and video – in a standardized format
  • 15. Chapter 2. Personalized Search 11 Figure 2.3. Google Maps example screen Starting the session When the user is opening the search engine site (typing the www.google.com address in a browser), he sends an HTTP Request [37] to the server. This request contains the IP address of the user’s computer. Thanks to this information the search engine has the ability to relate following search queries with particular users. Each of them has been assigned a unique session identifier, stored on the server. This is the beginning of the user’s search session. The identifier expires after a certain period of user’s inactivity. In this way the search session is being terminated. Sending the search query The view presented on figure 2.4 should be known by every Internet surfer. This is the place where user cane type his search query. After the search query is sent, it is followed by two facts: 1. The query is stored into a database and connected with user’s session identifier. Then the personalization mechanism takes advantage from it. 2. The query is analysed and used by search algorithm to provide relevant search results to the user.
  • 16. Chapter 2. Personalized Search 12 Figure 2.4. Google Search main screen After that user receives an HTTP Response [37] with the search results as the HTML code. Result selection In the figure 2.5 there is presented one of the results of ”query example” with the hyperlink highlighted. Figure 2.5. Example of the search result Commonly click on the link sends HTTP Request to the server which link leads to. So in this case, it should be sent to: http://www.wisegeek.com/what-is-query-by-example.htm But when you look into source code of the SERP, you will find URL like this: http://www.google.com/url?sa=t&source=web&cd=6&ved=0CDMQFjAF& url=http://www.wisegeek.com/what-is-query-by-example.htm& ei=CekOTLT4EpHu0gTYitWXDg&usg=AFQjCNE3t34-kSehUAK8TFNwh5CV9K-OWg& sig2=PdwrnqnhLhowpC8t5-06bw The most important thing which can be noticed is that the links in the SERPs lead to Google’s server. But the chosen website finally appears on the user’s screen, because Google’s server is doing URL redirection (forwarding). This technique bears the down- side of the short delay caused by the additional request to search engine server. However, in this way search engine can log every user’s click in SERPs. What is more, there are some additional data in the result’s URL which probably provide some extra information. For example, the value of the cd parameter is the place number in the
  • 17. Chapter 2. Personalized Search 13 current search ranking. What is more, this URL is can be seen by target server, because browser are placing it into HTTP Request data as Referer field [37]. This fact is used by software like Google Analytics to aggregate traffic sources of the website with Analytics scripts. Thanks to this, the website owner can get the information of: • the most popular search phrases that result in visits to his website • the place in ranking of his website for particular search phrase and particular user (it can vary due to personalization mechanism) Of course the same information is being taken into consideration by search engine. Behavioural data extraction According to many research [1], [6], [7], [10], [20] and [36], more than half of a website’s visits comes from the SERPs. In case of e-commerce websites, this value is even higher, because such services usually do not have regular visitors. They mostly come from search engines (even 90% of visits) or from ad appearing on other websites. According to the analysis of real users web traffic [29], typical user spends about 2 hours per session and 5 minutes per page. These statistics concerns a website of good quality, relevant to the user’s interests. Visit on website with poor content would be terminated as soon as after few seconds – so called bounce. Such visit should indicate irrelevance of website selected by user, so it would be desirable if it wouldn’t appear in the search results on particular phrase. Not only what you select/interact with from a given set of search results (or the Ads served with them), but what you do not select or have minimal interactions with (bounce rates) can have an effect. These metrics can be used to create a greater prob- ability model for future search result sets. What is more, a mechanism based on cookies have been recently introduced on Google Search. It allows to learn about every user’s (also not logged in any Google’s service) history of search queries from the last 180 days. Officially it is used to personalize SERPs according to past interests of the user. Google [13] says about that: Because many people might search from a single computer, the browser cookie may be associated with more than one person’s search activity. For this reason, we don’t provide a method for viewing this signed-out search activity. The diagram in figure 2.6 shows the process of tracking user which has not been signed-in to Google Account. This is what Google [14] says about personalized search for signed-out users: When you search using Google, you get more relevant, useful search results, recom- mendations, and other personalized features. By personalizing your results, we hope to deliver you the most useful, relevant information on the Internet. In the past, the only way to receive better results was to sign up for personalized search. Now, you can get customized results whenever you use Google. Depending upon whether
  • 18. Chapter 2. Personalized Search 14 User Search Engine Enter search engine via URL Extract profile information [Else] Type search query Show profile based search page Save cookie Search for relevant documents [Else] Look for interesting website Prepare search results [Found history cookies] [Found something interesting] Re-rank found documents Click chosen website Log phrase-selection Visit the website Redirect to selected website [Else] [Curiosity satisfied] [Else] Log bounce Go back to search engine [Visit longer than 3 minutes] Log visit Close browser Figure 2.6. Activity diagram of a visit session or not you’re signed in to a Google Account when you search, the information we use for customizing your experience will be different: Signed-in personalization: When you’re signed in, Google personalizes your search experience based on your Web History. If you don’t want to receive personalized results while you’re signed in, you can turn off Web History and remove it from your Google Account. You can also view and remove individual items from your Web History. Signed-out customization: When you’re not signed in, Google customizes your search experience based on past search information linked to your browser, using a cookie. Google stores up to 180 days of signed-out search activity linked to your browser’s cookie, including queries and results you click.
  • 19. Chapter 2. Personalized Search 15 Table 2.1. Information used by Google Search in personalization Signed-in Personalized Signed-out Personalized Search Search Place of data Web History, linked to Google On Google’s servers, linked to storage Account an anonymous browser cookie Time interval Indefinitely or until remove it Up to 180 days of data storage Searches used Only signed-in search activity, Only signed-out search activity to customize and only if user is signed up for Web History 2.3. Research The goal of this section is to evaluate the current level of personalization based on several factors. For the task should be helpful this what Google [14] says about types of results customizations: When you use Google to search, we try to provide the best possible results. To do that, we sometimes customize your search results based on one or more factors: Search history: Sometimes, we customize your search results based on your past search activity on Google, such as searches you’ve done or results you’ve clicked. If you’re signed in to your Google Account and have Web History enabled, these customizations are based on your Web History. If you’re signed in and don’t have Web History enabled, no search history customizations will be made. (Using Web History, you can control exactly what searches are stored and used to personalize your results. Learn about using Web History) If you aren’t signed in to a Google Account, your search results may be customized based on past search information linked to your browser using a cookie. Because many people might be searching on one computer, Google doesn’t show a list of previous search activity on this computer. Learn how to turn off these customiza- tions Location: We try to use information about your location to customize your search results if there’s a reason to believe it’ll be helpful (for example, if you search for a restaurant chain, you may want to find the one near you). If you’re signed in to your Google Account, that customization may rely on a default location that you’ve previously specified (for example, in Google Maps). If you’re not signed in, the results may be customized for an approximate location based on your IP address. If you’d like Google to use a different location, you can sign in to or create a Google Account and provide a city or street address. Your specific location will be used not only for customizing search results, but also to improve your experience in Google Maps and other Google products.
  • 20. Chapter 2. Personalized Search 16 2.3.1. Location While you can search at google.com just about anywhere in the world, you can also access Google at a number of different country specific addresses, such as google.co.uk, www.google.fr, www.google.co.in. In fact, Google automatically redirects you on the proper domain using your IP address and determining your geolocation, Browser setting with recommended language was clear in this experiment. This experiment was performed in one location in Poland. However, to simulate request from the other locations, there was used a similar software environment like described in section 5.6.1. The used phrased ”jaguar” is multi-lingual, so the language of the phrase does not affect the search results. The first query was sent through three Tor hosts, where the exit host was located in Los Angeles, California, United States. The result of the query is presented in the figure 2.7 (only several first entries). All website in this SERP are in English, which is the standing language at the described location. Moreover, at the near bottom there are some places indicated on the Google Maps, which are physically close to the location of the exit host. The second query was sent via exit host located in Erfurt, Thuringen, Germany. The figure 2.8 presents results of this query. In Official Google Blog [13] is written, that the same query typed in multiple countries may deserve completely different results. Presented results clearly shows that those words are true. Unfortunately author failed to check if a search for the query ”football” provides dif- ferent results in the US, the UK, and Australia, because the term refers to completely different sports in those countries. But it is rather possible. A preferred country might include the country of the searcher as well as other countries that searcher might find acceptable, such as showing search results from the United States to people located in Canada. 2.3.2. Phrase language It is rather clear, that language of the searched phrase is significant for the results. Search engines, despite personalization, still use matching phrases to the content of indexed pages as the major factor for evaluation of the search relevance. For this reason, the phrases identical in the semantic meaning but in other languages are completely different in general. So serving search results with pages in English about birds would be senseless if the user typed phrase ”Vogel” in search box, which means ”bird” in German.
  • 21. Chapter 2. Personalized Search 17 Figure 2.7. Search results of the query sent via host located in USA 2.3.3. Search history The most interesting factor which is said to have influence in personalization mechanism in Google search engine is user’s search history. Figure 2.9 shows search results which was slightly modified by re-ranking based on search history. On the fifth position, right after two video thumbnails, there is a link to the website, which was visited 4 times (exact number of visits is visible on the right side of the hyperlink) used by the author of this this study to gather information. These fluctuations appear only when user is signed into Google Account, otherwise there is no access to the web history (figure 2.1). This modified search result was not
  • 22. Chapter 2. Personalized Search 18 Figure 2.8. Search results of the query sent via host located in Germany the author’s intentionally. The phrase ”personalized search” was not the object of the experiment. However, this result shows, that search history affects future search results on similar areas of information. To compare modified results with the original (without impact of personalization), there are two ways to disable results customization: 1. signing-out from Google Account 2. using ”View customization” which is available on the bottom of results screen After using one of these options, we can check the original position of the visited website in the ranking. In this particular case, the website holds 17th position in the results with no customization. So after personalization re-rank, there was position increase by 12 places. But the most important is the fact that this change shows visited website on the first SERP of the search ranking. In most cases (more than 90% of searches) users does not go beyond first page of the results. So such change in ranking causes huge increment of the visitors via this phrase.
  • 23. Chapter 2. Personalized Search 19 Figure 2.9. Search results personalized by user’s search history Unfortunately author has failed in forcing search engine to re-rank search results in- tentionally. So after this experiment, approximation of the re-rank algorithm is impos- sible. 2.4. Spam Issue There is a huge amount of value from getting to the top of the search results. Especially considering competitive phrases related to the business. This is a marketing area with millions of dollars in it, quite often. So spammers are highly motivated, because there is a lot of money at stake. Unfortunately regular users, searching for valuable content are the main victims of these practices. One of the more interesting parts about implicit/explicit user feedback during search personalization process is that it can be very effective in dealing with spam. The more personalized the results, the less chance that spam will will appear in search ranking. Because in most cases, spammy websites are clickable by users (which are tricked by link with false information about target website), but after realize the real value of those website, they quickly go away and do not come back to them.
  • 24. Chapter 2. Personalized Search 20 Not only will this enable them to help limit spam through personalization, it would also be a great source of query/click analysis for Google. It is worth to consider that the click data across multiple users shows that a given entry in a query space rarely is clicked, or shows a high bounce rate. Google might just use that signal as a dampening factor for spam result.
  • 25. Chapter 3 Impact of Personalized Search on SEO 3.1. Metrics For quite a long time, SEO workers were using position in search ranking for particular phrases as the indicator of the web positioning. Increase of the website position has been always a desirable consequence of the SEO actions. After implementation of personalization mechanism, the issue is not so simple. Al- though customizations in rankings are still not very influential (only one entry in whole SERP), the highly visible benefits of the personalization suggest, that the impact will be increasing. For this reason ranking position cannot be the major metrics of success any longer. It is because position in ranking of particular website can be different for every user. Especially for those of them, which are regular visitors of this website. Of course, this indicator is still measurable, because position monitors1 are not person- alization subject (they do not use cookies and Google Account). It can also give useful information about position seen by users searching for concerned keywords for the first time. But it is the increase of inbound traffic which has always been the main moti- vation of SEO actions. So the major metrics of this actions should be closely related to this motivation. For example, those are metrics for SEO in time of personalized search: Number of unique visitors. Higher value indicates good result of the advertising campaign and gaining popularity among new customers. Previous search queries. As an example: if the searcher has been recently searching the term ‘diabetes’ and submits a query for ‘organic food’ the system attempts to learn and presents additional results relating to organic foods that are helpful in fighting diabetes. 1. Software which automates monitoring of a website’s position in search rankings for given phrases
  • 26. Chapter 3. Impact of Personalized Search on SEO 22 Previously presented results. Results that have been presented to the end user can be omitted in future results for a given period of time in exchange for other potentially viable results. User query selection. Past selected or preferred documents can be analysed and similar documents or linking documents can be used to refine subsequent results. Furthermore, certain documents types can be seen as preferred, in what would be a combination of Universal Search concepts. Common websites that accessed can also be tagged as preferred locations for further weighting. Selection and bounce rates (and user activity on website). An editorial scor- ing can be devised from the amount of time a user spends on a page, the amount of scrolling activity, what has been printed, or even what has been saved or book- marked. All can be used to further refine the ‘intent’ and ‘satisfaction’ with a given result that has been accessed. Advertising activity. The advertisements clicked on can also begin to add to a clearer understanding of the end users preferences and interests. User preferences. The end user can also provide specific information as to personal interests or location specific ranking prominence. It could also include favourite types of music or sports, inclusive of geo-graphic preferences such as a favourite sport in a given city. Historical user patterns. A persons surfing habits over a given period of time (e.g. 6 months) can also play a role in defining what is more likely to be of interest to them in a given query result. More recent information (on above factors) is likely to be weighted more than older historical performance metrics within a set of results. Past visited sites. Many of the above metrics, such as time spent and scrolling on a given web page or historical patterns and preferred locations can also be collected in a variety of ways (invasive or non-invasive). Cookies actually save resources for the Search Engine, an added benefit. The advices how to improve values of such metrics are presented in the next chap- ter. Higher position in rankings not always implicate more visitors. Moreover, there is no significant difference for positions between 6 and 10. Very often the proper website optimization of page’s title and description visible in a SERP is more important and brings more visitors than higher position. Better website titles and meta-descriptions would have an advantage as getting the user to engage with the SERP listing upon initial presentation would be at a premium. Quality content as well would begin to take on a more meaningful role than it has in the past, as bounce rates and user satisfaction now starts to play into actual search results rankings.
  • 27. Chapter 3. Impact of Personalized Search on SEO 23 3.1.1. Areas for Consideration Author’s experience in commercial SEO, which is closely related to the topic of mar- keting, is rather small. Thus this section is based on [16]. Demographics It should be ensure to leverage any obvious demographics that may apply to your site. If it is geographic, topical (sports, politics) or even a given age group, ensuring that this is targeted effectively is important in that the ‘topical’ nature of personalized search can group results prior to even ranking them. If the particular website is not clear in each of these areas, it risks less weighting to tighter demographic starting document sets. Even your off site activities (link building, Social Media Marketing etc.) should be as tightly targeted as possible. Relevance profile Of particular interest is potential categorization in terms of topical relevance. Ensur- ing that your site provides a strong relevance train would be particularly valuable. Much like phrase based indexing and retrieval concepts, probabilities play a large role. When refining results the search engine looks at related probable matches. Through a concerned effort with on-site and off-site relevance strengthening, you increase the odds of making it to a given set of results in a world of ‘flux’. It never hurts to review the concepts surrounding Phrase Based Indexing and Retrieval as many of the related patents addressed deriving concepts/topics from phrases. One would also have to imagine that tightening up the relevance profile in your Social Media Marketing efforts would also be beneficial to a tighter topical link profile. Fur- thermore, many topically targeted visitors that enter a site may bookmark (or passive collection) your site which ads to the organic search profile without ever being included in a search result. As such, there are many exterior opportunities to be had beyond the traditional off-site SEO. Keyword Targeting Building out from your core terms will be important as far as understanding search behaviour. The long tail as we know it would be targeted towards potential query refinements on a given subset of searcher types. Building out logical phrase extensions and potential query refinements would be something to look at. Furthermore, with changeable personalized ranks we would measure SEO success in actual traffic and conversions which puts term targeting into a new light as far as nailing money terms and having a cohesive plan that targets query refinement long-tail opportunities. Quality Content In considering the value of a website, user interaction becomes a consideration as far as bounce rates, time spent on page and scrolling activities are concerned. Producing
  • 28. Chapter 3. Impact of Personalized Search on SEO 24 compelling and resourceful content would be at a premium to best leverage these tendencies of the system. If a searcher has selected and interacted with your site on multiple occasions your site would be given weight in their personal rankings as well as related topical and searcher types. The more effective a resource the greater the ranking weight increase. Search result conversion Working with the page title, meta-description and snippets takes on a more important role in your SEO efforts when adjusting for personalized search. I dare say using ana- lytics and a form of split testing would be a great advantage as far as satisfying what not only ranks, but converts. Freshness Another area which may be important is document freshness in that people could be able to set default date ranges or the system could passively begin to see a pattern of a user accessing more current content. Valuable website that has been ranking well for a year that may no longer be getting all the traffic that is has been used to. It should be looked at updating such pages with fresh information, or creating new related pages and pass the flow via internal links. Depending on the nature of the content (searcher group profile) more current content may be more popular over the larger data set and thus newer content would be weighted more overall. Site Usability From a crawler or the end user perspective, having logical architecture and a quality end user experience is also at a premium. If similar searcher types embark on similar pathways and related actions (bookmark, print, navigate, and subscribe to RSS) then this will give greater value to those target pages within that community of search types. This also furthers the relevance profile. Analytics It can be noticed there is a strong need for the use of analytics in understanding traffic flows, understanding common pathways, bottlenecks, the paths to conversion, and much more. This data will be of immeasurable use in dealing with many of the factors that can affect Personalized Web Positioning. This issue is closely connected with psychology (particularly behavioural targeting).
  • 29. Chapter 4 SEO Guide This chapter presents a set of areas for consideration during a process of website opti- mization. The prepared advices concern only areas which may have positive result in gaining popularity of the website. They should be helpful for achieve higher position in search rankings which should increase the number of visitors. They also should make the website more attractive for users. This fact probably will decrease bounce rate which which has negative impact on the website in personalization re-ranking process. Concerned areas in this guide do not take into account any SEO techniques which are connected with external actions. So those that require contact with other websites, such as: • linking (the acquisition of links), free or paid • advertising • presell pages1 Listed methods are closely connected with generating spam. Due to this, they reduce the rate of quality content in the Internet. So the Internet surfers has no benefits from them. This chapter is based mainly on the information from [13], [15], [21] and [36]. 1. Presell page – page created only for SEO purpose. Text on such page is only a surrounding for link leading to positioned website. Content has no value for human reader because. It is only prepared to look like natural for crawlers, not to be filtered as spam.
  • 30. Chapter 4. SEO Guide 26 4.1. Website Presentation in Google Search 4.1.1. Title Title is the first information about particular website in SERP. It is also one of the main factors in with impact on the website ranking. An example of such title in html code looks like this: <title>Jaguars, Jaguar Pictures, Jaguar Facts - National Geographic</title> Such title presented in SERP looks like in the figure 4.1. Figure 4.1. Presentation of a website title in Google Search These are the issues connected with website title, which are significant in SEO: Length up to about 65 characters Longer titles can be also indexed by crawler, but title with 65 characters is rather optimal and it entire fits in SERP. Longer titles are shortened with ellipsis. Diversity of titles Each of the website pages relates slightly different information (e.g. product page, contact form etc.). The title should be prepared individually for each of them. Keywords There are 3 principles related to creating a title: 1. Keywords should be distributed on all the pages. Each of the pages must be optimized for only 3–4 keywords. Front page title should have most general ex- pression, titles of product pages should contain words characterizing the type of these products etc. Sticking to this rule is very important, because in other case the pages of the website could be treated by crawler as duplicated content. 2. The most important keywords should be place at the beginning of title.
  • 31. Chapter 4. SEO Guide 27 3. Google can connect keywords from title into different phrases. But these which appear one after another have the greatest impact on their position in ranking. Due to this fact, key phrase should not be separated. 4.1.2. Description Title is the second information about particular website, presented next after title in SERP. Such description presented in SERP looks like in the figure 4.2. Figure 4.2. Presentation of a website description in Google Search Description presentation in SERP can be generated from following sources: • description metatag, for example: <meta name="description" content="Learn all you wanted to know about jaguars with pictures, videos, photos, facts, and news from National Geographic." /> • a fragment of the website content (in case the description metatag is too long or there is no such one in the source code) Here are some tips on the description page in metatag: Length up to about 150 characters Longer descriptions will not be presented in SERP as they have been written. Diversity of descriptions Just as titles, description of a particular page should be slightly different from the other. It should be specific for the information presented in the page. Keywords Description should contain keywords concerned by the SEO strategy. When it does, the keywords will be bold in the search results for query phrase based on such keywords. It should call users’ attention on our website. However, the description should be prepared in the way to encourage users to visit the website.
  • 32. Chapter 4. SEO Guide 28 4.1.3. Sitelinks Sitelinks are links leading to other pages of the same website. The can be presented in SERPs in 2 ways: 1. Horizontally – 4 links in 1 row (presented in figure 4.3) 2. Vertically – 8 links in 2 columns Figure 4.3. Presentation of a website sitelinks in Google Search There is no manual way for publishers to force sitelinks presenting in SERP. It depends on how the website was indexed. But it can be made easier for crawler to make it correctly. There are two things which can be done: 1. First of all it must be well designed source code related to navigation on our website. Its syntax must be very clear. 2. Prepare a sitemap of website (e.g. in XML format). This issue will be described later. 4.2. Website Content 4.2.1. Unique content The basis of the proper content optimization is its uniqueness. This means that the same text or its larger fragments should not be reproduced on other websites or on different pages of our website. In order to verify the degree of uniqueness of our content, it can be used this tool: http://www.copyscape.com 4.2.2. Keywords It is very important make search engines able to relate our website to specific theme and keywords. In order to make it possible, keywords must be considered not only in website title and description design process. Keywords must be also contained in website content. In preparing the text for the website it suggested to stick following principles:
  • 33. Chapter 4. SEO Guide 29 Repetition Keywords should be repeated several times on every page. But it cannot be forgotten that the text should be written primarily for users. The task is to find a compromise between attractive text for users and good for SEO. Too high density of keywords on particular page can be treated by search engine as an abuse. In such situation our website will be penalize by ranking exclusion. Variations and synonyms The website content will be more natural, if contained keywords are used in many variations (grammatical). The proficiency of modern search engines can also detect using synonyms. For this reason we can use for example word ”drug” in the content being optimized for ”medicine” keyword. Location Keywords should located on whole page with similar density. This will give a better result in positioning than accumulation of keywords for example only at the beginning of the page. 4.3. Source Code Website’s source code has not direct influence on the position in search ranking. How- ever, some errors can cause problem with proper indexing by search engine robots. For this reason it is worth to ensure that the code contains no errors and it is compatible with current WWW standards. Very useful is the code validation tool, provided by the World Wide Web Consortium (W3C). It can by found here: http://validator.w3.org 4.3.1. Headers HTML headers tags (h1–h6) are very significant for proper indexation of the website content. Right usage of them is very important in desing of a website. There couple issues which must be considered form SEO point of view. Hierarchy Headers tags are designed to separate particular sections of a document. They must be used in the correct order and only when there is a need to use.
  • 34. Chapter 4. SEO Guide 30 Repetition Header of the first degree (h1 tag) by current HTML standards may occur only once in whole document. Other headers can be used repeatedly. Keywords It is suggested to put keywords into header tags. It is because they have more ”posi- tioning power” than a regular text. This power is probably respective to the headers hierarchy, so the most important keywords should be placed in h1 tag. 4.3.2. Highlights Keywords can be distinguished from the rest of text by using tags ¡strong¿ (bold) and ¡em¿ (italics). In this way, keywords are highlighted either for users end crawlers. Although it should be done with restraint. Not every occurrence of the keyword should be highlighted but only the most important of them. 4.3.3. Alternative texts Sometimes there are some images placed in the document. It is recommended to include alternative texts to those images. It can be done in this way: <img src="path/to/image" alt="alternative text" /> It is displayed on the screen in the case the browser can not display images (e.g. when they are unavailable on the server). These alternative texts are also interpreted by search engine robots. These data is then used in search for images (when search engine has such option). 4.3.4. Layout Well indexing website should have clear and minimalistic layout. The content is the most important factor, so even ratio of text amount to html code is significant. The higher this value is the better and more valuable website in the search engine point of view 4.4. Internal Linking Quite important issue in website optimization is internal linking. Internal link is the hyperlink which leads to another page of the same website. There are some recommen- dation connected with this link type.
  • 35. Chapter 4. SEO Guide 31 4.4.1. Distribution Each of the pages should be available in 3 or 4 clicks at most. If not, the website navigation must be re-designed. Attention should be given especially to the links on the main page. The structure of the website must be clear. What is more, it is also very important for usability of the website. Complicated navi- gation can discourage user to continue the visit. 4.4.2. Links anchors Link anchor is the clickable text. It is displayed for the user on a website, instead of plain URL which is rather unreadable for human. It looks like this: <a href="some_url.html">Anchor text</a> Anchors should describe the content of pages which their links lead to. If links are located among the other text, it should match the context of whole text. For example, it is not advised to write ”click here” like it was popular couple years ago. 4.4.3. Broken links Very important thing in website positioning is to beware of links which lead to unavail- able URLs. Such issue is very annoying and discouraging for visitors. The website with broken links will be also less valuable for search engines, because robots crawl the web using links. After indexing a page robot uses one of the links placed on this page to go to another page. When such link is broken, crawler can interrupt indexing process. It will cause the situation where not every page of the website will be indexed. 4.4.4. Nofollow attribute Nofollow is an HTML attribute value used to instruct some search engines that a hyperlink should not influence the link target’s ranking in the search engine’s index. This is example of such hyperlink: <a href="some_url" rel="nofollow">Some website</a> It is intended to reduce the effectiveness of certain types of search engine spam, thereby improving the quality of search engine results and preventing indexing particular web- site as spam. Nofollow attribute is used commonly in outbound links2 , for example in paid advertising. 2. Links which target at other websites
  • 36. Chapter 4. SEO Guide 32 4.4.5. Sitemap A sitemap is a list of pages of a website accessible to crawlers or users. This helps visitors and search engine bots find pages on the website. Sitemap for users It can be prepared a page, where will be placed links leading to all website’s pages or only the most important ones. Thanks to this, users having problems with navigation will be able to find quickly what they are looking for. The example of such sitemap located in footer is presented in figure 4.4. Figure 4.4. Example of sitemap for visitor Sitemap for robots Sitemap for crawlers must be easy to automatic processing. Such sitemap is mostly being prepared in the XML document format. This is how example looks like: <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.example.com/catalog?item=12</loc> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=73</loc> <lastmod>2004-12-23</lastmod>
  • 37. Chapter 4. SEO Guide 33 <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=74</loc> <lastmod>2004-12-23T18:00:15+00:00</lastmod> <priority>0.3</priority> </url> <url> <loc>http://www.example.com/catalog?item=83</loc> <lastmod>2004-11-23</lastmod> </url> </urlset> As it can be noticed, such document contains some information about each link: loc: URL to particular page lastmod: time of last modification of the page changefreq: average period time between changes in the page priority: value of priority for crawler to index particular page Such information are welcome by crawlers. It can profits to publisher with faster in- dexing by crawler. In most cases such documents are being prepared using software tools as like: http://www.xml-sitemaps.com/ After when the sitemap document is prepared, the search engine must be notified about its existence by special form. 4.5. Addresses and Redirects Among previously described factors used in rank algorithm, search engines also consider form of indexed website’s URLs and information included in HTTP Responses. 4.5.1. Friendly addresses Search engines give higher rank value to those websites whom pages have URLs more readable for human. For example, address like this: http://www.example.com/index.php?page=product&num=5 can be written in this way: http://www.example.com/product/5 Such effect can be achieved using mod rewrite. It is module to the Apache Server, which allow to create regular expression patterns for mapping URLs to particular pages. Such
  • 38. Chapter 4. SEO Guide 34 possibility have also modern web frameworks, such as Django Framework or Ruby on Rails. Moreover, such possibility gives another opportunity for placing keywords. Due to this fact, page being optimized for particular keyword should have this keyword in its URL. If it is phrase with couple words, it is suggested to separate them with dash. 4.5.2. Redirect 301 Redirect 301 is the constant redirect from one address to another. After using it: • visitors writing into the address bar in browser the old address, will be redirected into the new one • some search engines will switch the old address in the database into new one So it is very useful after domain change. Earlier it author said, that content of the website should not be duplicated. It is often forgotten, that allowing to entry a website through several addresses has the same result. Sometimes the same website is downloaded via: • example.com • www.example.com • example.com/index.html • www.example.com/index.html • example.com/default • www.example.com/default In such case it must be decided if the main address of the website will have the ”www” prefix. If it will, it should be placed .htaccess file in the main folder of the server, with such content: RewriteCond %{HTTP_HOST} ^example.com$ [NC] RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L] Similarly we can manage the redirection from the index.html file: RewriteCond %{REQUEST_FILENAME} index.html RewriteRule ^(.*)$ http://www.example.com [R=301,L] There are also many other possibilities which can be managed likewise. 4.6. Other Issues There is couple other things, which have some influence on the website ranking.
  • 39. Chapter 4. SEO Guide 35 4.6.1. Information for robots Sometimes publishers do not want robots to index some website’s pages, but keep them still available for regular visitors. For example: • results from internal search engines • data sort results • print version of pages • pages which should not be indexed, like login page to administration panel To manage this issue it can be prepared robots.txt file with the content like this: User-agent: * Disallow: /admin-panel/ It should prevent robots from indexing pages whom URLs start with www.example.com/admin-panel/. 4.6.2. Performance One of the most recent factors introduced Google search engine algorithm is website performance value. Google promotes websites with short time period of download pro- cess. It is not as significant as internal linking or quality of content. Big information portals are very complex, so they can not be downloaded as fast as e.g. small blog. But valuable information is the most important factor. However, good performance can increase rank value of the website compared to an another one with similar content but not so efficient. To improve website performance it can be used several tools, like PageSpeed by Google: http://code.google.com/speed/page-speed It provides the analysis of the website downloading efficiency and gives some tips on how to improve the performance. Next, will be presented the most suggestion being given by this software. Gzip compression Modern web browser allow to use gzip compression mechanism to reduce the size of website files (images, CSS files, Javascript files). If there is such possibility, it is recom- mended to use it. Number of DNS lookups DNS caching mechanism [23] causes there is no need to look for IP address matching to particular domain several times. For this reason, when every file used by the website is located on the same server (or another server in the same domain), there is only on DNS lookup in during download process.
  • 40. Chapter 4. SEO Guide 36 So it should be avoided placing media files (images, CSS, Javascript) on the different domain without clear need. External files Information commonly located in the external files, like CSS Style Sheets or Javascript, can be also located inside the HTML document. However it should be avoided, because it causes parsing of source code process more complex. Thus the browser need more time to display website on the screen. 4.7. Summary Tips presented in this chapter should significantly increase the rank value of every website. With the higher ranking in the search engine will be it will be more inbound traffic. In other words, the website will gain more popularity. Making the website more attractive to visitors should implicate better results of per- sonalization re-ranking. The assumptions of the impact of personalized search results on global ranking are very likely. So improving the quality of website on the basis of presented guide should increase website’s ranking in general. Both through the per- sonalization impact and collecting inbound links as the effect of increase in popular- ity.
  • 41. Chapter 5 The System for Global Personalization The goal of this chapter is to propose a method for improving the website’s search ranking through affecting the personalization mechanism in search engine. The idea of this method is to generate artificial behavioural data. The author of this thesis is the co-author of the article [19], which this chapter is based on. In section 2.2 of this thesis it has been shown that a lot of data goes into Google and a lot of useful manipulated data comes out. But we can only guess what happens in between or try to learn from the observation of the data coming out of Google. Evans wrote [10] that identifying the factors involved in a search engine ranking algo- rithm is extremely difficult without a large dataset of millions of SERPs and extremely sophisticated data-mining techniques. That is why, only an observation, experience and common sense are the main source of knowledge on Search Engine Optimization (SEO) methods. It was according to this knowledge that Search Engine Ranking Factors [11] was created. The last edition of it assumes that traffic generated by the visitors of a website has 7% of importance in the Google’s evaluation of the website value. It is, after links to the specific website and its content value, the most significant factor in website evaluation process. On the basis of the previous editions of the ranking, one can notice that the importance of this factor is increasing. Because these all are only reasonable assumptions, the intention is to evaluate the validity level of the described factor in web positioning efforts. For this purpose we need a simulation tool which will generate necessary human-like traffic on a tested website. The tool is going to be a multi-agent system (MAS) which will imitate real visitors of the websites. 5.1. Problems to Solve Fig. 5.1 presents the main reason why the system must be distributed. A few queries to Google, have been sent frequently one after another from the same IP address, are
  • 42. Chapter 5. The System for Global Personalization 38 Figure 5.1. Information displayed by Google on the abuse detection detected by Google and treated as abuse. Google suspects an automated activity and requires completion of the captcha form in order to continue searching. In case of using the distributed system, the queries would be sent from many different IP addresses. It should guarantee, that Google will not consider this issue abusing. This issue cannot be solved by using a set of public proxy servers. Google has probably put them into their black list. Every single query to Google via such proxy server leads to the same end – captcha request. What is more, after Tuzhilin [35] we can say that Google puts a lot of reasonable effort into invalid clicks on advertisement filtering. There is a big chance, that some of those mechanism Google uses in the analysis of the web traffic. This is the reason why generating behavioural data should be our concern. Recognized artificial web traffic could be treated by Google as an abuse and cause being punished (decline of website position). 5.2. Objectives 5.2.1. Web Positioning The main goal of the system is to improve the position of a website by generating traffic related to the website. The only activity which can be visible for Google the system should care of. It shows that there is no need to download all content from particular website. It would only waste the bandwidth. The system should only send to Google services requests used by a particular website, for example: • links to the website on SERPs, • Google Analytics scripts, • Google Public DNS queries,
  • 43. Chapter 5. The System for Global Personalization 39 • Google media embedded on the website like AdSense advertisement, maps, YouTube videos, calendars etc. 5.2.2. Cooperation The whole idea of the system is to spread positioning traffic into world wide IP ad- dresses. As a result of this distributed character, the system require a large group of cooperating users. Nobody will use the system if there is no benefits to him. A mech- anism which will let the system users to share their Internet connections in order to help themselves in web positioning must be introduced. What is more, the mechanism must treat all users equally-fairly. It means it should not allow to take benefits without any contribution. 5.2.3. Control According to [36], web positioning is not a single action, but a process. This process should be able to be controlled. Otherwise it could be destructive, instead of improving the website position. For this reason, the system should allow users to: • control the impact of the system activity on their websites, • check the current results of the system activity (changes in the website position on SERPs), • check the current state of the website in the web positioning process. 5.3. Architecture Fig. 5.2 presents the architecture of the system which take under consideration all specified problems and objectives. Server is necessary to control the whole process of generating the web traffic by specified algorithm. It gives the orders for clients to start generating traffic on the specified websites. It also gets the information from clients with amount of requests sent to particular Google’s services on the website account. Database serves as storage for process statistics. They can be presented to clients via web interface. They are also useful to server for creating the orders in accordance with the algorithm. Clients are the agents of the presented MAS. They take orders from the server with particular webiste registered in the database to be processed. Processing the website is to mimic its real visitor. Client performs this autonomously using the visitor session algorithm described in the next section.
  • 44. Chapter 5. The System for Global Personalization 40 Clients Cloud Website 1 Client 1 Google Website 2 All Google Search Services Client 2 Server Database Website 3 Client 3 Figure 5.2. System architecture 5.4. Visitor Session Algorithm According to many research [1], [6], [7], [10], [20] and [36], more than half of a web- site’s visits comes from the SERPs. That is why starting the single visitor session (the sequence of requests considering single website registered in the system) with querying Google Search sounds reasonably. However, only if currently considered website appears on one of the first few SERPs. Otherwise, visitor session should be started directly on the processed website or should refer to an incoming link, but it must be existing, if there is such one. Because of the likely Google’s actions in order to detect abuses, the visitor session should be possibly human-like. The analysis of real users web traffic [29] is very useful at this moment. According to it, a typical user: • visits about 22 pages in 5 websites in one sitting, • follows 5 links before jump to a new website, • spends about 2 hours per session and 5 minutes per page. These statistics clearly indicate that typical visit session concerns a website of good quality. Visit on a poor website would be aborted as soon as after few seconds. Such visit could have a negative impact on the website quality evaluation by Google. 1 – Server retrieves from the database information about the next website to be pro- cessed. 2 – Task assignation to the client. 3 – Client starts the visitor session.
  • 45. Chapter 5. The System for Global Personalization 41 Database 6 1 4 Visit session 2 3 5 Google Website All Google Search Services Client Server 7 Figure 5.3. Visitor session algorithm 4 – Searching on SERPs for a link to the processed website. 5 – If a link has been found – click on the link, otherwise direct request. 6 – Processing the visit session. 7 – Request to the server for another website to process. 5.5. Task Assignation Algorithm Task assignation algorithm helps server to build a queue of registered websites ordered by the visitor session priority. The website with the highest value of the priority is the next one to start visitor session. In other words, client always receive the website with the highest priority value to process. The priority value P V is calculated using the function: v(α) P V (α) = r(α) · t(α) · (5.1) T (α) where α — record in the system (website with phrase for web positioning) r(α) — returns current position in the search engine ranking for the α (returns 0 if there is no α in the ranking) t(α) — returns time since the end of the last visitor session on the α (in seconds) v(α) — returns number of visitor sessions made by α owner’s client T (α) — returns time since the registration of α in the system (in days)
  • 46. Chapter 5. The System for Global Personalization 42 Presented function gives the highest ”power” to the ranking factor. The reason for this is that websites with high ranking value should have more real visitors, so the system efforts will not be so crucial for its popularity. Time of participation in the system is not very significant. Novice participants have equal chance to gain attention for their websites as the senior ones. However, function promotes continuous activity of the clients. Worth to consider is also the possibility to dynamically modify weights of individual factors depending on the results. Because of the fact that websites queue is built by the server, it is possible to change whole function during the system activity. 5.6. Proof Study Presented system requires a large number of users to work properly. In other case, the generated traffic would not be distributed enough, thus would look unnaturally. As it was shown, centralized series of queries are being seen as abuse. Unfortunately, thesis author’s resources have been insufficient for this purpose. However, a simulation has been performed, which had to prove proposed concept. 5.6.1. Tools The idea was to use Tor application (http://www.torproject.org) to make the single host (the author’s computer) generate distributed traffic. In such way, the behavioural data of one real user could be seen by search engine as multi-user traffic. Tor is a free software enabling Internet anonymity by thwarting network traffic analysis. Tor aims to conceal its users’ identity and their network activity from traffic analysis. Operators of the system operate an overlay network of onion routers which provides anonymity in network location as well as anonymous hidden services. Users of a Tor network run an onion proxy on their machine. The Tor software peri- odically negotiates a virtual circuit through the Tor network. Application like browser may be pointed at Tor, which then multiplexes the traffic through a Tor virtual circuit. Once inside a Tor network, the encrypted traffic is sent from one host to another, ultimately reaching an exit node at which point the decrypted packet is available and is forwarded on to its original destination. Viewed from the destination, the source of the traffic appears to be at the Tor exit node. As the figure 5.4 shows, Tor has became quite popular, so its network involves large number of users. This makes Tor fit to the objective in this study. Mozilla Firefox browser have been used, connected with the Tor. Additionally has been also installed iMacros plug-in in order to automate executing of visitor sessions. For analysis of the behavioural data being received by Google during the study, Google Analytics (shown in figure 2.2) software has been used. It was installed on every ex- amined website.
  • 47. Chapter 5. The System for Global Personalization 43 Figure 5.4. Tor interface screen 5.6.2. Results Distributing traffic issue ended with success. After opening the Google Search main page (www.google.com), server redirects to the domain belonging to the country which the Tor exit node of particular session was located in. For example, when exit node was in Germany, Google server redirected browser from google.com to google.de address. There was displayed Google Search page in the appropriate language, in spite the fact, that browser’s setting with default language was removed. After visit on examined pages, Google Analytics also indicated, that the source of visits was not in Poland (where the study was actually conducted) but in the countries of exit nodes of the traffic. However, routing the traffic through distributed Tor network appeared to be insufficient solution. Firstly, the traffic routed by the Tor is significantly slowed down. From time to time there were even difficulties with download complete search engine site. What is more, despite the large number of visit sources, there are still only one browser and one real user. Because of this, this simulation could only imitate one singed-in user or a group of singed-out.
  • 48. Chapter 5. The System for Global Personalization 44 Signed-in user In the first case there is essentially no difference between visiting through Tor proxy network or directly. Like it was described, Tor is the tool to concealing the identity of a user. But after sign-in into Google Account, the identity is evident. From the Google point of view, such visit is seen as regular user travelling very quickly all around the world (metaphorically speaking). But it is still only one user, and applied personalized search results to him should not be globally significant. Group of signed-out users As it was described earlier, Google introduced personalization mechanism not only for users with Google Account. There is also personalization in search results for users with no such profile account. It is based on storing cookies in user’s browser up to 180 days, which contain information about past search activities. But cookies are not related to specific user, but to a browser. In this case, search results are re-ranked not for the person but rather for the particular computer which this person uses. This simulating system uses only one browser, so there was no possibility to evaluate the impact of personalization re-ranking on search ranking in the global perspective. Disabling storing of cookies option in the browser makes personalization impossible to act, because there is no way to relate past queries in search engine to particular user. Moreover, browser with blocked cookies is rather rare situation nowadays. Therefore Google search engine is rather suspicious about traffic with blocked cookies and for such requests they serve ”Sorry page” (figure 5.1). 5.7. Summary Generating artificial traffic on the Internet seems to be not very praiseworthy as it is dangerously close to spam appearance and causes the information noise into vis- itors statistics. On the other hand, this is not worse than other SEO activities like linkbaiting. After [6], today’s search engines use mainly link-popularity metrics to measure the ”quality” of a page. It is the main idea of the PageRank algorithm [3]. This fact causes the ”rich-get-richer” phenomenon. More popular websites appear higher on SERPs, which brings them more popularity. Unfortunately, it is not very beneficial for the new, unknown pages which have not gained popularity yet. There is a possibility, that these websites contain more valuable information than the popular ones. Despite this fact, they are ignored by search engines because of small amount of links. These sites, in particular, need SEO efforts. Probably classic techniques will be more effective than the one presented in this paper. Nevertheless, the methods presented in this article are likely to improve the rate of web positioning, because web traffic can be noticed by search engines immediately. It