Analysis Of The Modern Methods For Web Positioning

Faculty of Computer Science and Management
field of study: Computer Science
specialization: Software Engineering

Master thesis
Analysis of the Modern Methods
for Web Positioning

Paweł Kowalski

keywords:
search engine, SEO, personalization, optimization, web positioning

Thesis contains an analysis of personalization of search results mechanism in
popular search engines. It presents experiments and considerations about impact of
personalization on search rankings and how it affects on Search Engine Optimization
(SEO). There are proposed some new SEO methods that take an advantage of
personalization in search engines.

Supervisor: dr inż. Dariusz Król ............................. .............................
name and surname grade signature

Do celów archiwalnych pracę dyplomową zakwalifikowano do:*
a) kategorii A (akta wieczyste)
b) kategorii BE 50 (po 50 latach podlegające ekspertyzie)
*
niepotrzebne skreślić
Stamp of the institute

Wrocław 2010

Contents

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Beginning of the SEO Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Search Engines Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. The Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2. Personalized Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1. Operation Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Methods for the Analysis of Behavioural Data . . . . . . . . . . . . . . . . . . 7
2.2.1. Methods of behavioural data collecting . . . . . . . . . . . . . . . . . . 8
2.2.2. Process of tracking user . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3. Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1. Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2. Phrase language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3. Search history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4. Spam Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 3. Impact of Personalized Search on SEO . . . . . . . . . . . . . . . . 21
3.1. Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1. Areas for Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 4. SEO Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1. Website Presentation in Google Search . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1. Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2. Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.3. Sitelinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2. Website Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1. Unique content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2. Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3. Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1. Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.2. Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3. Alternative texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.4. Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4. Internal Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1. Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

i

4.4.2. Links anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.3. Broken links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.4. Nofollow attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.5. Sitemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5. Addresses and Redirects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.1. Friendly addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.2. Redirect 301 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6. Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6.1. Information for robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.2. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 5. The System for Global Personalization . . . . . . . . . . . . . . . . 37
5.1. Problems to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1. Web Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2. Cooperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.3. Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4. Visitor Session Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5. Task Assignation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6. Proof Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.6.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.6.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

ii

Abstract
Modern search engines are constantly improved. The most recent big step
introduced into their algorithms concern personalization mechanism. Its goal is to
extract information about user’s preferences implicitly from its search behaviour
and also such factors as location, phrase language and search history. This
information is the basis for building a user’s search profile. Motivation of this
process is to provide more relevant search results for specific user and its interests.
Thesis concern details of this personalization mechanism and try to examine
how various factors affect search results. Author also analyse the methods for
collecting behavioural data by search engines. He approaches to define possible
impact of customization of search results on the Search Engine Optimization
(SEO) issues like metrics, spam filtering or changes in significance of website
optimization factors. Then author tries to evaluate the possibility of personalized
search rankings manipulation through proposed system for generating human-like
web traffic.

Streszczenie
Nowoczesne wyszukiwarki internetowe są ciągle ulepszane. Ostatni duży krok na-
przód wprowadzony w ich algorytmach dotyczy mechanizmu personalizacji. Jego
zadaniem jest zdobycie informacji na temat preferencji użytkownika z jego za-
chowania podczas wyszukiwania informacji w wyszukiwarce pod względem takich
czynników jak jego lokalizacja, język wyszukiwanej frazy i historia wyszukiwań.
Te informacje są podstawą do utworzenia profilu użytkownika. Celem tych dzia-
łań jest zwrócenie konkretnemu użytkownikowi rezultatów wyszukiwania bardziej
odpowiadających jego zainteresowaniom. W pracy tej znajduje się szczegółowa
analiza mechanizmu personalizacji oraz próba zbadania jak poszczególne czyn-
niki wpływają na wyniki wyszukiwań. Autor analizuje także metody pozyskiwania
danych na temat zachowań użytkowników przez wyszukiwarki. Podejmuje próbę
określenia możliwego wpływu dostosowywania wyników wyszukiwania do użyt-
kownika na tematy związane z pozycjonowaniem witryn internetowych takie, jak
metryki, filtrowanie spamu lub zmiany w znaczeniu poszczególnych czynników
w optymalizacji stron WWW. Następnie autor próbuje ocenić możliwość mani-
pulowania spersonalizowanymi wynikami wyszukiwania poprzez zaproponowany
system służący do generowania naturalnie wyglądającego ruchu sieciowego.

iii

Chapter 1

Introduction

Before the Web and present-day search engines, searching meant simple matching the
terms in a query to the exact appearance of these terms in a database filled with textual
documents. Some database searches let you only locate documents where certain words
appeared within a defined distance from other specified words from the same document.
Sorting documents by relevance or importance would have been a monumental task, if
possible at all.

1.1. Beginning of the SEO Concept

When the Internet was introduced, it revolutionized the worldwide share of information.
Free access for everyone to this web without any restrictions is the reason why the
Internet is considered to be one of the greatest inventions of the 20th century. But this
freedom in the Internet has serious implication – many problems with the organization
of this enormous set of information.
Hyperlinks turned out to be insufficient for the issue. This is why the first search engines
was introduced. They quickly became the main source of visits in the commercial web-
sites. A good search results started to be very important issue for content publishers.
Those moment was the beginning of the SEO1 concept which is still the major element
of the Internet marketing.
The early search engines like AltaVista or Lycos were launched around 1994–1995 [7].
Their algorithms have only been analysing the content of websites and keywords in
meta tags. It was easy to circumvent these algorithms by placing false information
into keywords tags. Another popular fraud was filling website content with irrelevant
text, which was visible only for search engine robots, but not for the user. As a result,
search engine result pages (hereafter SERPs) contained websites filled with spam and
inappropriate content [21].
1. Search Engine Optimization (SEO) – the process of improving the volume or quality of traffic to
a website from search engines. It is also used to take Web Positioning term as a synonym.

Chapter 1. Introduction 2

1.2. Search Engines Evolution

However, the relevance of the search results to the query is still based on keyword
matching. But search engines started to understand differences in the importance of
words located in different parts of a page. For example, if you searched for a certain
phrase, pages containing those words in their titles and headlines might be considered
more relevant than other pages where those words also appeared, but not in those
“important” parts of pages.
Google company, which started in September 1999, revolutionized search engines. Its
co-founders, Larry Page and Siergiej Brin, developed PageRank algorithm [3]. This al-
gorithm redefined search issue. Content of the websites became slightly less significant.
Instead of text content, PageRank rates the websites mainly on the basis of quantity
and quality of links leading to these websites. With help of such improvements, the
Internet works as a kind of voting system. Every link is a vote for a website which leads
to.
Relevance was also found by indexing words that link to other pages. If a link leading to
a page used the phrase ”american basketball” as anchor text2 , the page being pointed
to would be considered relevant to the American basketball. The existence of links to
pages also has been used to help define the perceived importance of a page. Information
about the quality and quantity of links to a page can be used by search engines to get
a sense of implied importance of the page being linked.
Nevertheless, after a short time, the techniques [21] spoiling the results of PageRank
have also been discovered. Basically, most of them work in such a way to increase the
number of links leading to a particular website to enhance its PageRank value. Such
activity in a large scale is usually called linkbaiting. There are many scripts and web
catalogs to facilitate and to automate such activity. However, Google is also constantly
working to improve their search engine algorithm. They try to make it resistant to
linkbaiting. According to [11], many new factors are being introduced into websites
evaluation process, in order to reduce the impact of linkbaiting which is a sort of
spam.
Besides, there is a limit to the effectiveness of this type of keyword matching. When
two people perform a search at one of the major search engines, there is a chance that
even if they use the same search terms, they might be looking for something completely
different. For example, when an anthropologist searches for phrase ”jaguar” he expects
websites with information about big cats as a result. But he can also receive a collection
of websites about Jaguar cars instead.
As search engines progressed and users were given more and more websites with valu-
able information, the engines needed to respond with a refined approach to search.
The main idea to improve relevance of the search results was to better understand
the user’s intent and expectation then he types a certain phrase into a search box. So
it seems that the next step in search engines improvement is tracking regular users
in the Internet. Collecting data on their activity might give useful information about
2. Link label or link title, text in a hyperlink visible and clickable by the user.


which websites are valuable for them. The major engines such as Google, Yahoo and
Bing guard their search secrets closely, so one can never be absolutely certain how they
are operating. But they are evolving, and personalization seems to be the wave of the
future.

1.3. The Goal

It seems to be quite clear that search engines fitted with personalization mechanism
would have two main benefits:
1. Improvement of search results relevance for specific user.
2. Decrease of number of spam entries in SERPs.
Google, the leader in search engine market, is already the first steps behind in this area.
Such information reaching us from the official company blog [13]. Moreover they already
has several patents connected with the personalization mechanism. For this reason this
thesis will be mainly concerned with the Google Search. But high competition in the
Internet market suggests, that other popular search engines like Yahoo and Bing are
also being improved in this direction.
The goal of this thesis is to analyse the possible aspects of the personalization mech-
anism in Google Search, on the basis of available information. There will be sev-
eral factors taken into account, which can have an influence on the changeability of
SERPs:
• geolocation
• language of the query
• web search history
• query complexity
• search behaviour (e.g. bounce rates3 , time of visits)
It will be the base for several experiments which should determine how advanced is
the current level of personalization introduced into considered search engine. This re-
search also includes an analysis of the data used to describe users’ search behaviour –
particularly the methods of collecting these data and types of them.
The obtained results will be used to specify the potential impact on SEO and its
metrics, in particular:
• the possibility of using personalization of search engine to create a new SEO
techniques

3. Bounce rate is a term used in website traffic analysis. It essentially represents the percentage of
initial visitors to a site who ”bounce” away to a different site, rather than continue on to other
pages within the same website.


• usability of website ranking in search results as the measure of success in SEO
activity
Personalization is the opportunity for search engines to make spam less significant for
search results and make SEO workers’ life harder. But it is only the spam in the sense of
collection of websites with content of no value for human and irrelevant hyperlinks. But
personalization also opens door for a another kind of information noise – behavioural
data spam. The thesis presents the architecture of the distributed system generating
artificial web traffic, therefore, imitating search activity of a real user. However, using
such system can be seen as unethical, so the thesis contains only a conception and a
design. Author has no intention to implement such system, but he tries to examine
with available tools if building it would be reasonable. In this way might be indicated
possible harmful actions, whom search engines should be protected against.
After that, there is a short analysis of the known up-to-date information about signif-
icant factors in web positioning. Together with results of the personalization research,
they helped to prepare a collection of advices how to build a website attractive for
search engines. It is a sort of a guide for webmasters.
At the end of thesis there is short conclusion. It contains author’s few thoughts about
future trends in search engines and SEO.

Chapter 2

Personalized Search

Pretschner [27] in 1999 wrote: With the exponentially growing amount of information
available on the Internet, the task of retrieving documents of interest has become in-
creasingly difficult. Search engines usually return more than 1,500 results per query,
yet out of the top twenty results, only one half turn out to be relevant to the user. One
reason for this is that Web queries are in general very short and give an incomplete
specification of individual users’ information needs.

To be more specific, Speretta [31] in 2005 wrote: [...] most common query length sub-
mitted to a search engine (32.6%) was only two words long and 77.2% of all queries
were three words long or less. These short queries are often ambiguous, providing little
information to a search engine on which to base its selection of the most relevant Web
pages among millions.

According to Wikipedia, Google in 2006 has indexed over 25 billion web pages, 400
million queries per day, 1.3 billion images, and over one billion Usenet messages. The
Internet grows very quickly. For this reason search accuracy is crucial area for con-
stant improvement in modern search engines. One of the major solutions to meet the
challenge is personalization.

Personalized search is simply an attempt to deliver more relevant and useful results
to the end user (searcher) and minimize less useful results. Personalization mechanism
uses information about user’s past actions and behaviour to specify his profile and
match relevant search result to this profile. It should provide more useful set of results
or a set of results with less irrelevant or spam entries. For this reason personalized
search seems to be desirable to the end user.

Google puts it in this way: Search algorithms that are designed to take your personal
preferences into account, including the things you search for and the sites you visit,
have better odds of delivering useful results [13]. The goal is simple: to reduce spam
and to deliver better results. This looks like dangerous weapon against SEO workers
which are major offenders in generating spam.

Chapter 2. Personalized Search 6

Official information [13], [25], [38] indicates that Google is the only major search engine
that already introduce personalization mechanism. First personalized search result ap-
peared almost 5 years ago [13], and from that time it has been constantly evolving.

2.1. Operation Principles

Of course details of Google’s search algorithms are not public. But it can be expected,
that the main principles are based on the ideas which can be found in the scientific
literature.
According to [31], personalization can be applied to search in two different ways:
1. by providing tools that help users organizing their own past searches, preferences,
and visited URLs
2. by creating and maintaining sets of user’s interests, stored in profiles, that can be
used by retrieval process of a search engine to provide better results.
His research proved, that user profiles can be implicitly created out of the limited
amount of information available to the search engine itself. The profiles are built on
the basis of the user’s interactions with a particular search engine. Google has applied
this second approach in their search engine, because they do not provide any additional
tools like toolbars or browser add-ons for personalizing search.
After [31]: In order to learn about a user, systems must collect personal information,
analyze it, and store the results of the analysis in a user profile. Information can be
collected from users in two ways: explicitly, for example asking for feedback such as
preferences or ratings; or implicitly, for example observing user behaviors such as the
time spent reading an on-line document.
Google Search do not provide any forms, which let users to specify their interests
and preferences. So to build user profile, this information must be collected in other
way. According to [31], user browsing histories are the most frequently used source
of information to create interest profiles. But not only browsing history (like such
presented in figure 2.1) is significant.
For example, a user after sending search query gets search results. He selects a specific
entry seemed to be interesting. He clicks on it and the website is saved in his browsing
history. However, user quickly realizes, that the selected website did not fit to his
interests and goes back to the search results after few seconds. Such visit should be
qualified rather negatively. So not only browsing history, but also a user’s behaviour
should be taken into consideration in the personalization mechanism.
Also studying a series of searches from the same user may offer a glimpse into mod-
ified search behaviour. How does an individual change their queries after receiving
unsatisfactory results? Are search terms shortened, lengthened or combined with new
terms? There is much other information that a search engine might collect about a user
when a search is performed – location, language preferences indicated in their browser
or the type of device they are using (mobile phone, handheld or desktop). But how


Figure 2.1. Google Web History panel

such behavioural data can be collected by search engine? The answer is in the next
section.

2.2. Methods for the Analysis of Behavioural Data

Search engine robots, hereafter crawlers [26] continuously gather information from
almost every website on the Internet. It is well known that Google collects enormous
amount of data through this process. The reason for this is that these data have the
greatest signiﬁcance for search engine algorithm, thus classic web positioning methods
are based on links maintenance, mainly the acquisition of them.
Google processes these data and sorts the websites according to its value to the user.
User sends queries to the search engine and gets appropriate SERPs. Because Google
knows what people search, they are able to determine the popularity of speciﬁc in-


formation on the Internet. But eventually, it is the user who decides which website
is valuable for him and which is not. The value of a particular website is reﬂected in
users’ activity – websites whose links have been clicked and time between these actions.
These information is called behavioural data.
It is reasonable to make all this data useful for search engine. Certainly, Google knows
that, too. Probably this is the reason why they collect enormous amount of behavioural
data in addition to data collected by crawlers. This kind of information is what this
study is the most interested in.

2.2.1. Methods of behavioural data collecting

The entire web is based on HTTP protocol which generates requests containing follow-
ing information:
• IP address of the user making request which can be used in geolocation of this
user,
• date and time of request,
• language spoken by the user,
• operating system of the user,
• browser of the user,
• address of the website which redirected user by the link to the requested website.
These HTTP requests are being used by Google in:
Click tracking – Google logs all of its users clicks on all of its services,
Forms – Google logs every piece of information typed into every sending form,
Javascript executing – requests and sometimes even more data are sent when a
user’s browser executes the script embedded in website,
Web Beacons – small (1 pixel by 1 pixel) transparent images in its websites, which
cause sending a requests every time while user’s browser tries to download such
images,
Cookies – small text information stored on user’s computer which lets Google track
users’ movement around the web every time when they get on any page that has
Google advertisement.
But these all elements has to be placed on websites being indexed by Google’s crawlers.
Fortunately for Google, they have a lot of services, very useful for Internet publishers.
Because they are mostly free to use, webmasters gladly use them in their websites.

Google Analytics

One of these attractive services is Google Analytics. It generates detailed statistics
about:


• the visitors to the website,
• the previous website of the visitor,
• activity (navigation) of the user in the website.

Figure 2.2. Google Analytics main panel

It is the most popular and one of the most powerful tool to examine the web traﬃc on
our website. It gives the owner a lot of useful information about visitors on his website.
Figure 2.2 shows a few features of this piece of software. The information which it
provides is certainly very interesting for the Google itself. For this reason there is a
lot of discussions between SEO workers about possible disadvantages of using Google
Analytics in SEO campaigns. It is because poor results of particular website indicated
through Google Analytics can inform Google’s search engine to decrease value of this
website in search ranking. But this is only an unconﬁrmed speculation.

Google Toolbar

Another Google’s tool which provides them with even more valuable data is Google
Toolbar. It is the plug-in adding a few new features to popular browsers, mainly a
quick access to Google’s services. One of these features is checking the PageRank value


for currently viewing webpage. This gives Google the information about every website
which the users with installed Toolbar are viewing.

Google AdSense

There is also Google AdSense - context advertising program for website publishers.
There are millions of websites which uses this service to generate some financial profits
for their authors. The effect of this is that all these websites are displaying ads published
by Google’s servers. It can provide to Google a similar information as Google Analytics
and Google Toolbar.

Google Public DNS

The latest service launched by Google is Public DNS (Domain Name System) [23]. It
is said to be faster and more secure than others and this is how Google encourages us
to start using their DNS. It can generate massive amount of information about web
traffic. Every single query to the DNS can be analyzed by its provider. So the more
popular their DNS will be, the better for Google. It can provide a lot of information
helpful on defining websites popularity.
But because of DNS caching mechanisms [23], Google do not get all the desired infor-
mation. DNS client sends query only when he or she wants to visit a domain for the
first time. After he gets the IP address of this domain from DNS, he caches it for an
interval determined by a value called time to live (TTL). Every next visit during this
interval does not send any query. Consequently, Google is still in need for other services
to gather desired information about the activity of particular website’s visitors.

Other Google Services

Google has other very popular services like for example YouTube, Google Maps (Fig.
2.3) etc. They allow users to embed objects like videos or maps on their own websites.
There is also Google Reader which can indicate the popularity of particular websites
by counting their RSS1 subscribers.
There are many other ways for Google to gain the useful data [9]. In fact Google itself
admits to the use of all described techniques in its privacy policy [12]. Most of these
data are probably used by them to improve accuracy of their search engine and quality
of their services.

2.2.2. Process of tracking user

Described services can be a great source of behavioural data. It us no doubt on that. But
tracking user’s search activity process would be incomplete without data provided by
search engine. The next few sections present how the tracking process looks like.
1. RSS (most commonly expanded as Really Simple Syndication) is a family of web feed formats used
to publish frequently updated works – such as blog entries, news headlines, audio, and video – in
a standardized format


Figure 2.3. Google Maps example screen

Starting the session

When the user is opening the search engine site (typing the www.google.com address
in a browser), he sends an HTTP Request [37] to the server. This request contains the
IP address of the user’s computer. Thanks to this information the search engine has
the ability to relate following search queries with particular users. Each of them has
been assigned a unique session identifier, stored on the server. This is the beginning of
the user’s search session.
The identifier expires after a certain period of user’s inactivity. In this way the search
session is being terminated.

Sending the search query

The view presented on figure 2.4 should be known by every Internet surfer. This is the
place where user cane type his search query.
After the search query is sent, it is followed by two facts:
1. The query is stored into a database and connected with user’s session identifier.
Then the personalization mechanism takes advantage from it.
2. The query is analysed and used by search algorithm to provide relevant search
results to the user.


Figure 2.4. Google Search main screen

After that user receives an HTTP Response [37] with the search results as the HTML
code.

Result selection

In the figure 2.5 there is presented one of the results of ”query example” with the
hyperlink highlighted.

Figure 2.5. Example of the search result

Commonly click on the link sends HTTP Request to the server which link leads to. So
in this case, it should be sent to:
http://www.wisegeek.com/what-is-query-by-example.htm
But when you look into source code of the SERP, you will find URL like this:
http://www.google.com/url?sa=t&source=web&cd=6&ved=0CDMQFjAF&
url=http://www.wisegeek.com/what-is-query-by-example.htm&
ei=CekOTLT4EpHu0gTYitWXDg&usg=AFQjCNE3t34-kSehUAK8TFNwh5CV9K-OWg&
sig2=PdwrnqnhLhowpC8t5-06bw
The most important thing which can be noticed is that the links in the SERPs lead to
Google’s server. But the chosen website finally appears on the user’s screen, because
Google’s server is doing URL redirection (forwarding). This technique bears the down-
side of the short delay caused by the additional request to search engine server.
However, in this way search engine can log every user’s click in SERPs. What is more,
there are some additional data in the result’s URL which probably provide some extra
information. For example, the value of the cd parameter is the place number in the


current search ranking. What is more, this URL is can be seen by target server, because
browser are placing it into HTTP Request data as Referer field [37]. This fact is used by
software like Google Analytics to aggregate traffic sources of the website with Analytics
scripts. Thanks to this, the website owner can get the information of:
• the most popular search phrases that result in visits to his website
• the place in ranking of his website for particular search phrase and particular user
(it can vary due to personalization mechanism)
Of course the same information is being taken into consideration by search engine.

Behavioural data extraction

According to many research [1], [6], [7], [10], [20] and [36], more than half of a website’s
visits comes from the SERPs. In case of e-commerce websites, this value is even higher,
because such services usually do not have regular visitors. They mostly come from
search engines (even 90% of visits) or from ad appearing on other websites.
According to the analysis of real users web traffic [29], typical user spends about 2
hours per session and 5 minutes per page.
These statistics concerns a website of good quality, relevant to the user’s interests. Visit
on website with poor content would be terminated as soon as after few seconds – so
called bounce. Such visit should indicate irrelevance of website selected by user, so it
would be desirable if it wouldn’t appear in the search results on particular phrase.
Not only what you select/interact with from a given set of search results (or the Ads
served with them), but what you do not select or have minimal interactions with
(bounce rates) can have an effect. These metrics can be used to create a greater prob-
ability model for future search result sets.
What is more, a mechanism based on cookies have been recently introduced on Google
Search. It allows to learn about every user’s (also not logged in any Google’s service)
history of search queries from the last 180 days. Officially it is used to personalize
SERPs according to past interests of the user. Google [13] says about that: Because
many people might search from a single computer, the browser cookie may be associated
with more than one person’s search activity. For this reason, we don’t provide a method
for viewing this signed-out search activity.
The diagram in figure 2.6 shows the process of tracking user which has not been
signed-in to Google Account.
This is what Google [14] says about personalized search for signed-out users:
When you search using Google, you get more relevant, useful search results, recom-
mendations, and other personalized features. By personalizing your results, we hope to
deliver you the most useful, relevant information on the Internet.
In the past, the only way to receive better results was to sign up for personalized search.
Now, you can get customized results whenever you use Google. Depending upon whether


User Search Engine

Enter search engine via URL Extract profile information

[Else] Type search query Show profile based search page

Save cookie Search for relevant documents

[Else]
Look for interesting website Prepare search results

[Found history cookies]

[Found something interesting] Re-rank found documents

Click chosen website Log phrase-selection

Visit the website Redirect to selected website

[Else]
[Curiosity satisfied]
[Else]
Log bounce

Go back to search engine
[Visit longer
than 3 minutes]
Log visit

Close browser

Figure 2.6. Activity diagram of a visit session

or not you’re signed in to a Google Account when you search, the information we use
for customizing your experience will be diﬀerent:

Signed-in personalization: When you’re signed in, Google personalizes your search
experience based on your Web History. If you don’t want to receive personalized
results while you’re signed in, you can turn oﬀ Web History and remove it from
your Google Account. You can also view and remove individual items from your
Web History.

Signed-out customization: When you’re not signed in, Google customizes your search
experience based on past search information linked to your browser, using a cookie.
Google stores up to 180 days of signed-out search activity linked to your browser’s
cookie, including queries and results you click.


Table 2.1. Information used by Google Search in personalization
Signed-in Personalized Signed-out Personalized
Search Search
Place of data Web History, linked to Google On Google’s servers, linked to
storage Account an anonymous browser cookie
Time interval Indefinitely or until remove it Up to 180 days
of data storage
Searches used Only signed-in search activity, Only signed-out search activity
to customize and only if user is signed up for
Web History

2.3. Research

The goal of this section is to evaluate the current level of personalization based on
several factors. For the task should be helpful this what Google [14] says about types
of results customizations:

When you use Google to search, we try to provide the best possible results. To do that,
we sometimes customize your search results based on one or more factors:

Search history: Sometimes, we customize your search results based on your past
search activity on Google, such as searches you’ve done or results you’ve clicked.
If you’re signed in to your Google Account and have Web History enabled, these
customizations are based on your Web History. If you’re signed in and don’t have
Web History enabled, no search history customizations will be made. (Using Web
History, you can control exactly what searches are stored and used to personalize
your results. Learn about using Web History)

If you aren’t signed in to a Google Account, your search results may be customized
based on past search information linked to your browser using a cookie. Because
many people might be searching on one computer, Google doesn’t show a list of
previous search activity on this computer. Learn how to turn off these customiza-
tions

Location: We try to use information about your location to customize your search
results if there’s a reason to believe it’ll be helpful (for example, if you search for
a restaurant chain, you may want to find the one near you). If you’re signed in
to your Google Account, that customization may rely on a default location that
you’ve previously specified (for example, in Google Maps). If you’re not signed
in, the results may be customized for an approximate location based on your IP
address.

If you’d like Google to use a different location, you can sign in to or create a
Google Account and provide a city or street address. Your specific location will be
used not only for customizing search results, but also to improve your experience
in Google Maps and other Google products.


2.3.1. Location

While you can search at google.com just about anywhere in the world, you can also
access Google at a number of different country specific addresses, such as google.co.uk,
www.google.fr, www.google.co.in. In fact, Google automatically redirects you on the
proper domain using your IP address and determining your geolocation, Browser setting
with recommended language was clear in this experiment.

This experiment was performed in one location in Poland. However, to simulate request
from the other locations, there was used a similar software environment like described
in section 5.6.1. The used phrased ”jaguar” is multi-lingual, so the language of the
phrase does not affect the search results.

The first query was sent through three Tor hosts, where the exit host was located in
Los Angeles, California, United States. The result of the query is presented in the figure
2.7 (only several first entries).

All website in this SERP are in English, which is the standing language at the described
location. Moreover, at the near bottom there are some places indicated on the Google
Maps, which are physically close to the location of the exit host.

The second query was sent via exit host located in Erfurt, Thuringen, Germany. The
figure 2.8 presents results of this query.

In Official Google Blog [13] is written, that the same query typed in multiple countries
may deserve completely different results. Presented results clearly shows that those
words are true.

Unfortunately author failed to check if a search for the query ”football” provides dif-
ferent results in the US, the UK, and Australia, because the term refers to completely
different sports in those countries. But it is rather possible. A preferred country might
include the country of the searcher as well as other countries that searcher might find
acceptable, such as showing search results from the United States to people located in
Canada.

2.3.2. Phrase language

It is rather clear, that language of the searched phrase is significant for the results.
Search engines, despite personalization, still use matching phrases to the content of
indexed pages as the major factor for evaluation of the search relevance. For this reason,
the phrases identical in the semantic meaning but in other languages are completely
different in general.

So serving search results with pages in English about birds would be senseless if the
user typed phrase ”Vogel” in search box, which means ”bird” in German.


Figure 2.7. Search results of the query sent via host located in USA

2.3.3. Search history

The most interesting factor which is said to have influence in personalization mechanism
in Google search engine is user’s search history. Figure 2.9 shows search results which
was slightly modified by re-ranking based on search history. On the fifth position, right
after two video thumbnails, there is a link to the website, which was visited 4 times
(exact number of visits is visible on the right side of the hyperlink) used by the author
of this this study to gather information.
These fluctuations appear only when user is signed into Google Account, otherwise
there is no access to the web history (figure 2.1). This modified search result was not


Figure 2.8. Search results of the query sent via host located in Germany

the author’s intentionally. The phrase ”personalized search” was not the object of the
experiment. However, this result shows, that search history affects future search results
on similar areas of information.

To compare modified results with the original (without impact of personalization),
there are two ways to disable results customization:

1. signing-out from Google Account

2. using ”View customization” which is available on the bottom of results screen

After using one of these options, we can check the original position of the visited website
in the ranking. In this particular case, the website holds 17th position in the results
with no customization. So after personalization re-rank, there was position increase by
12 places.

But the most important is the fact that this change shows visited website on the first
SERP of the search ranking. In most cases (more than 90% of searches) users does not
go beyond first page of the results. So such change in ranking causes huge increment
of the visitors via this phrase.


Figure 2.9. Search results personalized by user’s search history

Unfortunately author has failed in forcing search engine to re-rank search results in-
tentionally. So after this experiment, approximation of the re-rank algorithm is impos-
sible.

2.4. Spam Issue

There is a huge amount of value from getting to the top of the search results. Especially
considering competitive phrases related to the business. This is a marketing area with
millions of dollars in it, quite often. So spammers are highly motivated, because there
is a lot of money at stake. Unfortunately regular users, searching for valuable content
are the main victims of these practices.
One of the more interesting parts about implicit/explicit user feedback during search
personalization process is that it can be very eﬀective in dealing with spam. The more
personalized the results, the less chance that spam will will appear in search ranking.
Because in most cases, spammy websites are clickable by users (which are tricked by
link with false information about target website), but after realize the real value of
those website, they quickly go away and do not come back to them.


Not only will this enable them to help limit spam through personalization, it would
also be a great source of query/click analysis for Google. It is worth to consider that
the click data across multiple users shows that a given entry in a query space rarely is
clicked, or shows a high bounce rate. Google might just use that signal as a dampening
factor for spam result.

Chapter 3

Impact of Personalized Search on
SEO

3.1. Metrics

For quite a long time, SEO workers were using position in search ranking for particular
phrases as the indicator of the web positioning. Increase of the website position has
been always a desirable consequence of the SEO actions.

After implementation of personalization mechanism, the issue is not so simple. Al-
though customizations in rankings are still not very influential (only one entry in whole
SERP), the highly visible benefits of the personalization suggest, that the impact will
be increasing. For this reason ranking position cannot be the major metrics of success
any longer. It is because position in ranking of particular website can be different for
every user. Especially for those of them, which are regular visitors of this website.

Of course, this indicator is still measurable, because position monitors1 are not person-
alization subject (they do not use cookies and Google Account). It can also give useful
information about position seen by users searching for concerned keywords for the first
time. But it is the increase of inbound traffic which has always been the main moti-
vation of SEO actions. So the major metrics of this actions should be closely related
to this motivation. For example, those are metrics for SEO in time of personalized
search:

Number of unique visitors. Higher value indicates good result of the advertising
campaign and gaining popularity among new customers.

Previous search queries. As an example: if the searcher has been recently searching
the term ‘diabetes’ and submits a query for ‘organic food’ the system attempts
to learn and presents additional results relating to organic foods that are helpful
in fighting diabetes.

1. Software which automates monitoring of a website’s position in search rankings for given phrases

Chapter 3. Impact of Personalized Search on SEO 22

Previously presented results. Results that have been presented to the end user
can be omitted in future results for a given period of time in exchange for other
potentially viable results.

User query selection. Past selected or preferred documents can be analysed and
similar documents or linking documents can be used to refine subsequent results.
Furthermore, certain documents types can be seen as preferred, in what would be
a combination of Universal Search concepts. Common websites that accessed can
also be tagged as preferred locations for further weighting.

Selection and bounce rates (and user activity on website). An editorial scor-
ing can be devised from the amount of time a user spends on a page, the amount
of scrolling activity, what has been printed, or even what has been saved or book-
marked. All can be used to further refine the ‘intent’ and ‘satisfaction’ with a
given result that has been accessed.

Advertising activity. The advertisements clicked on can also begin to add to a
clearer understanding of the end users preferences and interests.

User preferences. The end user can also provide specific information as to personal
interests or location specific ranking prominence. It could also include favourite
types of music or sports, inclusive of geo-graphic preferences such as a favourite
sport in a given city.

Historical user patterns. A persons surfing habits over a given period of time (e.g.
6 months) can also play a role in defining what is more likely to be of interest
to them in a given query result. More recent information (on above factors) is
likely to be weighted more than older historical performance metrics within a set
of results.

Past visited sites. Many of the above metrics, such as time spent and scrolling on a
given web page or historical patterns and preferred locations can also be collected
in a variety of ways (invasive or non-invasive). Cookies actually save resources for
the Search Engine, an added benefit.

The advices how to improve values of such metrics are presented in the next chap-
ter.

Higher position in rankings not always implicate more visitors. Moreover, there is no
significant difference for positions between 6 and 10. Very often the proper website
optimization of page’s title and description visible in a SERP is more important and
brings more visitors than higher position. Better website titles and meta-descriptions
would have an advantage as getting the user to engage with the SERP listing upon
initial presentation would be at a premium. Quality content as well would begin to take
on a more meaningful role than it has in the past, as bounce rates and user satisfaction
now starts to play into actual search results rankings.


3.1.1. Areas for Consideration

Author’s experience in commercial SEO, which is closely related to the topic of mar-
keting, is rather small. Thus this section is based on [16].

Demographics

It should be ensure to leverage any obvious demographics that may apply to your site.
If it is geographic, topical (sports, politics) or even a given age group, ensuring that this
is targeted effectively is important in that the ‘topical’ nature of personalized search
can group results prior to even ranking them. If the particular website is not clear in
each of these areas, it risks less weighting to tighter demographic starting document
sets. Even your off site activities (link building, Social Media Marketing etc.) should
be as tightly targeted as possible.

Relevance profile

Of particular interest is potential categorization in terms of topical relevance. Ensur-
ing that your site provides a strong relevance train would be particularly valuable.
Much like phrase based indexing and retrieval concepts, probabilities play a large role.
When refining results the search engine looks at related probable matches. Through
a concerned effort with on-site and off-site relevance strengthening, you increase the
odds of making it to a given set of results in a world of ‘flux’. It never hurts to review
the concepts surrounding Phrase Based Indexing and Retrieval as many of the related
patents addressed deriving concepts/topics from phrases.
One would also have to imagine that tightening up the relevance profile in your Social
Media Marketing efforts would also be beneficial to a tighter topical link profile. Fur-
thermore, many topically targeted visitors that enter a site may bookmark (or passive
collection) your site which ads to the organic search profile without ever being included
in a search result. As such, there are many exterior opportunities to be had beyond
the traditional off-site SEO.

Keyword Targeting

Building out from your core terms will be important as far as understanding search
behaviour. The long tail as we know it would be targeted towards potential query
refinements on a given subset of searcher types. Building out logical phrase extensions
and potential query refinements would be something to look at. Furthermore, with
changeable personalized ranks we would measure SEO success in actual traffic and
conversions which puts term targeting into a new light as far as nailing money terms
and having a cohesive plan that targets query refinement long-tail opportunities.

Quality Content

In considering the value of a website, user interaction becomes a consideration as far
as bounce rates, time spent on page and scrolling activities are concerned. Producing


compelling and resourceful content would be at a premium to best leverage these
tendencies of the system. If a searcher has selected and interacted with your site on
multiple occasions your site would be given weight in their personal rankings as well
as related topical and searcher types. The more effective a resource the greater the
ranking weight increase.

Search result conversion

Working with the page title, meta-description and snippets takes on a more important
role in your SEO efforts when adjusting for personalized search. I dare say using ana-
lytics and a form of split testing would be a great advantage as far as satisfying what
not only ranks, but converts.

Freshness

Another area which may be important is document freshness in that people could be
able to set default date ranges or the system could passively begin to see a pattern of
a user accessing more current content. Valuable website that has been ranking well for
a year that may no longer be getting all the traffic that is has been used to. It should
be looked at updating such pages with fresh information, or creating new related pages
and pass the flow via internal links. Depending on the nature of the content (searcher
group profile) more current content may be more popular over the larger data set and
thus newer content would be weighted more overall.

Site Usability

From a crawler or the end user perspective, having logical architecture and a quality
end user experience is also at a premium. If similar searcher types embark on similar
pathways and related actions (bookmark, print, navigate, and subscribe to RSS) then
this will give greater value to those target pages within that community of search types.
This also furthers the relevance profile.

Analytics

It can be noticed there is a strong need for the use of analytics in understanding
traffic flows, understanding common pathways, bottlenecks, the paths to conversion,
and much more. This data will be of immeasurable use in dealing with many of the
factors that can affect Personalized Web Positioning. This issue is closely connected
with psychology (particularly behavioural targeting).

Chapter 4

SEO Guide

This chapter presents a set of areas for consideration during a process of website opti-
mization. The prepared advices concern only areas which may have positive result in
gaining popularity of the website. They should be helpful for achieve higher position in
search rankings which should increase the number of visitors. They also should make the
website more attractive for users. This fact probably will decrease bounce rate which
which has negative impact on the website in personalization re-ranking process.

Concerned areas in this guide do not take into account any SEO techniques which are
connected with external actions. So those that require contact with other websites,
such as:

• linking (the acquisition of links), free or paid

• advertising

• presell pages1

Listed methods are closely connected with generating spam. Due to this, they reduce
the rate of quality content in the Internet. So the Internet surfers has no beneﬁts from
them.

This chapter is based mainly on the information from [13], [15], [21] and [36].

1. Presell page – page created only for SEO purpose. Text on such page is only a surrounding for link
leading to positioned website. Content has no value for human reader because. It is only prepared
to look like natural for crawlers, not to be ﬁltered as spam.

Chapter 4. SEO Guide 26

4.1. Website Presentation in Google Search

4.1.1. Title

Title is the first information about particular website in SERP. It is also one of the
main factors in with impact on the website ranking. An example of such title in html
code looks like this:
<title>Jaguars, Jaguar Pictures, Jaguar Facts -
National Geographic</title>
Such title presented in SERP looks like in the figure 4.1.

Figure 4.1. Presentation of a website title in Google Search

These are the issues connected with website title, which are significant in SEO:

Length up to about 65 characters

Longer titles can be also indexed by crawler, but title with 65 characters is rather
optimal and it entire fits in SERP. Longer titles are shortened with ellipsis.

Diversity of titles

Each of the website pages relates slightly different information (e.g. product page,
contact form etc.). The title should be prepared individually for each of them.

Keywords

There are 3 principles related to creating a title:
1. Keywords should be distributed on all the pages. Each of the pages must be
optimized for only 3–4 keywords. Front page title should have most general ex-
pression, titles of product pages should contain words characterizing the type of
these products etc. Sticking to this rule is very important, because in other case
the pages of the website could be treated by crawler as duplicated content.
2. The most important keywords should be place at the beginning of title.


3. Google can connect keywords from title into different phrases. But these which
appear one after another have the greatest impact on their position in ranking.
Due to this fact, key phrase should not be separated.

4.1.2. Description

Title is the second information about particular website, presented next after title in
SERP. Such description presented in SERP looks like in the figure 4.2.

Figure 4.2. Presentation of a website description in Google Search

Description presentation in SERP can be generated from following sources:
• description metatag, for example:
<meta name="description" content="Learn all you wanted to know
about jaguars with pictures, videos, photos, facts,
and news from National Geographic." />
• a fragment of the website content (in case the description metatag is too long or
there is no such one in the source code)
Here are some tips on the description page in metatag:

Length up to about 150 characters

Longer descriptions will not be presented in SERP as they have been written.

Diversity of descriptions

Just as titles, description of a particular page should be slightly different from the
other. It should be specific for the information presented in the page.

Keywords

Description should contain keywords concerned by the SEO strategy. When it does, the
keywords will be bold in the search results for query phrase based on such keywords. It
should call users’ attention on our website. However, the description should be prepared
in the way to encourage users to visit the website.


4.1.3. Sitelinks

Sitelinks are links leading to other pages of the same website. The can be presented in
SERPs in 2 ways:
1. Horizontally – 4 links in 1 row (presented in figure 4.3)
2. Vertically – 8 links in 2 columns

Figure 4.3. Presentation of a website sitelinks in Google Search

There is no manual way for publishers to force sitelinks presenting in SERP. It depends
on how the website was indexed. But it can be made easier for crawler to make it
correctly. There are two things which can be done:
1. First of all it must be well designed source code related to navigation on our
website. Its syntax must be very clear.
2. Prepare a sitemap of website (e.g. in XML format). This issue will be described
later.

4.2. Website Content

4.2.1. Unique content

The basis of the proper content optimization is its uniqueness. This means that the
same text or its larger fragments should not be reproduced on other websites or on
different pages of our website.
In order to verify the degree of uniqueness of our content, it can be used this tool:
http://www.copyscape.com

4.2.2. Keywords

It is very important make search engines able to relate our website to specific theme
and keywords. In order to make it possible, keywords must be considered not only
in website title and description design process. Keywords must be also contained in
website content.
In preparing the text for the website it suggested to stick following principles:


Repetition

Keywords should be repeated several times on every page. But it cannot be forgotten
that the text should be written primarily for users. The task is to find a compromise
between attractive text for users and good for SEO. Too high density of keywords on
particular page can be treated by search engine as an abuse. In such situation our
website will be penalize by ranking exclusion.

Variations and synonyms

The website content will be more natural, if contained keywords are used in many
variations (grammatical). The proficiency of modern search engines can also detect
using synonyms. For this reason we can use for example word ”drug” in the content
being optimized for ”medicine” keyword.

Location

Keywords should located on whole page with similar density. This will give a better
result in positioning than accumulation of keywords for example only at the beginning
of the page.

4.3. Source Code

Website’s source code has not direct influence on the position in search ranking. How-
ever, some errors can cause problem with proper indexing by search engine robots. For
this reason it is worth to ensure that the code contains no errors and it is compatible
with current WWW standards.
Very useful is the code validation tool, provided by the World Wide Web Consortium
(W3C). It can by found here:
http://validator.w3.org

4.3.1. Headers

HTML headers tags (h1–h6) are very significant for proper indexation of the website
content. Right usage of them is very important in desing of a website. There couple
issues which must be considered form SEO point of view.

Hierarchy

Headers tags are designed to separate particular sections of a document. They must
be used in the correct order and only when there is a need to use.


Repetition

Header of the ﬁrst degree (h1 tag) by current HTML standards may occur only once
in whole document. Other headers can be used repeatedly.

Keywords

It is suggested to put keywords into header tags. It is because they have more ”posi-
tioning power” than a regular text. This power is probably respective to the headers
hierarchy, so the most important keywords should be placed in h1 tag.

4.3.2. Highlights

Keywords can be distinguished from the rest of text by using tags ¡strong¿ (bold)
and ¡em¿ (italics). In this way, keywords are highlighted either for users end crawlers.
Although it should be done with restraint. Not every occurrence of the keyword should
be highlighted but only the most important of them.

4.3.3. Alternative texts

Sometimes there are some images placed in the document. It is recommended to include
alternative texts to those images. It can be done in this way:
<img src="path/to/image" alt="alternative text" />
It is displayed on the screen in the case the browser can not display images (e.g. when
they are unavailable on the server).
These alternative texts are also interpreted by search engine robots. These data is then
used in search for images (when search engine has such option).

4.3.4. Layout

Well indexing website should have clear and minimalistic layout. The content is the
most important factor, so even ratio of text amount to html code is signiﬁcant. The
higher this value is the better and more valuable website in the search engine point of
view

4.4. Internal Linking

Quite important issue in website optimization is internal linking. Internal link is the
hyperlink which leads to another page of the same website. There are some recommen-
dation connected with this link type.


4.4.1. Distribution

Each of the pages should be available in 3 or 4 clicks at most. If not, the website
navigation must be re-designed. Attention should be given especially to the links on
the main page. The structure of the website must be clear.
What is more, it is also very important for usability of the website. Complicated navi-
gation can discourage user to continue the visit.

4.4.2. Links anchors

Link anchor is the clickable text. It is displayed for the user on a website, instead of
plain URL which is rather unreadable for human. It looks like this:
<a href="some_url.html">Anchor text</a>
Anchors should describe the content of pages which their links lead to. If links are
located among the other text, it should match the context of whole text. For example,
it is not advised to write ”click here” like it was popular couple years ago.

4.4.3. Broken links

Very important thing in website positioning is to beware of links which lead to unavail-
able URLs. Such issue is very annoying and discouraging for visitors.
The website with broken links will be also less valuable for search engines, because
robots crawl the web using links. After indexing a page robot uses one of the links
placed on this page to go to another page. When such link is broken, crawler can
interrupt indexing process. It will cause the situation where not every page of the
website will be indexed.

4.4.4. Nofollow attribute

Nofollow is an HTML attribute value used to instruct some search engines that a
hyperlink should not inﬂuence the link target’s ranking in the search engine’s index.
This is example of such hyperlink:
<a href="some_url" rel="nofollow">Some website</a>
It is intended to reduce the eﬀectiveness of certain types of search engine spam, thereby
improving the quality of search engine results and preventing indexing particular web-
site as spam. Nofollow attribute is used commonly in outbound links2 , for example in
paid advertising.

2. Links which target at other websites


4.4.5. Sitemap

A sitemap is a list of pages of a website accessible to crawlers or users. This helps
visitors and search engine bots find pages on the website.

Sitemap for users

It can be prepared a page, where will be placed links leading to all website’s pages or
only the most important ones. Thanks to this, users having problems with navigation
will be able to find quickly what they are looking for. The example of such sitemap
located in footer is presented in figure 4.4.

Figure 4.4. Example of sitemap for visitor

Sitemap for robots

Sitemap for crawlers must be easy to automatic processing. Such sitemap is mostly
being prepared in the XML document format. This is how example looks like:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12</loc>
<changefreq>weekly</changefreq>
</url>
<url>


<changefreq>weekly</changefreq>
</url>
<url>
<lastmod>2004-12-23T18:00:15+00:00</lastmod>
<priority>0.3</priority>
</url>
<url>
</url>
</urlset>
As it can be noticed, such document contains some information about each link:
loc: URL to particular page
lastmod: time of last modification of the page
changefreq: average period time between changes in the page
priority: value of priority for crawler to index particular page
Such information are welcome by crawlers. It can profits to publisher with faster in-
dexing by crawler.
In most cases such documents are being prepared using software tools as like:
http://www.xml-sitemaps.com/
After when the sitemap document is prepared, the search engine must be notified about
its existence by special form.

4.5. Addresses and Redirects

Among previously described factors used in rank algorithm, search engines also consider
form of indexed website’s URLs and information included in HTTP Responses.

4.5.1. Friendly addresses

Search engines give higher rank value to those websites whom pages have URLs more
readable for human. For example, address like this:
http://www.example.com/index.php?page=product&num=5
can be written in this way:
http://www.example.com/product/5
Such effect can be achieved using mod rewrite. It is module to the Apache Server, which
allow to create regular expression patterns for mapping URLs to particular pages. Such


possibility have also modern web frameworks, such as Django Framework or Ruby on
Rails.
Moreover, such possibility gives another opportunity for placing keywords. Due to this
fact, page being optimized for particular keyword should have this keyword in its URL.
If it is phrase with couple words, it is suggested to separate them with dash.

4.5.2. Redirect 301

Redirect 301 is the constant redirect from one address to another. After using it:
• visitors writing into the address bar in browser the old address, will be redirected
into the new one
• some search engines will switch the old address in the database into new one
So it is very useful after domain change.
Earlier it author said, that content of the website should not be duplicated. It is often
forgotten, that allowing to entry a website through several addresses has the same
result. Sometimes the same website is downloaded via:
• example.com
• www.example.com
• example.com/index.html
• www.example.com/index.html
• example.com/default
• www.example.com/default
In such case it must be decided if the main address of the website will have the ”www”
prefix. If it will, it should be placed .htaccess file in the main folder of the server, with
such content:
RewriteCond %{HTTP_HOST} êxample.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
Similarly we can manage the redirection from the index.html file:
RewriteCond %{REQUEST_FILENAME} index.html
RewriteRule ^(.*)$ http://www.example.com [R=301,L]
There are also many other possibilities which can be managed likewise.

4.6. Other Issues

There is couple other things, which have some influence on the website ranking.


4.6.1. Information for robots

Sometimes publishers do not want robots to index some website’s pages, but keep them
still available for regular visitors. For example:
• results from internal search engines
• data sort results
• print version of pages
• pages which should not be indexed, like login page to administration panel
To manage this issue it can be prepared robots.txt file with the content like this:
User-agent: *
Disallow: /admin-panel/
It should prevent robots from indexing pages whom URLs start with www.example.com/admin-panel/.

4.6.2. Performance

One of the most recent factors introduced Google search engine algorithm is website
performance value. Google promotes websites with short time period of download pro-
cess. It is not as significant as internal linking or quality of content. Big information
portals are very complex, so they can not be downloaded as fast as e.g. small blog. But
valuable information is the most important factor.
However, good performance can increase rank value of the website compared to an
another one with similar content but not so efficient.
To improve website performance it can be used several tools, like PageSpeed by Google:
http://code.google.com/speed/page-speed
It provides the analysis of the website downloading efficiency and gives some tips on
how to improve the performance. Next, will be presented the most suggestion being
given by this software.

Gzip compression

Modern web browser allow to use gzip compression mechanism to reduce the size of
website files (images, CSS files, Javascript files). If there is such possibility, it is recom-
mended to use it.

Number of DNS lookups

DNS caching mechanism [23] causes there is no need to look for IP address matching
to particular domain several times. For this reason, when every file used by the website
is located on the same server (or another server in the same domain), there is only on
DNS lookup in during download process.


So it should be avoided placing media files (images, CSS, Javascript) on the different
domain without clear need.

External files

Information commonly located in the external files, like CSS Style Sheets or Javascript,
can be also located inside the HTML document. However it should be avoided, because
it causes parsing of source code process more complex. Thus the browser need more
time to display website on the screen.

4.7. Summary

Tips presented in this chapter should significantly increase the rank value of every
website. With the higher ranking in the search engine will be it will be more inbound
traffic. In other words, the website will gain more popularity.
Making the website more attractive to visitors should implicate better results of per-
sonalization re-ranking. The assumptions of the impact of personalized search results
on global ranking are very likely. So improving the quality of website on the basis of
presented guide should increase website’s ranking in general. Both through the per-
sonalization impact and collecting inbound links as the effect of increase in popular-
ity.

Chapter 5

The System for Global
Personalization

The goal of this chapter is to propose a method for improving the website’s search
ranking through affecting the personalization mechanism in search engine. The idea of
this method is to generate artificial behavioural data. The author of this thesis is the
co-author of the article [19], which this chapter is based on.
In section 2.2 of this thesis it has been shown that a lot of data goes into Google
and a lot of useful manipulated data comes out. But we can only guess what happens
in between or try to learn from the observation of the data coming out of Google.
Evans wrote [10] that identifying the factors involved in a search engine ranking algo-
rithm is extremely difficult without a large dataset of millions of SERPs and extremely
sophisticated data-mining techniques.
That is why, only an observation, experience and common sense are the main source
of knowledge on Search Engine Optimization (SEO) methods. It was according to this
knowledge that Search Engine Ranking Factors [11] was created. The last edition of it
assumes that traffic generated by the visitors of a website has 7% of importance in the
Google’s evaluation of the website value. It is, after links to the specific website and its
content value, the most significant factor in website evaluation process. On the basis of
the previous editions of the ranking, one can notice that the importance of this factor
is increasing.
Because these all are only reasonable assumptions, the intention is to evaluate the
validity level of the described factor in web positioning efforts. For this purpose we
need a simulation tool which will generate necessary human-like traffic on a tested
website. The tool is going to be a multi-agent system (MAS) which will imitate real
visitors of the websites.

5.1. Problems to Solve

Fig. 5.1 presents the main reason why the system must be distributed. A few queries
to Google, have been sent frequently one after another from the same IP address, are

Chapter 5. The System for Global Personalization 38

Figure 5.1. Information displayed by Google on the abuse detection

detected by Google and treated as abuse. Google suspects an automated activity and
requires completion of the captcha form in order to continue searching. In case of using
the distributed system, the queries would be sent from many different IP addresses. It
should guarantee, that Google will not consider this issue abusing. This issue cannot
be solved by using a set of public proxy servers. Google has probably put them into
their black list. Every single query to Google via such proxy server leads to the same
end – captcha request.
What is more, after Tuzhilin [35] we can say that Google puts a lot of reasonable
effort into invalid clicks on advertisement filtering. There is a big chance, that some of
those mechanism Google uses in the analysis of the web traffic. This is the reason why
generating behavioural data should be our concern. Recognized artificial web traffic
could be treated by Google as an abuse and cause being punished (decline of website
position).

5.2. Objectives

5.2.1. Web Positioning

The main goal of the system is to improve the position of a website by generating traffic
related to the website. The only activity which can be visible for Google the system
should care of. It shows that there is no need to download all content from particular
website. It would only waste the bandwidth. The system should only send to Google
services requests used by a particular website, for example:
• links to the website on SERPs,
• Google Analytics scripts,
• Google Public DNS queries,


• Google media embedded on the website like AdSense advertisement, maps, YouTube
videos, calendars etc.

5.2.2. Cooperation

The whole idea of the system is to spread positioning traffic into world wide IP ad-
dresses. As a result of this distributed character, the system require a large group of
cooperating users. Nobody will use the system if there is no benefits to him. A mech-
anism which will let the system users to share their Internet connections in order to
help themselves in web positioning must be introduced. What is more, the mechanism
must treat all users equally-fairly. It means it should not allow to take benefits without
any contribution.

5.2.3. Control

According to [36], web positioning is not a single action, but a process. This process
should be able to be controlled. Otherwise it could be destructive, instead of improving
the website position. For this reason, the system should allow users to:

• control the impact of the system activity on their websites,

• check the current results of the system activity (changes in the website position
on SERPs),

• check the current state of the website in the web positioning process.

5.3. Architecture

Fig. 5.2 presents the architecture of the system which take under consideration all
specified problems and objectives. Server is necessary to control the whole process of
generating the web traffic by specified algorithm. It gives the orders for clients to start
generating traffic on the specified websites. It also gets the information from clients
with amount of requests sent to particular Google’s services on the website account.
Database serves as storage for process statistics. They can be presented to clients via
web interface. They are also useful to server for creating the orders in accordance with
the algorithm.

Clients are the agents of the presented MAS. They take orders from the server with
particular webiste registered in the database to be processed. Processing the website
is to mimic its real visitor. Client performs this autonomously using the visitor session
algorithm described in the next section.


Clients Cloud

Website 1

Client 1

Google Website 2 All Google
Search Services
Client 2
Server

Database Website 3

Client 3

Figure 5.2. System architecture

5.4. Visitor Session Algorithm

According to many research [1], [6], [7], [10], [20] and [36], more than half of a web-
site’s visits comes from the SERPs. That is why starting the single visitor session (the
sequence of requests considering single website registered in the system) with querying
Google Search sounds reasonably. However, only if currently considered website appears
on one of the ﬁrst few SERPs. Otherwise, visitor session should be started directly on
the processed website or should refer to an incoming link, but it must be existing, if
there is such one. Because of the likely Google’s actions in order to detect abuses, the
visitor session should be possibly human-like. The analysis of real users web traﬃc [29]
is very useful at this moment. According to it, a typical user:
• visits about 22 pages in 5 websites in one sitting,
• follows 5 links before jump to a new website,
• spends about 2 hours per session and 5 minutes per page.
These statistics clearly indicate that typical visit session concerns a website of good
quality. Visit on a poor website would be aborted as soon as after few seconds. Such
visit could have a negative impact on the website quality evaluation by Google.
1 – Server retrieves from the database information about the next website to be pro-
cessed.
2 – Task assignation to the client.
3 – Client starts the visitor session.


Database 6
1 4
Visit session
2 3 5

Google Website All Google
Search Services
Client
Server

7

Figure 5.3. Visitor session algorithm

4 – Searching on SERPs for a link to the processed website.
5 – If a link has been found – click on the link, otherwise direct request.
6 – Processing the visit session.
7 – Request to the server for another website to process.

5.5. Task Assignation Algorithm

Task assignation algorithm helps server to build a queue of registered websites ordered
by the visitor session priority. The website with the highest value of the priority is the
next one to start visitor session. In other words, client always receive the website with
the highest priority value to process.
The priority value P V is calculated using the function:

v(α)
P V (α) = r(α) · t(α) · (5.1)
T (α)

where
α — record in the system (website with phrase for web positioning)
r(α) — returns current position in the search engine ranking for the α (returns
0 if there is no α in the ranking)
t(α) — returns time since the end of the last visitor session on the α (in
seconds)
v(α) — returns number of visitor sessions made by α owner’s client
T (α) — returns time since the registration of α in the system (in days)


Presented function gives the highest ”power” to the ranking factor. The reason for this
is that websites with high ranking value should have more real visitors, so the system
efforts will not be so crucial for its popularity.
Time of participation in the system is not very significant. Novice participants have
equal chance to gain attention for their websites as the senior ones. However, function
promotes continuous activity of the clients.
Worth to consider is also the possibility to dynamically modify weights of individual
factors depending on the results. Because of the fact that websites queue is built by
the server, it is possible to change whole function during the system activity.

5.6. Proof Study

Presented system requires a large number of users to work properly. In other case, the
generated traffic would not be distributed enough, thus would look unnaturally. As it
was shown, centralized series of queries are being seen as abuse. Unfortunately, thesis
author’s resources have been insufficient for this purpose. However, a simulation has
been performed, which had to prove proposed concept.

5.6.1. Tools

The idea was to use Tor application (http://www.torproject.org) to make the single
host (the author’s computer) generate distributed traffic. In such way, the behavioural
data of one real user could be seen by search engine as multi-user traffic.
Tor is a free software enabling Internet anonymity by thwarting network traffic analysis.
Tor aims to conceal its users’ identity and their network activity from traffic analysis.
Operators of the system operate an overlay network of onion routers which provides
anonymity in network location as well as anonymous hidden services.
Users of a Tor network run an onion proxy on their machine. The Tor software peri-
odically negotiates a virtual circuit through the Tor network. Application like browser
may be pointed at Tor, which then multiplexes the traffic through a Tor virtual circuit.
Once inside a Tor network, the encrypted traffic is sent from one host to another,
ultimately reaching an exit node at which point the decrypted packet is available and
is forwarded on to its original destination. Viewed from the destination, the source of
the traffic appears to be at the Tor exit node.
As the figure 5.4 shows, Tor has became quite popular, so its network involves large
number of users. This makes Tor fit to the objective in this study. Mozilla Firefox
browser have been used, connected with the Tor. Additionally has been also installed
iMacros plug-in in order to automate executing of visitor sessions.
For analysis of the behavioural data being received by Google during the study, Google
Analytics (shown in figure 2.2) software has been used. It was installed on every ex-
amined website.


Figure 5.4. Tor interface screen

5.6.2. Results

Distributing traffic issue ended with success. After opening the Google Search main
page (www.google.com), server redirects to the domain belonging to the country which
the Tor exit node of particular session was located in. For example, when exit node was
in Germany, Google server redirected browser from google.com to google.de address.
There was displayed Google Search page in the appropriate language, in spite the fact,
that browser’s setting with default language was removed. After visit on examined
pages, Google Analytics also indicated, that the source of visits was not in Poland
(where the study was actually conducted) but in the countries of exit nodes of the
traffic.

However, routing the traffic through distributed Tor network appeared to be insufficient
solution. Firstly, the traffic routed by the Tor is significantly slowed down. From time
to time there were even difficulties with download complete search engine site. What
is more, despite the large number of visit sources, there are still only one browser and
one real user. Because of this, this simulation could only imitate one singed-in user or
a group of singed-out.


Signed-in user

In the first case there is essentially no difference between visiting through Tor proxy
network or directly. Like it was described, Tor is the tool to concealing the identity of
a user. But after sign-in into Google Account, the identity is evident. From the Google
point of view, such visit is seen as regular user travelling very quickly all around the
world (metaphorically speaking). But it is still only one user, and applied personalized
search results to him should not be globally significant.

Group of signed-out users

As it was described earlier, Google introduced personalization mechanism not only for
users with Google Account. There is also personalization in search results for users
with no such profile account. It is based on storing cookies in user’s browser up to
180 days, which contain information about past search activities. But cookies are not
related to specific user, but to a browser. In this case, search results are re-ranked not
for the person but rather for the particular computer which this person uses.
This simulating system uses only one browser, so there was no possibility to evaluate the
impact of personalization re-ranking on search ranking in the global perspective.
Disabling storing of cookies option in the browser makes personalization impossible to
act, because there is no way to relate past queries in search engine to particular user.
Moreover, browser with blocked cookies is rather rare situation nowadays. Therefore
Google search engine is rather suspicious about traffic with blocked cookies and for
such requests they serve ”Sorry page” (figure 5.1).

5.7. Summary

Generating artificial traffic on the Internet seems to be not very praiseworthy as it
is dangerously close to spam appearance and causes the information noise into vis-
itors statistics. On the other hand, this is not worse than other SEO activities like
linkbaiting.
After [6], today’s search engines use mainly link-popularity metrics to measure the
”quality” of a page. It is the main idea of the PageRank algorithm [3]. This fact causes
the ”rich-get-richer” phenomenon. More popular websites appear higher on SERPs,
which brings them more popularity.
Unfortunately, it is not very beneficial for the new, unknown pages which have not
gained popularity yet. There is a possibility, that these websites contain more valuable
information than the popular ones. Despite this fact, they are ignored by search engines
because of small amount of links. These sites, in particular, need SEO efforts. Probably
classic techniques will be more effective than the one presented in this paper.
Nevertheless, the methods presented in this article are likely to improve the rate of
web positioning, because web traffic can be noticed by search engines immediately. It

Analysis Of The Modern Methods For Web Positioning

Analysis Of The Modern Methods For Web Positioning

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Analysis Of The Modern Methods For Web Positioning

Similar to Analysis Of The Modern Methods For Web Positioning (20)

Analysis Of The Modern Methods For Web Positioning