This paper focuses on various ways of monitoring and tracking of users while surfing the web as well as current methods used by websites to track users. This paper further went on to enumerate how users can protect themselves from being tracked as well as highlight the importance of privacy.
1.
Abstract—The web is a worldwide collections of system
providing a variety of information and communication
across the web. The internet as we know it today is an
outstanding success more than1,300,000,000 billion users
are connected to the globe. In as much as users tend to
surf the web for resources, it is of great importance to
know that the web is the most popular internet service.
However, this paper focuses on various ways of monitoring
and tracking of users while surfing the web as well as
current methods used by websites to track users. This
paper further went on to enumerate how users can protect
themselves from being tracked as well as highlight the
importance of privacy.
keywords: Cookies, Tracking devices, Authentication
protocols, Server and proxy logs, Eavesdropping,
Scripting.
I. Introduction
A great way to capture the insight about users and customers
who visit the website, is to be able to track their area of
interest during web surfing. In other words, this paper focuses
on certain ways a user can be tracked during web surfing using
certain internet tools like Cookies, web bugs, server logs and
log files, JavaScript's in Java run time environments.
A particular tool cannot really identify user while surfing the
web, but the combination of a number of tools could be used
coherently in other to identify a particular user's private details
like name, age, address, email, location and frequently visited
websites as addressed in this paper. However, privacy
protection options available to Internet users are also
addressed. These information received would enable the
company improve on services rendered to users, help top
strategic managers in inventing new business strategies,
objectives as well as meeting goals.
II. Methodology
A step by step approach will be used in tracking of users and
identification of users Personal Information Identification[PII]
based on the following tools; Cookies, Scripts, servers and
proxy logs files, as related to this paper starting from when a
request is triggered from the clients browser to the server as
well as the activities that occurs during the processes.
Fig: 1 Physical Representation.
The above Fig 1 illustrates how a user visits a site.
When a new party is introduced into the Wi-Fi network as the
sample user.
Running tools which perform access-point spoofing in
addition to packet sniffing software, all data such as scripts,
cookies, server logs which are sent between the client and the
HTTP server located on the internet can be read by the
attacker.
The figure below illustrates the set-up used to eavesdrop the
connection between the client and the internet server.
Web Security
Atsegwasi Otsemhuno Rogers
RA956@live.mdx.ac.uk. M00478276
2. The access-point spoofing tool tricks the router into using a
compromised computer as an access point to the HTTP server
located on the internet.
Logs of internet activity between the internet server and the
internet pass through this compromised access point. As the
access point is located on the same wi-fi connection, the
location of the client computer can be easily traced down to a
10m radius.
Packet-sniffing tools which are run on the compromised
system capture cookie data, session data, browser-agent data,
client IP address,target IP address,usernames, passwords and
many other information invisibly.
Scripts or cookies that are being transferred can be modified to
add malicious content which would harm the client’s
computer.
Besides, sensitive information such as user-names and
passwords, if not encrypted can be seen in plain text format.
These data can be used to impersonate the client on the
internet.
III. Related Work
The advancement of the web has ushered in the philosophy of
non-obtrusive use of the web. Besides, this advancement has
made data a valuable asset for the web. In order to enhance
user experience, provide a personalized feel and improve
services for users on the web, this data proves invaluable [14].
Providing personalized services to users makes them feel
special and hence makes them loyal users of the web service.
Several researchers have experimented with ways through
which data on the web can be used to personally identify users
and hence give them web results based on their unique
identity[5][13][16][17][18]. This is termed as ‘user
profiling’[16]. The tools available to internet platforms to
achieve this include cookies, browser cache, proxy servers,
browser agents, web logs, search logs. Data accumulated from
these sources enable internet platforms to understand their
users better.
Research by Xia and Brustoloni was conducted into the extent
of the personal information disclosed on the internet [10]. In
the case study of sample users, they discovered that over 90%
of users had submitted public information about their real
name, pictures, email address, date of birth, relationship status
and interests. While just 10% of sample users disclosed
information about their physical address, this can easily be
gotten from geo-location tools.
This information can be used for positive benefits[13][17].
However, when it falls into the hands of the wrong people, it
could be used for criminal and harmful purposes. One of the
techniques used is eavesdropping[20]. Using airsnarf, an
access point tool with integrated DHCP, DNS, and HTTP
spoofing. The tool enables attackers to re-associate a client
computers to a rogue access point. This is done by amplifying
the rogue signal over the legitimate access point’s signal (with
the aid of antennas). Packets between the client and the server
are easily intercepted and hold important information [13].
Besides personal information, physical address can be
determined from this[17].
Mobasher, Cooley and Srivastava proposed the mathematical
formula used to process the vast amount of data gathered from
a web user. They stated that a database of user profiles is
represented as UP = [suk (ij)]m x n. Where UP represents the
universal set of user profiles, suk (ij) represents the degree of
interest in an item (represented by ij) by userrepresented by uk.
They stated that this is algorithm is used by analytic firms to
provide personalized products or web experience[14].
Data is collected through explicit feedback or implicit
feedback[16]. Implicit feedback is un-obtrusive and uses
cookies, browser cache, proxy servers and the other tools to
gather information about user habits and profile. Explicit
feedback requires users to knowingly submit information
regarding their personalities to the website. Explicit feedback
uses selection tools, surveys, feedback forms to gather
information regarding users’ persona, habits and preferences.
Both modes of feedback have their advantages and
disadvantages[16].
In their research, Quiroga and Mostafa aimed to compare the
strength of each feedback in a test whereby a feedback mode
will provide the file requested by a specific user based on his
or her profile [16].
Taking 18 users, each user had to use a system containing a
record of 6,000 health records each categorized into 15
different areas. Each user had to use all 15 categories.
For explicit feedback, the user was presented with a form
which collect’s the user’s chosen preference based on
suggestions given. This user’s inputted data is used as his or
her profile and used to provide automatic suggestions of
documents needed by the user.
On the other hand, for implicit feedback, the usage activity on
the system and viewed documents are automatically logged
and used as the user’s profile. As done in the explicit
feedback, the user’s profile is used to provide suggestions of
the health records which the sample users would be interested
in.
Finally, the feedback from both implicit and explicit feedback
were compared to the accuracy of a systemwhich made use of
both the implicit and explicit feedback methods.
Results at the end of the day showed that while the accuracy
of both the explicit and implicit feedback were almost similar,
the result from the system which made use of both feedback
methods was far greater [16].
The end result highlighted how the use of automated systems
can help provide a better outlook of the user when combined
with the explicit feedback.
3. In research by Teevan et al., data was added to a client-
profiling agent which included information such as browsing
history, documents stored on the computer and email history.
They found out that the more information available to the
client-profiling agent, the better the profile performed in
providing results which matched the user’s intentions. In
addition, it was found that results from data created through a
profile out-performed that from a non-personalized search[18].
Results from several other researchers back up the usefulness
of the data collected by user-tracking tools.
IV. Literature Review
Cookies:- When a uservisits a website, cookies are sent to the
clients browser to uniquely identify the users browser. [2] It is
sent together with the request made by the user. Cookies are
small text files embedded in the browser of a computer when a
person visits a website[2].
Cookies usually contain a serial number which uniquely
identifies a user. Once they are put on a user's computer, they
track the user's activity on the website, and send these
information back to the website owners [6]. Whenever a user
returns to the website, the web server uses the unique
identifier to retrieve the user's record from their database.
There are two classes of cookies, session cookies and
persistent cookies. Cookies generally are used for session
handling, authentication, identification of clients and storage
of site preferences.
Normal cookies (Session Cookies) are data saved by a
website onto the user's computer during a visit to the site.
Session cookies are those which are reside on a client's
computer for the duration of his/her browsing session[12].
When the browser is closed, these cookies get deleted
automatically. They do not store any personal information.
They are used by commercial websites and are mostly
employed for shopping cart functionality[9].
On the other hand, Tracking cookies (Persistent Cookies)
are a specialized type of cookie that can be shared by multiple
websites. Persistent cookies are stored on a user's browser,
even when the browsing session ends[12]. These type of
cookies are used to identify individual users and also used by
website owners to analyze user surfing habits on their
websites[17]. Cookies keep track of which advertisements the
user has already seen on the site but personal information is
not generated by the cookies but by your own input into the
website through order forms, registration pages, payment
pages as well as other online forms[10][12].
Flash cookies are stored by the Adobe Flash plugin. These
cookies usually back up data from regular cookies. If a user
deletes regular cookies, flash cookies still keep the data. A
website that placed a cookie on a client's computer can still
recognize the user even when the cookie is deleted, as long as
it is backed up in a Flash cookie[3].
Cookies have several benefits to web users. They remember
personal information (such as name, address, payment
information, emails and many others), so one does not need to
refill website forms or perform the same tasks over and over
again[10]. Cookies remain on a user's computer for a long
time, thereby making the accessibility of a website much
easier for the client[10]
However, depending on a website's policy, the data collected
by cookies may be sold to third-parties such as marketing
firms, advertisers, junk mailers[13].
Cookies can be a very powerful tool to track a particular user.
Due to the fact that cookies help remember the online
footprint of a returning user, by obtaining the client's computer
IP address via geo-location tools (such as JavaScript, Google
Maps API), the physical location of a user can be matched
with his/her online profile or footprint.
Scripts:- Code that runs in a web browser is one of the most
powerful tools used to track user activity online. These scripts
are based on JavaScript. They can either be client-side
JavaScript or server-side JavaScript implementations which
are translated to browser-readable language or scripts.
JavaScript was created in 1995 to allow the browser to
become more interactive. Since then, it has become a language
used in network programming, game development and the
creation of mobile and desktop applications[4].
Marketers add JavaScript to their websites through the source
code or the template of their web pages to collect visitor data.
When visitors visit their web pages for the first time, a
JavaScript is sent which generates user browser data for
storage on the client computer.
The collected data is stored for a long time, and so when the
user returns, the data stored can identify returning users. For
storing such data, JavaScript is aided by cookies[15].
Getting to the server-side JavaScript implementations, these
codes are processed by the server and sent to the browser for
further processing. These codes collect information on
location data, social network activity, music selections, movie
preferences and user behaviour.
This data is sent back to the marketers or social networks,
stored, analyzed and used for various uses[15].
Due to the vast amount of data which can be collected by
JavaScript, privacy activists or organizations fight to curtail
the excessive use of JavaScript for tracking users online[13].
Advantages of using JavaScript for user tracking include the
fact it is mostly not obstructive of the user experience on the
websites. In addition, it mostly enhances the user experience
on the website by sometimes tailoring products meant for the
user profile. It sometimes also provide useful suggestionsas to
related products or information which a user might need[4].
4. In addition, as browsers become faster and more stable,
JavaScript codes can run very fast in the background while the
user undertakes his or her web activities[1][4].
Furthermore, the user still has control over the use of these
scripts. By customizing browser JavaScript preferences or by
installing tools such as NoScript, the user can block scripts not
needed[15][17][20].
There are however several disadvantages of scripts. These
scripts provide an opportunity for hackers and malicious script
authors to run scripts on a client's computer which could be
potentially harmful. However, browser vendors are aiming to
restrict this problem by running scripts in a sandbox and
restricting sites to a same-origin policy[1].
Social Networks Tracking: Comprises of both On line social
networks (OSN) and Mobile On line social networks are
related to social-based services such as Facebook, My Space,
Twitter, Instagram and much more. With the help of these
immerse social based services, individuals have been able to
share some of their personal information with a couple of
entries, such as companies, events, public places and current
locations with the use of Geo location API, Geo latitude API
and check_In plug-in[15][17].
Furthermore, certain browsers tend to support these functions,
such as Chrome, Internet Explorer, Firefox, Safari and Opera.
But Geo location is much more accurate for devices with GPS
compactability like smart phones, Geo location and Geo
latitude makes use of web scripting such as JavaScript, Html,
CSS, and PHP to operate effectively as they are associated
with Google Map in determining a user's position during
tracking[13][17].
Keyloggers: As the name suggests, key-loggers log keys
inputted into users' computers. Key-loggers can either be in
software form or in the form of a device[11].
Legitimate programs use key-logging functionality to capture
hotkeys and provide additional functionality to the user. On
the other hand illegitimate use of key-loggers can be achieved.
These run invisibly while recording any single keystroke
made. They also log browsing activity, applications used and
screenshots of the computer[11].
Key-loggers are sometimes installed as spyware via bugged or
cracked software. The user believes he is installing a
legitimate software but in-avertedly, the key-logger installs
separately and silently.
However, legitimate key-loggers are installed along with
bundled software to enhance user experience.
Advantages of using key loggers to track users is that data
collected from the client's system can be used to improve
products and services[11][15]. Such products and services
include auto-complete or spell-check features
Server Logs: A server log file is one of the tools employed by
websites to track the activities of their users online or on the
website[7]. The log file is a file (or sometimes, several files)
created by the server and consists of all activities/requests
performed by the user.
Whenever a user visits a web site, the web server
automatically collects information on the new user. Typical
server log information include IP address of the client, referrer
link, operating system version, user agent, page requested,
time/date of client request[7].
A server log file is one of the tools employed by websites to
track the activities of their users online or on the website. The
log file is a file (or sometimes, several files) created by the
server and consists of all activities/requests performed by the
user.
Whenever a user visits a web site, the web server
automatically collects information on the new user. Typical
server log information include IP address of the client, referrer
link, operating system version, user agent, page requested,
time/date of client request.
A typical server log file looks like the image below:
Figure: A typical server log[7]
While server logs do not typically collect information on
specific users, information provided can be used for tracking a
user's browsing activity or pattern. It is important to note that
users have no control whatsoever over the data collected by
web servers. This raises ethical and legal questions over data
mining[13]. Server logs are accessible only to the web
administrator or to the webmaster. Information gathered from
server logs are used typically used to analyze web traffic
patterns, URL referrers or user agents.
Data logging by servers have both advantages and
disadvantages.
One of the advantages provided by server logs is that it aids
resolving issues. In a scenario whereby a customer has
problems using an e-commerce website, with a little
information provided by the client, technical problems can be
resolved due to the abundant information provided by the
server log(s)[7].
5. Besides, marketers utilize the log file to monitor trends and
tailor products and services to meet demand of end-users. In
addition, information provided by the server logs ensures that
webmasters and system administrators can fix loopholes and
better ensure optimum connectivity for their web clients[6][7].
On the other hand, the lack of transparency in terms of what
data is logged can raise privacy and subsequently legal issues.
Users have no control over how companies process
information obtained through server logs[13]. While a Terms
of Service document may provide answers to this, there are
situations whereby there is no such document[13].
Furthermore, since users have no control over how
organizations process data logs, user tracking information
such as IP addresses or page requests can be sold to third
parties without the end-client's consent. Data collected from
server logs can easily be stored into a database for further
analysis or usage[13].
Server log analytics software include Google Analytics, Deep
Log Analyzer, AWStats, Piwik, Webalizer etc. These software
provide several indicators to marketers or website owners[7].
Scenario: An online growing market place with about 3000
users seen as one of the most popular leading companies that
provides internet services for users to earn an income from
home using a PC and an internet connection.
IV. Proposed Method
The proposed method involves steps in obtain user tracking
information. Information such as username, passwords, cookie
data, browsing activity and IP addresses will be gathered and
stored in log files. The unsuspecting is attacked without any
knowledge. While browsing on the internet, he may fail to
discover that the certificates sent to his computer are spoofed
certificates self-signed by the attacker. However, anti-virus
software and modern browsers could trigger warnings to the
unsuspecting user.
If the user fails to heed warnings and continues browsing, all
data between the user and the internet passes through the
attacker's computer. All of this is done without any suspicious
warnings given to the user.
Fig: 2 Logical Representation:
Our sample user is called Benedict, a student of Middlesex
University. His web surfing session begins when he starts
surfing and ends when he quits the browser. Our test subject is
about to browse on Facebook, check his emails on Yahoo and
see course materials on Unihub.
Figure __: Benedict has to provide Google with information such as his name,
email address, mobile phone number, date of birth and gender.
Benedict provides his username and password on all three
websites.
However, on another computer, the attacker Ivan uses Cain
and Abel to scan the wireless networks for computers
available for exploitation. On Cain and Abel, Ivan resolves
host-names of the captured systems and sees Benedict's
computer online.
He begins his man-in-the-middle attack by selecting to
capture data from the wireless router and Benedict's system.
This mode of attack tricks Benedict's computer to think Ivan's
6. computer is the router. The router also thinks Ivan's computer
is Benedict's computer. He thereafter starts poisoning traffic
between the router and Benedict.
When Benedict clicks the log-in button with the username and
password, this data is captured in plain text format and can be
seen by Ivan on his computer
A cookie is inserted into Benedict's computer upon login. This
cookie is also captured by Ivan and stored in several log files.
Besides Ivan using the username and password to login to any
of Benedict's site, Ivan uses the Cookie Manager Firefox add-
on to replace his Facebook cookie data on Firefox. Thereafter,
when Ivan accesses Facebook, he is recognized as Benedict.
He can thereafter mask himself as Benedict.
Conclusion
In conclusion, this report has analyzed how users are tracked
on the internet. It highlighted the advantages and
disadvantages of user tracking. It also briefly outlined what
the information from user tracking is used for. Finally, it
outlined a sample exercise whereby a man-in-the middle
attack was conducted in order to gather user tracking
information which passes over the internet. This exercise was
successful, and the attack can be replicated.
I. BIBLIOGRAPHY
[1] ADsafe, Making JavaScript Safe for Advertising., 2015.
[2] Allaboutcookies, All About Computer Cookies - privacy
concerns on cookies, 2015.
[3] M. Brinkmann, Flash Cookies explained -gHacks Tech
News, 2007.
[4] D. Flanagan and P. Ferguson, JavaScript, 5 ed., O'Reilly,
2006.
[5] S. Gauch, M. Speretta, A. Chandramouli and A.
Micarelli, "User profiles for personalized information
access.," The adaptive web, vol. 1, no. 2, pp. 54-89, 2007.
[6] Java Republic, Privacy Policy - Java Republic, 2015.
[7] L. Joshila Grace, V. Maheswari and D. Nagamalai,
"Analysis of Web Logs And Web User In Web Mining,"
International Journal ofNetwork Security & Its
Applications, vol. 3, no. 1, pp. 99-110, 2011.
[8] N. Kamaraj and M. Chandran, "Tracking Down Travel
Agencies Geo-location using Software Engineering,"
IJCTT, vol. 9, no. 2, pp. 49-52, 2014.
[9] J. P. Kesan and R. C. Shah, "Deconstructing Code," Yale
Journal of Law & Technology, vol. 6, pp. 277-389, 2004.
[10] D. M. Kristol, "HTTP Cookies: Standards, privacy, and
politics," ACM Transactions on Internet Technology, vol.
1, no. 2, pp. 151-198, 2001.
[11] M. Kusuma-Atmadja, "Some Thoughts on ASEAN
Security Co-Operation: An Indonesian Perspective,"
Contemporary Southeast Asia, vol. 12, no. 3, pp. 161-
171, 1990.
[12] Microsoft, Description of Persistent and Per-Session
Cookies in Internet Explorer, 2007.
[13] A. D. Miyazaki, "Online Privacy and the Disclosure of
Cookie Use: Effects on Consumer Trust and Anticipated
Patronage," Journal of Public Policy & Marketing,vol.
27, no. 1, pp. 19-33, 2008.
[14] B. Mobasher,R. Cooley and J. Srivastava, "Automatic
personalization based on Web usage mining," Commun.
7. ACM, vol. 43, no. 8, pp. 142-151, 2000.
[15] OpenTracker, How doesuser-tracking work?, 2015.
[16] L. M. Quiroga and J. Mostafa,"Empirical evaluation of
explicit versus implicit acquisition of userprofiles in
information filtering systems.," in Proceedings of the
fourth ACM conference on Digital libraries, ACM, 1999,
pp. 238-239.
[17] N. Schmuker, Web Tracking, 1 ed., Berlin University of
Technol, 2011, pp. 1-3.
[18] J. Teevan, S. T. Dumais and E. Horvitz, "Personalizing
search via automated analysis of interests and activities,"
in Proceedingsof the 28th annual international ACM
SIGIR conference on Research and development in
information retrieval, ACM, 2005, pp. 449-456.
[19] Wikipedia, HTTP cookie,2015.
[20] H. Xia and J. C. Brustoloni, "Hardening web browsers
against man-in-the-middle and eavesdropping attacks.,"
in Proceedingsof the 14th international conference on
World Wide Web, ACM, 2005.