2. Web Metrics
• Some interesting terms
– Hits: an action on a website such as when a user
views a page or downloads a file
– When a page is an assembly of downloaded
objects the hits include all the downloaded files!
– Artificially inflated
– If anything, hits are at the bottom of the pyramid
– Measuring hits does not mean anything!
12/03/18 Professor V. Nagadevara
3. Click-Through
• Click-through is the number of people who click on a
particular thing – a link, an ad word, banner ad, etc.
• “Percentage of users who click on a viewed
advertisement”
• Click-oops-back is part of the click-through
• Better measure is the response rate – number of
visitors who arrived at the site after clicking the link
• “we had 200,000 hits last month” or “our click-through rate
from Yahoo is 95%”
12/03/18 Professor V. Nagadevara
4. Page View
• What is a page?
– Static pages, pages measured by page tags, documents are
okay
– Dynamically generated pages – is entire page a view, or is
object a view?
– repeat views – are they different from original view? How to
decide when is it a reload?
– How to consider frames?
• A successful loading of any document containing content that
was requested by a website visitor regardless of the delivery
mechanism or the number or frequency with which the said
content is delivered
12/03/18 Professor V. Nagadevara
5. Visits
• A visit is a session or user session
• Defined as any period of activity separated by
m minutes (say 30 minutes).
• A visit is counted when unique visitor creates an activity on a
webpage measured via a page statistic regardless the
duration of the activity as long as the period of inactivity
between page views does not exceed the pre-specified limit
12/03/18 Professor V. Nagadevara
6. Visitor/Unique visitor
• Human beings are visitors; bots are not
• How to determine uniqueness?
• A unique visitor is counted when a human being uses a web
browser to visit a website, regardless of the number of pages,
or duration. A visitor can be unique for different periods of
time
• A unique visitor for any arbitrary time frame should be
counted for onetime and onetime only on their first visit and
between the start date and end date
12/03/18 Professor V. Nagadevara
7. Referrer
• This is the source of traffic to the website
• Referring URL or referring Domain
• How to report when there is no referrer?
– Refer to them as “bookmarks or directly referred”
• Group the source into different referrers
– Banner ads, paid search engines, partners, affiliate programs
etc.
• A referrer to any website should be an undifferentiated
and complete uniform resource locator describing the
exact page on the referring website that contained the
link to the site in question
12/03/18 Professor V. Nagadevara
8. Conversion Rate
• One of the most important metrics for an
ecommerce site or online business
• A conversion rate is the number of completers
divided by the number of starters for any on-line
activity that is more than one logical step in length
• Starting and ending points Do not matter
• There can be many conversion rates for a given
website
• Can be measured by using views, visits, visitors
12/03/18 Professor V. Nagadevara
9. Abandonment Rate
• The abandonment rate for any step in a multi-step process is
“one minus the number of units that make it to step n+1
divided by those at step n”.
• AR = {1 – (Nn+1/Nn)}
• It can be measured by using visits, visitors, views etc.
• More important if the abandonment (leakage) occurs
before the conversion
• K-1 abandonment rates for a process involving k
steps
12/03/18 Professor V. Nagadevara
10. Loyalty, Frequency and Recency
• Measures how well the traffic is maintained
• Loyalty is the measure of the number of visits any visitor is
likely to make over the lifetime as a visitor.
• It is the raw number of visits all visitors have made since the
beginning of measurement and de-duplicated
• 100 visitors made 6 visits, 92 visitors made 7 visits
• De-duplication is adjusting the numbers when one moves
from 6 to 7 visits
12/03/18 Professor V. Nagadevara
11. Loyalty, Frequency and Recency
• Frequency is a measure of activity a visitor generates
on the website in terms of average time between
return visits.
• It is measured in logical groups or discrete numbers
of days between visits
• Data should be de-duplicated
• Reported as “once a day”, “once a Week”
• Can also be presented as average days between
visits or distribution of return visitors in each group
12/03/18 Professor V. Nagadevara
12. Loyalty, Frequency and Recency
• Recency is the number of days since the last
visit
• Reported as the number of visitors who
returned after d days
• Usually used in the context of online
purchase, “as the number of days since last
purchase”
• Recency is a moving target
12/03/18 Professor V. Nagadevara
13. Value Pyramid
12/03/18 Professor V. Nagadevara
Uniquely Identified Visitors
Unique Visitors
Visits
Page Views
Hits
Volume of Available Data
IncreasingValueofData
15. Authoritative Web Pages
• Search Engines need to retrieve relevant
pages which are of high quality or
Authoritative
• Authority is in the hyperlinks
• Hyperlink is an endorsement
• Collective endorsement of a given page by
different authors indicate its importance
16. Limitations
• Not All Links Are Meant for Endorsement
• Some Are Navigation
• Some Are for Paid Advertisement
• Commercial or Competitive Interests Avoid Links to
Rivals’ Pages
• Real Authoritative Pages Are Seldom Descriptive Eg.
“Web Search Engine”
17.
18. HUBs
• Hub is one or a set of web pages providing
collections of links to authorities
• Hub pages may not be prominent
• Have very few links pointing to them
• They provide a collection of prominent links to
a common topic
19. Hubs
• These could be
– Recommended links in individual home pages
– Recommended reference sites
– Professionally assembled resource lists
– Professional collection of commercial sites
20.
21. Hubs and Authorities
• A good hub is one which points to many good
authorities
• A Good authority is one which has many good
hubs pointing to it
• This mutual reinforcement is the basis for
mining
22. Page Rank Algorithm
• Static concept
– Computed off-line and does not depend on the
search query
– It can be regarded as the “prestige”
– “In-links” of page p are the hyperlinks from other
pages that point to page p. Ignore the in-links
from the same website
– “Out-links” of page p are the hyperlinks that point
to other pages from page p
23. Page importance
• A Hyperlink from a page pointing to another page is
an implicit endorsement (conveyance of authority) of
the target page. More in-links that page p has, the
more “Prestige” that page p has.
• Pages that point to page p also have their own
prestige. A page with higher prestige pointing to
page p is more important than a page with lower
prestige score pointing to the same page (page p)
• A page is more important if it is pointed to by other
important pages.
12/03/18 Professor V. Nagadevara
25. Why page Rank?
• Not influenced by search string
• Page rank values are computed and stored. A
look-up is done at the time of the query.
Hence very efficient
• Cannot be spammed. Not easy to create links
in other pages pointing to your page
• Cannot distinguish between pages that are
authoritative in general and those which are
authoritative on the query topic
12/03/18 Professor V. Nagadevara
26. Hypertext Induced Topic Search (HITS)
• HITS is search query dependent
• Uses Authority ranking and Hub ranking
• It sends a query q and collets t highest ranked
pages (200 in the original paper)-root set W
• Then it grows W by including any page pointed
to by a page in W and any page that points to a
page in W creating a larger set S.
• It limits the size by allowing each page in W to
bring in at most k pages (k=50 in the paper)
12/03/18 Professor V. Nagadevara
28. HITS
• Ranks according to the query topic.
• Provides more relevant authorities and hubs
• Easy to spam. Create a hub pointing to many
high ranking authorities. It will boost the H(i)
• Problem of topic drift. Expanding the root set
can lead to collecting irrelevant pages
• Time consuming
12/03/18 Professor V. Nagadevara
29. Layout of Search Results
Name Description Position
Organic Results from Web crawl. ‘‘Objective hits”
not influenced by direct payments
Central on results page
Sponsored Paid results, separated from the organic
results list
Above or below organic results, on
the right-hand side of the results list
Shortcuts Emphasized result pointing to results from a
special collection
Above organic results, within organic
results list
Primary search
result
Extended result that points to different
collections. It comes with an image
and further information
Above organic results, often within
organic results
Prefetch Result from a preferred source, emphasized
in the results set
Above or within organic results
Snippet Regular organic result with result
description extended by additional
navigational links
Within organic results list (usually
first position only)
Child Second result from the same server with a
link to further results from same server
Within organic results list; indented
12/03/18 Professor V. Nagadevara
30. Research Questions
• RQ1: How many sponsored links are on the results screen?
• RQ2: Are there popular hosts, domains, and content types
preferred by a certain search engine?
• RQ3: Is there a difference between search engines regarding
the results types presented?
• RQ4: How many specially displayed results are on the first
results page?
• RQ5: To what extent are shortcuts used on the results pages?
• RQ6: What is the difference between search engines regarding
the questions above?
12/03/18 Professor V. Nagadevara
31. Methodology
• Search engines Google, Yahoo, MSN/Live.com, and Ask.com.
• Obtained 500 queries from the 100k top queries and 500 from
the last 100k queries of long tail queries from Ask.com –
• Sorted the list of all queries by frequency and alphabetically.
• Selected every second hundredth query to make sure that there
is a representative selection over the popular and over the very
rare search queries in the long tail.
• a script developed to automatically download the search
engines’ results pages and to analyze the HTML code.
• categorized every single result in those pages into organic, paid
advertisements, snippets, etc.
12/03/18 Professor V. Nagadevara
38. Use of shortcuts in Google
Popular Rare
Count Position Count Position
Prefetch 170 1 69 1
Snippet 290 var 196 var
Images 9 1 5 1
Total 183 – 60 –
Shortcuts
Local Results 2 1 4 1
Video 108 var 127 var
Blog search 6 11+ 4 11+
Books 22 10+ 9 10+
Calculator 7 1 1 1
Dictionary 1 1 3 1
News 10 3/10 4 3/10
Scholar 3 1 2 1
Shopping 17 var 37 var
Weather 5 1 1 1
NASDAQ 2 1 – –
12/03/18 Professor V. Nagadevara