SlideShare a Scribd company logo
1 of 39
WEB STRUCTURE MINING
12/03/18 Professor V. Nagadevara
Web Metrics
• Some interesting terms
– Hits: an action on a website such as when a user
views a page or downloads a file
– When a page is an assembly of downloaded
objects the hits include all the downloaded files!
– Artificially inflated
– If anything, hits are at the bottom of the pyramid
– Measuring hits does not mean anything!
12/03/18 Professor V. Nagadevara
Click-Through
• Click-through is the number of people who click on a
particular thing – a link, an ad word, banner ad, etc.
• “Percentage of users who click on a viewed
advertisement”
• Click-oops-back is part of the click-through
• Better measure is the response rate – number of
visitors who arrived at the site after clicking the link
• “we had 200,000 hits last month” or “our click-through rate
from Yahoo is 95%”
12/03/18 Professor V. Nagadevara
Page View
• What is a page?
– Static pages, pages measured by page tags, documents are
okay
– Dynamically generated pages – is entire page a view, or is
object a view?
– repeat views – are they different from original view? How to
decide when is it a reload?
– How to consider frames?
• A successful loading of any document containing content that
was requested by a website visitor regardless of the delivery
mechanism or the number or frequency with which the said
content is delivered
12/03/18 Professor V. Nagadevara
Visits
• A visit is a session or user session
• Defined as any period of activity separated by
m minutes (say 30 minutes).
• A visit is counted when unique visitor creates an activity on a
webpage measured via a page statistic regardless the
duration of the activity as long as the period of inactivity
between page views does not exceed the pre-specified limit
12/03/18 Professor V. Nagadevara
Visitor/Unique visitor
• Human beings are visitors; bots are not
• How to determine uniqueness?
• A unique visitor is counted when a human being uses a web
browser to visit a website, regardless of the number of pages,
or duration. A visitor can be unique for different periods of
time
• A unique visitor for any arbitrary time frame should be
counted for onetime and onetime only on their first visit and
between the start date and end date
12/03/18 Professor V. Nagadevara
Referrer
• This is the source of traffic to the website
• Referring URL or referring Domain
• How to report when there is no referrer?
– Refer to them as “bookmarks or directly referred”
• Group the source into different referrers
– Banner ads, paid search engines, partners, affiliate programs
etc.
• A referrer to any website should be an undifferentiated
and complete uniform resource locator describing the
exact page on the referring website that contained the
link to the site in question
12/03/18 Professor V. Nagadevara
Conversion Rate
• One of the most important metrics for an
ecommerce site or online business
• A conversion rate is the number of completers
divided by the number of starters for any on-line
activity that is more than one logical step in length
• Starting and ending points Do not matter
• There can be many conversion rates for a given
website
• Can be measured by using views, visits, visitors
12/03/18 Professor V. Nagadevara
Abandonment Rate
• The abandonment rate for any step in a multi-step process is
“one minus the number of units that make it to step n+1
divided by those at step n”.
• AR = {1 – (Nn+1/Nn)}
• It can be measured by using visits, visitors, views etc.
• More important if the abandonment (leakage) occurs
before the conversion
• K-1 abandonment rates for a process involving k
steps
12/03/18 Professor V. Nagadevara
Loyalty, Frequency and Recency
• Measures how well the traffic is maintained
• Loyalty is the measure of the number of visits any visitor is
likely to make over the lifetime as a visitor.
• It is the raw number of visits all visitors have made since the
beginning of measurement and de-duplicated
• 100 visitors made 6 visits, 92 visitors made 7 visits
• De-duplication is adjusting the numbers when one moves
from 6 to 7 visits
12/03/18 Professor V. Nagadevara
Loyalty, Frequency and Recency
• Frequency is a measure of activity a visitor generates
on the website in terms of average time between
return visits.
• It is measured in logical groups or discrete numbers
of days between visits
• Data should be de-duplicated
• Reported as “once a day”, “once a Week”
• Can also be presented as average days between
visits or distribution of return visitors in each group
12/03/18 Professor V. Nagadevara
Loyalty, Frequency and Recency
• Recency is the number of days since the last
visit
• Reported as the number of visitors who
returned after d days
• Usually used in the context of online
purchase, “as the number of days since last
purchase”
• Recency is a moving target
12/03/18 Professor V. Nagadevara
Value Pyramid
12/03/18 Professor V. Nagadevara
Uniquely Identified Visitors
Unique Visitors
Visits
Page Views
Hits
Volume of Available Data
IncreasingValueofData
Web Structure Mining
12/03/18 Professor V. Nagadevara
Authoritative Web Pages
• Search Engines need to retrieve relevant
pages which are of high quality or
Authoritative
• Authority is in the hyperlinks
• Hyperlink is an endorsement
• Collective endorsement of a given page by
different authors indicate its importance
Limitations
• Not All Links Are Meant for Endorsement
• Some Are Navigation
• Some Are for Paid Advertisement
• Commercial or Competitive Interests Avoid Links to
Rivals’ Pages
• Real Authoritative Pages Are Seldom Descriptive Eg.
“Web Search Engine”
HUBs
• Hub is one or a set of web pages providing
collections of links to authorities
• Hub pages may not be prominent
• Have very few links pointing to them
• They provide a collection of prominent links to
a common topic
Hubs
• These could be
– Recommended links in individual home pages
– Recommended reference sites
– Professionally assembled resource lists
– Professional collection of commercial sites
Hubs and Authorities
• A good hub is one which points to many good
authorities
• A Good authority is one which has many good
hubs pointing to it
• This mutual reinforcement is the basis for
mining
Page Rank Algorithm
• Static concept
– Computed off-line and does not depend on the
search query
– It can be regarded as the “prestige”
– “In-links” of page p are the hyperlinks from other
pages that point to page p. Ignore the in-links
from the same website
– “Out-links” of page p are the hyperlinks that point
to other pages from page p
Page importance
• A Hyperlink from a page pointing to another page is
an implicit endorsement (conveyance of authority) of
the target page. More in-links that page p has, the
more “Prestige” that page p has.
• Pages that point to page p also have their own
prestige. A page with higher prestige pointing to
page p is more important than a page with lower
prestige score pointing to the same page (page p)
• A page is more important if it is pointed to by other
important pages.
12/03/18 Professor V. Nagadevara
Page Rank
12/03/18 Professor V. Nagadevara
Why page Rank?
• Not influenced by search string
• Page rank values are computed and stored. A
look-up is done at the time of the query.
Hence very efficient
• Cannot be spammed. Not easy to create links
in other pages pointing to your page
• Cannot distinguish between pages that are
authoritative in general and those which are
authoritative on the query topic
12/03/18 Professor V. Nagadevara
Hypertext Induced Topic Search (HITS)
• HITS is search query dependent
• Uses Authority ranking and Hub ranking
• It sends a query q and collets t highest ranked
pages (200 in the original paper)-root set W
• Then it grows W by including any page pointed
to by a page in W and any page that points to a
page in W creating a larger set S.
• It limits the size by allowing each page in W to
bring in at most k pages (k=50 in the paper)
12/03/18 Professor V. Nagadevara
HITS
12/03/18 Professor V. Nagadevara
HITS
• Ranks according to the query topic.
• Provides more relevant authorities and hubs
• Easy to spam. Create a hub pointing to many
high ranking authorities. It will boost the H(i)
• Problem of topic drift. Expanding the root set
can lead to collecting irrelevant pages
• Time consuming
12/03/18 Professor V. Nagadevara
Layout of Search Results
Name Description Position
Organic Results from Web crawl. ‘‘Objective hits”
not influenced by direct payments
Central on results page
Sponsored Paid results, separated from the organic
results list
Above or below organic results, on
the right-hand side of the results list
Shortcuts Emphasized result pointing to results from a
special collection
Above organic results, within organic
results list
Primary search
result
Extended result that points to different
collections. It comes with an image
and further information
Above organic results, often within
organic results
Prefetch Result from a preferred source, emphasized
in the results set
Above or within organic results
Snippet Regular organic result with result
description extended by additional
navigational links
Within organic results list (usually
first position only)
Child Second result from the same server with a
link to further results from same server
Within organic results list; indented
12/03/18 Professor V. Nagadevara
Research Questions
• RQ1: How many sponsored links are on the results screen?
• RQ2: Are there popular hosts, domains, and content types
preferred by a certain search engine?
• RQ3: Is there a difference between search engines regarding
the results types presented?
• RQ4: How many specially displayed results are on the first
results page?
• RQ5: To what extent are shortcuts used on the results pages?
• RQ6: What is the difference between search engines regarding
the questions above?
12/03/18 Professor V. Nagadevara
Methodology
• Search engines Google, Yahoo, MSN/Live.com, and Ask.com.
• Obtained 500 queries from the 100k top queries and 500 from
the last 100k queries of long tail queries from Ask.com –
• Sorted the list of all queries by frequency and alphabetically.
• Selected every second hundredth query to make sure that there
is a representative selection over the popular and over the very
rare search queries in the long tail.
• a script developed to automatically download the search
engines’ results pages and to analyze the HTML code.
• categorized every single result in those pages into organic, paid
advertisements, snippets, etc.
12/03/18 Professor V. Nagadevara
Results
Engine Google Yahoo MSN/Live Ask
Valid popular queries 499 463 500 500
Valid rare queries 498 492 457 498
URLS in results screens 12,522 9436 11,700 9127
URLs (from popular) 6731 5232 6685 5224
URLs (From Rare) 5791 4204 5065 3903
Organic URLs 9641 8454 9177 8183
OrganicURLs (from m Popular) 5041 4543 4996 4661
OrganicURLs (From Rare) 4600 3911 4181 3522
Sponsored Links 2881 982 2573 944
Sponsored 1690 689 1689 563
Sponsored 1191 293 884 381
No results( from popular) 0 2 0 5
No results (from rare) 16 64 12 79
12/03/18 Professor V. Nagadevara
Domains
Top Google Yahoo MSN/Live Ask
1 .com 6614 68.60% 5519 65.30% 5255 57.30% 4450 54.40%
2 .co.uk 1141 11.80% 830 9.80% 1350 14.70% 1630 19.90%
3 .org 1029 10.70% 1014 12.00% 1070 11.70% 687 8.40%
4 .net 300 3.10% 382 4.50% 304 3.30% 271 3.30%
5 .gov 130 1.30% 111 1.30% 89 1.00% 100 1.20%
6 .edu 141 1.50% 170 2.00% 231 2.50% 166 2.00%
7 .gov.uk 111 1.20% 73 0.90% 140 1.50% 105 1.30%
8 .au 72 0.70% 56 0.70% 106 1.20% 91 1.10%
9 .info 59 0.30% 35 0.40% 55 0.60% 17 0.20%
10 .us 40 0.40% 50 0.60% 21 0.20% 29 0.40%
12/03/18 Professor V. Nagadevara
File Types
Type Google Yahoo MSN/Live Ask.com
.html 1838 1423 1366 1982
.htm 885 787 899 1149
.php 176 182 221 153
.pdf 158 152 111 50
.aspx 141 130 152 97
.asp 140 140 199 126
.shtml 111 84 87 106
.cfm 30 32 42 18
.doc 29 5 11 0
.ppt 4 2 2 0
12/03/18 Professor V. Nagadevara
Ads (popular Queries)
Google Yahoo MSN/Live Ask
Top All 137 27.40% 180 38.90% 193 38.60% 134 26.80%
1 51 10.20% 120 25.90% 70 14.00% 63 12.60%
2 26 5.20% 50 10.80% 39 7.80% 33 6.60%
3 60 12.00% 10 2.20% 84 16.80% 13 2.60%
4 14 2.80%
5 11 2.20%
12/03/18 Professor V. Nagadevara
Ads (popular Queries)
Google Yahoo MSN/Live Ask
Right All 287 57.50% 117 25.30% 329 65.80% None
1 62 12.40% 58 12.50% 107 21.40%
2 32 6.40% 25 5.40% 52 10.40%
3 24 4.80% 13 2.80% 36 7.20%
4 17 3.40% 7 1.50% 16 3.20%
5 13 2.60% 5 1.10% 118 23.60%
6 13 2.60% 2 0.40%
7 10 2.00% 5 1.10%
8 116 23.20% 2 0.40%
12/03/18 Professor V. Nagadevara
Ads (popular Queries)
Google Yahoo MSN/Live Ask
Below 1 58 12.50% 70 14.00% 63 12.60%
2 59 5.40% 123 24.60% 33 6.60%
3 13 2.60%
4 9 1.80%
5 16 3.20%
12/03/18 Professor V. Nagadevara
Use of shortcuts in Google
Popular Rare
Count Position Count Position
Prefetch 170 1 69 1
Snippet 290 var 196 var
Images 9 1 5 1
Total 183 – 60 –
Shortcuts
Local Results 2 1 4 1
Video 108 var 127 var
Blog search 6 11+ 4 11+
Books 22 10+ 9 10+
Calculator 7 1 1 1
Dictionary 1 1 3 1
News 10 3/10 4 3/10
Scholar 3 1 2 1
Shopping 17 var 37 var
Weather 5 1 1 1
NASDAQ 2 1 – –
12/03/18 Professor V. Nagadevara
• Questions?
12/03/18 Professor V. Nagadevara

More Related Content

What's hot

Evaluating information on the web
Evaluating information on the webEvaluating information on the web
Evaluating information on the webe1033930
 
Authority building with relevance
Authority building with relevanceAuthority building with relevance
Authority building with relevanceDejan SEO
 
Inventory to Insight to Action with Paula Land
Inventory to Insight to Action with Paula LandInventory to Insight to Action with Paula Land
Inventory to Insight to Action with Paula LandContent Strategy Workshops
 
Business Link North East Internet Marketing and SEO Training
Business Link North East Internet Marketing and SEO TrainingBusiness Link North East Internet Marketing and SEO Training
Business Link North East Internet Marketing and SEO TrainingBusiness & Enterprise North East
 
Why people hate your website
Why people hate your websiteWhy people hate your website
Why people hate your websiteLorraine Ball
 
Deck_Prescott Shibles
Deck_Prescott ShiblesDeck_Prescott Shibles
Deck_Prescott ShiblesWarren Hersch
 
Itc536 web evalpp
Itc536 web evalppItc536 web evalpp
Itc536 web evalppChris
 
Deminar: Link Optimization using the Searchmetrics Suite
Deminar: Link Optimization using the Searchmetrics SuiteDeminar: Link Optimization using the Searchmetrics Suite
Deminar: Link Optimization using the Searchmetrics SuiteSearchmetrics
 
AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...
AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...
AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...Association of University Presses
 
10 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 2011
10 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 201110 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 2011
10 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 2011Jimmy Smith
 
Introduction to digital marketing for beginners | Digital marketing startup g...
Introduction to digital marketing for beginners | Digital marketing startup g...Introduction to digital marketing for beginners | Digital marketing startup g...
Introduction to digital marketing for beginners | Digital marketing startup g...Deep Mehta
 

What's hot (19)

Evaluating information on the web
Evaluating information on the webEvaluating information on the web
Evaluating information on the web
 
Authority building with relevance
Authority building with relevanceAuthority building with relevance
Authority building with relevance
 
Inventory to Insight to Action with Paula Land
Inventory to Insight to Action with Paula LandInventory to Insight to Action with Paula Land
Inventory to Insight to Action with Paula Land
 
Website workout
Website workoutWebsite workout
Website workout
 
Business Link North East Internet Marketing and SEO Training
Business Link North East Internet Marketing and SEO TrainingBusiness Link North East Internet Marketing and SEO Training
Business Link North East Internet Marketing and SEO Training
 
SEO 101
SEO 101SEO 101
SEO 101
 
Why people hate your website
Why people hate your websiteWhy people hate your website
Why people hate your website
 
Deck_Prescott Shibles
Deck_Prescott ShiblesDeck_Prescott Shibles
Deck_Prescott Shibles
 
Seo services-india
Seo services-indiaSeo services-india
Seo services-india
 
Itc536 web evalpp
Itc536 web evalppItc536 web evalpp
Itc536 web evalpp
 
SEO: Optimising your web content
SEO: Optimising your web contentSEO: Optimising your web content
SEO: Optimising your web content
 
Deminar: Link Optimization using the Searchmetrics Suite
Deminar: Link Optimization using the Searchmetrics SuiteDeminar: Link Optimization using the Searchmetrics Suite
Deminar: Link Optimization using the Searchmetrics Suite
 
Web analytics
Web analyticsWeb analytics
Web analytics
 
Search Marketing Success
Search Marketing SuccessSearch Marketing Success
Search Marketing Success
 
Mohsin khan
Mohsin khanMohsin khan
Mohsin khan
 
AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...
AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...
AAUP 2017: "Conceiving, Developing, and Creating a Great University Press Web...
 
Click2serere
Click2serereClick2serere
Click2serere
 
10 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 2011
10 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 201110 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 2011
10 Ways to Be Strategic with Web Analytics - Presentation for RootsTech Feb 2011
 
Introduction to digital marketing for beginners | Digital marketing startup g...
Introduction to digital marketing for beginners | Digital marketing startup g...Introduction to digital marketing for beginners | Digital marketing startup g...
Introduction to digital marketing for beginners | Digital marketing startup g...
 

Similar to Web structure mining

Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013Selena Vidya
 
Digital Marketing Course Week 6: Search Engine Optimization (SEO)
Digital Marketing Course Week 6: Search Engine Optimization (SEO)Digital Marketing Course Week 6: Search Engine Optimization (SEO)
Digital Marketing Course Week 6: Search Engine Optimization (SEO)Ayca Turhan
 
Google Analytics Dabble
Google Analytics DabbleGoogle Analytics Dabble
Google Analytics DabbleKeidra Chaney
 
Search engine marketing
Search engine marketingSearch engine marketing
Search engine marketingDr,Saini Anand
 
Introduction to Search Marketing - Search Engine Optimisation
Introduction to Search Marketing - Search Engine OptimisationIntroduction to Search Marketing - Search Engine Optimisation
Introduction to Search Marketing - Search Engine OptimisationRhys Downard
 
Search Engine Optimization Review
Search Engine Optimization ReviewSearch Engine Optimization Review
Search Engine Optimization ReviewMark Cijo
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Lucidworks
 
Anatomy of Search Relevance: From Data To Action
Anatomy of Search Relevance: From Data To ActionAnatomy of Search Relevance: From Data To Action
Anatomy of Search Relevance: From Data To ActionSaïd Radhouani
 
Google Analytics Beginners: How to Measure Your Homepage Design
Google Analytics Beginners: How to Measure Your Homepage DesignGoogle Analytics Beginners: How to Measure Your Homepage Design
Google Analytics Beginners: How to Measure Your Homepage DesignBop Design
 
Web Analytics & Conversion Rate Optimization
Web Analytics & Conversion Rate OptimizationWeb Analytics & Conversion Rate Optimization
Web Analytics & Conversion Rate OptimizationWorkshop Digital
 
AMA Richmond - Web Analytics (Oct. 2017)
AMA Richmond - Web Analytics (Oct. 2017)AMA Richmond - Web Analytics (Oct. 2017)
AMA Richmond - Web Analytics (Oct. 2017)Workshop Digital
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
Praxis Business School - Web Analytics - 2023.pptx
Praxis Business School - Web Analytics - 2023.pptxPraxis Business School - Web Analytics - 2023.pptx
Praxis Business School - Web Analytics - 2023.pptxManaliSandeepParab
 
Getting started with Compete PRO
Getting started with Compete PROGetting started with Compete PRO
Getting started with Compete PROCompete
 
Clickstream Mining visualization for Ecommerce
Clickstream Mining visualization for EcommerceClickstream Mining visualization for Ecommerce
Clickstream Mining visualization for Ecommerceshraddha mane
 
Search Marketing 101
Search Marketing 101Search Marketing 101
Search Marketing 101Lessing-Flynn
 

Similar to Web structure mining (20)

Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013
 
Digital Marketing Course Week 6: Search Engine Optimization (SEO)
Digital Marketing Course Week 6: Search Engine Optimization (SEO)Digital Marketing Course Week 6: Search Engine Optimization (SEO)
Digital Marketing Course Week 6: Search Engine Optimization (SEO)
 
Google Analytics Dabble
Google Analytics DabbleGoogle Analytics Dabble
Google Analytics Dabble
 
Search engine marketing
Search engine marketingSearch engine marketing
Search engine marketing
 
Introduction to Search Marketing - Search Engine Optimisation
Introduction to Search Marketing - Search Engine OptimisationIntroduction to Search Marketing - Search Engine Optimisation
Introduction to Search Marketing - Search Engine Optimisation
 
Search Engine Optimization Review
Search Engine Optimization ReviewSearch Engine Optimization Review
Search Engine Optimization Review
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
 
Anatomy of Search Relevance: From Data To Action
Anatomy of Search Relevance: From Data To ActionAnatomy of Search Relevance: From Data To Action
Anatomy of Search Relevance: From Data To Action
 
Google Analytics Beginners: How to Measure Your Homepage Design
Google Analytics Beginners: How to Measure Your Homepage DesignGoogle Analytics Beginners: How to Measure Your Homepage Design
Google Analytics Beginners: How to Measure Your Homepage Design
 
Web Analytics & Conversion Rate Optimization
Web Analytics & Conversion Rate OptimizationWeb Analytics & Conversion Rate Optimization
Web Analytics & Conversion Rate Optimization
 
AMA Richmond - Web Analytics (Oct. 2017)
AMA Richmond - Web Analytics (Oct. 2017)AMA Richmond - Web Analytics (Oct. 2017)
AMA Richmond - Web Analytics (Oct. 2017)
 
Search 4
Search 4Search 4
Search 4
 
Seo
SeoSeo
Seo
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Praxis Business School - Web Analytics - 2023.pptx
Praxis Business School - Web Analytics - 2023.pptxPraxis Business School - Web Analytics - 2023.pptx
Praxis Business School - Web Analytics - 2023.pptx
 
Search engines
Search enginesSearch engines
Search engines
 
Search engine Optimization
Search engine OptimizationSearch engine Optimization
Search engine Optimization
 
Getting started with Compete PRO
Getting started with Compete PROGetting started with Compete PRO
Getting started with Compete PRO
 
Clickstream Mining visualization for Ecommerce
Clickstream Mining visualization for EcommerceClickstream Mining visualization for Ecommerce
Clickstream Mining visualization for Ecommerce
 
Search Marketing 101
Search Marketing 101Search Marketing 101
Search Marketing 101
 

More from Sumit Sony

Web content mining
Web content miningWeb content mining
Web content miningSumit Sony
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1Sumit Sony
 
Sentiment analysis and opinion mining
Sentiment analysis and opinion miningSentiment analysis and opinion mining
Sentiment analysis and opinion miningSumit Sony
 
Basic techniques in nlp
Basic techniques in nlpBasic techniques in nlp
Basic techniques in nlpSumit Sony
 
Web usage mining
Web usage miningWeb usage mining
Web usage miningSumit Sony
 

More from Sumit Sony (7)

Web mining
Web miningWeb mining
Web mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
Sentiment analysis and opinion mining
Sentiment analysis and opinion miningSentiment analysis and opinion mining
Sentiment analysis and opinion mining
 
Basic techniques in nlp
Basic techniques in nlpBasic techniques in nlp
Basic techniques in nlp
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Deep learning
Deep learningDeep learning
Deep learning
 

Recently uploaded

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSSnehalVinod
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...vershagrag
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjaytendertech
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIf6x4zqzk86
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 

Web structure mining

  • 1. WEB STRUCTURE MINING 12/03/18 Professor V. Nagadevara
  • 2. Web Metrics • Some interesting terms – Hits: an action on a website such as when a user views a page or downloads a file – When a page is an assembly of downloaded objects the hits include all the downloaded files! – Artificially inflated – If anything, hits are at the bottom of the pyramid – Measuring hits does not mean anything! 12/03/18 Professor V. Nagadevara
  • 3. Click-Through • Click-through is the number of people who click on a particular thing – a link, an ad word, banner ad, etc. • “Percentage of users who click on a viewed advertisement” • Click-oops-back is part of the click-through • Better measure is the response rate – number of visitors who arrived at the site after clicking the link • “we had 200,000 hits last month” or “our click-through rate from Yahoo is 95%” 12/03/18 Professor V. Nagadevara
  • 4. Page View • What is a page? – Static pages, pages measured by page tags, documents are okay – Dynamically generated pages – is entire page a view, or is object a view? – repeat views – are they different from original view? How to decide when is it a reload? – How to consider frames? • A successful loading of any document containing content that was requested by a website visitor regardless of the delivery mechanism or the number or frequency with which the said content is delivered 12/03/18 Professor V. Nagadevara
  • 5. Visits • A visit is a session or user session • Defined as any period of activity separated by m minutes (say 30 minutes). • A visit is counted when unique visitor creates an activity on a webpage measured via a page statistic regardless the duration of the activity as long as the period of inactivity between page views does not exceed the pre-specified limit 12/03/18 Professor V. Nagadevara
  • 6. Visitor/Unique visitor • Human beings are visitors; bots are not • How to determine uniqueness? • A unique visitor is counted when a human being uses a web browser to visit a website, regardless of the number of pages, or duration. A visitor can be unique for different periods of time • A unique visitor for any arbitrary time frame should be counted for onetime and onetime only on their first visit and between the start date and end date 12/03/18 Professor V. Nagadevara
  • 7. Referrer • This is the source of traffic to the website • Referring URL or referring Domain • How to report when there is no referrer? – Refer to them as “bookmarks or directly referred” • Group the source into different referrers – Banner ads, paid search engines, partners, affiliate programs etc. • A referrer to any website should be an undifferentiated and complete uniform resource locator describing the exact page on the referring website that contained the link to the site in question 12/03/18 Professor V. Nagadevara
  • 8. Conversion Rate • One of the most important metrics for an ecommerce site or online business • A conversion rate is the number of completers divided by the number of starters for any on-line activity that is more than one logical step in length • Starting and ending points Do not matter • There can be many conversion rates for a given website • Can be measured by using views, visits, visitors 12/03/18 Professor V. Nagadevara
  • 9. Abandonment Rate • The abandonment rate for any step in a multi-step process is “one minus the number of units that make it to step n+1 divided by those at step n”. • AR = {1 – (Nn+1/Nn)} • It can be measured by using visits, visitors, views etc. • More important if the abandonment (leakage) occurs before the conversion • K-1 abandonment rates for a process involving k steps 12/03/18 Professor V. Nagadevara
  • 10. Loyalty, Frequency and Recency • Measures how well the traffic is maintained • Loyalty is the measure of the number of visits any visitor is likely to make over the lifetime as a visitor. • It is the raw number of visits all visitors have made since the beginning of measurement and de-duplicated • 100 visitors made 6 visits, 92 visitors made 7 visits • De-duplication is adjusting the numbers when one moves from 6 to 7 visits 12/03/18 Professor V. Nagadevara
  • 11. Loyalty, Frequency and Recency • Frequency is a measure of activity a visitor generates on the website in terms of average time between return visits. • It is measured in logical groups or discrete numbers of days between visits • Data should be de-duplicated • Reported as “once a day”, “once a Week” • Can also be presented as average days between visits or distribution of return visitors in each group 12/03/18 Professor V. Nagadevara
  • 12. Loyalty, Frequency and Recency • Recency is the number of days since the last visit • Reported as the number of visitors who returned after d days • Usually used in the context of online purchase, “as the number of days since last purchase” • Recency is a moving target 12/03/18 Professor V. Nagadevara
  • 13. Value Pyramid 12/03/18 Professor V. Nagadevara Uniquely Identified Visitors Unique Visitors Visits Page Views Hits Volume of Available Data IncreasingValueofData
  • 14. Web Structure Mining 12/03/18 Professor V. Nagadevara
  • 15. Authoritative Web Pages • Search Engines need to retrieve relevant pages which are of high quality or Authoritative • Authority is in the hyperlinks • Hyperlink is an endorsement • Collective endorsement of a given page by different authors indicate its importance
  • 16. Limitations • Not All Links Are Meant for Endorsement • Some Are Navigation • Some Are for Paid Advertisement • Commercial or Competitive Interests Avoid Links to Rivals’ Pages • Real Authoritative Pages Are Seldom Descriptive Eg. “Web Search Engine”
  • 17.
  • 18. HUBs • Hub is one or a set of web pages providing collections of links to authorities • Hub pages may not be prominent • Have very few links pointing to them • They provide a collection of prominent links to a common topic
  • 19. Hubs • These could be – Recommended links in individual home pages – Recommended reference sites – Professionally assembled resource lists – Professional collection of commercial sites
  • 20.
  • 21. Hubs and Authorities • A good hub is one which points to many good authorities • A Good authority is one which has many good hubs pointing to it • This mutual reinforcement is the basis for mining
  • 22. Page Rank Algorithm • Static concept – Computed off-line and does not depend on the search query – It can be regarded as the “prestige” – “In-links” of page p are the hyperlinks from other pages that point to page p. Ignore the in-links from the same website – “Out-links” of page p are the hyperlinks that point to other pages from page p
  • 23. Page importance • A Hyperlink from a page pointing to another page is an implicit endorsement (conveyance of authority) of the target page. More in-links that page p has, the more “Prestige” that page p has. • Pages that point to page p also have their own prestige. A page with higher prestige pointing to page p is more important than a page with lower prestige score pointing to the same page (page p) • A page is more important if it is pointed to by other important pages. 12/03/18 Professor V. Nagadevara
  • 25. Why page Rank? • Not influenced by search string • Page rank values are computed and stored. A look-up is done at the time of the query. Hence very efficient • Cannot be spammed. Not easy to create links in other pages pointing to your page • Cannot distinguish between pages that are authoritative in general and those which are authoritative on the query topic 12/03/18 Professor V. Nagadevara
  • 26. Hypertext Induced Topic Search (HITS) • HITS is search query dependent • Uses Authority ranking and Hub ranking • It sends a query q and collets t highest ranked pages (200 in the original paper)-root set W • Then it grows W by including any page pointed to by a page in W and any page that points to a page in W creating a larger set S. • It limits the size by allowing each page in W to bring in at most k pages (k=50 in the paper) 12/03/18 Professor V. Nagadevara
  • 28. HITS • Ranks according to the query topic. • Provides more relevant authorities and hubs • Easy to spam. Create a hub pointing to many high ranking authorities. It will boost the H(i) • Problem of topic drift. Expanding the root set can lead to collecting irrelevant pages • Time consuming 12/03/18 Professor V. Nagadevara
  • 29. Layout of Search Results Name Description Position Organic Results from Web crawl. ‘‘Objective hits” not influenced by direct payments Central on results page Sponsored Paid results, separated from the organic results list Above or below organic results, on the right-hand side of the results list Shortcuts Emphasized result pointing to results from a special collection Above organic results, within organic results list Primary search result Extended result that points to different collections. It comes with an image and further information Above organic results, often within organic results Prefetch Result from a preferred source, emphasized in the results set Above or within organic results Snippet Regular organic result with result description extended by additional navigational links Within organic results list (usually first position only) Child Second result from the same server with a link to further results from same server Within organic results list; indented 12/03/18 Professor V. Nagadevara
  • 30. Research Questions • RQ1: How many sponsored links are on the results screen? • RQ2: Are there popular hosts, domains, and content types preferred by a certain search engine? • RQ3: Is there a difference between search engines regarding the results types presented? • RQ4: How many specially displayed results are on the first results page? • RQ5: To what extent are shortcuts used on the results pages? • RQ6: What is the difference between search engines regarding the questions above? 12/03/18 Professor V. Nagadevara
  • 31. Methodology • Search engines Google, Yahoo, MSN/Live.com, and Ask.com. • Obtained 500 queries from the 100k top queries and 500 from the last 100k queries of long tail queries from Ask.com – • Sorted the list of all queries by frequency and alphabetically. • Selected every second hundredth query to make sure that there is a representative selection over the popular and over the very rare search queries in the long tail. • a script developed to automatically download the search engines’ results pages and to analyze the HTML code. • categorized every single result in those pages into organic, paid advertisements, snippets, etc. 12/03/18 Professor V. Nagadevara
  • 32. Results Engine Google Yahoo MSN/Live Ask Valid popular queries 499 463 500 500 Valid rare queries 498 492 457 498 URLS in results screens 12,522 9436 11,700 9127 URLs (from popular) 6731 5232 6685 5224 URLs (From Rare) 5791 4204 5065 3903 Organic URLs 9641 8454 9177 8183 OrganicURLs (from m Popular) 5041 4543 4996 4661 OrganicURLs (From Rare) 4600 3911 4181 3522 Sponsored Links 2881 982 2573 944 Sponsored 1690 689 1689 563 Sponsored 1191 293 884 381 No results( from popular) 0 2 0 5 No results (from rare) 16 64 12 79 12/03/18 Professor V. Nagadevara
  • 33. Domains Top Google Yahoo MSN/Live Ask 1 .com 6614 68.60% 5519 65.30% 5255 57.30% 4450 54.40% 2 .co.uk 1141 11.80% 830 9.80% 1350 14.70% 1630 19.90% 3 .org 1029 10.70% 1014 12.00% 1070 11.70% 687 8.40% 4 .net 300 3.10% 382 4.50% 304 3.30% 271 3.30% 5 .gov 130 1.30% 111 1.30% 89 1.00% 100 1.20% 6 .edu 141 1.50% 170 2.00% 231 2.50% 166 2.00% 7 .gov.uk 111 1.20% 73 0.90% 140 1.50% 105 1.30% 8 .au 72 0.70% 56 0.70% 106 1.20% 91 1.10% 9 .info 59 0.30% 35 0.40% 55 0.60% 17 0.20% 10 .us 40 0.40% 50 0.60% 21 0.20% 29 0.40% 12/03/18 Professor V. Nagadevara
  • 34. File Types Type Google Yahoo MSN/Live Ask.com .html 1838 1423 1366 1982 .htm 885 787 899 1149 .php 176 182 221 153 .pdf 158 152 111 50 .aspx 141 130 152 97 .asp 140 140 199 126 .shtml 111 84 87 106 .cfm 30 32 42 18 .doc 29 5 11 0 .ppt 4 2 2 0 12/03/18 Professor V. Nagadevara
  • 35. Ads (popular Queries) Google Yahoo MSN/Live Ask Top All 137 27.40% 180 38.90% 193 38.60% 134 26.80% 1 51 10.20% 120 25.90% 70 14.00% 63 12.60% 2 26 5.20% 50 10.80% 39 7.80% 33 6.60% 3 60 12.00% 10 2.20% 84 16.80% 13 2.60% 4 14 2.80% 5 11 2.20% 12/03/18 Professor V. Nagadevara
  • 36. Ads (popular Queries) Google Yahoo MSN/Live Ask Right All 287 57.50% 117 25.30% 329 65.80% None 1 62 12.40% 58 12.50% 107 21.40% 2 32 6.40% 25 5.40% 52 10.40% 3 24 4.80% 13 2.80% 36 7.20% 4 17 3.40% 7 1.50% 16 3.20% 5 13 2.60% 5 1.10% 118 23.60% 6 13 2.60% 2 0.40% 7 10 2.00% 5 1.10% 8 116 23.20% 2 0.40% 12/03/18 Professor V. Nagadevara
  • 37. Ads (popular Queries) Google Yahoo MSN/Live Ask Below 1 58 12.50% 70 14.00% 63 12.60% 2 59 5.40% 123 24.60% 33 6.60% 3 13 2.60% 4 9 1.80% 5 16 3.20% 12/03/18 Professor V. Nagadevara
  • 38. Use of shortcuts in Google Popular Rare Count Position Count Position Prefetch 170 1 69 1 Snippet 290 var 196 var Images 9 1 5 1 Total 183 – 60 – Shortcuts Local Results 2 1 4 1 Video 108 var 127 var Blog search 6 11+ 4 11+ Books 22 10+ 9 10+ Calculator 7 1 1 1 Dictionary 1 1 3 1 News 10 3/10 4 3/10 Scholar 3 1 2 1 Shopping 17 var 37 var Weather 5 1 1 1 NASDAQ 2 1 – – 12/03/18 Professor V. Nagadevara