1.1 Problems With Web
Difficulty in finding
consumers or individual
To Survey the area of
Introduction to Link
Review of HITS and
Page Rank algorithm.
3. Web Mining: Definition
Process of discovering
potentially useful &
information or knowledge
from the web data.
3.1 Web Mining: Subtasks
3.1 Web Mining Categories
3.1.1 Web Content Mining
Scanning data of a Web page to determine content
relevance with respect to search query.
3.1.2 Web Structure Mining
between Web pages.
Focuses on following
Reducing irrelevant search
information on the web.
3.1.3 Web Usage Mining
Focuses on techniques that predict user behavior while
interacting with the WWW.
Web log records analyzed to discover user access pattern.
The challenges could be
divided into three phases:
4. Link Mining
It is located at the intersection of the work in
Hypertext and web mining
Relational learning and inductive logic programming
Some tasks of link mining applicable in web structure
Linked-based cluster analysis
(i) Link-based Classification
Predicts category of a web
page, based on
words that occur on the page
Links between pages
and other possible attributes
on web page.
Eg: Predicting the category
of a paper, based on its
citations and the co-citations.
(ii) Link-based Cluster Analysis
Goal : Finding naturally occurring subclasses.
Data is segmented into groups
similar objects - grouped together
dissimilar objects - different groups.
Helps in discovering hidden patterns.
Eg: Finding diseases with similar transmission pattern.
(iii) Link Type
Predicting link type
between two entities.
Predicting purpose of
Eg. Navigational or
(iv) Link Strength
Links could be associated with weights.
Strong links - higher weight
Weak links – lower weight
(v) Link Cardinality
Refers to the number
of inbound links to a
Link popularity :
factors that weigh the
importance of each
5. Hyperlink-Induced Topic Search
Link analysis algorithm that
Identifies two kinds of pages
from Web hyperlink structure:
Authorities: Contains valuable
information on the subject.
Hubs: Contains useful links
towards the authoritative
Two step process:
Sampling step: Set of
relevant pages collected
Iterative step: Hubs and
authorities are found
using output of above step
Query submitted to search engine yields a root set
From root set we expand to base set
Expanding the root set into base set
Problems With HITS Algorithm
Some problems with the HITS algorithm are:
Mutually reinforced relationships between hosts
Automatically generated links
Hubs and authorities
6. PageRank Model
It is a link analysis algorithm.
Numeric value to know the
importance of a web page
Computes importance by no.
of incoming links
Rank of a page is divided evenly among its out-links to
contribute to the ranks of the pages they point to.
Page Ranks form a probability distribution over web
pages, so the sum of all pages’ Page Ranks will be one.
PageRank can be calculated by:
PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn))
T1..Tn are the pages that point to page A.
C(A) is defined as the number of links going out of page A.
d is the dampening factor which is usually set to 0.85
The dampening factor is the probability at each page a
random surfer will get bored and will request another
HITS was used in Clever search engine by IBM.
PageRank is used by Google.
Knowledge Discovery and Retrieval on World Wide Web Using Web Structure
Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and
Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International
Conference on Mathematical/Analytical Modelling and Computer Simulation
Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD
Explorations, Volume 4, Issue 2
Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In
proceedings of ACM-SIAM Symposium on Discrete Algorithms
The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and
T. Winograd, 1998, Technical report, Stanford University