4.5 webminig

DATA MINING
MINING THE WORLD WIDE WEB

Mining the Web’s Link Structures to Identify
Authoritative Web Pages
• The Number the pages {1,2,....,n} and their adjacency matrix A to
be an n×n matrix, then A(i, j) is 1 if page i links to page j, or 0
otherwise.
• The authority weight vector a = (a1,a2,....,an), and the hub weight
vector h = (h1,h2,....,hn). we have
• Two equations for k times, we have
2mining www

• HITS sometimes drifts when hubs contain multiple topics. It may
also cause “topic hijacking” when many pages from a single
website point to the same single popular site, giving the site too
large a share of the authority weight.
• Such problems can be overcome by replacing the sums of
Equations with weighted sums
• scaling down the weights of multiple links from within the same
site, using anchor text to adjust the weight of the links along which
authority is propagated and breaking large hub pages into smaller
units.
3mining www

• The link analysis algorithms are based on 2 assumptions
– links convey human endorsement.(if there exists a link from page
A to page B and these two pages are authored by different
people, then the link implies that the author of page A found page
B valuable.)
– pages that are co-cited by a certain page are likely related to the
same topic.
• Problems are
– importance of page may be miscalculated by Page Rank
– topic drift may occur in HITS
• Causes are a single Web page often contains multiple semantics, and
the different parts of the Web page have different importance in that
page
4mining www

• Using VIPS,construct a page graph and a block graph.
• Using Graph model the new link analysis algorithms discovers
the intrinsic semantic structure of the Web.
• The graph model in block-level link analysis is induced from two
kinds of relationships, block-to-page (link structure) and page-to-
block (page layout).
6mining www

• The block-to-page relationship (link analysis) -more reasonable
to consider the hyperlinks from block to page , rather from page
to page.
• Let Z denote the block-to-page matrix with dimension
Z can be defined as :
7mining www

• The page-to-block relationship(page layout)-Let X
denote the page-to-block matrix with dimension k×n
• Each Web page can be segmented into blocks. X is defined
as
• where f is a function that assigns to every block b in page
p an importance value. The bigger is, the more important
the block b is. Function f is empirically defined as
8mining www

• Based on the block-to-page and page-to-block relations, a
new Web page graph incorporates the block importance
information is defined as
9mining www

Mining Multimedia Data on the Web
• Web-based multimedia data are embedded on the Web page and are
associated with text and link information.
• Using some Web page layout mining techniques (like VIPS), a
Web page can be partitioned into a set of semantic blocks.
• VIPS help to identify the surrounding text for Web images. This
text provides a textual description of Web images and can be used
to build an image index.
• TheWeb image search problem can then be partially completed
using traditional text search techniques.
10mining www

• The block-level link analysis technique is used to
organize Web images. Consider a new relation: block-to-
image relation.
• Let Y denote the block-to-image matrix with dimension
n×m. For each image, at least one block contains this
image.
• Y is defined as
13mining www

• we first construct the block graph from which the image
graph can be induced. the block graph is defined as:
• where t is a suitable constant. D is a diagonal matrix,
is 0 if block i and block j are contained in
two different Web pages; otherwise, it is set to DOC,the
value of the smallest block containing both block i and
block j. It is easy to check that the sum of is 1.
• can be viewed as a probability transition matrix such
that is the probability of jumping from block a to
block b.
14mining www

• The image graph can be constructed by noticing that
every image is contained in at least one block.
• The weight matrix of the image graph is defined as:
• Where is an matrix. If two images i and j are in
the same block say b, then
• The images in the same block are semantically related.
Thus, we get
15mining www

Automatic Classification of Web Documents
• Each document is assigned a class label from a set of predefined
topic categories, based on a set of examples of preclassified
documents
• For example, Yahoo!’s taxonomy and its associated documents can
be used as training and test sets in order to derive a Web document
classification scheme
• A Web page may contain multiple themes, ads, and navigation
information, block-based page content analysis play an important
role in construction of high-quality classification models.
• The block-based Web linkage will reduce such noise and enhance
the quality of Web document classification.
17mining www

Web Usage Mining
• A Web server usually registers a (Web) log entry, or Weblog entry,
for every access of a Web page. It includes the URL requested, the
IP address from which the request originated and a timestamp.
• Web usage mining, mines Weblog records to discover user access
patterns of Web pages.
• Analyzing and exploring Weblog records can identify the
customers for electronic commerce, enhance the quality and
delivery of Internet information services to the end user, and
improve Web server system performance.
• E.g. Web-based e-commerce servers
18mining www

• The techniques for developing Web usage mining
– what and how much valid and reliable knowledge can be
discovered from the large raw log data. data need to be cleaned,
condensed, and transformed in order to retrieve and analyze
significant and useful information.
– construct a multidimensional view on the Weblog database ,
and multidimensional OLAP analysis is performed to find top
N users, Web pages and so on, which helps to discover
customers, users, markets, and others.
– data mining can be performed on Weblog records to find
association patterns, sequential patterns, and trends of Web
accessing
19mining www

• For example, some studies have proposed adaptive sites:
websites that improve themselves by learning from user access
patterns.
• Weblog analysis may also help build customized Web services
for individual users.
• Weblog information can be integrated with Web content and
Web linkage structure mining to help Web page ranking , Web
document classification, and the construction of a multilayered
Web information
20mining www

4.5 webminig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 4.5 webminig

Similar to 4.5 webminig (20)

More from Krish_ver2

More from Krish_ver2 (20)

Recently uploaded

Recently uploaded (20)

4.5 webminig