3. Web Mining
It is the application of data mining techniques to
automatically discover and extract information from
Web data, including web documents, hyperlinks
between documents, usage logs of websites, etc.
Web mining is a multidisciplinary field:
Data mining,
Machine learning,
Natural language processing,
Statistics,
Databases,
Information retrieval, multimedia, etc.
4. Web mining challenges
The Web has many unique characteristics, which make
mining useful information and knowledge a fascinating and
challenging task.
The amount of information on the Web is huge, and easily
accessible.
Information/data of almost all types exist on the Web, e.g.,
structured tables, texts, multimedia data, etc.
Much of the Web information is redundant. The same piece of
information or its variants may appear in many pages.
The Web is noisy. A Web page typically contains a mixture of many
kinds of information, e.g., main contents, advertisements,
navigation panels, copyright notices, etc.
5. Web mining challenges
The Web is dynamic. Information on the Web changes
constantly. Keeping up with the changes and monitoring the
changes are important issues.
Above all, the Web is a virtual society. It is not only about data,
information and services, but also about interactions among
people, organizations and automatic systems, i.e., communities.
6. Classification of Web Mining Techniques
Web Structure Mining
Web Usage Mining
Web Content Mining
7. Web-Structure Mining
Discovering useful knowledge from hyperlinks,
which represent the structure of the Web.
Link mining refers to data mining techniques that
explicitly consider these links when building predictive or
descriptive models of the linked data are used for
beneficial applications i.e.,:
In search engines: for discovering important Web
pages.
In social network analysis: for discovering
communities of users who share common interests.
Citation analysis (co-citation & bibliographic coupling)
8. Web-Usage Mining
Discovery of user access patterns from Web
usage logs, which record user clickstreams.
Clickstream
It is the recording of what a computer user clicks on
while Web browsing. As the user clicks anywhere in
the webpage, the action is logged on a client or inside
the Web server, as well as other sources.
9. Web-Usage Mining
Clickstream Analysis answers the following questions:
Which web page is the most common point of entry for users?
Are visitors entering through the gateway constructed by the
website developers, or are they somehow by passing the
gateway and landing in the middle of the Web site?
In which order have the pages been viewed?
Is this page sequencing as the developers might have expected,
or is there something the users are trying to tell us about how the
Web site should be structured?
Which other Web sites referred the users to your Web site?
Which referrer sites are providing us with the greatest number of
referrals?
How many web pages have been viewed in the typical visit?
10. Web-Usage Mining Benefits
Restructure a website
Extract user access patterns to target ads
Number of access to individual files
Predict user behavior based on previously learned
rules and users’ profile
11. Web-Usage Mining Techniques
Data Preprocessing
Conversion of raw data in usage logs in order to produce
the right data for mining. (e.g., data cleaning)
Pattern Discovery
- using the algorithms and techniques from data mining,
sequential pattern mining, machine learning, statistics and pattern
recognition etc.
- Common data mining techniques are association rules
and sequence pattern mining.
Pattern Analysis
Validation and interpretation of the mined patterns.
12. Web Content Mining
Discovering useful information or knowledge
from Web page contents.
Web data contents include text, Image, audio, video,
metadata and hyperlinks.
Technologies that are normally used in web
content mining are NLP (Natural Language
Processing) and IR (Information Retrieval).
13. Web Content Mining Applications
Web Information Integration and Schema
Matching.
(Lecture 2)
Opinion extraction from online sources.
(Lecture 3)
Knowledge synthesis (representation).
(Lecture 4)
15. CS583, Bing Liu, UIC 15
Social network analysis
Social network is the study of social entities (people
in an organization, called actors), and their
interactions and relationships.
The interactions and relationships can be
represented with a network or graph,
each vertex (or node) represents an actor and
each link represents a relationship.
From the network, we can study the properties of its
structure, and find various kinds of sub-graphs, e.g.,
communities formed by groups of actors.
We study two types of social network analysis, centrality
and prestige, which are closely related to hyperlink
analysis and search on the Web.
16. CS583, Bing Liu, UIC 16
Centrality
Important or prominent actors are those that
are linked or involved with other actors
extensively.
A person with extensive contacts (links) or
communications with many other people in
the organization is considered more important
than a person with relatively fewer contacts.
The links can also be called ties. A central
actor is one involved in many ties.
17. 17
Centrality
Based on the varying notions of importance of
vertices or edges, different centrality measures
were developed:
1. Degree centrality
2. Betweenness centrality
3. Closeness centrality
18. 18
Degree Centrality
Central actors are the most active actors that have most links or ties
with other actors. Let the total number of actors in the network be n.
Undirected Graph: In an undirected graph, the degree centrality of an
actor i (denoted by CD(i)) is simply the node degree (the number of edges)
of the actor node, denoted by d(i), normalized with the maximum degree, n-
1.
The value of this measure ranges between 0 and 1 as n-1 is the maximum
value of d(i).
Directed Graph: In this case, we need to distinguish in-links of actor i
(links pointing to i), and out-links (links pointing out from i). The degree
centrality is defined based on only the out-degree (the number of out-links or
edges), do(i).
20. 20
Closeness Centrality
This view of centrality is based on the closeness or distance. The basic
idea is that an actor xi is central if it can easily interact with all other
actors. That is, its distance to all other actors is short. Thus, we can use
the shortest distance to compute this measure. Let the shortest
distance from actor i to actor j be d(i, j) (measured as the number of
links in a shortest path).
Undirected Graph: The closeness centrality CC(i) of actor i is defined as
The value of this measure also ranges between 0 and 1 as n-1 is the
minimum value of the denominator, which is the sum of the shortest
distances from i to all other actors.
Directed Graph: The same equation can be used for a directed graph. The
distance computation needs to consider directions of links or edges.
21. 21
Closeness Centrality
CC(d)=0.75
d is at distance 1 from 4 nodes
and at distance 2 from 2 nodes.
Then
∑j≠ddist(d,j)=1+1+1+1+2+2=8
Since there are 7 nodes in the
network, the numerator of the
equation above is 6, then the
closeness centrality of d is
6/8=0.75
22. CS583, Bing Liu, UIC 22
Betweenness Centrality
If two non-adjacent actors j and k want to
interact and actor i is on the path between j
and k, then i may have some control over the
interactions between j and k.
Betweenness measures this control of i over
other pairs of actors. Thus,
if i is on the paths of many such interactions, then
i is an important actor.
23. CS583, Bing Liu, UIC 23
Betweenness Centrality (cont …)
Undirected graph: Let pjk be the number of
shortest paths between actor j and actor k.
The betweenness of an actor i is defined as the
number of shortest paths that pass i (pjk(i))
normalized by the total number of shortest paths.
k
j jk
jk
p
i
p )
(