2. Web Mining:
“Web mining refers to the overall process of
discovering potentially useful and previously
unknown information or knowledge from the Web
data.”
2
3. Discovering Knowledge from and about WWW - is one
of the basic abilities of an intelligent agent
3
Knowledg
e
WWW
4. Web Mining: Subtasks
1. Resource finding
◦ Retrieving intended documents
2. Information selection/pre-processing
◦ Select and pre-process specific information from selected
documents
3. Generalization
◦ Discover general patterns within and across web sites
4. Analysis
◦ Validation and/or interpretation of mined patterns
4
5. Web Mining:
As Kosala et al, put it, We interact
with the web for the following
purposes.
1. Finding Relevant Information
We either browse or use the search service when we want to find
specific information on the web.
We usually specify a simple keyword query and the response
from the web search engine is a list of pages, ranked based on
their similarity to the query.
6. 1. Low precision: This is due to the irrelevance of many of the
search results. We may get many pages of information
which are not really relevant to our query.
2. Low recall/unindexed information: This is due to the
inability to index all the information available on the web.
Because some of the relevant pages are not properly indexed,
we may not get those pages through any of the search engines.
Search tools have the following problem:
7. 2. Discovering new knowledge from the
web
We can have a data triggered process that presumes that we
already have a collection of web data and we want to extract
potentially useful knowledge out of it (data mining-oriented).
3. Personalized web pages synthesis
We may wish to synthesize a web page for different
individuals from the available set of web pages. i.e
catering to personal preference in contents and
presentation.
Individuals have their own preferences in the
style of the contents and presentations while
interacting with the web.
8. 4. Learning about individual users:-
Inside this problem, there are subproblems, such
as mass customizing the information to the
intended consumers or even personalizing it to
individual user, problems related to effective
web site design and management, problems
related to marketing etc.
9. •Web Mining can be said to have three
operations of interests
1. Clustering – Finding natural groupings of users,
pages, etc.
2. Associations – which URLs tend to be
requested together.
3. Sequential Analysis – The order in which URLs
tend to be accessed.
10.
11. WEB CONTENT MINING:
•Web content mining describes the discovery of useful information
from the web contents.
•The web contains many kind of data.
•We see that much of the government information are gradually being
placed on the web in recent years.
•We also know the existence of Digital Libraries that are also
accessible from the web.
• Many commercial institutions are transforming their business and
services electronically.
• We cannot ignore another type of web content- the existence of web
application through web interfaces.
•Some of the web content data are hidden data, and some are generated
dynamically as a result of queries and reside in the DBMS. These data
are generally not indexed.
12. Web content consists of several types of data such as
textual, image, audio, video, metadata as well as
hyperlinks.
Recent research on mining multi-types of data is termed as
multimedia data mining.
The textual parts of web content data consist of
unstructured data such as free texts, semi-structured
data such as HTML documents and more structured
data such as data in the tables or database-generated
HTML pages.
13. Web Structure Mining:-
Web structure mining is concerned with
discovering the model underlying the link
structures of the web. Or it studies the structures
of documents within the web itself.
It is used to study the topology of the hyperlinks
with or without the description of the links.
This model can be used to categorize web pages
and is useful to generate information such as the
similarity and relationship between different web
sites.
Interested in the structure between Web documents
(not within a document). Inspired by the study of
social networks and citation analysis.
14. PageRank (PR) is an algorithm used by Google
Search to rank web pages in their search engine
results. It is named after both the term "web page"
and co-founder Larry Page. PageRank is a way of
measuring the importance of website pages.
Damping factor:
The PageRank theory holds that an imaginary
surfer who is randomly clicking on links will
eventually stop clicking. The probability, at any
step, that the person will continue is a damping
factor d.
15. Page Rank is defined as follows;
We assume page A has pages Tl, ........ , Tn which
point to it (i.e., are citations).
The parameter d is a damping factor which can be
set between 0 and 1 and is usually set to 0.85.
out_deg(A) denotes the number of links going out
of page A (out-degree of A).
16. Social Network:
Social network analysis is another way of studying
the web link structure. It uses an exponentially
varying damping structure.
Web structure mining utilizes the hyperlinks
structure of the web to apply social network
analysis, to model the underlying links structure of
the web itself.
The social network studies ways to measure the
relative standing or importance of individuals in a
network. The same process can be mapped to study
the link structures of the web pages.
17. • INDEX NODE: An index node is a node whose
out-degree is significantly larger than the
average out-degree of the graph.
• REFERENCE NODE: An reference node is a
node whose in-degree is significantly larger
than the average in-degree of the graph.
A link is said to be a transverse link if it is between
pages with different domain names and
An intrinsic link if it is between pages with the same
domain name. Here by "domain name", we mean the
first level in the URL string associated with the page.
18. For determining the collection of similar pages, we
need to define the similarity measure between pages.
There can be two basic similarity functions.
For the pair of nodes, p and q, the bibliographic
coupling is equal to the number of nodes that
have links from both p and q.
Example: Documents are said to be
bibliographically coupled if they share one or
more bibliographic references. It is used as an
indicator of subject relatedness. There is no guarantee
that two bibliographically coupled documents (A) and
(B) cite the same piece of information in (C).
19. For the pair of nodes, p and q, the co-citation is
the number of nodes that point to both p and q.
Eg: If A and B are both cited by C, they may be said
to be related to one another, even though they
don't directly reference each other.
If A and B are both cited by many other items,
they have a stronger relationship. The more items
they are cited by, the stronger their relationship is.
20. In some cases we can take into
account both bibliographic and co-
citation couplings. The similarity
measure between two sub
cluster Sx and Sy is computed as
| Sx П Sy |
|Sx U Sy|
21.
22. Web Usage Mining is the process of
applying data mining techniques to the
discovery of usage patterns from Web data,
in order to understand and better serve the
needs of Web-based applications.
23. WEB USAGE MINING
Web usage mining deals with studying the data
generated by the web surfer's sessions or behaviors.
Web content and structure mining utilize the real or
primary data on the web. On the contrary, web usage
mining mines the secondary data derived from the
web server access logs, proxy server logs. browser
logs, user profiles, registration, data user sessions or
transactions, cookies, user queries, bookmarks data,
mouse clicks and scrolls, and any other data which
are the results of these interactions.
In simple words we can say that it is a discovery of user
access pattern from the web usage logs.
24. Two Approaches in web usage mining
1. General access pattern tracking:
This is to learn user navigation patterns
(impersonalized).
The general access pattern tracking analyzes the web
logs to understand access patterns and trends
2. Customized usage tracking
This is to learn a user profile or user modeling in adaptive
interfaces (personalized).
Customized usage tracking analyzes individual trends.
Its purpose is to customize web sites to users.
25. Text mining is the subset of Data Mining that
involves processing unstructured text
documents into a structured format.
Web mining is a subset of Data Mining that
involves processing the data related to the
Web. It can be Web Logs, Web Structure data,
or Web Contact data.
26. Due to continuous growth of the volumes of text
data, automated extraction of implicit, previously
unknown, ad potentially useful information
becomes more necessary to properly utilize this
vast source of knowledge.
Text mining, therefore, corresponds to the
extension of the data mining approach to textual
data and is concerned with various tasks, such as
extraction of information implicitly contained in
collection of documents, or similarity-based
structuring.
27. UNSTRUCTURED TEXT
Unstructured documents are free texts such as
new stories
Features
1. Word Occurrences
The bag of words or vector representation takes singe words in the
training corpus as features ignoring the sequence in which the words
occur.
2. Stop-words
The feature selection includes removing the case,punctuation,
infrequent words, and stop words.
3. Latent Semantic Indexing
Latent Semantic Indexing (LSI) transforms the original
document vectors to the lower dimensional space by
analyzing the co-relational structure of terms in the document
collection such that similar document that do not share terms
are placed in the same topic.
28. 4. Stemming
Stemming is a process which reduces words to their
morphological roots. For ex, the word "informing", "information",
"informer", and "informed" would be stemmed to their common root
"inform", and only the later words is used as the feature instead of
the former four.
5. N-Gram
Other feature representations are also possible, such as using
information about word positions in the document, or using
n-grams representation (word sequence of length up to n).
6. Part Of Speech (POS)
One important feature is POS. There can be 25 possible values
for POS tags. Most common tags are noun, verb, adjective
and adverb.
7.Positional Collocations
The values of this type of feature are the words that occur one or
two position to the right or left of the given word.