WEB MINING.pptx

WEB MINING
 We make use of the web in several ways. As Kosala et al put it,
we interact with the web for the following purposes.
FINDING RELEVANT INFORATION
 We either browse or use the search service when we want to find
specific information on the web.
 We usually specify a simple keyword query and the response from a
web- search engine is a list of pages, ranked based on their
similarity to the query.
 However, today's search tools have the following problems:
 how precision: This is due to the irrelevance of many of the search results.
We may get many pages of information which are not really relevant to our
query.
 how recall: This is due to the inability to index all the information
available on the web. Because some of the relevant pages are not properly
indexed, we may not get those pages through any of the search engines.

DISCOVERING NEW KNOWLEDGE FROM THE WEB
 We can term the above problem as a query-triggered process
(retrieval oriented).
 On the other hand, we can have a data-triggered process that
presumes that we already have a collection of web data and we want
to extract potentially useful knowledge out of it (data mining-
oriented).
PERSONALIZED WEB PAGE SYNTHESIS
 We may wish to synthesize a web page for different individuals from
the available set of web pages.
 Individuals have their own preferences in the style of the contents
and presentations while interacting with the web.
 The information providers like to create a system which responds to
user queries by potentially aggregating information from several
sources, in a manner which is dependent on the user.

LEARNING ABOUT INDIVIDUAL USERS
 It is about knowing what the customers do and want. Inside
this problem, there are sub problems, such as mass
customizing the information to the intended consumers or
even personalizing it to individual user, problems related to
effects web site design and management, problems related to
marketing, etc.
Web mining techniques provide a set of techniques that can
be used to solve the above problems.
Sometimes, web mining techniques provide direct solutions
to above problems.

Mining techniques in the web can
be categorized into three areas
 Web content mining,
 Web structure mining, and
 Web usage mining.
Figure : web mining tasks
Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Page
Content Mining
Search Result
Mining
General Access
Pattern Tracking
Customized
Usage Tracking

 Web content mining describes the discovery of useful
information from the web contents.
 The web contains many kinds of data.
 Much of the government information are gradually
being placed on the web in recent years.
 Existence of Digital Libraries that are also accessible
from the web.
 Many commercial institutions are transforming their
businesses and services electronically.
 We cannot ignore another type of web content—the
existence of web applications, so that the users could
access the applications through web interfaces.
Web content mining

 Basically, the web content consists of several types of
data such as textual, image, audio, video, metadata, as
well as hyperlinks.
 The textual parts of web content data consists of
unstructured data such as free texts, semi – structured
data such as HTML documents, and more structured
data such as data in the tables or database – generated
HTML pages.

 Web structure mining is the process of discovering the
structure information from the web.
 According to the type of web structural data, web structure
mining can be divided into two kinds:
 Extracting patterns from hyperlinks in the web:
a hyperlink is a structural component that connects the web
page to a different location.
 Mining the document structure:
analysis of the tree-like structure of page structures to
describe HTML or XML tag usage.
 The structure of typical web graph consists of web pages as
nodes, and hyperlinks as edges connecting between two
related pages.
WEB STRUCTURE MINING

Web structure mining terminology:
 Web graph: directed graph representing web.
 Node: web page in graph.
 Edge: hyperlinks.
 In degree: number of links pointing to particular node.
 Out degree: number of links generated from particular node

 some of the techniques that are useful in modeling web
topology.
 PAGE RANK
 Used to discover the most important pages on the web.
 A page can have a high PageRank if there are many pages
that point to it, or if there are some pages that point to it
which have a high PageRank.
 PageRank is defined as follows:
We assume page A has pages T1 ,..., Tn which point to it
(i.e., are citations). The parameter d is a damping factor which
can be set between 0 and 1 and is usually set to 0.85.
out_deg(A ) denotes the number of links going out of page A
(out- degree of A). ’

 SOCIAL NETWORK
 Social network analysis is yet another way of studying the
web link structure. It uses an exponentially varying
damping factor.
 Web structure mining utilizes the hyperlinks structure of
the web to apply social network analysis, to model the
underlying links structure of the web itself.
 The social network studies ways to measure the relative
standing or importance of individuals in a network.
 The same process can be mapped to study the link
structures of the web pages. The basic premise here is that
if a web page points a link to another web page, then the
former is, in some sense, endorsing the importance of the
latter.

 Kautz et at. in a pioneering work on web structure
mining, The Hidden Web, propose a measure of
standing of a node based on path counting. They carry
out social network analysis to model the network of AI
researchers. The standing of a node(page) can be
defined as follows.

 TRANSVERSE AND INTRINSIC LINKS
A link is said to be a transverse link if it is between
pages with different domain names, and
An intrinsic link if it is between pages with the same
domain name.
 REFERENCE NODES AND INDEX NODES
Botafogo et al. propose another way of ranking pages.
They define the notion of index nodes and reference nodes.
 DEFINITION 11.3 : INDEX NODE
An index node is a node whose out-degree is
significantly larger than the average out- degree of the graph.
 DEFINITION 11.4: REFEPENCE NODE
A reference node is a node whose in-degree is
significantly larger than the average in- degree of the graph.

 CLUSTERING AND DETERMINING SIMILAR PAGES
For determining the collection of similar pages, we need
to define the similarity measure between pages. There can be
two basic similarity functions.
 DEFINITION 11.5:BIBLIOGRAPHIC COUPLING
For a pair of nodes, p and q, the bibliographic coupling
is equal to the number of nodes that have links from both p
and q.
 DEFINITION 11.6: CO-CITATION
For a pair of nodes, p and q, the co-citation is the
number of nodes that point to both p and q.

 Web usage mining deals with studying the data generated by
the web.
 Web content and structure mining utilize the real or primary
data on the web.
 Web usage mining mines the secondary data derived from the
interactions of the users with the web.
 The secondary data includes the data from the web server
access logs, proxy server logs, browser logs, user profiles,
registration data, user sessions or transactions, cookies, user
queries, bookmark data, mouse clicks and scrolls, and any
other data which are the results of these interactions.
 This data can be accumulated by the web server.
 Analyses of the web access logs of different web sites can
facilitate an understanding of the user behavior and the web
structure, thereby improving the design of this large collection
of information.
WEB USAGE MINING

There are two main approaches in
web usage mining
1. GENERAL ACCESS PATTERN TRACKING
 This is to learn user navigation patterns (impersonalized).
 The general access pattern tracking analyzes the web logs
to understand access patterns and trends.
2. CUSTOMIZED USAGE TRACKING
 This is to learn a user profile or user modeling in adaptive
interfaces (personalized).
 Customized usage tracking analyzes individual trends. Its
purpose is to customize web sites to users.
 The information displayed, the depth of the site structure,
and the format of the resources can all be dynamically
customized for each user over time, based on their access
patterns.

Text mining
 Text mining, corresponds to the extension of the data mining
approach to textual data and is concerned with various tasks,
such as extraction of information implicitly contained in
collection of documents, or similarity- based structuring.
 The text expresses a vast range of information, but encodes the
information in a form that is difficult to interpret
automatically.
 When the data is structured it is easy to define the set of items,
and hence, it becomes easy to employ the traditional mining
techniques.
 Identifying individual items or terms is not so obvious in a
textual database.
 Thus, unstructured data, particularly free- running text, places
a new demand on data mining methodology.

OTHER RELATED AREAS
 Information Retrieval(IR),
 Information Extraction(IE),
 Computational Linguistics.

Information Retrieval(IR)
 IR is concerned with finding and ranking documents that
match the user’s information needs.
 The way of dealing with textual information by the IR
community is a keyword based document representation.
 A body of text is analyzed by its constituent words, and various
techniques are used to build the core words for a document.
 Actually, IR is the automatic retrieval of all relevant documents
 The goals of IR are
 To find documents that are similar, based on some specification of the
user.
 To find the right index terms in a collection, so that querying will return
the appropriate document.

Information extraction(IE)
 IE has the goal of transforming a collection of documents
into information that is more readily digested and analyzed
with the help of an IR system.
 IE extracts relevant facts from the documents, while IR
selects relevant documents. Thus, in general, IE works at a
finer granularity level than IR does on the documents.
 Most IE systems use machine learning or data mining
techniques to learn the extraction patterns or rules for
documents semi-automatically or automatically.
 The results of the IE process could be in the form of a
structured database, or could be a compression or
summary of the original text or documents.

Computational Linguistics
 Computational linguistics framework, patterns are
discovered to aid other problems within the same
domain, whereas text data mining is aimed at
discovering unknown information for different
applications.

Unstructured documents are free texts, such as news
stories.
 FEATURES
For an unstructured document, features are extracted
to convert it to a structured form. Some of the important
features are listed below.
1. WORD OCCURRENCES
Word occurrence can be used to identify the most
recurrent terms or concepts in a set of data.
2. STOP-WORDS
 The features election includes removing the case,
punctuation, infrequent words, and stop words. A good site
for the set of stop-words for the English language is
www,dcs.gla.ac.uk/idorn/irresources/linguisticutil/stopwo
rds
UNSTRUCTURED TEXT

3. LATENT SEMANTIC INDEXING
Latent Semantic Indexing(LSI) transforms the original
document vectors to a lower dimensional space by
analyzing the correlational structure of terms in the
document collection, such that similar documents that do
not share terms are placed in the same topic.
4. STEMMING
Stemming is a process which reduces words to their
morphological roots. For example, the words “informing ”,
“information ”“informer”, and “informed” would be
stemmed to their common root “inform”, and only the
latter word is used as the feature instead of the former four.
5. n-GRAM
Other feature representations are also possible, such
as using information about word positions in the
document, or using n-grams representation (word
sequences of length up to n)(In WEBSOM).

6. PART-OF-SPEECH(POS)
One important feature is the POS. There can be 25 possible
values for POS tags. Most common tags are noun, verb, adjective
and adverb. Thus, we can assign a number 1,2, 3, 4 or 5, depending
on whether the word is a noun, verb, adjective, adverb or any
other, respectively.
7. POSITIONAL COLLOCATIONS
The values of this type of feature are the words that occur
one or two position to the right or left of the given word.
8. HIGHER ORDER FEATURES
Other features include phrases, document concept
categories, terms, hypernyms, named entities, dates, email
addresses, locations, organizati9ns, or URLs. These features could
be reduced further by applying some other feature selection
techniques, such as information gain, mutual information, cross
entropy, or odds ratio.

 Once the features are extracted, the text is represented
as structured data, and traditional data mining
techniques can be used.
 The techniques include discovering frequent sets,
frequent sequences and episode rules. We describe
below the preprocessing stage to fund frequent
episodes.

 Ahonen et al. propose to apply sequence mining techniques for
text data.
 They consider text as sequential data which consists of a
sequence of pairs (feature vector, index), where the feature
vector is an ordered set of features and the index contains
information about the position of the word in the sequence
For example, the text Path finder photographs Mars can be
represented as
 (pathfinder_noun_singular,1),(photographs_verb_singular,2),(
Mars_noun_singular,3))
EPISODE RULE DISCOVERY FOR TEXTS

 Similarly, the text knowledge discovery in databases can be
represented as the sequence
((knowledge_noun_singular,1),(discovery_noun_singular,2),(in_r
eposition,3), (databases_noun_plural,4))
 Instead of considering all occurrences of the episode, a restriction is
set that the episode must occur within a pre specified window of size,
w. Thus, we examine the substrings S‘ of S such that the difference of
the indices in S‘ is at most w.
 For w=2, the subsequence(knowledge_noun_singular,
discovery_noun_singular)is an episode contained in the window, but
the subsequence(knowledge_noun_singular,
databases_noun_plural)is not contained within the window.

WEB MINING.pptx

Recommended

Recommended

More Related Content

Similar to WEB MINING.pptx

Similar to WEB MINING.pptx (20)

Recently uploaded

Recently uploaded (20)

WEB MINING.pptx