SlideShare a Scribd company logo
CHAPTER-1
WEB MINING
 We make use of the web in several ways. As Kosala et al put it,
we interact with the web for the following purposes.
FINDING RELEVANT INFORATION
 We either browse or use the search service when we want to find
specific information on the web.
 We usually specify a simple keyword query and the response from a
web- search engine is a list of pages, ranked based on their
similarity to the query.
 However, today's search tools have the following problems:
 how precision: This is due to the irrelevance of many of the search results.
We may get many pages of information which are not really relevant to our
query.
 how recall: This is due to the inability to index all the information
available on the web. Because some of the relevant pages are not properly
indexed, we may not get those pages through any of the search engines.
DISCOVERING NEW KNOWLEDGE FROM THE WEB
 We can term the above problem as a query-triggered process
(retrieval oriented).
 On the other hand, we can have a data-triggered process that
presumes that we already have a collection of web data and we want
to extract potentially useful knowledge out of it (data mining-
oriented).
PERSONALIZED WEB PAGE SYNTHESIS
 We may wish to synthesize a web page for different individuals from
the available set of web pages.
 Individuals have their own preferences in the style of the contents
and presentations while interacting with the web.
 The information providers like to create a system which responds to
user queries by potentially aggregating information from several
sources, in a manner which is dependent on the user.
LEARNING ABOUT INDIVIDUAL USERS
 It is about knowing what the customers do and want. Inside
this problem, there are sub problems, such as mass
customizing the information to the intended consumers or
even personalizing it to individual user, problems related to
effects web site design and management, problems related to
marketing, etc.
Web mining techniques provide a set of techniques that can
be used to solve the above problems.
Sometimes, web mining techniques provide direct solutions
to above problems.
Mining techniques in the web can
be categorized into three areas
 Web content mining,
 Web structure mining, and
 Web usage mining.
Figure : web mining tasks
Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Page
Content Mining
Search Result
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
 Web content mining describes the discovery of useful
information from the web contents.
 The web contains many kinds of data.
 Much of the government information are gradually
being placed on the web in recent years.
 Existence of Digital Libraries that are also accessible
from the web.
 Many commercial institutions are transforming their
businesses and services electronically.
 We cannot ignore another type of web content—the
existence of web applications, so that the users could
access the applications through web interfaces.
Web content mining
 Basically, the web content consists of several types of
data such as textual, image, audio, video, metadata, as
well as hyperlinks.
 The textual parts of web content data consists of
unstructured data such as free texts, semi – structured
data such as HTML documents, and more structured
data such as data in the tables or database – generated
HTML pages.
 Web structure mining is the process of discovering the
structure information from the web.
 According to the type of web structural data, web structure
mining can be divided into two kinds:
 Extracting patterns from hyperlinks in the web:
a hyperlink is a structural component that connects the web
page to a different location.
 Mining the document structure:
analysis of the tree-like structure of page structures to
describe HTML or XML tag usage.
 The structure of typical web graph consists of web pages as
nodes, and hyperlinks as edges connecting between two
related pages.
WEB STRUCTURE MINING
Web structure mining terminology:
 Web graph: directed graph representing web.
 Node: web page in graph.
 Edge: hyperlinks.
 In degree: number of links pointing to particular node.
 Out degree: number of links generated from particular node
 some of the techniques that are useful in modeling web
topology.
 PAGE RANK
 Used to discover the most important pages on the web.
 A page can have a high PageRank if there are many pages
that point to it, or if there are some pages that point to it
which have a high PageRank.
 PageRank is defined as follows:
We assume page A has pages T1 ,..., Tn which point to it
(i.e., are citations). The parameter d is a damping factor which
can be set between 0 and 1 and is usually set to 0.85.
out_deg(A ) denotes the number of links going out of page A
(out- degree of A). ’
 SOCIAL NETWORK
 Social network analysis is yet another way of studying the
web link structure. It uses an exponentially varying
damping factor.
 Web structure mining utilizes the hyperlinks structure of
the web to apply social network analysis, to model the
underlying links structure of the web itself.
 The social network studies ways to measure the relative
standing or importance of individuals in a network.
 The same process can be mapped to study the link
structures of the web pages. The basic premise here is that
if a web page points a link to another web page, then the
former is, in some sense, endorsing the importance of the
latter.
 Kautz et at. in a pioneering work on web structure
mining, The Hidden Web, propose a measure of
standing of a node based on path counting. They carry
out social network analysis to model the network of AI
researchers. The standing of a node(page) can be
defined as follows.
 TRANSVERSE AND INTRINSIC LINKS
A link is said to be a transverse link if it is between
pages with different domain names, and
An intrinsic link if it is between pages with the same
domain name.
 REFERENCE NODES AND INDEX NODES
Botafogo et al. propose another way of ranking pages.
They define the notion of index nodes and reference nodes.
 DEFINITION 11.3 : INDEX NODE
An index node is a node whose out-degree is
significantly larger than the average out- degree of the graph.
 DEFINITION 11.4: REFEPENCE NODE
A reference node is a node whose in-degree is
significantly larger than the average in- degree of the graph.
 CLUSTERING AND DETERMINING SIMILAR PAGES
For determining the collection of similar pages, we need
to define the similarity measure between pages. There can be
two basic similarity functions.
 DEFINITION 11.5:BIBLIOGRAPHIC COUPLING
For a pair of nodes, p and q, the bibliographic coupling
is equal to the number of nodes that have links from both p
and q.
 DEFINITION 11.6: CO-CITATION
For a pair of nodes, p and q, the co-citation is the
number of nodes that point to both p and q.
 Web usage mining deals with studying the data generated by
the web.
 Web content and structure mining utilize the real or primary
data on the web.
 Web usage mining mines the secondary data derived from the
interactions of the users with the web.
 The secondary data includes the data from the web server
access logs, proxy server logs, browser logs, user profiles,
registration data, user sessions or transactions, cookies, user
queries, bookmark data, mouse clicks and scrolls, and any
other data which are the results of these interactions.
 This data can be accumulated by the web server.
 Analyses of the web access logs of different web sites can
facilitate an understanding of the user behavior and the web
structure, thereby improving the design of this large collection
of information.
WEB USAGE MINING
There are two main approaches in
web usage mining
1. GENERAL ACCESS PATTERN TRACKING
 This is to learn user navigation patterns (impersonalized).
 The general access pattern tracking analyzes the web logs
to understand access patterns and trends.
2. CUSTOMIZED USAGE TRACKING
 This is to learn a user profile or user modeling in adaptive
interfaces (personalized).
 Customized usage tracking analyzes individual trends. Its
purpose is to customize web sites to users.
 The information displayed, the depth of the site structure,
and the format of the resources can all be dynamically
customized for each user over time, based on their access
patterns.
Text mining
 Text mining, corresponds to the extension of the data mining
approach to textual data and is concerned with various tasks,
such as extraction of information implicitly contained in
collection of documents, or similarity- based structuring.
 The text expresses a vast range of information, but encodes the
information in a form that is difficult to interpret
automatically.
 When the data is structured it is easy to define the set of items,
and hence, it becomes easy to employ the traditional mining
techniques.
 Identifying individual items or terms is not so obvious in a
textual database.
 Thus, unstructured data, particularly free- running text, places
a new demand on data mining methodology.
OTHER RELATED AREAS
 Information Retrieval(IR),
 Information Extraction(IE),
 Computational Linguistics.
Information Retrieval(IR)
 IR is concerned with finding and ranking documents that
match the user’s information needs.
 The way of dealing with textual information by the IR
community is a keyword based document representation.
 A body of text is analyzed by its constituent words, and various
techniques are used to build the core words for a document.
 Actually, IR is the automatic retrieval of all relevant documents
 The goals of IR are
 To find documents that are similar, based on some specification of the
user.
 To find the right index terms in a collection, so that querying will return
the appropriate document.
Information extraction(IE)
 IE has the goal of transforming a collection of documents
into information that is more readily digested and analyzed
with the help of an IR system.
 IE extracts relevant facts from the documents, while IR
selects relevant documents. Thus, in general, IE works at a
finer granularity level than IR does on the documents.
 Most IE systems use machine learning or data mining
techniques to learn the extraction patterns or rules for
documents semi-automatically or automatically.
 The results of the IE process could be in the form of a
structured database, or could be a compression or
summary of the original text or documents.
Computational Linguistics
 Computational linguistics framework, patterns are
discovered to aid other problems within the same
domain, whereas text data mining is aimed at
discovering unknown information for different
applications.
Unstructured documents are free texts, such as news
stories.
 FEATURES
For an unstructured document, features are extracted
to convert it to a structured form. Some of the important
features are listed below.
1. WORD OCCURRENCES
Word occurrence can be used to identify the most
recurrent terms or concepts in a set of data.
2. STOP-WORDS
 The features election includes removing the case,
punctuation, infrequent words, and stop words. A good site
for the set of stop-words for the English language is
www,dcs.gla.ac.uk/idorn/irresources/linguisticutil/stopwo
rds
UNSTRUCTURED TEXT
3. LATENT SEMANTIC INDEXING
Latent Semantic Indexing(LSI) transforms the original
document vectors to a lower dimensional space by
analyzing the correlational structure of terms in the
document collection, such that similar documents that do
not share terms are placed in the same topic.
4. STEMMING
Stemming is a process which reduces words to their
morphological roots. For example, the words “informing ”,
“information ”“informer”, and “informed” would be
stemmed to their common root “inform”, and only the
latter word is used as the feature instead of the former four.
5. n-GRAM
Other feature representations are also possible, such
as using information about word positions in the
document, or using n-grams representation (word
sequences of length up to n)(In WEBSOM).
6. PART-OF-SPEECH(POS)
One important feature is the POS. There can be 25 possible
values for POS tags. Most common tags are noun, verb, adjective
and adverb. Thus, we can assign a number 1,2, 3, 4 or 5, depending
on whether the word is a noun, verb, adjective, adverb or any
other, respectively.
7. POSITIONAL COLLOCATIONS
The values of this type of feature are the words that occur
one or two position to the right or left of the given word.
8. HIGHER ORDER FEATURES
Other features include phrases, document concept
categories, terms, hypernyms, named entities, dates, email
addresses, locations, organizati9ns, or URLs. These features could
be reduced further by applying some other feature selection
techniques, such as information gain, mutual information, cross
entropy, or odds ratio.
 Once the features are extracted, the text is represented
as structured data, and traditional data mining
techniques can be used.
 The techniques include discovering frequent sets,
frequent sequences and episode rules. We describe
below the preprocessing stage to fund frequent
episodes.
 Ahonen et al. propose to apply sequence mining techniques for
text data.
 They consider text as sequential data which consists of a
sequence of pairs (feature vector, index), where the feature
vector is an ordered set of features and the index contains
information about the position of the word in the sequence
For example, the text Path finder photographs Mars can be
represented as
 (pathfinder_noun_singular,1),(photographs_verb_singular,2),(
Mars_noun_singular,3))
EPISODE RULE DISCOVERY FOR TEXTS
 Similarly, the text knowledge discovery in databases can be
represented as the sequence
((knowledge_noun_singular,1),(discovery_noun_singular,2),(in_r
eposition,3), (databases_noun_plural,4))
 Instead of considering all occurrences of the episode, a restriction is
set that the episode must occur within a pre specified window of size,
w. Thus, we examine the substrings S‘ of S such that the difference of
the indices in S‘ is at most w.
 For w=2, the subsequence(knowledge_noun_singular,
discovery_noun_singular)is an episode contained in the window, but
the subsequence(knowledge_noun_singular,
databases_noun_plural)is not contained within the window.

More Related Content

Similar to WEB MINING.pptx

DWM-MODULE 6.pdf
DWM-MODULE 6.pdfDWM-MODULE 6.pdf
DWM-MODULE 6.pdf
nikshaikh786
 
01635156
0163515601635156
01635156
Mechergui Najla
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
IRJET Journal
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
Nicole Heredia
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
IRJET Journal
 
Web Mining
Web MiningWeb Mining
Web Mining
Shobha Rani
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
ijsrd.com
 
A Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web UsageA Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web Usage
ijbuiiir1
 
50320140501002
5032014050100250320140501002
50320140501002
IAEME Publication
 
A detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniquesA detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniques
ijctet
 
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
IOSR Journals
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
IJERA Editor
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
 
Web content mining
Web content miningWeb content mining
Web content mining
Daminda Herath
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
Daminda Herath
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web Mining
IOSR Journals
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
Ijcem Journal
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs
inventionjournals
 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
Ouzza Brahim
 

Similar to WEB MINING.pptx (20)

DWM-MODULE 6.pdf
DWM-MODULE 6.pdfDWM-MODULE 6.pdf
DWM-MODULE 6.pdf
 
01635156
0163515601635156
01635156
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
 
A Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web UsageA Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web Usage
 
50320140501002
5032014050100250320140501002
50320140501002
 
A detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniquesA detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniques
 
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web Mining
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs
 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
 

Recently uploaded

Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 

Recently uploaded (20)

Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 

WEB MINING.pptx

  • 2. WEB MINING  We make use of the web in several ways. As Kosala et al put it, we interact with the web for the following purposes. FINDING RELEVANT INFORATION  We either browse or use the search service when we want to find specific information on the web.  We usually specify a simple keyword query and the response from a web- search engine is a list of pages, ranked based on their similarity to the query.  However, today's search tools have the following problems:  how precision: This is due to the irrelevance of many of the search results. We may get many pages of information which are not really relevant to our query.  how recall: This is due to the inability to index all the information available on the web. Because some of the relevant pages are not properly indexed, we may not get those pages through any of the search engines.
  • 3. DISCOVERING NEW KNOWLEDGE FROM THE WEB  We can term the above problem as a query-triggered process (retrieval oriented).  On the other hand, we can have a data-triggered process that presumes that we already have a collection of web data and we want to extract potentially useful knowledge out of it (data mining- oriented). PERSONALIZED WEB PAGE SYNTHESIS  We may wish to synthesize a web page for different individuals from the available set of web pages.  Individuals have their own preferences in the style of the contents and presentations while interacting with the web.  The information providers like to create a system which responds to user queries by potentially aggregating information from several sources, in a manner which is dependent on the user.
  • 4. LEARNING ABOUT INDIVIDUAL USERS  It is about knowing what the customers do and want. Inside this problem, there are sub problems, such as mass customizing the information to the intended consumers or even personalizing it to individual user, problems related to effects web site design and management, problems related to marketing, etc. Web mining techniques provide a set of techniques that can be used to solve the above problems. Sometimes, web mining techniques provide direct solutions to above problems.
  • 5. Mining techniques in the web can be categorized into three areas  Web content mining,  Web structure mining, and  Web usage mining. Figure : web mining tasks Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining Search Result Mining General Access Pattern Tracking Customized Usage Tracking
  • 6.  Web content mining describes the discovery of useful information from the web contents.  The web contains many kinds of data.  Much of the government information are gradually being placed on the web in recent years.  Existence of Digital Libraries that are also accessible from the web.  Many commercial institutions are transforming their businesses and services electronically.  We cannot ignore another type of web content—the existence of web applications, so that the users could access the applications through web interfaces. Web content mining
  • 7.  Basically, the web content consists of several types of data such as textual, image, audio, video, metadata, as well as hyperlinks.  The textual parts of web content data consists of unstructured data such as free texts, semi – structured data such as HTML documents, and more structured data such as data in the tables or database – generated HTML pages.
  • 8.  Web structure mining is the process of discovering the structure information from the web.  According to the type of web structural data, web structure mining can be divided into two kinds:  Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.  Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.  The structure of typical web graph consists of web pages as nodes, and hyperlinks as edges connecting between two related pages. WEB STRUCTURE MINING
  • 9. Web structure mining terminology:  Web graph: directed graph representing web.  Node: web page in graph.  Edge: hyperlinks.  In degree: number of links pointing to particular node.  Out degree: number of links generated from particular node
  • 10.  some of the techniques that are useful in modeling web topology.  PAGE RANK  Used to discover the most important pages on the web.  A page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it which have a high PageRank.  PageRank is defined as follows: We assume page A has pages T1 ,..., Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1 and is usually set to 0.85. out_deg(A ) denotes the number of links going out of page A (out- degree of A). ’
  • 11.
  • 12.  SOCIAL NETWORK  Social network analysis is yet another way of studying the web link structure. It uses an exponentially varying damping factor.  Web structure mining utilizes the hyperlinks structure of the web to apply social network analysis, to model the underlying links structure of the web itself.  The social network studies ways to measure the relative standing or importance of individuals in a network.  The same process can be mapped to study the link structures of the web pages. The basic premise here is that if a web page points a link to another web page, then the former is, in some sense, endorsing the importance of the latter.
  • 13.  Kautz et at. in a pioneering work on web structure mining, The Hidden Web, propose a measure of standing of a node based on path counting. They carry out social network analysis to model the network of AI researchers. The standing of a node(page) can be defined as follows.
  • 14.  TRANSVERSE AND INTRINSIC LINKS A link is said to be a transverse link if it is between pages with different domain names, and An intrinsic link if it is between pages with the same domain name.  REFERENCE NODES AND INDEX NODES Botafogo et al. propose another way of ranking pages. They define the notion of index nodes and reference nodes.  DEFINITION 11.3 : INDEX NODE An index node is a node whose out-degree is significantly larger than the average out- degree of the graph.  DEFINITION 11.4: REFEPENCE NODE A reference node is a node whose in-degree is significantly larger than the average in- degree of the graph.
  • 15.  CLUSTERING AND DETERMINING SIMILAR PAGES For determining the collection of similar pages, we need to define the similarity measure between pages. There can be two basic similarity functions.  DEFINITION 11.5:BIBLIOGRAPHIC COUPLING For a pair of nodes, p and q, the bibliographic coupling is equal to the number of nodes that have links from both p and q.  DEFINITION 11.6: CO-CITATION For a pair of nodes, p and q, the co-citation is the number of nodes that point to both p and q.
  • 16.  Web usage mining deals with studying the data generated by the web.  Web content and structure mining utilize the real or primary data on the web.  Web usage mining mines the secondary data derived from the interactions of the users with the web.  The secondary data includes the data from the web server access logs, proxy server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any other data which are the results of these interactions.  This data can be accumulated by the web server.  Analyses of the web access logs of different web sites can facilitate an understanding of the user behavior and the web structure, thereby improving the design of this large collection of information. WEB USAGE MINING
  • 17. There are two main approaches in web usage mining 1. GENERAL ACCESS PATTERN TRACKING  This is to learn user navigation patterns (impersonalized).  The general access pattern tracking analyzes the web logs to understand access patterns and trends. 2. CUSTOMIZED USAGE TRACKING  This is to learn a user profile or user modeling in adaptive interfaces (personalized).  Customized usage tracking analyzes individual trends. Its purpose is to customize web sites to users.  The information displayed, the depth of the site structure, and the format of the resources can all be dynamically customized for each user over time, based on their access patterns.
  • 18. Text mining  Text mining, corresponds to the extension of the data mining approach to textual data and is concerned with various tasks, such as extraction of information implicitly contained in collection of documents, or similarity- based structuring.  The text expresses a vast range of information, but encodes the information in a form that is difficult to interpret automatically.  When the data is structured it is easy to define the set of items, and hence, it becomes easy to employ the traditional mining techniques.  Identifying individual items or terms is not so obvious in a textual database.  Thus, unstructured data, particularly free- running text, places a new demand on data mining methodology.
  • 19. OTHER RELATED AREAS  Information Retrieval(IR),  Information Extraction(IE),  Computational Linguistics.
  • 20. Information Retrieval(IR)  IR is concerned with finding and ranking documents that match the user’s information needs.  The way of dealing with textual information by the IR community is a keyword based document representation.  A body of text is analyzed by its constituent words, and various techniques are used to build the core words for a document.  Actually, IR is the automatic retrieval of all relevant documents  The goals of IR are  To find documents that are similar, based on some specification of the user.  To find the right index terms in a collection, so that querying will return the appropriate document.
  • 21. Information extraction(IE)  IE has the goal of transforming a collection of documents into information that is more readily digested and analyzed with the help of an IR system.  IE extracts relevant facts from the documents, while IR selects relevant documents. Thus, in general, IE works at a finer granularity level than IR does on the documents.  Most IE systems use machine learning or data mining techniques to learn the extraction patterns or rules for documents semi-automatically or automatically.  The results of the IE process could be in the form of a structured database, or could be a compression or summary of the original text or documents.
  • 22. Computational Linguistics  Computational linguistics framework, patterns are discovered to aid other problems within the same domain, whereas text data mining is aimed at discovering unknown information for different applications.
  • 23. Unstructured documents are free texts, such as news stories.  FEATURES For an unstructured document, features are extracted to convert it to a structured form. Some of the important features are listed below. 1. WORD OCCURRENCES Word occurrence can be used to identify the most recurrent terms or concepts in a set of data. 2. STOP-WORDS  The features election includes removing the case, punctuation, infrequent words, and stop words. A good site for the set of stop-words for the English language is www,dcs.gla.ac.uk/idorn/irresources/linguisticutil/stopwo rds UNSTRUCTURED TEXT
  • 24. 3. LATENT SEMANTIC INDEXING Latent Semantic Indexing(LSI) transforms the original document vectors to a lower dimensional space by analyzing the correlational structure of terms in the document collection, such that similar documents that do not share terms are placed in the same topic. 4. STEMMING Stemming is a process which reduces words to their morphological roots. For example, the words “informing ”, “information ”“informer”, and “informed” would be stemmed to their common root “inform”, and only the latter word is used as the feature instead of the former four. 5. n-GRAM Other feature representations are also possible, such as using information about word positions in the document, or using n-grams representation (word sequences of length up to n)(In WEBSOM).
  • 25. 6. PART-OF-SPEECH(POS) One important feature is the POS. There can be 25 possible values for POS tags. Most common tags are noun, verb, adjective and adverb. Thus, we can assign a number 1,2, 3, 4 or 5, depending on whether the word is a noun, verb, adjective, adverb or any other, respectively. 7. POSITIONAL COLLOCATIONS The values of this type of feature are the words that occur one or two position to the right or left of the given word. 8. HIGHER ORDER FEATURES Other features include phrases, document concept categories, terms, hypernyms, named entities, dates, email addresses, locations, organizati9ns, or URLs. These features could be reduced further by applying some other feature selection techniques, such as information gain, mutual information, cross entropy, or odds ratio.
  • 26.  Once the features are extracted, the text is represented as structured data, and traditional data mining techniques can be used.  The techniques include discovering frequent sets, frequent sequences and episode rules. We describe below the preprocessing stage to fund frequent episodes.
  • 27.  Ahonen et al. propose to apply sequence mining techniques for text data.  They consider text as sequential data which consists of a sequence of pairs (feature vector, index), where the feature vector is an ordered set of features and the index contains information about the position of the word in the sequence For example, the text Path finder photographs Mars can be represented as  (pathfinder_noun_singular,1),(photographs_verb_singular,2),( Mars_noun_singular,3)) EPISODE RULE DISCOVERY FOR TEXTS
  • 28.  Similarly, the text knowledge discovery in databases can be represented as the sequence ((knowledge_noun_singular,1),(discovery_noun_singular,2),(in_r eposition,3), (databases_noun_plural,4))  Instead of considering all occurrences of the episode, a restriction is set that the episode must occur within a pre specified window of size, w. Thus, we examine the substrings S‘ of S such that the difference of the indices in S‘ is at most w.  For w=2, the subsequence(knowledge_noun_singular, discovery_noun_singular)is an episode contained in the window, but the subsequence(knowledge_noun_singular, databases_noun_plural)is not contained within the window.