SlideShare a Scribd company logo
1 of 28
Download to read offline
WEB MINING
References: Data mining techniques by
Arun k. Pujari
MRS.SOWMYA JYOTHI
SDMCBM
MANGALORE
Web Mining:
“Web mining refers to the overall process of
discovering potentially useful and previously
unknown information or knowledge from the Web
data.”
2
Discovering Knowledge from and about WWW - is one
of the basic abilities of an intelligent agent
3
Knowledg
e
WWW
Web Mining: Subtasks
1. Resource finding
◦ Retrieving intended documents
2. Information selection/pre-processing
◦ Select and pre-process specific information from selected
documents
3. Generalization
◦ Discover general patterns within and across web sites
4. Analysis
◦ Validation and/or interpretation of mined patterns
4
Web Mining:
As Kosala et al, put it, We interact
with the web for the following
purposes.
1. Finding Relevant Information
We either browse or use the search service when we want to find
specific information on the web.
We usually specify a simple keyword query and the response
from the web search engine is a list of pages, ranked based on
their similarity to the query.
1. Low precision: This is due to the irrelevance of many of the
search results. We may get many pages of information
which are not really relevant to our query.
2. Low recall/unindexed information: This is due to the
inability to index all the information available on the web.
Because some of the relevant pages are not properly indexed,
we may not get those pages through any of the search engines.
Search tools have the following problem:
2. Discovering new knowledge from the
web
We can have a data triggered process that presumes that we
already have a collection of web data and we want to extract
potentially useful knowledge out of it (data mining-oriented).
3. Personalized web pages synthesis
We may wish to synthesize a web page for different
individuals from the available set of web pages. i.e
catering to personal preference in contents and
presentation.
Individuals have their own preferences in the
style of the contents and presentations while
interacting with the web.
4. Learning about individual users:-
Inside this problem, there are subproblems, such
as mass customizing the information to the
intended consumers or even personalizing it to
individual user, problems related to effective
web site design and management, problems
related to marketing etc.
•Web Mining can be said to have three
operations of interests
1. Clustering – Finding natural groupings of users,
pages, etc.
2. Associations – which URLs tend to be
requested together.
3. Sequential Analysis – The order in which URLs
tend to be accessed.
WEB CONTENT MINING:
•Web content mining describes the discovery of useful information
from the web contents.
•The web contains many kind of data.
•We see that much of the government information are gradually being
placed on the web in recent years.
•We also know the existence of Digital Libraries that are also
accessible from the web.
• Many commercial institutions are transforming their business and
services electronically.
• We cannot ignore another type of web content- the existence of web
application through web interfaces.
•Some of the web content data are hidden data, and some are generated
dynamically as a result of queries and reside in the DBMS. These data
are generally not indexed.
Web content consists of several types of data such as
textual, image, audio, video, metadata as well as
hyperlinks.
Recent research on mining multi-types of data is termed as
multimedia data mining.
The textual parts of web content data consist of
unstructured data such as free texts, semi-structured
data such as HTML documents and more structured
data such as data in the tables or database-generated
HTML pages.
Web Structure Mining:-
Web structure mining is concerned with
discovering the model underlying the link
structures of the web. Or it studies the structures
of documents within the web itself.
It is used to study the topology of the hyperlinks
with or without the description of the links.
This model can be used to categorize web pages
and is useful to generate information such as the
similarity and relationship between different web
sites.
Interested in the structure between Web documents
(not within a document). Inspired by the study of
social networks and citation analysis.
PageRank (PR) is an algorithm used by Google
Search to rank web pages in their search engine
results. It is named after both the term "web page"
and co-founder Larry Page. PageRank is a way of
measuring the importance of website pages.
Damping factor:
The PageRank theory holds that an imaginary
surfer who is randomly clicking on links will
eventually stop clicking. The probability, at any
step, that the person will continue is a damping
factor d.
Page Rank is defined as follows;
We assume page A has pages Tl, ........ , Tn which
point to it (i.e., are citations).
The parameter d is a damping factor which can be
set between 0 and 1 and is usually set to 0.85.
out_deg(A) denotes the number of links going out
of page A (out-degree of A).
Social Network:
Social network analysis is another way of studying
the web link structure. It uses an exponentially
varying damping structure.
Web structure mining utilizes the hyperlinks
structure of the web to apply social network
analysis, to model the underlying links structure of
the web itself.
The social network studies ways to measure the
relative standing or importance of individuals in a
network. The same process can be mapped to study
the link structures of the web pages.
• INDEX NODE: An index node is a node whose
out-degree is significantly larger than the
average out-degree of the graph.
• REFERENCE NODE: An reference node is a
node whose in-degree is significantly larger
than the average in-degree of the graph.
 A link is said to be a transverse link if it is between
pages with different domain names and
 An intrinsic link if it is between pages with the same
domain name. Here by "domain name", we mean the
first level in the URL string associated with the page.
For determining the collection of similar pages, we
need to define the similarity measure between pages.
There can be two basic similarity functions.
For the pair of nodes, p and q, the bibliographic
coupling is equal to the number of nodes that
have links from both p and q.
Example: Documents are said to be
bibliographically coupled if they share one or
more bibliographic references. It is used as an
indicator of subject relatedness. There is no guarantee
that two bibliographically coupled documents (A) and
(B) cite the same piece of information in (C).
For the pair of nodes, p and q, the co-citation is
the number of nodes that point to both p and q.
Eg: If A and B are both cited by C, they may be said
to be related to one another, even though they
don't directly reference each other.
If A and B are both cited by many other items,
they have a stronger relationship. The more items
they are cited by, the stronger their relationship is.
In some cases we can take into
account both bibliographic and co-
citation couplings. The similarity
measure between two sub
cluster Sx and Sy is computed as
| Sx П Sy |
|Sx U Sy|
Web Usage Mining is the process of
applying data mining techniques to the
discovery of usage patterns from Web data,
in order to understand and better serve the
needs of Web-based applications.
WEB USAGE MINING
Web usage mining deals with studying the data
generated by the web surfer's sessions or behaviors.
Web content and structure mining utilize the real or
primary data on the web. On the contrary, web usage
mining mines the secondary data derived from the
web server access logs, proxy server logs. browser
logs, user profiles, registration, data user sessions or
transactions, cookies, user queries, bookmarks data,
mouse clicks and scrolls, and any other data which
are the results of these interactions.
In simple words we can say that it is a discovery of user
access pattern from the web usage logs.
Two Approaches in web usage mining
1. General access pattern tracking:
This is to learn user navigation patterns
(impersonalized).
The general access pattern tracking analyzes the web
logs to understand access patterns and trends
2. Customized usage tracking
This is to learn a user profile or user modeling in adaptive
interfaces (personalized).
Customized usage tracking analyzes individual trends.
Its purpose is to customize web sites to users.
Text mining is the subset of Data Mining that
involves processing unstructured text
documents into a structured format.
Web mining is a subset of Data Mining that
involves processing the data related to the
Web. It can be Web Logs, Web Structure data,
or Web Contact data.
Due to continuous growth of the volumes of text
data, automated extraction of implicit, previously
unknown, ad potentially useful information
becomes more necessary to properly utilize this
vast source of knowledge.
Text mining, therefore, corresponds to the
extension of the data mining approach to textual
data and is concerned with various tasks, such as
extraction of information implicitly contained in
collection of documents, or similarity-based
structuring.
UNSTRUCTURED TEXT
Unstructured documents are free texts such as
new stories
Features
1. Word Occurrences
The bag of words or vector representation takes singe words in the
training corpus as features ignoring the sequence in which the words
occur.
2. Stop-words
The feature selection includes removing the case,punctuation,
infrequent words, and stop words.
3. Latent Semantic Indexing
Latent Semantic Indexing (LSI) transforms the original
document vectors to the lower dimensional space by
analyzing the co-relational structure of terms in the document
collection such that similar document that do not share terms
are placed in the same topic.
4. Stemming
Stemming is a process which reduces words to their
morphological roots. For ex, the word "informing", "information",
"informer", and "informed" would be stemmed to their common root
"inform", and only the later words is used as the feature instead of
the former four.
5. N-Gram
Other feature representations are also possible, such as using
information about word positions in the document, or using
n-grams representation (word sequence of length up to n).
6. Part Of Speech (POS)
One important feature is POS. There can be 25 possible values
for POS tags. Most common tags are noun, verb, adjective
and adverb.
7.Positional Collocations
The values of this type of feature are the words that occur one or
two position to the right or left of the given word.

More Related Content

What's hot

MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxnikshaikh786
 
Team project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataTeam project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataManasa Damera
 
Graph Neural Network (한국어)
Graph Neural Network (한국어)Graph Neural Network (한국어)
Graph Neural Network (한국어)Jungwon Kim
 
Market Basket Analysis in SAS
Market Basket Analysis in SASMarket Basket Analysis in SAS
Market Basket Analysis in SASAndrew Kramer
 
chap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxchap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxDiaaMustafa2
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text MiningHemant Sharma
 
1.11.association mining 3
1.11.association mining 31.11.association mining 3
1.11.association mining 3Krish_ver2
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningUtkarsh Sharma
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxnikshaikh786
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.Megha Sharma
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis IntroductionPrasiddhaSarma
 

What's hot (20)

Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
Team project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataTeam project - Data visualization on Olist company data
Team project - Data visualization on Olist company data
 
Market basket analysis
Market basket analysisMarket basket analysis
Market basket analysis
 
Graph Neural Network (한국어)
Graph Neural Network (한국어)Graph Neural Network (한국어)
Graph Neural Network (한국어)
 
Market Basket Analysis in SAS
Market Basket Analysis in SASMarket Basket Analysis in SAS
Market Basket Analysis in SAS
 
chap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxchap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptx
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
1.11.association mining 3
1.11.association mining 31.11.association mining 3
1.11.association mining 3
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Chapter8
Chapter8Chapter8
Chapter8
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 

Similar to WEBMINING_SOWMYAJYOTHI.pdf

Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categoriestheijes
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure MiningIRJET Journal
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure MiningNicole Heredia
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logsijsrd.com
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimizationBookStoreLib
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technologyanchalsinghdm
 
A detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniquesA detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniquesijctet
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
 
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and PagerankData Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and Pagerankijtsrd
 
Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Mumbai Academisc
 

Similar to WEBMINING_SOWMYAJYOTHI.pdf (20)

WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
 
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 
Aa03401490154
Aa03401490154Aa03401490154
Aa03401490154
 
Minning www
Minning wwwMinning www
Minning www
 
A detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniquesA detail survey of page re ranking various web features and techniques
A detail survey of page re ranking various web features and techniques
 
Web mining
Web miningWeb mining
Web mining
 
F43033234
F43033234F43033234
F43033234
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
01635156
0163515601635156
01635156
 
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and PagerankData Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
 
Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)
 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
 

More from SowmyaJyothi3

USER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdf
USER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdfUSER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdf
USER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdfSowmyaJyothi3
 
STRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdf
STRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdfSTRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdf
STRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdfSowmyaJyothi3
 
STRINGS IN C MRS.SOWMYA JYOTHI.pdf
STRINGS IN C MRS.SOWMYA JYOTHI.pdfSTRINGS IN C MRS.SOWMYA JYOTHI.pdf
STRINGS IN C MRS.SOWMYA JYOTHI.pdfSowmyaJyothi3
 
POINTERS IN C MRS.SOWMYA JYOTHI.pdf
POINTERS IN C MRS.SOWMYA JYOTHI.pdfPOINTERS IN C MRS.SOWMYA JYOTHI.pdf
POINTERS IN C MRS.SOWMYA JYOTHI.pdfSowmyaJyothi3
 
MANAGING INPUT AND OUTPUT OPERATIONS IN C MRS.SOWMYA JYOTHI.pdf
MANAGING INPUT AND OUTPUT OPERATIONS IN C    MRS.SOWMYA JYOTHI.pdfMANAGING INPUT AND OUTPUT OPERATIONS IN C    MRS.SOWMYA JYOTHI.pdf
MANAGING INPUT AND OUTPUT OPERATIONS IN C MRS.SOWMYA JYOTHI.pdfSowmyaJyothi3
 
Constants Variables Datatypes by Mrs. Sowmya Jyothi
Constants Variables Datatypes by Mrs. Sowmya JyothiConstants Variables Datatypes by Mrs. Sowmya Jyothi
Constants Variables Datatypes by Mrs. Sowmya JyothiSowmyaJyothi3
 

More from SowmyaJyothi3 (6)

USER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdf
USER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdfUSER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdf
USER DEFINED FUNCTIONS IN C MRS.SOWMYA JYOTHI.pdf
 
STRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdf
STRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdfSTRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdf
STRUCTURE AND UNION IN C MRS.SOWMYA JYOTHI.pdf
 
STRINGS IN C MRS.SOWMYA JYOTHI.pdf
STRINGS IN C MRS.SOWMYA JYOTHI.pdfSTRINGS IN C MRS.SOWMYA JYOTHI.pdf
STRINGS IN C MRS.SOWMYA JYOTHI.pdf
 
POINTERS IN C MRS.SOWMYA JYOTHI.pdf
POINTERS IN C MRS.SOWMYA JYOTHI.pdfPOINTERS IN C MRS.SOWMYA JYOTHI.pdf
POINTERS IN C MRS.SOWMYA JYOTHI.pdf
 
MANAGING INPUT AND OUTPUT OPERATIONS IN C MRS.SOWMYA JYOTHI.pdf
MANAGING INPUT AND OUTPUT OPERATIONS IN C    MRS.SOWMYA JYOTHI.pdfMANAGING INPUT AND OUTPUT OPERATIONS IN C    MRS.SOWMYA JYOTHI.pdf
MANAGING INPUT AND OUTPUT OPERATIONS IN C MRS.SOWMYA JYOTHI.pdf
 
Constants Variables Datatypes by Mrs. Sowmya Jyothi
Constants Variables Datatypes by Mrs. Sowmya JyothiConstants Variables Datatypes by Mrs. Sowmya Jyothi
Constants Variables Datatypes by Mrs. Sowmya Jyothi
 

Recently uploaded

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 

Recently uploaded (20)

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 

WEBMINING_SOWMYAJYOTHI.pdf

  • 1. WEB MINING References: Data mining techniques by Arun k. Pujari MRS.SOWMYA JYOTHI SDMCBM MANGALORE
  • 2. Web Mining: “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” 2
  • 3. Discovering Knowledge from and about WWW - is one of the basic abilities of an intelligent agent 3 Knowledg e WWW
  • 4. Web Mining: Subtasks 1. Resource finding ◦ Retrieving intended documents 2. Information selection/pre-processing ◦ Select and pre-process specific information from selected documents 3. Generalization ◦ Discover general patterns within and across web sites 4. Analysis ◦ Validation and/or interpretation of mined patterns 4
  • 5. Web Mining: As Kosala et al, put it, We interact with the web for the following purposes. 1. Finding Relevant Information We either browse or use the search service when we want to find specific information on the web. We usually specify a simple keyword query and the response from the web search engine is a list of pages, ranked based on their similarity to the query.
  • 6. 1. Low precision: This is due to the irrelevance of many of the search results. We may get many pages of information which are not really relevant to our query. 2. Low recall/unindexed information: This is due to the inability to index all the information available on the web. Because some of the relevant pages are not properly indexed, we may not get those pages through any of the search engines. Search tools have the following problem:
  • 7. 2. Discovering new knowledge from the web We can have a data triggered process that presumes that we already have a collection of web data and we want to extract potentially useful knowledge out of it (data mining-oriented). 3. Personalized web pages synthesis We may wish to synthesize a web page for different individuals from the available set of web pages. i.e catering to personal preference in contents and presentation. Individuals have their own preferences in the style of the contents and presentations while interacting with the web.
  • 8. 4. Learning about individual users:- Inside this problem, there are subproblems, such as mass customizing the information to the intended consumers or even personalizing it to individual user, problems related to effective web site design and management, problems related to marketing etc.
  • 9. •Web Mining can be said to have three operations of interests 1. Clustering – Finding natural groupings of users, pages, etc. 2. Associations – which URLs tend to be requested together. 3. Sequential Analysis – The order in which URLs tend to be accessed.
  • 10.
  • 11. WEB CONTENT MINING: •Web content mining describes the discovery of useful information from the web contents. •The web contains many kind of data. •We see that much of the government information are gradually being placed on the web in recent years. •We also know the existence of Digital Libraries that are also accessible from the web. • Many commercial institutions are transforming their business and services electronically. • We cannot ignore another type of web content- the existence of web application through web interfaces. •Some of the web content data are hidden data, and some are generated dynamically as a result of queries and reside in the DBMS. These data are generally not indexed.
  • 12. Web content consists of several types of data such as textual, image, audio, video, metadata as well as hyperlinks. Recent research on mining multi-types of data is termed as multimedia data mining. The textual parts of web content data consist of unstructured data such as free texts, semi-structured data such as HTML documents and more structured data such as data in the tables or database-generated HTML pages.
  • 13. Web Structure Mining:- Web structure mining is concerned with discovering the model underlying the link structures of the web. Or it studies the structures of documents within the web itself. It is used to study the topology of the hyperlinks with or without the description of the links. This model can be used to categorize web pages and is useful to generate information such as the similarity and relationship between different web sites. Interested in the structure between Web documents (not within a document). Inspired by the study of social networks and citation analysis.
  • 14. PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Damping factor: The PageRank theory holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d.
  • 15. Page Rank is defined as follows; We assume page A has pages Tl, ........ , Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1 and is usually set to 0.85. out_deg(A) denotes the number of links going out of page A (out-degree of A).
  • 16. Social Network: Social network analysis is another way of studying the web link structure. It uses an exponentially varying damping structure. Web structure mining utilizes the hyperlinks structure of the web to apply social network analysis, to model the underlying links structure of the web itself. The social network studies ways to measure the relative standing or importance of individuals in a network. The same process can be mapped to study the link structures of the web pages.
  • 17. • INDEX NODE: An index node is a node whose out-degree is significantly larger than the average out-degree of the graph. • REFERENCE NODE: An reference node is a node whose in-degree is significantly larger than the average in-degree of the graph.  A link is said to be a transverse link if it is between pages with different domain names and  An intrinsic link if it is between pages with the same domain name. Here by "domain name", we mean the first level in the URL string associated with the page.
  • 18. For determining the collection of similar pages, we need to define the similarity measure between pages. There can be two basic similarity functions. For the pair of nodes, p and q, the bibliographic coupling is equal to the number of nodes that have links from both p and q. Example: Documents are said to be bibliographically coupled if they share one or more bibliographic references. It is used as an indicator of subject relatedness. There is no guarantee that two bibliographically coupled documents (A) and (B) cite the same piece of information in (C).
  • 19. For the pair of nodes, p and q, the co-citation is the number of nodes that point to both p and q. Eg: If A and B are both cited by C, they may be said to be related to one another, even though they don't directly reference each other. If A and B are both cited by many other items, they have a stronger relationship. The more items they are cited by, the stronger their relationship is.
  • 20. In some cases we can take into account both bibliographic and co- citation couplings. The similarity measure between two sub cluster Sx and Sy is computed as | Sx П Sy | |Sx U Sy|
  • 21.
  • 22. Web Usage Mining is the process of applying data mining techniques to the discovery of usage patterns from Web data, in order to understand and better serve the needs of Web-based applications.
  • 23. WEB USAGE MINING Web usage mining deals with studying the data generated by the web surfer's sessions or behaviors. Web content and structure mining utilize the real or primary data on the web. On the contrary, web usage mining mines the secondary data derived from the web server access logs, proxy server logs. browser logs, user profiles, registration, data user sessions or transactions, cookies, user queries, bookmarks data, mouse clicks and scrolls, and any other data which are the results of these interactions. In simple words we can say that it is a discovery of user access pattern from the web usage logs.
  • 24. Two Approaches in web usage mining 1. General access pattern tracking: This is to learn user navigation patterns (impersonalized). The general access pattern tracking analyzes the web logs to understand access patterns and trends 2. Customized usage tracking This is to learn a user profile or user modeling in adaptive interfaces (personalized). Customized usage tracking analyzes individual trends. Its purpose is to customize web sites to users.
  • 25. Text mining is the subset of Data Mining that involves processing unstructured text documents into a structured format. Web mining is a subset of Data Mining that involves processing the data related to the Web. It can be Web Logs, Web Structure data, or Web Contact data.
  • 26. Due to continuous growth of the volumes of text data, automated extraction of implicit, previously unknown, ad potentially useful information becomes more necessary to properly utilize this vast source of knowledge. Text mining, therefore, corresponds to the extension of the data mining approach to textual data and is concerned with various tasks, such as extraction of information implicitly contained in collection of documents, or similarity-based structuring.
  • 27. UNSTRUCTURED TEXT Unstructured documents are free texts such as new stories Features 1. Word Occurrences The bag of words or vector representation takes singe words in the training corpus as features ignoring the sequence in which the words occur. 2. Stop-words The feature selection includes removing the case,punctuation, infrequent words, and stop words. 3. Latent Semantic Indexing Latent Semantic Indexing (LSI) transforms the original document vectors to the lower dimensional space by analyzing the co-relational structure of terms in the document collection such that similar document that do not share terms are placed in the same topic.
  • 28. 4. Stemming Stemming is a process which reduces words to their morphological roots. For ex, the word "informing", "information", "informer", and "informed" would be stemmed to their common root "inform", and only the later words is used as the feature instead of the former four. 5. N-Gram Other feature representations are also possible, such as using information about word positions in the document, or using n-grams representation (word sequence of length up to n). 6. Part Of Speech (POS) One important feature is POS. There can be 25 possible values for POS tags. Most common tags are noun, verb, adjective and adverb. 7.Positional Collocations The values of this type of feature are the words that occur one or two position to the right or left of the given word.