SlideShare a Scribd company logo
Web Mining
Concepts and Applications
Utkarsh Sharma
Jaypee University of Engineering & Technology
India
Introduction
• Web mining is actually an area of data mining related to the
information available on internet. It is a concept of extracting
informative data available on web pages over the internet
Why Web Mining ??
• Web data is
• Web content –text, image, records, etc.
• Web structure –hyperlinks, tags, etc.
• Web usage –http logs, app server logs, etc.
Application: Recommendation System
DIFFERENT TYPES OF RECOMMENDATION ENGINES
• Collaborative Filtering
Content Based Filtering
Web Content Mining
Definition
• Web Content Mining is the process of extracting useful information
from the contents of Web documents.
• Content data corresponds to the collection of facts a Web page was designed
to convey to the users.
• Research activities in this field also involve using techniques from
other disciplines such as Information Retrieval (IR) and natural
language processing (NLP).
Web Content Mining
It may consist of:
Web Content Mining
Pre-processing Content
• Content Preparation
• Extract text from HTML.
• Perform Stemming.
• Remove Stop Words.
• Calculate Collection Wide Word Frequencies (DF).
• Calculate per Document Term Frequencies (TF).
• Vector Creation
• Common Information Retrieval Technique.
• Each document (HTML page) is represented by a sparse vector of term
weights.
• TFIDF weighting is most common.
• Typically, additional weight is given to terms appearing as keywords or in
titles.
Common Mining Techniques
• The more basic and popular data mining techniques include:
• Classification
• Clustering
• Associations
• The other significant ideas:
• Topic Identification, tracking and drift analysis
• Concept hierarchy creation
• Relevance of content.
Web Content Mining Applications
Identify the topics represented by a Web Documents
Categorize Web Documents
Find Web Pages across different servers that are similar
Applications related to relevance
Queries –Enhance standard Query Relevance with User, Role, and/or Task
Based Relevance
Recommendations –List of top “n” relevant documents in a collection or
portion of a collection.
Filters –Show/Hide documents based on relevance score
Web Structure Mining
What is Web Structure Mining?
• The structure of a typical Web graph consists of Web
pages as nodes, and hyperlinks as edges connecting
between two related pages
• Web Structure Mining can be the process of discovering
structure information from the Web
• This type of mining can be performed either at the (intra-page)
document level or at the (inter-page) hyperlink level
• The research at the hyperlink level is also called Hyperlink Analysis
Motivation to study Hyperlink Structure
• Hyperlinks serve two main purposes.
• Pure Navigation.
• Point to pages with authority* on the same topic of the page containing the
link.
• This can be used to retrieve useful information from the web.
Web Structure Terminology
• Web-graph:A directed graph that represents the Web.
• Node:Each Web page is a node of the Web-graph.
• Link:Each hyperlink on the Web is a directed edge of the Web-graph.
• In-degree:The in-degree of a node, p,is the number of distinct links
that point to p.
• Out-degree:The out-degree of a node, p, is the number of distinct
links originating at pthat point to other nodes.
Web Structure Terminology(2)
• Directed Path:A sequence of links, starting from p that can be
followed to reach q.
• Shortest Path:Of all the paths between nodes pand q,which has the
shortest length, i.e. number of links on it.
• Diameter:The maximum of all the shortest paths between a pair of
nodes p and q, for all pairs of nodesp andqin the Web-graph.
Google’s PageRank
Example
Iter 1 Iter 2 Iter 3 Page Rank
A ¼
B ¼
C ¼
D ¼A B
C
D
P(A)=(1/4)/3= 1/12 P(B)= (¼)/2 + (¼)/3 = 2.5/12 P(c)= (¼)/2 + ¼ = 4.5/12 P(D)= (¼) + (¼)/3 = 4/12
Iter 1 Iter 2 Iter 3 Page Rank
A ¼ 1/12 1.5/12 1
B ¼ 2.5/12 2/12 2
C ¼ 4.5/12 4.5/12 4
D ¼ 4/12 4/12 3
HITS Algorithm
• Hypertext Induced Topics Search (HITS)
developed by Jon Kleinberg.
• HITS is applied on a subgraph after a search
is done on the complete graph.
• Uses hubs and authorities to define a
recursive relationship between web pages.
• An authority is a page that many hubs link to.
• A hub is a page that links to many
authorities
HITS Algorithm
Example
Nodes Out Degree(Hub) In Degree(Authority)
N1 3 1
N2 2 1
N3 2 2
N4 1 4
N1
N4N3
N2
Find the hubs and authority scores for the given graph for K=3 and initial hub weight vector as 1.
Example
N1 N2 N3 N4
N1 0 1 1 1
N2 0 0 1 1
N3 1 0 0 1
N4 0 0 0 1
Adjacency Matrix A
N1 N2 N3 N4
N1 0 0 1 0
N2 1 0 0 0
N3 1 1 0 0
N4 1 1 1 1
Transpose Matrix AT
Example
Assuming initial Hub weight vector u as 1
Example
Example
Example
Example
2
Example
Example
Example
Example
Python Code for HITS
import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGarph()
G.add_edges_from([('A', 'D’),
('B', 'C'), ('B', 'E'), ('C', 'A'), ('D',
'C’), ('E', 'D'), ('E', 'B'), ('E',
'F'),('E', 'C’), ('F', 'C'), ('F', 'H'),
('G', 'A'), ('G', 'C'), ('H', 'A')])
plt.figure(figsize =(10, 10))
nx.draw_networkx(G, with_labels =
True)
hubs, authorities = nx.hits(G, max_iter
= 50, normalized = True)
print("Hub Scores: ", hubs)
print("Authority Scores: ", authorities)
Output
Hub Scores: {'A': 0.04642540386472174, 'D': 0.133660375232863,
'B': 0.15763599440595596, 'C': 0.037389132480584515,
'E': 0.2588144594158868, 'F': 0.15763599440595596,
'H': 0.037389132480584515, 'G': 0.17104950771344754}
Authority Scores: {'A': 0.10864044085687284, 'D': 0.13489685393050574,
'B': 0.11437974045401585, 'C': 0.3883728005172019,
'E': 0.06966521189369385, 'F': 0.11437974045401585,
'H': 0.06966521189369385, 'G': 0.0}

More Related Content

What's hot

ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
Helge Holzmann
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
Chris Bizer
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
Sammy Fung
 
adative websites
adative websitesadative websites
adative websites
Akash Shindhe
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
Helge Holzmann
 
Linked data tooling XML
Linked data tooling XMLLinked data tooling XML
Linked data tooling XML
FREMEProjectH2020
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by Globus
Globus
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
Kai Schlegel
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
Robert Meusel
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
Robert Meusel
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
Heiko Paulheim
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
Chris Bizer
 
SEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentationSEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentation
SemLib Project
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
Felix Sasaki
 
Walk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked DataWalk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked Data
Kenning Arlitsch
 
Current Issue: December 2019, Volume 8, Number 6
Current Issue: December 2019, Volume 8, Number 6Current Issue: December 2019, Volume 8, Number 6
Current Issue: December 2019, Volume 8, Number 6
kevig
 

What's hot (19)

ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
 
adative websites
adative websitesadative websites
adative websites
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
Linked data tooling XML
Linked data tooling XMLLinked data tooling XML
Linked data tooling XML
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by Globus
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 
SEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentationSEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentation
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
 
Walk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked DataWalk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked Data
 
Current Issue: December 2019, Volume 8, Number 6
Current Issue: December 2019, Volume 8, Number 6Current Issue: December 2019, Volume 8, Number 6
Current Issue: December 2019, Volume 8, Number 6
 

Similar to Web mining: Concepts and applications

Web mining
Web miningWeb mining
Web mining
Rashmi Bhat
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
IRJET Journal
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
Nicole Heredia
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
IRJET Journal
 
Data mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirData mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sir
chaudharipruthvirajr
 
I/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTESI/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTES
Subhajit Sahu
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
Amir Fahmideh
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
Artificial Intelligence Institute at UofSC
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
Artificial Intelligence Institute at UofSC
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
ScrbifPt
 
Pagerank
PagerankPagerank
Pagerank
Sunil Rawal
 
Web mining
Web miningWeb mining
Web mining
SarthakSahoo8
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
Hemant Sharma
 
Web crawling
Web crawlingWeb crawling
Web crawling
Tushar Tilwani
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Web mining
Web miningWeb mining
Web mining
Jay Lohokare
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
ijceronline
 
H017554148
H017554148H017554148
H017554148
IOSR Journals
 
Web mining tools
Web mining toolsWeb mining tools
Web mining tools
Sujata Regoti
 

Similar to Web mining: Concepts and applications (20)

Web mining
Web miningWeb mining
Web mining
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
 
Data mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirData mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sir
 
I/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTESI/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTES
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Pagerank
PagerankPagerank
Pagerank
 
Web mining
Web miningWeb mining
Web mining
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Web crawling
Web crawlingWeb crawling
Web crawling
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Web mining
Web miningWeb mining
Web mining
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
H017554148
H017554148H017554148
H017554148
 
Web mining tools
Web mining toolsWeb mining tools
Web mining tools
 

More from Utkarsh Sharma

Model validation
Model validationModel validation
Model validation
Utkarsh Sharma
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
Utkarsh Sharma
 
Time series analysis
Time series analysisTime series analysis
Time series analysis
Utkarsh Sharma
 
Text analytics
Text analyticsText analytics
Text analytics
Utkarsh Sharma
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
Utkarsh Sharma
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
Utkarsh Sharma
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithms
Utkarsh Sharma
 
Principle Component Analysis
Principle Component AnalysisPrinciple Component Analysis
Principle Component Analysis
Utkarsh Sharma
 
Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )
Utkarsh Sharma
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Utkarsh Sharma
 

More from Utkarsh Sharma (10)

Model validation
Model validationModel validation
Model validation
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Time series analysis
Time series analysisTime series analysis
Time series analysis
 
Text analytics
Text analyticsText analytics
Text analytics
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithms
 
Principle Component Analysis
Principle Component AnalysisPrinciple Component Analysis
Principle Component Analysis
 
Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 

Recently uploaded

A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
سمير بسيوني
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
BoudhayanBhattachari
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Vivekanand Anglo Vedic Academy
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 

Recently uploaded (20)

A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 

Web mining: Concepts and applications

  • 1. Web Mining Concepts and Applications Utkarsh Sharma Jaypee University of Engineering & Technology India
  • 2. Introduction • Web mining is actually an area of data mining related to the information available on internet. It is a concept of extracting informative data available on web pages over the internet
  • 3. Why Web Mining ?? • Web data is • Web content –text, image, records, etc. • Web structure –hyperlinks, tags, etc. • Web usage –http logs, app server logs, etc.
  • 5. DIFFERENT TYPES OF RECOMMENDATION ENGINES • Collaborative Filtering
  • 7.
  • 9. Definition • Web Content Mining is the process of extracting useful information from the contents of Web documents. • Content data corresponds to the collection of facts a Web page was designed to convey to the users. • Research activities in this field also involve using techniques from other disciplines such as Information Retrieval (IR) and natural language processing (NLP).
  • 10. Web Content Mining It may consist of:
  • 12. Pre-processing Content • Content Preparation • Extract text from HTML. • Perform Stemming. • Remove Stop Words. • Calculate Collection Wide Word Frequencies (DF). • Calculate per Document Term Frequencies (TF). • Vector Creation • Common Information Retrieval Technique. • Each document (HTML page) is represented by a sparse vector of term weights. • TFIDF weighting is most common. • Typically, additional weight is given to terms appearing as keywords or in titles.
  • 13. Common Mining Techniques • The more basic and popular data mining techniques include: • Classification • Clustering • Associations • The other significant ideas: • Topic Identification, tracking and drift analysis • Concept hierarchy creation • Relevance of content.
  • 14. Web Content Mining Applications Identify the topics represented by a Web Documents Categorize Web Documents Find Web Pages across different servers that are similar Applications related to relevance Queries –Enhance standard Query Relevance with User, Role, and/or Task Based Relevance Recommendations –List of top “n” relevant documents in a collection or portion of a collection. Filters –Show/Hide documents based on relevance score
  • 16. What is Web Structure Mining? • The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as edges connecting between two related pages • Web Structure Mining can be the process of discovering structure information from the Web • This type of mining can be performed either at the (intra-page) document level or at the (inter-page) hyperlink level • The research at the hyperlink level is also called Hyperlink Analysis
  • 17. Motivation to study Hyperlink Structure • Hyperlinks serve two main purposes. • Pure Navigation. • Point to pages with authority* on the same topic of the page containing the link. • This can be used to retrieve useful information from the web.
  • 18. Web Structure Terminology • Web-graph:A directed graph that represents the Web. • Node:Each Web page is a node of the Web-graph. • Link:Each hyperlink on the Web is a directed edge of the Web-graph. • In-degree:The in-degree of a node, p,is the number of distinct links that point to p. • Out-degree:The out-degree of a node, p, is the number of distinct links originating at pthat point to other nodes.
  • 19. Web Structure Terminology(2) • Directed Path:A sequence of links, starting from p that can be followed to reach q. • Shortest Path:Of all the paths between nodes pand q,which has the shortest length, i.e. number of links on it. • Diameter:The maximum of all the shortest paths between a pair of nodes p and q, for all pairs of nodesp andqin the Web-graph.
  • 21. Example Iter 1 Iter 2 Iter 3 Page Rank A ¼ B ¼ C ¼ D ¼A B C D P(A)=(1/4)/3= 1/12 P(B)= (¼)/2 + (¼)/3 = 2.5/12 P(c)= (¼)/2 + ¼ = 4.5/12 P(D)= (¼) + (¼)/3 = 4/12 Iter 1 Iter 2 Iter 3 Page Rank A ¼ 1/12 1.5/12 1 B ¼ 2.5/12 2/12 2 C ¼ 4.5/12 4.5/12 4 D ¼ 4/12 4/12 3
  • 22. HITS Algorithm • Hypertext Induced Topics Search (HITS) developed by Jon Kleinberg. • HITS is applied on a subgraph after a search is done on the complete graph. • Uses hubs and authorities to define a recursive relationship between web pages. • An authority is a page that many hubs link to. • A hub is a page that links to many authorities
  • 24. Example Nodes Out Degree(Hub) In Degree(Authority) N1 3 1 N2 2 1 N3 2 2 N4 1 4 N1 N4N3 N2 Find the hubs and authority scores for the given graph for K=3 and initial hub weight vector as 1.
  • 25. Example N1 N2 N3 N4 N1 0 1 1 1 N2 0 0 1 1 N3 1 0 0 1 N4 0 0 0 1 Adjacency Matrix A N1 N2 N3 N4 N1 0 0 1 0 N2 1 0 0 0 N3 1 1 0 0 N4 1 1 1 1 Transpose Matrix AT
  • 26. Example Assuming initial Hub weight vector u as 1
  • 35. Python Code for HITS import networkx as nx import matplotlib.pyplot as plt G = nx.DiGarph() G.add_edges_from([('A', 'D’), ('B', 'C'), ('B', 'E'), ('C', 'A'), ('D', 'C’), ('E', 'D'), ('E', 'B'), ('E', 'F'),('E', 'C’), ('F', 'C'), ('F', 'H'), ('G', 'A'), ('G', 'C'), ('H', 'A')]) plt.figure(figsize =(10, 10)) nx.draw_networkx(G, with_labels = True) hubs, authorities = nx.hits(G, max_iter = 50, normalized = True) print("Hub Scores: ", hubs) print("Authority Scores: ", authorities)
  • 36. Output Hub Scores: {'A': 0.04642540386472174, 'D': 0.133660375232863, 'B': 0.15763599440595596, 'C': 0.037389132480584515, 'E': 0.2588144594158868, 'F': 0.15763599440595596, 'H': 0.037389132480584515, 'G': 0.17104950771344754} Authority Scores: {'A': 0.10864044085687284, 'D': 0.13489685393050574, 'B': 0.11437974045401585, 'C': 0.3883728005172019, 'E': 0.06966521189369385, 'F': 0.11437974045401585, 'H': 0.06966521189369385, 'G': 0.0}