SlideShare a Scribd company logo
WEB MINING
What is Web Mining?
 Web is a collection of files/documents over the Internet which are
inter-linked.
 The World Wide Web contains a large amount of data that provides a
rich source to data mining.
 Web Mining is the application of Data Mining techniques to
automatically discover and extract information from Web documents
and services.
 The objective of Web mining is to look for patterns in Web data by
collecting and examining data in order to gain insights.
 Web mining helps to improve the power of web search engine by
identifying the web pages and classifying the web documents.
 Web mining is very useful to e-commerce websites and e-services.
2
What is Web Mining?
 Web mining is used to predict user behavior.
 It is used for Web Searching e.g., Google, Yahoo etc. and in Vertical
Searching.
 Web mining has a distinctive property to provide a set of various data
types.
 The web has multiple aspects that yield different approaches for the
mining process, such as
 Web pages consist of text,
 Web pages are linked via hyperlinks, and
 User activity can be monitored via web server logs.
3
Challenges in Web Mining
 Amount of information on the Web is huge
 Coverage of Web information is very wide diverse
 Web information is semi-structured.
 Web information is linked.
 Web information is redundant
 Web consists of Surface web and Deep web
 Web information is dynamic
 Web information is noisy
4
Web Mining Taxonomy
Web Mining
Web
Content
Mining
Web Usage
Mining
Web
Structure
Mining
Extracting
useful
knowledge
from the
contents of
Web
documents
Extracting
interesting
patterns from
user
interactions on
multiple
websites
Extracting
hyperlinks
structure
connecting
different web
sites/applicatio
ns 5
Web Structure Mining
Web structure mining is the application of discovering
structure information from the web.
Structure mining basically shows the structured summary of a
particular website.
It identifies relationship between web pages linked by
information or direct link connection.
To determine the connection between two commercial
websites, Web structure mining can be very useful.
6
Web Structure Mining
The structure of the web graph consists
of web pages as nodes, and hyperlinks as
edges connecting related pages.
Type of mining
 Document level mining (Intra-page)
 Hyperlink level mining (Inter-page)
The research at hyperlink level is known
as Hyperlink Analysis
Web pages can be accessed using URL,
hyperlinks or using navigational tools.
7
Web Structure Mining
Web page ranking algorithms PageRank (PR)
 Google Search engine uses PageRank algorithm to rank websites in
their search engine results.
 It is a way of measuring the importance of website pages.
 The PageRank algorithm outputs a probability distribution used to
represent the likelihood that a person randomly clicking on links
will arrive at any particular page.
 PageRank can be calculated for collections of documents of any
size.
8
“It counts the number and quality of links to a page to determine a rough estimate of
how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.”
https://www.youtube.com/watch?v=meonLcN7LD4
Web Structure Mining
Web page ranking algorithms PageRank (PR)
9
Web Structure Mining
Web page ranking algorithms PageRank (PR)
 Google interprets a link from page A to page B as a vote from page A to
page B.
 All incoming links can be interpreted as votes.
10
Web Structure Mining
Web page ranking algorithms PageRank (PR)
 It also takes into consideration the “importance” of the page that is
“giving” out the vote.
 If the page that’s casting a vote is more important, then the links
are worth more and it will help rank up the other pages.
 Page’s importance is equal to the sum of the votes of its incoming
links.
 A web page is important if it is pointed by other important web
pages.
 Types of Links
 Inbound links (In-links)
 Outbound links (Out-links)
11
Web Structure Mining
Web page ranking algorithms PageRank (PR)
Page rank of a page 𝑃𝑖 denoted by 𝑃𝑅(𝑃𝑖) is the sum of page
ranks of all pages pointing into 𝑃𝑖.
𝑃𝑅 𝑃𝑖 =
𝑃𝑗→𝑃𝑖
𝑃𝑅(𝑃𝑗)
|𝑃𝑗|
 𝑃𝑗 → 𝑃𝑖 set of pages pointing to 𝑃𝑖
|𝑃𝑗| = number of outlinks from 𝑃𝑗
𝑃𝑅 is page rank initially unknown
 All pages are given equal page rank initially 1
𝑛
 𝑛 is number of pages in Google index of web
12
Web Structure Mining
Web page ranking algorithms PageRank (PR)
13
1 2
3
5
4
6
Iteration 1
𝑃𝑅 1 =
1
6
𝑃𝑅 2 =
1
6
𝑃𝑅 3 =
1
6
𝑃𝑅 4 =
1
6
𝑃𝑅 5 =
1
6
𝑃𝑅 6 =
1
6
Web Structure Mining
Web page ranking algorithms PageRank (PR)
14
1 2
3
5
4
6
Iteration 1
𝑃𝑅 1 =
1
6
𝑃𝑅 2 =
1
6
𝑃𝑅 3 =
1
6
𝑃𝑅 4 =
1
6
𝑃𝑅 5 =
1
6
𝑃𝑅 6 =
1
6
Iteration 2
𝑃𝑅 1 =
1
18
𝑃𝑅 2 =
5
36
𝑃𝑅 3 =
1
12
𝑃𝑅 4 =
1
4
𝑃𝑅 5 =
5
36
𝑃𝑅 6 =
1
6
Web Structure Mining
Web page ranking algorithms PageRank (PR)
15
1 2
3
5
4
6
Iteration 1
𝑃𝑅 1 =
1
6
𝑃𝑅 2 =
1
6
𝑃𝑅 3 =
1
6
𝑃𝑅 4 =
1
6
𝑃𝑅 5 =
1
6
𝑃𝑅 6 =
1
6
Iteration 2
𝑃𝑅 1 =
1
18
𝑃𝑅 2 =
5
36
𝑃𝑅 3 =
1
12
𝑃𝑅 4 =
1
4
𝑃𝑅 5 =
5
36
𝑃𝑅 6 =
1
6
Iteration 2
𝑃𝑅 1 =
1
36
𝑃𝑅 2 =
1
18
𝑃𝑅 3 =
1
36
𝑃𝑅 4 =
17
72
𝑃𝑅 5 =
11
72
𝑃𝑅 6 =
14
72
Web Structure Mining
Web page ranking algorithms PageRank (PR)
16
1 2
3
5
4
6
Iteration 1
𝑃𝑅 1 =
1
6
𝑃𝑅 2 =
1
6
𝑃𝑅 3 =
1
6
𝑃𝑅 4 =
1
6
𝑃𝑅 5 =
1
6
𝑃𝑅 6 =
1
6
Iteration 2
𝑃𝑅 1 =
1
18
𝑃𝑅 2 =
5
36
𝑃𝑅 3 =
1
12
𝑃𝑅 4 =
1
4
𝑃𝑅 5 =
5
36
𝑃𝑅 6 =
1
6
Iteration 2
𝑃𝑅 1 =
1
36
𝑃𝑅 2 =
1
18
𝑃𝑅 3 =
1
36
𝑃𝑅 4 =
17
72
𝑃𝑅 5 =
11
72
𝑃𝑅 6 =
14
72
Rank
0.0277
0.0555
0.0277
0.2361
0.1527
0.1944
1
2
3
4
5
5
Web Structure Mining
Web page ranking algorithms PageRank (PR)
 Simple PageRank Algorithm
𝑃𝑅 𝐴 = 1 − 𝑑 + 𝑑
𝑃𝑅 𝑇1
𝐶 𝑇1
+ ⋯ +
𝑃𝑅 𝑇𝑛
𝐶 𝑇𝑛
 where,
 𝑃𝑅 𝐴 –Page Rank of page 𝐴
 𝑃𝑅 𝑇𝑖 – Page Rank of pages 𝑇𝑖 which link to page 𝐴
 𝐶 𝑇𝑖 -number of outbound links on page 𝑇𝑖
 𝑑 -damping factor which can be set between 0 and 1
(Usually set to 0.85)
 The probability of a random user continuing to click on links
as they browse the web.
17
Web Structure Mining
Web page ranking algorithms PageRank (PR)
 Simple Page Rank Algorithm
19
A B
C
𝑃𝑅 𝐴 = 1 − 𝑑 + 𝑑 𝑃𝑅(𝐶)/𝐶(𝐶)
= 1 − 0.85 + 0.85 1/1
= 0.15 + 0.85 = 1
𝑃𝑅 𝐵 = 1 − 𝑑 + 𝑑 𝑃𝑅(𝐴)/𝐶(𝐴)
= 1 − 0.85 + 0.85 1/2
= 0.15 + 0.425 = 0.575
𝑃𝑅 𝐶 = 1 − 𝑑 + 𝑑
𝑃𝑅 𝐴
𝐶 𝐴
+
𝑃𝑅 𝐵
𝐶 𝐵
= 1 − 0.85 + 0.85 1/2 + 0.575
= 0.15 + 0.85 0.5 + 0.575
= 0.15 + 0.91375 = 1.06
Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
 A Link Analysis Algorithm that rates webpages, developed by Jon
Kleinberg.
 This algorithm is used to discover and rank the webpages relevant
for a particular search.
 HITS uses hubs and authorities to define a recursive relationship
between webpages.
 Given a query to a Search Engine, the set of highly relevant web pages
are called Roots. They are potential Authorities.
 Pages that are not very relevant but point to pages in the Root are called
Hubs.
 An Authority is a page that many hubs link to, whereas a Hub is a page
that links to many authorities.
20
Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
21
Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
 Hub and authority scores:
 Compute for each page d in the base set a hub score ℎ(𝑑) and an
authority score 𝑎(𝑑).
 Initialize for all d : ℎ 𝑑 = 1 ; 𝑎 𝑑 = 1.
 Update h(𝑑) and 𝑎(𝑑) for each iteration.
 Output pages with highest ℎ scores as top hubs.
 Output pages with highest 𝑎 scores as top authorities
22
Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
𝑎𝑎 =
𝑎1
𝑎2
𝑎3
.
.
𝑎𝑛
ℎ =
ℎ1
ℎ2
ℎ3
.
.
ℎ𝑛
𝑎 = 𝐴𝑇
∙ ℎ
ℎ = 𝐴 ∙ 𝑎
𝑎0 =
1
1
⋮
1
ℎ0 = 𝐴𝑇
∙
1
1
⋮
1
𝑎𝑘 = 𝐴𝑇
∙ 𝐴 ∙ 𝑎𝑘−1
ℎ𝑘 = 𝐴 ∙ 𝐴𝑇
∙ ℎ𝑘−1
23
𝑎 = Authority Matrix
ℎ = Hub matrix
𝐴 = Adjacency Matrix of
web
Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
24
1
2
3
4
𝐴 =
0 0
0 0
1 0
1 1
0 0
0 0
0 0
1 0
𝐴𝑇
=
0 0
0 0
0 0
0 0
1 1
0 1
0 1
0 0
ℎ =
1
1
1
1
𝑎 = 𝐴𝑇 ⋅ ℎ =
0 0
0 0
0 0
0 0
1 1
0 1
0 1
0 0
⋅
1
1
1
1
=
ℎ = 𝐴 ⋅ 𝑎 =
0 0
0 0
1 0
1 1
0 0
0 0
0 0
1 0
⋅
0
0
3
1
=
Top hub = Node 2
Top authority =
Node 3
0
0
3
1
3
4
0
3
Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
25
3
2
1
5
4 1
2
3
4
Web Structure Mining
PageRank vs HITS
 PageRank:
 Used for ranking all the nodes of the complete graph and then applying a
search.
 Based on the ’random surfer’ idea
 Example : Google
 HITS
 Applied when a search is done on the complete graph.
 Applied on a subgraph based on query.
 Defines hubs and authorities recursively.
 Example :ASK, Yahoo, Clever
26
Web Content Mining
 Web content mining is the application of extracting useful
information from the content of the web documents.
 Web content consist of several types of data – text, image, audio,
video etc.
 Content data is the group of facts that a web page is designed.
 It can provide effective and interesting patterns about user needs.
 Text documents are related to text mining, machine learning and
natural language processing.
 This mining is also known as text mining.
 This type of mining performs scanning and mining of the text,
images and groups of web pages according to the content of the
input.
27
Web Content Mining
 With web mining, one can
 Cluster web documents
 Classify web pages
 Extract patterns
Research activities in this field involve using other techniques
from other discipline such as Information Retrieval (IR) and
Natural Language Processing (NLP).
Applications
 Document clustering / categorization
 Topic identification / tracking
 Concept discovery
28
Web Content Mining
 Web Scraping
 Web Scraping is an automatic method to obtain large amounts of data
from websites.
 Unstructured data in an HTML format is converted into structured data
in a spreadsheet or a database so that it can be used in various
applications.
 There are many different ways to perform web scraping to obtain data
from websites.
 Using online services,
 Using particular API’s or
 creating your code for web scraping from scratch.
 Websites like Google, Twitter, Facebook, StackOverflow, etc. have API’s
that allow you to access their data in a structured format.
 Some websites that don’t allow users to access large amounts of data
in a structured form, use Web Scraping to scrape the website for data.29
Web Content Mining
Web Scraping
 Web scraping requires two parts namely
the crawler and the scraper.
 The crawler is an artificial intelligence
algorithm that browses the web to
search the particular data required by
following the links across the internet.
 The scraper is a specific tool created to
extract the data from the website.
30
Web Content Mining
 Web Crawler
 A web crawler as known as web spider is a bot that searches and
indexes content on the internet.
 Web crawlers are responsible for understanding the content on a
web page so they can retrieve it when an inquiry is made.
 Web crawlers are operated by search engines with their own
algorithms.
 A web spider will search (crawl) and categorize all web pages on
the internet that it can find and index them.
 The most well known web crawler is the Googlebot.
31
Web Content Mining
 How does a Web Crawler work?
 Finds information by crawling
 Crawlers look at webpages and follow links on those
pages, much like you would if you were browsing content
on the web.
 They go from link to link and bring data about those
webpages back to the servers.
 Organizes information by indexing
 When crawlers find a webpage, systems render the
content of the page, just as a browser does.
 It takes note of key signals — from keywords to website
freshness — and keeps track of it all in the Search index.
32
https://www.youtube.com/watch?v=jkuR
WcH1-kk
Web Content Mining
 Applications of Web Crawler
 The crawlers are the basis for the work of search engines.
 Web crawlers are also used for other purposes:
 Price comparison portals search for information on specific products on the
Web, so that prices or data can be compared accurately.
 In the area of data mining, a crawler may collect publicly available e-mail or
postal addresses of companies.
 Web analysis tools use crawlers or spiders to collect data for page views, or
incoming or outbound links.
 Crawlers serve to provide information hubs with data, for example, news
sites.
33
Web Content Mining
Web Scraping
 First the web scraper is provided the
URL’s of the required sites.
 Then it loads all the HTML code for those
sites
 Then scraper obtains the required data
from this HTML code and outputs this
data in the format specified by the user.
 Output is in the form of an Excel
spreadsheet or a CSV file but the data can
also be saved in other formats such as a
JSON file.
34
https://www.youtube.com/watch?v=US-
HP1hiRHY
Web Content Mining
Web Scraping Applications
 Price Intelligence
 Market Research
 Alternative data for finance
 Real estate
 News and content monitoring etc.
35
Web Content Mining
Web Scraping
 Python is popular programming language for web scraping.
 Scrapy:
 Open-source web crawling framework
 Ideal for web scraping as well as extracting data using APIs.
 Beautiful soap:
 Highly suitable for Web Scraping.
 It creates a parse tree that can be used to extract data from HTML on a
website.
36
Web Usage Mining
Focuses on predicting the behavior of users while they are interacting
with the WWW.
Focus:
 How web sites are used by users?
 Understanding behavior of different user segment.
 Predict user behavior in future
 Extracting meaningful information to individual user or group.
 What pages are being accessed frequently?
 From what search engine are visitors coming?
 Which browser and operating systems are most commonly used by visitors?
 What are the most recent visits per page?
 Who is visiting which page?
37
Web Usage Mining
Analyzes user activities on different web pages and track them over a
period of time.
Three types of Web Data
 Web content data
 Web structure data
 Web usage data
 The main source of data here is-Web Server and Application Server.
 It involves log data which is collected by sources.
 The data in this type can be mainly categorized into three types based on the
source it comes from:
 Server side.
 Client side
 Proxy side
38
Web Usage Mining
Types of Web Usage mining based on the usage data
 Web server data
 includes the IP address, browser logs, proxy server logs, user profiles, etc.
 Application server data
 Tracking various business events and logging them into application server logs is
mainly what application server data consists of.
 Application level data
 There are various new kinds of events that can be there in an application.
39
Web Usage Mining
Techniques used in Web Usage Mining
40
Techniques
used in Web
Usage Mining
Association
Rules
Classification Clustering
Web Usage Mining
Techniques used in Web Usage Mining
 Association Rules
 This technique focuses on relations among the web pages that frequently appear
together in users’ sessions.
 The pages accessed together are always put together into a single server session.
Association Rules help in the reconstruction of websites using the access logs.
 So many sets of rules produced together may result in some of the rules being
completely inconsequential.
41
Web Usage Mining
Techniques used in Web Usage Mining
Classification
 Develops that kind of profile of users/customers that are associated with a
particular class/category.
 One requires to extract the best features that will be best suitable for the
associated class.
 Classification can be implemented by various algorithms – some of them include-
Support vector machines, K-Nearest Neighbors, Logistic Regression, Decision
Trees, etc.
42
Web Usage Mining
Techniques used in Web Usage Mining
Clustering
 There are mainly 2 types of clusters- the first one is the usage cluster and the
second one is the page cluster.
 The clustering of pages can be readily performed based on the usage data.
 In usage-based clustering, items that are commonly accessed /purchased
together can be automatically organized into groups.
43
Virtual Web View
44
 A view on top of the World-Wide-Web
 An approach to handle an unstructured data
 Uses Multiple Layered Database (MLDB) built on top of web

More Related Content

Similar to Web mining

IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IJwest
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms
dannyijwest
 
Search Engine Optimization(SEO)
Search Engine Optimization(SEO)Search Engine Optimization(SEO)
Search Engine Optimization(SEO)
Surit Datta
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOL
IOSR Journals
 
The Role of Backlinks in SEO
The Role of Backlinks in SEOThe Role of Backlinks in SEO
The Role of Backlinks in SEO
Aarav Infotech
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)James Arnold
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spamJames Arnold
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)James Arnold
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
Ankit Raj
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
E017624043
E017624043E017624043
E017624043
IOSR Journals
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
Raja R
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
Umang MIshra
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGlebinit singh
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
I04015559
I04015559I04015559
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
Ijcem Journal
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
anchalsinghdm
 

Similar to Web mining (20)

IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
 
Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms Identifying Important Features of Users to Improve Page Ranking Algorithms
Identifying Important Features of Users to Improve Page Ranking Algorithms
 
Search Engine Optimization(SEO)
Search Engine Optimization(SEO)Search Engine Optimization(SEO)
Search Engine Optimization(SEO)
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOL
 
The Role of Backlinks in SEO
The Role of Backlinks in SEOThe Role of Backlinks in SEO
The Role of Backlinks in SEO
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
Ranking algorithms
Ranking algorithmsRanking algorithms
Ranking algorithms
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
E017624043
E017624043E017624043
E017624043
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
 
Googling of GooGle
Googling of GooGleGoogling of GooGle
Googling of GooGle
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
Page Rank Link Farm Detection
 
I04015559
I04015559I04015559
I04015559
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 

More from Rashmi Bhat

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
Rashmi Bhat
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
Rashmi Bhat
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
Rashmi Bhat
 
Process Scheduling in OS
Process Scheduling in OSProcess Scheduling in OS
Process Scheduling in OS
Rashmi Bhat
 
Introduction to Operating System
Introduction to Operating SystemIntroduction to Operating System
Introduction to Operating System
Rashmi Bhat
 
The Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdfThe Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdf
Rashmi Bhat
 
Module 1 VR.pdf
Module 1 VR.pdfModule 1 VR.pdf
Module 1 VR.pdf
Rashmi Bhat
 
OLAP
OLAPOLAP
Spatial Data Mining
Spatial Data MiningSpatial Data Mining
Spatial Data Mining
Rashmi Bhat
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
Rashmi Bhat
 
Clustering
ClusteringClustering
Clustering
Rashmi Bhat
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
Rashmi Bhat
 
ETL Process
ETL ProcessETL Process
ETL Process
Rashmi Bhat
 
Data Warehouse Fundamentals
Data Warehouse FundamentalsData Warehouse Fundamentals
Data Warehouse Fundamentals
Rashmi Bhat
 
Virtual Reality
Virtual Reality Virtual Reality
Virtual Reality
Rashmi Bhat
 
Introduction To Virtual Reality
Introduction To Virtual RealityIntroduction To Virtual Reality
Introduction To Virtual Reality
Rashmi Bhat
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
Rashmi Bhat
 

More from Rashmi Bhat (17)

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Process Scheduling in OS
Process Scheduling in OSProcess Scheduling in OS
Process Scheduling in OS
 
Introduction to Operating System
Introduction to Operating SystemIntroduction to Operating System
Introduction to Operating System
 
The Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdfThe Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdf
 
Module 1 VR.pdf
Module 1 VR.pdfModule 1 VR.pdf
Module 1 VR.pdf
 
OLAP
OLAPOLAP
OLAP
 
Spatial Data Mining
Spatial Data MiningSpatial Data Mining
Spatial Data Mining
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
 
Clustering
ClusteringClustering
Clustering
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Data Warehouse Fundamentals
Data Warehouse FundamentalsData Warehouse Fundamentals
Data Warehouse Fundamentals
 
Virtual Reality
Virtual Reality Virtual Reality
Virtual Reality
 
Introduction To Virtual Reality
Introduction To Virtual RealityIntroduction To Virtual Reality
Introduction To Virtual Reality
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 

Web mining

  • 2. What is Web Mining?  Web is a collection of files/documents over the Internet which are inter-linked.  The World Wide Web contains a large amount of data that provides a rich source to data mining.  Web Mining is the application of Data Mining techniques to automatically discover and extract information from Web documents and services.  The objective of Web mining is to look for patterns in Web data by collecting and examining data in order to gain insights.  Web mining helps to improve the power of web search engine by identifying the web pages and classifying the web documents.  Web mining is very useful to e-commerce websites and e-services. 2
  • 3. What is Web Mining?  Web mining is used to predict user behavior.  It is used for Web Searching e.g., Google, Yahoo etc. and in Vertical Searching.  Web mining has a distinctive property to provide a set of various data types.  The web has multiple aspects that yield different approaches for the mining process, such as  Web pages consist of text,  Web pages are linked via hyperlinks, and  User activity can be monitored via web server logs. 3
  • 4. Challenges in Web Mining  Amount of information on the Web is huge  Coverage of Web information is very wide diverse  Web information is semi-structured.  Web information is linked.  Web information is redundant  Web consists of Surface web and Deep web  Web information is dynamic  Web information is noisy 4
  • 5. Web Mining Taxonomy Web Mining Web Content Mining Web Usage Mining Web Structure Mining Extracting useful knowledge from the contents of Web documents Extracting interesting patterns from user interactions on multiple websites Extracting hyperlinks structure connecting different web sites/applicatio ns 5
  • 6. Web Structure Mining Web structure mining is the application of discovering structure information from the web. Structure mining basically shows the structured summary of a particular website. It identifies relationship between web pages linked by information or direct link connection. To determine the connection between two commercial websites, Web structure mining can be very useful. 6
  • 7. Web Structure Mining The structure of the web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. Type of mining  Document level mining (Intra-page)  Hyperlink level mining (Inter-page) The research at hyperlink level is known as Hyperlink Analysis Web pages can be accessed using URL, hyperlinks or using navigational tools. 7
  • 8. Web Structure Mining Web page ranking algorithms PageRank (PR)  Google Search engine uses PageRank algorithm to rank websites in their search engine results.  It is a way of measuring the importance of website pages.  The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.  PageRank can be calculated for collections of documents of any size. 8 “It counts the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.” https://www.youtube.com/watch?v=meonLcN7LD4
  • 9. Web Structure Mining Web page ranking algorithms PageRank (PR) 9
  • 10. Web Structure Mining Web page ranking algorithms PageRank (PR)  Google interprets a link from page A to page B as a vote from page A to page B.  All incoming links can be interpreted as votes. 10
  • 11. Web Structure Mining Web page ranking algorithms PageRank (PR)  It also takes into consideration the “importance” of the page that is “giving” out the vote.  If the page that’s casting a vote is more important, then the links are worth more and it will help rank up the other pages.  Page’s importance is equal to the sum of the votes of its incoming links.  A web page is important if it is pointed by other important web pages.  Types of Links  Inbound links (In-links)  Outbound links (Out-links) 11
  • 12. Web Structure Mining Web page ranking algorithms PageRank (PR) Page rank of a page 𝑃𝑖 denoted by 𝑃𝑅(𝑃𝑖) is the sum of page ranks of all pages pointing into 𝑃𝑖. 𝑃𝑅 𝑃𝑖 = 𝑃𝑗→𝑃𝑖 𝑃𝑅(𝑃𝑗) |𝑃𝑗|  𝑃𝑗 → 𝑃𝑖 set of pages pointing to 𝑃𝑖 |𝑃𝑗| = number of outlinks from 𝑃𝑗 𝑃𝑅 is page rank initially unknown  All pages are given equal page rank initially 1 𝑛  𝑛 is number of pages in Google index of web 12
  • 13. Web Structure Mining Web page ranking algorithms PageRank (PR) 13 1 2 3 5 4 6 Iteration 1 𝑃𝑅 1 = 1 6 𝑃𝑅 2 = 1 6 𝑃𝑅 3 = 1 6 𝑃𝑅 4 = 1 6 𝑃𝑅 5 = 1 6 𝑃𝑅 6 = 1 6
  • 14. Web Structure Mining Web page ranking algorithms PageRank (PR) 14 1 2 3 5 4 6 Iteration 1 𝑃𝑅 1 = 1 6 𝑃𝑅 2 = 1 6 𝑃𝑅 3 = 1 6 𝑃𝑅 4 = 1 6 𝑃𝑅 5 = 1 6 𝑃𝑅 6 = 1 6 Iteration 2 𝑃𝑅 1 = 1 18 𝑃𝑅 2 = 5 36 𝑃𝑅 3 = 1 12 𝑃𝑅 4 = 1 4 𝑃𝑅 5 = 5 36 𝑃𝑅 6 = 1 6
  • 15. Web Structure Mining Web page ranking algorithms PageRank (PR) 15 1 2 3 5 4 6 Iteration 1 𝑃𝑅 1 = 1 6 𝑃𝑅 2 = 1 6 𝑃𝑅 3 = 1 6 𝑃𝑅 4 = 1 6 𝑃𝑅 5 = 1 6 𝑃𝑅 6 = 1 6 Iteration 2 𝑃𝑅 1 = 1 18 𝑃𝑅 2 = 5 36 𝑃𝑅 3 = 1 12 𝑃𝑅 4 = 1 4 𝑃𝑅 5 = 5 36 𝑃𝑅 6 = 1 6 Iteration 2 𝑃𝑅 1 = 1 36 𝑃𝑅 2 = 1 18 𝑃𝑅 3 = 1 36 𝑃𝑅 4 = 17 72 𝑃𝑅 5 = 11 72 𝑃𝑅 6 = 14 72
  • 16. Web Structure Mining Web page ranking algorithms PageRank (PR) 16 1 2 3 5 4 6 Iteration 1 𝑃𝑅 1 = 1 6 𝑃𝑅 2 = 1 6 𝑃𝑅 3 = 1 6 𝑃𝑅 4 = 1 6 𝑃𝑅 5 = 1 6 𝑃𝑅 6 = 1 6 Iteration 2 𝑃𝑅 1 = 1 18 𝑃𝑅 2 = 5 36 𝑃𝑅 3 = 1 12 𝑃𝑅 4 = 1 4 𝑃𝑅 5 = 5 36 𝑃𝑅 6 = 1 6 Iteration 2 𝑃𝑅 1 = 1 36 𝑃𝑅 2 = 1 18 𝑃𝑅 3 = 1 36 𝑃𝑅 4 = 17 72 𝑃𝑅 5 = 11 72 𝑃𝑅 6 = 14 72 Rank 0.0277 0.0555 0.0277 0.2361 0.1527 0.1944 1 2 3 4 5 5
  • 17. Web Structure Mining Web page ranking algorithms PageRank (PR)  Simple PageRank Algorithm 𝑃𝑅 𝐴 = 1 − 𝑑 + 𝑑 𝑃𝑅 𝑇1 𝐶 𝑇1 + ⋯ + 𝑃𝑅 𝑇𝑛 𝐶 𝑇𝑛  where,  𝑃𝑅 𝐴 –Page Rank of page 𝐴  𝑃𝑅 𝑇𝑖 – Page Rank of pages 𝑇𝑖 which link to page 𝐴  𝐶 𝑇𝑖 -number of outbound links on page 𝑇𝑖  𝑑 -damping factor which can be set between 0 and 1 (Usually set to 0.85)  The probability of a random user continuing to click on links as they browse the web. 17
  • 18. Web Structure Mining Web page ranking algorithms PageRank (PR)  Simple Page Rank Algorithm 19 A B C 𝑃𝑅 𝐴 = 1 − 𝑑 + 𝑑 𝑃𝑅(𝐶)/𝐶(𝐶) = 1 − 0.85 + 0.85 1/1 = 0.15 + 0.85 = 1 𝑃𝑅 𝐵 = 1 − 𝑑 + 𝑑 𝑃𝑅(𝐴)/𝐶(𝐴) = 1 − 0.85 + 0.85 1/2 = 0.15 + 0.425 = 0.575 𝑃𝑅 𝐶 = 1 − 𝑑 + 𝑑 𝑃𝑅 𝐴 𝐶 𝐴 + 𝑃𝑅 𝐵 𝐶 𝐵 = 1 − 0.85 + 0.85 1/2 + 0.575 = 0.15 + 0.85 0.5 + 0.575 = 0.15 + 0.91375 = 1.06
  • 19. Web Structure Mining Hyperlink Induced Topic Search (HITS) Algorithm  A Link Analysis Algorithm that rates webpages, developed by Jon Kleinberg.  This algorithm is used to discover and rank the webpages relevant for a particular search.  HITS uses hubs and authorities to define a recursive relationship between webpages.  Given a query to a Search Engine, the set of highly relevant web pages are called Roots. They are potential Authorities.  Pages that are not very relevant but point to pages in the Root are called Hubs.  An Authority is a page that many hubs link to, whereas a Hub is a page that links to many authorities. 20
  • 20. Web Structure Mining Hyperlink Induced Topic Search (HITS) Algorithm 21
  • 21. Web Structure Mining Hyperlink Induced Topic Search (HITS) Algorithm  Hub and authority scores:  Compute for each page d in the base set a hub score ℎ(𝑑) and an authority score 𝑎(𝑑).  Initialize for all d : ℎ 𝑑 = 1 ; 𝑎 𝑑 = 1.  Update h(𝑑) and 𝑎(𝑑) for each iteration.  Output pages with highest ℎ scores as top hubs.  Output pages with highest 𝑎 scores as top authorities 22
  • 22. Web Structure Mining Hyperlink Induced Topic Search (HITS) Algorithm 𝑎𝑎 = 𝑎1 𝑎2 𝑎3 . . 𝑎𝑛 ℎ = ℎ1 ℎ2 ℎ3 . . ℎ𝑛 𝑎 = 𝐴𝑇 ∙ ℎ ℎ = 𝐴 ∙ 𝑎 𝑎0 = 1 1 ⋮ 1 ℎ0 = 𝐴𝑇 ∙ 1 1 ⋮ 1 𝑎𝑘 = 𝐴𝑇 ∙ 𝐴 ∙ 𝑎𝑘−1 ℎ𝑘 = 𝐴 ∙ 𝐴𝑇 ∙ ℎ𝑘−1 23 𝑎 = Authority Matrix ℎ = Hub matrix 𝐴 = Adjacency Matrix of web
  • 23. Web Structure Mining Hyperlink Induced Topic Search (HITS) Algorithm 24 1 2 3 4 𝐴 = 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 𝐴𝑇 = 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 ℎ = 1 1 1 1 𝑎 = 𝐴𝑇 ⋅ ℎ = 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 ⋅ 1 1 1 1 = ℎ = 𝐴 ⋅ 𝑎 = 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 ⋅ 0 0 3 1 = Top hub = Node 2 Top authority = Node 3 0 0 3 1 3 4 0 3
  • 24. Web Structure Mining Hyperlink Induced Topic Search (HITS) Algorithm 25 3 2 1 5 4 1 2 3 4
  • 25. Web Structure Mining PageRank vs HITS  PageRank:  Used for ranking all the nodes of the complete graph and then applying a search.  Based on the ’random surfer’ idea  Example : Google  HITS  Applied when a search is done on the complete graph.  Applied on a subgraph based on query.  Defines hubs and authorities recursively.  Example :ASK, Yahoo, Clever 26
  • 26. Web Content Mining  Web content mining is the application of extracting useful information from the content of the web documents.  Web content consist of several types of data – text, image, audio, video etc.  Content data is the group of facts that a web page is designed.  It can provide effective and interesting patterns about user needs.  Text documents are related to text mining, machine learning and natural language processing.  This mining is also known as text mining.  This type of mining performs scanning and mining of the text, images and groups of web pages according to the content of the input. 27
  • 27. Web Content Mining  With web mining, one can  Cluster web documents  Classify web pages  Extract patterns Research activities in this field involve using other techniques from other discipline such as Information Retrieval (IR) and Natural Language Processing (NLP). Applications  Document clustering / categorization  Topic identification / tracking  Concept discovery 28
  • 28. Web Content Mining  Web Scraping  Web Scraping is an automatic method to obtain large amounts of data from websites.  Unstructured data in an HTML format is converted into structured data in a spreadsheet or a database so that it can be used in various applications.  There are many different ways to perform web scraping to obtain data from websites.  Using online services,  Using particular API’s or  creating your code for web scraping from scratch.  Websites like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format.  Some websites that don’t allow users to access large amounts of data in a structured form, use Web Scraping to scrape the website for data.29
  • 29. Web Content Mining Web Scraping  Web scraping requires two parts namely the crawler and the scraper.  The crawler is an artificial intelligence algorithm that browses the web to search the particular data required by following the links across the internet.  The scraper is a specific tool created to extract the data from the website. 30
  • 30. Web Content Mining  Web Crawler  A web crawler as known as web spider is a bot that searches and indexes content on the internet.  Web crawlers are responsible for understanding the content on a web page so they can retrieve it when an inquiry is made.  Web crawlers are operated by search engines with their own algorithms.  A web spider will search (crawl) and categorize all web pages on the internet that it can find and index them.  The most well known web crawler is the Googlebot. 31
  • 31. Web Content Mining  How does a Web Crawler work?  Finds information by crawling  Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web.  They go from link to link and bring data about those webpages back to the servers.  Organizes information by indexing  When crawlers find a webpage, systems render the content of the page, just as a browser does.  It takes note of key signals — from keywords to website freshness — and keeps track of it all in the Search index. 32 https://www.youtube.com/watch?v=jkuR WcH1-kk
  • 32. Web Content Mining  Applications of Web Crawler  The crawlers are the basis for the work of search engines.  Web crawlers are also used for other purposes:  Price comparison portals search for information on specific products on the Web, so that prices or data can be compared accurately.  In the area of data mining, a crawler may collect publicly available e-mail or postal addresses of companies.  Web analysis tools use crawlers or spiders to collect data for page views, or incoming or outbound links.  Crawlers serve to provide information hubs with data, for example, news sites. 33
  • 33. Web Content Mining Web Scraping  First the web scraper is provided the URL’s of the required sites.  Then it loads all the HTML code for those sites  Then scraper obtains the required data from this HTML code and outputs this data in the format specified by the user.  Output is in the form of an Excel spreadsheet or a CSV file but the data can also be saved in other formats such as a JSON file. 34 https://www.youtube.com/watch?v=US- HP1hiRHY
  • 34. Web Content Mining Web Scraping Applications  Price Intelligence  Market Research  Alternative data for finance  Real estate  News and content monitoring etc. 35
  • 35. Web Content Mining Web Scraping  Python is popular programming language for web scraping.  Scrapy:  Open-source web crawling framework  Ideal for web scraping as well as extracting data using APIs.  Beautiful soap:  Highly suitable for Web Scraping.  It creates a parse tree that can be used to extract data from HTML on a website. 36
  • 36. Web Usage Mining Focuses on predicting the behavior of users while they are interacting with the WWW. Focus:  How web sites are used by users?  Understanding behavior of different user segment.  Predict user behavior in future  Extracting meaningful information to individual user or group.  What pages are being accessed frequently?  From what search engine are visitors coming?  Which browser and operating systems are most commonly used by visitors?  What are the most recent visits per page?  Who is visiting which page? 37
  • 37. Web Usage Mining Analyzes user activities on different web pages and track them over a period of time. Three types of Web Data  Web content data  Web structure data  Web usage data  The main source of data here is-Web Server and Application Server.  It involves log data which is collected by sources.  The data in this type can be mainly categorized into three types based on the source it comes from:  Server side.  Client side  Proxy side 38
  • 38. Web Usage Mining Types of Web Usage mining based on the usage data  Web server data  includes the IP address, browser logs, proxy server logs, user profiles, etc.  Application server data  Tracking various business events and logging them into application server logs is mainly what application server data consists of.  Application level data  There are various new kinds of events that can be there in an application. 39
  • 39. Web Usage Mining Techniques used in Web Usage Mining 40 Techniques used in Web Usage Mining Association Rules Classification Clustering
  • 40. Web Usage Mining Techniques used in Web Usage Mining  Association Rules  This technique focuses on relations among the web pages that frequently appear together in users’ sessions.  The pages accessed together are always put together into a single server session. Association Rules help in the reconstruction of websites using the access logs.  So many sets of rules produced together may result in some of the rules being completely inconsequential. 41
  • 41. Web Usage Mining Techniques used in Web Usage Mining Classification  Develops that kind of profile of users/customers that are associated with a particular class/category.  One requires to extract the best features that will be best suitable for the associated class.  Classification can be implemented by various algorithms – some of them include- Support vector machines, K-Nearest Neighbors, Logistic Regression, Decision Trees, etc. 42
  • 42. Web Usage Mining Techniques used in Web Usage Mining Clustering  There are mainly 2 types of clusters- the first one is the usage cluster and the second one is the page cluster.  The clustering of pages can be readily performed based on the usage data.  In usage-based clustering, items that are commonly accessed /purchased together can be automatically organized into groups. 43
  • 43. Virtual Web View 44  A view on top of the World-Wide-Web  An approach to handle an unstructured data  Uses Multiple Layered Database (MLDB) built on top of web