SlideShare a Scribd company logo
Chapter-6
Web mining
Data mining in web applications
What isWeb mining?
Web mining is the use of data mining techniques to automatically discover and
extract information fromWeb documents and services.
What is web mining?
• Mining of data related toWWW
• Data present inWeb pages or data related to web activity
• Web data is classified
• Content of web pages
• Intra page structure which include code and actual linkage
• Usage data – how used by visitors
• User profiles
Taxonomy ofWeb Mining
Web Content Mining
• Extension of basic search engines
• Search engines are keyword-based
• Traditional search engines use crawlers
• to search theWeb
• gather information
• indexing techniques to store the information
• query processing to provide fast and accurate information to users
Taxonomy ofWeb content mining
AGENT BASED APPROACH
WEB CONTENT MINING
DATABASEAPPROACH
USE SOFTWARE SYSTEMSTO PERFORM
THE CONTENT MINING
EG. SEARCH ENGINES
VIEWSWEB DATA AS BELONGINGTO
DATABASE
WEB IS A MULTILEVEL DATABASEAND
QUERY LANGUAGESARE USED FOR
QUERYINGTHE DATA
CONTENT MINING ISATYPE OF TEXT MINING
Text mining hierarchy
Keyword
Term
Association
Similarity Search
Classification and Clustering
Natural Language processing
Simple
Complex
Crawlers
How do crawlers work?
• Robot, spider, crawler is a program that traverses the hypertext structure in
the web
• Page that the crawler starts is referred to as seed URL
• All links from that page are recorded and saved in a queue
• The new pages are in turn searched and their links are saved
• The crawlers collect information about each page, extract keywords, store
indices for users
Crawling the web
A Web crawler is an Internet bot which systematically browses the World Wide
Web, typically for the purpose of Web indexing (web spidering). Web search
engines and some other sites use Web crawling or spidering software to
update their web content or indices of others sites' web content.
Including a robots.txt file can request bots to index only parts of a website, or
nothing at all.
Crawling
When the Google visit your website for the purpose of tracking, Google does
this with help of machine, known as web crawler, spider, Google bot, internet
bot, automatic indexer
The process of Crawling: Google uses huge set of computer to fetch or crawl”
million of web pages on the web. Googlebot discovers new and updated pages
with the help of site map to be added to the Google to crawl. Crawler obtains
information; add in to the Google index, this is where crawling works.
Indexing
Once the crawling process has been done from the web
crawler, the result will store in the Google index. The Google
index is alike to an index or library, which lists information
about all the books or library. If you want more pages included
in the Google index, you can create and submit a
Sitemap through Webmaster Tools.
The index is basically a big list of words and the web pages
that feature them on the basis of keywords location of the
term in that particular webpage will store.
Ranking
“Ranking – Determining what each page is about, and
how it should rank for relevant queries”
Search engines have two major functions: crawling and
building an index, and providing search users with a
ranked list of the websites they've determined as the most
relevant.
Types of crawlers
• Periodic crawlers: activated periodically; every time it is activated it replaces
the existing index
• Incremental crawler: updates the index incrementally instead of replacing it
• Focused crawler: visits pages related to topics of interest
Focused crawling
SEED URL
SITE-2 SITE-3
SITE-1
SITE-6
SITE-5
SITE-4
Web Harvesting
• Web harvesting also known asWeb Scraping is in an increasingly popular
method used by websites to channel customer’s searches to their website
• Web harvesting software automatically extracts information from the Web
and picks up where search engines leave off, doing the work the search
engine can't.
• FMiner
VirtualWebView
• Large amount of unstructured data can be handled using multiple layered
database(MLDB) on top of the web data
• Every layer of this dbase is more generalized then the preceding layer
• The upper layer are structured and can be accessed using SQL
• View of MLDB-Virtual WebView(VWV)
WebML
• Query language which supports data mining operations on MLDB
• Four primitive operations inWebML are
• COVERS
• COVERED BY
• LIKE
• CLOSETO
SELECT *
FROM document in “www.engr.smu.edu”
WHERE ONE OF keywords COVERS “cat”
Personalization
• Contents of a web page are modified to fit the desires of the user
• Advertisements are sent to a potential customer based on his specific knowledge
• Personalization is performed on target web page
• Targeting is different from personalization
• In targeting businesses display advertisements at other sites visited by their users
• In personalization when a person visits aWeb site, the advertising can be designed
specifically for that person
Personalization Contd….
• Personalization is a combination of clustering, classification and prediction
• Types of personalization are
• Manual techniques – user registration details
• Collaborative filtering
• Content-based filtering
• Eg. MyYahoo
Web Usage Mining
It deals with understanding user behavior in interacting
with the web or with a website.
Aim
To obtain information that may assist web sites for
reorganization or adaptation to better suit the user.
•Clicking pattern
•Browsing time
•Transaction
To understand user’s behaviour
Web Usage Mining Applications
• Personalization
• Improve structure of a site’s Web pages
• Aid in caching and prediction of future page references
• Improve design of individual pages
• Improve effectiveness of e-commerce (sales and advertising)
Web Usage Mining Activities
• Preprocessing Web log
• Cleanse
• Remove extraneous information
• Sessionize
Session: Sequence of pages referenced by one user at a sitting.
• Pattern Discovery
• Count patterns that occur in sessions
• Pattern is sequence of pages references in session.
• Similar to association rules
• Transaction: session
• Itemset: pattern (or subset)
• Order is important
• Pattern Analysis
Web Usage Mining Issues
• Identification of exact user not possible.
• Exact sequence of pages referenced by a user not possible due to caching.
• Session not well defined
• Security, privacy, and legal issues
Web Log Cleansing
• Replace source IP address with unique but non-identifying ID.
• Replace exact URL of pages referenced with unique but non-identifying ID.
• Delete error records and records containing not page data (such as figures
and code)
Web Structure Mining
• Creating a model of the web organization
• Used to classifyWeb pages or to create similarity measures between
documents
• Mine structure (links, graph) of theWeb
• Techniques
• PageRank
• HITS
Page Rank
• Designed to increase the effectiveness of search engines and improve their
efficiency
• Used to
• Measure the importance of a page
• Prioritize the pages returned from a traditional search engine using keyword searching
• Page Rank is calculated based on the number of pages that point to it
Page Rank
 Search engine that uses link structure to calculate a quality ranking (PageRank) for each page
 Intuition: PageRank can be seen as the probability that a “random surfer” visits a page
 A page is important if important pages link to it
PageRank
Page Rank: A page is important if many important pages link to it.
(PageRank) + (Website Content) = Overall Rank in Results
 Link
ij :
i considers j important.
the more important i, the more
important j becomes.
if i has many out-links: links are less
important.
Let OutDegreei = # out-links of page i
Adjust pj:
PageRank ( j )  (1  d ) +d
PageRank (i )
OutDegree(i)
This is the weighted sum of the importance of the pages
referring to Pj
d-damping factor
Parameter d is probability that the surfer gets bored and starts on
a new random page
(1-d) is the probability that the random surfer follows a link on
current page
Repeat until pagerank vector converges…
Hyperlink-induced topic search(HITS)
• Finds hubs and authoritative pages
• HITS has two components
• Based on a given set of keywords relevant pages are found
• Hubs and authority measures are associated with these pages. Pages with highest
values are returned
Authorities and hubs
• The algorithm produces two types of pages:
- Authority: pages that provide an important, trustworthy information on a
given topic (highly-referenced pages on a topic)
- Hub: pages that “point” to authorities
• A better hub points to many good authorities .A better authority is pointed
to by many good hubs
Definitions
• Authority: pages that provide an important, trustworthy information on a
given topic
• Hubs: pages that contain links to authorities
• Indegree: number of incoming links to a given node, used to measure the
authoritativeness
• Outdegree: number of outgoing links from a given node, here it is used to
measure the hubness
36
HITS Algorithm
• Hubs point to lots of authorities.
• Authorities are pointed to by lots of hubs.
• Together they form a bipartite graph:
Hubs Authorities
HITS
Pages that link to a collection of authoritative pages on a broad topic
Hubs
Authorities
Relevant pages of the highest quality on a broad topic
HITS
 Steps for Discovering Hubs and Authorities on a
specific topic
Collect seed set of pages S (returned by search engine)
Expand seed set to contain pages that point to or are pointed
to by pages in seed set (removes links inside a site)
Iteratively update hub weight h(p) and authority weight a(p)
for each page:
a (p )  h(q ) h(p )  a (q )
q p p q
After a fixed number of iterations, pages with highest
hub/authority weights form core of community
Strengths and weaknesses of HITS
 Strength: its ability to rank pages according to the
query topic, which may be able to provide more relevant
authority and hub pages.
 Weaknesses:
It is easily spammed. It is in fact quite easy to influence HITS
since adding out-links in one’s own page is so easy.
Topic drift. Many pages in the expanded set may not be on topic.
Inefficiency at query time: The query time evaluation is slow.
HITS Algorithm
Application areas of web mining
1. E-commerce: personalized marketing;
2. Fight against terrorism: classify threats;
3. Prediction;
4. And others :)
Future research directions
1. Multimedia data mining: a picture is worth a thousand words;
2. Multilingual knowledge extraction: web page translations;
3. Semantic web mining.
Thank you!

More Related Content

Similar to Web Mining.pptx

Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
Gaurav Uniyal
 
Search engine
Search engineSearch engine
Search engine
Rishabh Agarwal
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
E017624043
E017624043E017624043
E017624043
IOSR Journals
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
IRJESJOURNAL
 
Web mining
Web miningWeb mining
Web mining
Innovative Pencils
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
Web crawler
Web crawlerWeb crawler
Web crawler
Abhishek Gupta
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
Christopher Mbinda
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
waqas ahmad
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
DEEPAK948083
 
Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014
VIJAYAKRISHNAN K
 
Search Engine Optimization Primer
Search Engine Optimization PrimerSearch Engine Optimization Primer
Search Engine Optimization Primer
Simobo
 
SEO: search Engine Optimization
SEO: search Engine OptimizationSEO: search Engine Optimization
SEO: search Engine Optimization
phoolchand yadav
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
NiteshKumar176268
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
Ijcem Journal
 
Seo beginners-slide-show
Seo beginners-slide-showSeo beginners-slide-show
Seo beginners-slide-show
Ankush77721
 

Similar to Web Mining.pptx (20)

Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Search engine
Search engineSearch engine
Search engine
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
E017624043
E017624043E017624043
E017624043
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
CAB 2.pptx
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
 
Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014
 
Search Engine Optimization Primer
Search Engine Optimization PrimerSearch Engine Optimization Primer
Search Engine Optimization Primer
 
SEO: search Engine Optimization
SEO: search Engine OptimizationSEO: search Engine Optimization
SEO: search Engine Optimization
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Search engine
Search engineSearch engine
Search engine
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
 
Seo beginners-slide-show
Seo beginners-slide-showSeo beginners-slide-show
Seo beginners-slide-show
 

Recently uploaded

Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 

Recently uploaded (20)

Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 

Web Mining.pptx

  • 1. Chapter-6 Web mining Data mining in web applications
  • 2. What isWeb mining? Web mining is the use of data mining techniques to automatically discover and extract information fromWeb documents and services.
  • 3. What is web mining? • Mining of data related toWWW • Data present inWeb pages or data related to web activity • Web data is classified • Content of web pages • Intra page structure which include code and actual linkage • Usage data – how used by visitors • User profiles
  • 5. Web Content Mining • Extension of basic search engines • Search engines are keyword-based • Traditional search engines use crawlers • to search theWeb • gather information • indexing techniques to store the information • query processing to provide fast and accurate information to users
  • 6. Taxonomy ofWeb content mining AGENT BASED APPROACH WEB CONTENT MINING DATABASEAPPROACH USE SOFTWARE SYSTEMSTO PERFORM THE CONTENT MINING EG. SEARCH ENGINES VIEWSWEB DATA AS BELONGINGTO DATABASE WEB IS A MULTILEVEL DATABASEAND QUERY LANGUAGESARE USED FOR QUERYINGTHE DATA CONTENT MINING ISATYPE OF TEXT MINING
  • 7. Text mining hierarchy Keyword Term Association Similarity Search Classification and Clustering Natural Language processing Simple Complex
  • 9. How do crawlers work? • Robot, spider, crawler is a program that traverses the hypertext structure in the web • Page that the crawler starts is referred to as seed URL • All links from that page are recorded and saved in a queue • The new pages are in turn searched and their links are saved • The crawlers collect information about each page, extract keywords, store indices for users
  • 10. Crawling the web A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering). Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web content. Including a robots.txt file can request bots to index only parts of a website, or nothing at all.
  • 11. Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot, internet bot, automatic indexer The process of Crawling: Google uses huge set of computer to fetch or crawl” million of web pages on the web. Googlebot discovers new and updated pages with the help of site map to be added to the Google to crawl. Crawler obtains information; add in to the Google index, this is where crawling works.
  • 12. Indexing Once the crawling process has been done from the web crawler, the result will store in the Google index. The Google index is alike to an index or library, which lists information about all the books or library. If you want more pages included in the Google index, you can create and submit a Sitemap through Webmaster Tools. The index is basically a big list of words and the web pages that feature them on the basis of keywords location of the term in that particular webpage will store.
  • 13. Ranking “Ranking – Determining what each page is about, and how it should rank for relevant queries” Search engines have two major functions: crawling and building an index, and providing search users with a ranked list of the websites they've determined as the most relevant.
  • 14. Types of crawlers • Periodic crawlers: activated periodically; every time it is activated it replaces the existing index • Incremental crawler: updates the index incrementally instead of replacing it • Focused crawler: visits pages related to topics of interest
  • 15. Focused crawling SEED URL SITE-2 SITE-3 SITE-1 SITE-6 SITE-5 SITE-4
  • 16. Web Harvesting • Web harvesting also known asWeb Scraping is in an increasingly popular method used by websites to channel customer’s searches to their website • Web harvesting software automatically extracts information from the Web and picks up where search engines leave off, doing the work the search engine can't. • FMiner
  • 17. VirtualWebView • Large amount of unstructured data can be handled using multiple layered database(MLDB) on top of the web data • Every layer of this dbase is more generalized then the preceding layer • The upper layer are structured and can be accessed using SQL • View of MLDB-Virtual WebView(VWV)
  • 18. WebML • Query language which supports data mining operations on MLDB • Four primitive operations inWebML are • COVERS • COVERED BY • LIKE • CLOSETO SELECT * FROM document in “www.engr.smu.edu” WHERE ONE OF keywords COVERS “cat”
  • 19. Personalization • Contents of a web page are modified to fit the desires of the user • Advertisements are sent to a potential customer based on his specific knowledge • Personalization is performed on target web page • Targeting is different from personalization • In targeting businesses display advertisements at other sites visited by their users • In personalization when a person visits aWeb site, the advertising can be designed specifically for that person
  • 20. Personalization Contd…. • Personalization is a combination of clustering, classification and prediction • Types of personalization are • Manual techniques – user registration details • Collaborative filtering • Content-based filtering • Eg. MyYahoo
  • 21. Web Usage Mining It deals with understanding user behavior in interacting with the web or with a website. Aim To obtain information that may assist web sites for reorganization or adaptation to better suit the user.
  • 23. Web Usage Mining Applications • Personalization • Improve structure of a site’s Web pages • Aid in caching and prediction of future page references • Improve design of individual pages • Improve effectiveness of e-commerce (sales and advertising)
  • 24. Web Usage Mining Activities • Preprocessing Web log • Cleanse • Remove extraneous information • Sessionize Session: Sequence of pages referenced by one user at a sitting. • Pattern Discovery • Count patterns that occur in sessions • Pattern is sequence of pages references in session. • Similar to association rules • Transaction: session • Itemset: pattern (or subset) • Order is important • Pattern Analysis
  • 25. Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not possible due to caching. • Session not well defined • Security, privacy, and legal issues
  • 26. Web Log Cleansing • Replace source IP address with unique but non-identifying ID. • Replace exact URL of pages referenced with unique but non-identifying ID. • Delete error records and records containing not page data (such as figures and code)
  • 27. Web Structure Mining • Creating a model of the web organization • Used to classifyWeb pages or to create similarity measures between documents • Mine structure (links, graph) of theWeb • Techniques • PageRank • HITS
  • 28. Page Rank • Designed to increase the effectiveness of search engines and improve their efficiency • Used to • Measure the importance of a page • Prioritize the pages returned from a traditional search engine using keyword searching • Page Rank is calculated based on the number of pages that point to it
  • 29. Page Rank  Search engine that uses link structure to calculate a quality ranking (PageRank) for each page  Intuition: PageRank can be seen as the probability that a “random surfer” visits a page  A page is important if important pages link to it
  • 30. PageRank Page Rank: A page is important if many important pages link to it. (PageRank) + (Website Content) = Overall Rank in Results  Link ij : i considers j important. the more important i, the more important j becomes. if i has many out-links: links are less important.
  • 31. Let OutDegreei = # out-links of page i Adjust pj: PageRank ( j )  (1  d ) +d PageRank (i ) OutDegree(i) This is the weighted sum of the importance of the pages referring to Pj d-damping factor Parameter d is probability that the surfer gets bored and starts on a new random page (1-d) is the probability that the random surfer follows a link on current page
  • 32. Repeat until pagerank vector converges…
  • 33. Hyperlink-induced topic search(HITS) • Finds hubs and authoritative pages • HITS has two components • Based on a given set of keywords relevant pages are found • Hubs and authority measures are associated with these pages. Pages with highest values are returned
  • 34. Authorities and hubs • The algorithm produces two types of pages: - Authority: pages that provide an important, trustworthy information on a given topic (highly-referenced pages on a topic) - Hub: pages that “point” to authorities • A better hub points to many good authorities .A better authority is pointed to by many good hubs
  • 35. Definitions • Authority: pages that provide an important, trustworthy information on a given topic • Hubs: pages that contain links to authorities • Indegree: number of incoming links to a given node, used to measure the authoritativeness • Outdegree: number of outgoing links from a given node, here it is used to measure the hubness
  • 36. 36 HITS Algorithm • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs. • Together they form a bipartite graph: Hubs Authorities
  • 37. HITS Pages that link to a collection of authoritative pages on a broad topic Hubs
  • 38. Authorities Relevant pages of the highest quality on a broad topic
  • 39. HITS  Steps for Discovering Hubs and Authorities on a specific topic Collect seed set of pages S (returned by search engine) Expand seed set to contain pages that point to or are pointed to by pages in seed set (removes links inside a site) Iteratively update hub weight h(p) and authority weight a(p) for each page: a (p )  h(q ) h(p )  a (q ) q p p q After a fixed number of iterations, pages with highest hub/authority weights form core of community
  • 40. Strengths and weaknesses of HITS  Strength: its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages.  Weaknesses: It is easily spammed. It is in fact quite easy to influence HITS since adding out-links in one’s own page is so easy. Topic drift. Many pages in the expanded set may not be on topic. Inefficiency at query time: The query time evaluation is slow.
  • 42. Application areas of web mining 1. E-commerce: personalized marketing; 2. Fight against terrorism: classify threats; 3. Prediction; 4. And others :)
  • 43. Future research directions 1. Multimedia data mining: a picture is worth a thousand words; 2. Multilingual knowledge extraction: web page translations; 3. Semantic web mining.