SlideShare a Scribd company logo
1 of 38
Web Mining
By:-Mudit Dholakia
Guide:-Dr. Amit Ganatra Sir
What is web mining?
• Web mining is the use of the data mining techniques to automatically
discover and extract information from web documents/services.
• Discovering Knowledge from and about WWW - is one of the basic
abilities of an intelligent agent.
Knowledge
WWW
Web Mining .vs. Data Mining
• Structure (or lack of it)
• Textual information and linkage structure
• Scale
• Data generated per day is comparable to largest conventional data
warehouses
• Speed
• Often need to react to evolving usage patterns in real-time (e.g.,
merchandising)
Web Mining topics
• Web graph analysis
• Power Laws and The Long Tail
• Structured data extraction
• Web advertising
• Systems Issues
Size of the Web
• Number of pages
• Technically, infinite
• Much duplication (30-40%)
• Best estimate of “unique” static HTML pages comes from search engine
claims
• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion
• Google recently announced that their index contains 1 trillion pages
• How to explain the discrepancy?
The web as a graph
• Pages = nodes, hyperlinks = edges
• Ignore content
• Directed graph
• High linkage
• 10-20 links/page on average
• Power-law degree distribution
Structure of Web graph
Power-law degree distribution
Measures
• Structure
• In-degrees
• Out-degrees
• Number of pages per site
• Usage patterns
• Number of visitors
• Popularity e.g., products, movies, music
The Long Tail
Measures
• Shelf space is a scarce commodity for traditional retailers
• Also: TV networks, movie theaters,…
• The web enables near-zero-cost dissemination of information about
products
• More choice necessitates better filters
• Recommendation engines (e.g., Amazon)
• How Into Thin Air made Touching the Void a bestseller
Searching the Web
Content aggregatorsThe Web Content consumers
Two approaches for analyzing data
• Machine Learning approach
• Emphasizes sophisticated algorithms e.g., Support Vector Machines
• Data sets tend to be small, fit in memory
• Data Mining approach
• Emphasizes big data sets (e.g., in the terabytes)
• Data cannot even fit on a single disk!
• Necessarily leads to simpler algorithms
View of mining system
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU
…
Issues
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server!
• Need large farms of servers
• How to organize hardware/software to mine multi-terabyte data sets
• Without breaking the bank!
What it should do?
• Finding relevant information
• Low precision and unindexed information
• Creating new knowledge out of available information on the web
• A data-triggered process
• Personalizing the information
• Personal preference in content and presentation of the information
• Learning about the consumers
• What does the customer want to do?
Direct vs Indirect web mining
• Web mining techniques can be used to solve the information
overload problems:
Directly
Address the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevant
Indirectly
Used as part of a bigger application that addresses problems
E.g. used to create index terms for a web search service
Web Mining Categories
• Web Content Mining
Discovering useful information from web page
contents/data/documents.
• Web Structure Mining
Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
• Web Usage Mining
Extraction of interesting knowledge from logging information
produced by web servers.
Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.
Types
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
IR
System
Query
Documents
source
Ranked
Documents
Document
Document
Document
Clustering
System
Similarity
measure
Documents
source
Doc
Do
c
Doc
Doc
Doc
DocDoc
Doc
Doc
Doc
Web Content Data Structure
• Web content consists of several types of data
• Text, image, audio, video, hyperlinks.
• Unstructured – free text
• Semi-structured – HTML
• More structured – Data in the tables or database generated HTML
pages
Note: much of the Web content data is unstructured text data.
Web Content Mining
• Unstructured Documents
Bag of words to represent unstructured documents
 Takes single word as feature
 Ignores the sequence in which words occur
Features could be
 Boolean
 Word either occurs or does not occur in a document
 Frequency based
 Frequency of the word in a document
Variations of the feature selection include
 Removing the case, punctuation, infrequent words and stop words
Features can be reduced using different feature selection techniques:
 Information gain, mutual information, cross entropy.
 Stemming: which reduces words to their morphological roots.
Web Content Mining
• Semi-Structured Documents
Uses richer representations for features
Due to the additional structural information in the hypertext
document (typically HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining methods)
Application:
 Hypertext classification or categorization and clustering,
 learning relations between web documents,
 learning extraction patterns or rules, and
 finding patterns in semi-structured data.
Web Content Mining: DB View
• The database techniques on the Web are related to the problems of managing
and querying the information on the Web.
• DB view tries to infer the structure of a Web site or transform a Web site to
become a database
Better information management
Better querying on the Web
• Can be achieved by:
Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
Web Content Mining: DB View
• DB view mainly uses the Object Exchange Model (OEM)
Represents semi-structured data by a labeled graph
The data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges
 Each object is identified by an object identifier [oid] and
 Value is either atomic or complex
• Process typically starts with manual selection of Web sites for
doing Web content mining
• Main application:
• The task of finding frequent substructures in semi-structured data
• The task of creating multi-layered database
Taxonomies
• Ranking
• Graph Search
• Communities
• Hyperlink Induced Topic Search
• SEO
• Hub & Authorities
Web Structure Mining
• Interested in the structure of the hyperlinks within the Web
• Inspired by the study of social networks and citation analysis
• Can discover specific types of pages(such as hubs, authorities, etc.) based on
the incoming and outgoing links.
• Application:
• Discovering micro-communities in the Web ,
• measuring the “completeness” of a Web site
Web Usage Mining
• Tries to predict user behavior from interaction
with the Web
• Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
• Two common approaches
 Maps the usage data of Web server into relational tables before
an adapted data mining techniques
 Uses the log data directly by utilizing special pre-processing
techniques
Web Usage Mining
Pre-Processing Pattern Discovery Pattern Analysis
User session
File Rules and Patterns Interesting
Knowledge
XML View
Generalized Descriptions
More Generalized Descriptions
Layer0
Layer1
Layern
...
33
Use of Multi-Layer Meta Web
• Benefits of Multi-Layer Meta-Web:
• Multi-dimensional Web info summary analysis
• Approximate and intelligent query answering
• Web high-level query answering (WebSQL, WebML)
• Web content and structure mining
• Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?
• Benefits even if it is partially constructed
• Benefits may justify the cost of tool development,
standardization and partial restructuring
Web Search Products and Services
 Alta Vista
 DB2 text extender
 Excite
 Fulcrum
 Glimpse (Academic)
 Google!
 Inforseek Internet
 Inforseek Intranet
 Inktomi (HotBot)
 Lycos
 PLS
 Smart (Academic)
 Oracle text extender
 Verity
 Yahoo!
Web Usage Mining
• Typical problems:
• Distinguishing among unique users, server sessions,
episodes, etc. in the presence of caching and proxy
servers
• Often Usage Mining uses some background or domain
knowledge
E.g. site topology, Web content, etc.
Web Usage Mining
• Applications:
• Two main categories:
 Learning a user profile (personalized)
Web users would be interested in techniques that learn their
needs and preferences automatically
 Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques that
improve the effectiveness of their Web site
References
• www.cs.jyu.fi/ai/vagan/Web_Mining.ppt
• www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt
• www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.ppt
x
Thank You

More Related Content

What's hot

Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 

What's hot (20)

Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
Text MIning
Text MIningText MIning
Text MIning
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text Mining
Text MiningText Mining
Text Mining
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Text mining
Text miningText mining
Text mining
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Text mining
Text miningText mining
Text mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Web usage-mining
Web usage-miningWeb usage-mining
Web usage-mining
 

Similar to Web mining

Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
Tim Weninger
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
Stefanos Anastasiadis
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
mak57
 

Similar to Web mining (20)

Web mining
Web miningWeb mining
Web mining
 
Web mining
Web miningWeb mining
Web mining
 
Metadata and the web
Metadata and the webMetadata and the web
Metadata and the web
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
IRT Unit_4.pptx
IRT Unit_4.pptxIRT Unit_4.pptx
IRT Unit_4.pptx
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
 
TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Foundations of business intelligence databases and information management
Foundations of business intelligence databases and information managementFoundations of business intelligence databases and information management
Foundations of business intelligence databases and information management
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
Semantic Web For Dummies
Semantic Web For DummiesSemantic Web For Dummies
Semantic Web For Dummies
 
SharePoint WCM 2013
SharePoint WCM 2013SharePoint WCM 2013
SharePoint WCM 2013
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
SharePoint Saturday Paris 2015 Validating SharePoint 2013 Farm Before Go-Live
SharePoint Saturday Paris 2015   Validating SharePoint 2013 Farm Before Go-LiveSharePoint Saturday Paris 2015   Validating SharePoint 2013 Farm Before Go-Live
SharePoint Saturday Paris 2015 Validating SharePoint 2013 Farm Before Go-Live
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
 

Recently uploaded

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Recently uploaded (20)

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Web mining

  • 2. What is web mining? • Web mining is the use of the data mining techniques to automatically discover and extract information from web documents/services. • Discovering Knowledge from and about WWW - is one of the basic abilities of an intelligent agent.
  • 4. Web Mining .vs. Data Mining • Structure (or lack of it) • Textual information and linkage structure • Scale • Data generated per day is comparable to largest conventional data warehouses • Speed • Often need to react to evolving usage patterns in real-time (e.g., merchandising)
  • 5. Web Mining topics • Web graph analysis • Power Laws and The Long Tail • Structured data extraction • Web advertising • Systems Issues
  • 6. Size of the Web • Number of pages • Technically, infinite • Much duplication (30-40%) • Best estimate of “unique” static HTML pages comes from search engine claims • Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion • Google recently announced that their index contains 1 trillion pages • How to explain the discrepancy?
  • 7. The web as a graph • Pages = nodes, hyperlinks = edges • Ignore content • Directed graph • High linkage • 10-20 links/page on average • Power-law degree distribution
  • 10. Measures • Structure • In-degrees • Out-degrees • Number of pages per site • Usage patterns • Number of visitors • Popularity e.g., products, movies, music
  • 12. Measures • Shelf space is a scarce commodity for traditional retailers • Also: TV networks, movie theaters,… • The web enables near-zero-cost dissemination of information about products • More choice necessitates better filters • Recommendation engines (e.g., Amazon) • How Into Thin Air made Touching the Void a bestseller
  • 13. Searching the Web Content aggregatorsThe Web Content consumers
  • 14. Two approaches for analyzing data • Machine Learning approach • Emphasizes sophisticated algorithms e.g., Support Vector Machines • Data sets tend to be small, fit in memory • Data Mining approach • Emphasizes big data sets (e.g., in the terabytes) • Data cannot even fit on a single disk! • Necessarily leads to simpler algorithms
  • 15. View of mining system Mem Disk CPU Mem Disk CPU Mem Disk CPU …
  • 16. Issues • Web data sets can be very large • Tens to hundreds of terabytes • Cannot mine on a single server! • Need large farms of servers • How to organize hardware/software to mine multi-terabyte data sets • Without breaking the bank!
  • 17. What it should do? • Finding relevant information • Low precision and unindexed information • Creating new knowledge out of available information on the web • A data-triggered process • Personalizing the information • Personal preference in content and presentation of the information • Learning about the consumers • What does the customer want to do?
  • 18. Direct vs Indirect web mining • Web mining techniques can be used to solve the information overload problems: Directly Address the problem with web mining techniques E.g. newsgroup agent classifies whether the news as relevant Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service
  • 19. Web Mining Categories • Web Content Mining Discovering useful information from web page contents/data/documents. • Web Structure Mining Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs • Web Usage Mining Extraction of interesting knowledge from logging information produced by web servers. Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.
  • 20. Types • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining
  • 22. Web Content Data Structure • Web content consists of several types of data • Text, image, audio, video, hyperlinks. • Unstructured – free text • Semi-structured – HTML • More structured – Data in the tables or database generated HTML pages Note: much of the Web content data is unstructured text data.
  • 23. Web Content Mining • Unstructured Documents Bag of words to represent unstructured documents  Takes single word as feature  Ignores the sequence in which words occur Features could be  Boolean  Word either occurs or does not occur in a document  Frequency based  Frequency of the word in a document Variations of the feature selection include  Removing the case, punctuation, infrequent words and stop words Features can be reduced using different feature selection techniques:  Information gain, mutual information, cross entropy.  Stemming: which reduces words to their morphological roots.
  • 24. Web Content Mining • Semi-Structured Documents Uses richer representations for features Due to the additional structural information in the hypertext document (typically HTML and hyperlinks) Uses common data mining methods (whereas unstructured might use more text mining methods) Application:  Hypertext classification or categorization and clustering,  learning relations between web documents,  learning extraction patterns or rules, and  finding patterns in semi-structured data.
  • 25. Web Content Mining: DB View • The database techniques on the Web are related to the problems of managing and querying the information on the Web. • DB view tries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web • Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database
  • 26. Web Content Mining: DB View • DB view mainly uses the Object Exchange Model (OEM) Represents semi-structured data by a labeled graph The data in the OEM is viewed as a graph, with objects as the vertices and labels on the edges  Each object is identified by an object identifier [oid] and  Value is either atomic or complex • Process typically starts with manual selection of Web sites for doing Web content mining • Main application: • The task of finding frequent substructures in semi-structured data • The task of creating multi-layered database
  • 27.
  • 28. Taxonomies • Ranking • Graph Search • Communities • Hyperlink Induced Topic Search • SEO • Hub & Authorities
  • 29. Web Structure Mining • Interested in the structure of the hyperlinks within the Web • Inspired by the study of social networks and citation analysis • Can discover specific types of pages(such as hubs, authorities, etc.) based on the incoming and outgoing links. • Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site
  • 30. Web Usage Mining • Tries to predict user behavior from interaction with the Web • Wide range of data (logs)  Web client data  Proxy server data  Web server data • Two common approaches  Maps the usage data of Web server into relational tables before an adapted data mining techniques  Uses the log data directly by utilizing special pre-processing techniques
  • 31. Web Usage Mining Pre-Processing Pattern Discovery Pattern Analysis User session File Rules and Patterns Interesting Knowledge
  • 32. XML View Generalized Descriptions More Generalized Descriptions Layer0 Layer1 Layern ...
  • 33. 33 Use of Multi-Layer Meta Web • Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis • Approximate and intelligent query answering • Web high-level query answering (WebSQL, WebML) • Web content and structure mining • Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? • Benefits even if it is partially constructed • Benefits may justify the cost of tool development, standardization and partial restructuring
  • 34. Web Search Products and Services  Alta Vista  DB2 text extender  Excite  Fulcrum  Glimpse (Academic)  Google!  Inforseek Internet  Inforseek Intranet  Inktomi (HotBot)  Lycos  PLS  Smart (Academic)  Oracle text extender  Verity  Yahoo!
  • 35. Web Usage Mining • Typical problems: • Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers • Often Usage Mining uses some background or domain knowledge E.g. site topology, Web content, etc.
  • 36. Web Usage Mining • Applications: • Two main categories:  Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically  Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site