Web Mining
By:-Mudit Dholakia
Guide:-Dr. Amit Ganatra Sir
What is web mining?
• Web mining is the use of the data mining techniques to automatically
discover and extract information from web documents/services.
• Discovering Knowledge from and about WWW - is one of the basic
abilities of an intelligent agent.
Knowledge
WWW
Web Mining .vs. Data Mining
• Structure (or lack of it)
• Textual information and linkage structure
• Scale
• Data generated per day is comparable to largest conventional data
warehouses
• Speed
• Often need to react to evolving usage patterns in real-time (e.g.,
merchandising)
Web Mining topics
• Web graph analysis
• Power Laws and The Long Tail
• Structured data extraction
• Web advertising
• Systems Issues
Size of the Web
• Number of pages
• Technically, infinite
• Much duplication (30-40%)
• Best estimate of “unique” static HTML pages comes from search engine
claims
• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion
• Google recently announced that their index contains 1 trillion pages
• How to explain the discrepancy?
The web as a graph
• Pages = nodes, hyperlinks = edges
• Ignore content
• Directed graph
• High linkage
• 10-20 links/page on average
• Power-law degree distribution
Structure of Web graph
Power-law degree distribution
Measures
• Structure
• In-degrees
• Out-degrees
• Number of pages per site
• Usage patterns
• Number of visitors
• Popularity e.g., products, movies, music
The Long Tail
Measures
• Shelf space is a scarce commodity for traditional retailers
• Also: TV networks, movie theaters,…
• The web enables near-zero-cost dissemination of information about
products
• More choice necessitates better filters
• Recommendation engines (e.g., Amazon)
• How Into Thin Air made Touching the Void a bestseller
Searching the Web
Content aggregatorsThe Web Content consumers
Two approaches for analyzing data
• Machine Learning approach
• Emphasizes sophisticated algorithms e.g., Support Vector Machines
• Data sets tend to be small, fit in memory
• Data Mining approach
• Emphasizes big data sets (e.g., in the terabytes)
• Data cannot even fit on a single disk!
• Necessarily leads to simpler algorithms
View of mining system
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU
…
Issues
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server!
• Need large farms of servers
• How to organize hardware/software to mine multi-terabyte data sets
• Without breaking the bank!
What it should do?
• Finding relevant information
• Low precision and unindexed information
• Creating new knowledge out of available information on the web
• A data-triggered process
• Personalizing the information
• Personal preference in content and presentation of the information
• Learning about the consumers
• What does the customer want to do?
Direct vs Indirect web mining
• Web mining techniques can be used to solve the information
overload problems:
Directly
Address the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevant
Indirectly
Used as part of a bigger application that addresses problems
E.g. used to create index terms for a web search service
Web Mining Categories
• Web Content Mining
Discovering useful information from web page
contents/data/documents.
• Web Structure Mining
Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
• Web Usage Mining
Extraction of interesting knowledge from logging information
produced by web servers.
Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.
Types
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
IR
System
Query
Documents
source
Ranked
Documents
Document
Document
Document
Clustering
System
Similarity
measure
Documents
source
Doc
Do
c
Doc
Doc
Doc
DocDoc
Doc
Doc
Doc
Web Content Data Structure
• Web content consists of several types of data
• Text, image, audio, video, hyperlinks.
• Unstructured – free text
• Semi-structured – HTML
• More structured – Data in the tables or database generated HTML
pages
Note: much of the Web content data is unstructured text data.
Web Content Mining
• Unstructured Documents
Bag of words to represent unstructured documents
 Takes single word as feature
 Ignores the sequence in which words occur
Features could be
 Boolean
 Word either occurs or does not occur in a document
 Frequency based
 Frequency of the word in a document
Variations of the feature selection include
 Removing the case, punctuation, infrequent words and stop words
Features can be reduced using different feature selection techniques:
 Information gain, mutual information, cross entropy.
 Stemming: which reduces words to their morphological roots.
Web Content Mining
• Semi-Structured Documents
Uses richer representations for features
Due to the additional structural information in the hypertext
document (typically HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining methods)
Application:
 Hypertext classification or categorization and clustering,
 learning relations between web documents,
 learning extraction patterns or rules, and
 finding patterns in semi-structured data.
Web Content Mining: DB View
• The database techniques on the Web are related to the problems of managing
and querying the information on the Web.
• DB view tries to infer the structure of a Web site or transform a Web site to
become a database
Better information management
Better querying on the Web
• Can be achieved by:
Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
Web Content Mining: DB View
• DB view mainly uses the Object Exchange Model (OEM)
Represents semi-structured data by a labeled graph
The data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges
 Each object is identified by an object identifier [oid] and
 Value is either atomic or complex
• Process typically starts with manual selection of Web sites for
doing Web content mining
• Main application:
• The task of finding frequent substructures in semi-structured data
• The task of creating multi-layered database
Taxonomies
• Ranking
• Graph Search
• Communities
• Hyperlink Induced Topic Search
• SEO
• Hub & Authorities
Web Structure Mining
• Interested in the structure of the hyperlinks within the Web
• Inspired by the study of social networks and citation analysis
• Can discover specific types of pages(such as hubs, authorities, etc.) based on
the incoming and outgoing links.
• Application:
• Discovering micro-communities in the Web ,
• measuring the “completeness” of a Web site
Web Usage Mining
• Tries to predict user behavior from interaction
with the Web
• Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
• Two common approaches
 Maps the usage data of Web server into relational tables before
an adapted data mining techniques
 Uses the log data directly by utilizing special pre-processing
techniques
Web Usage Mining
Pre-Processing Pattern Discovery Pattern Analysis
User session
File Rules and Patterns Interesting
Knowledge
XML View
Generalized Descriptions
More Generalized Descriptions
Layer0
Layer1
Layern
...
33
Use of Multi-Layer Meta Web
• Benefits of Multi-Layer Meta-Web:
• Multi-dimensional Web info summary analysis
• Approximate and intelligent query answering
• Web high-level query answering (WebSQL, WebML)
• Web content and structure mining
• Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?
• Benefits even if it is partially constructed
• Benefits may justify the cost of tool development,
standardization and partial restructuring
Web Search Products and Services
 Alta Vista
 DB2 text extender
 Excite
 Fulcrum
 Glimpse (Academic)
 Google!
 Inforseek Internet
 Inforseek Intranet
 Inktomi (HotBot)
 Lycos
 PLS
 Smart (Academic)
 Oracle text extender
 Verity
 Yahoo!
Web Usage Mining
• Typical problems:
• Distinguishing among unique users, server sessions,
episodes, etc. in the presence of caching and proxy
servers
• Often Usage Mining uses some background or domain
knowledge
E.g. site topology, Web content, etc.
Web Usage Mining
• Applications:
• Two main categories:
 Learning a user profile (personalized)
Web users would be interested in techniques that learn their
needs and preferences automatically
 Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques that
improve the effectiveness of their Web site
References
• www.cs.jyu.fi/ai/vagan/Web_Mining.ppt
• www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt
• www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.ppt
x
Thank You

Web Mining

  • 1.
  • 2.
    What is webmining? • Web mining is the use of the data mining techniques to automatically discover and extract information from web documents/services. • Discovering Knowledge from and about WWW - is one of the basic abilities of an intelligent agent.
  • 3.
  • 4.
    Web Mining .vs.Data Mining • Structure (or lack of it) • Textual information and linkage structure • Scale • Data generated per day is comparable to largest conventional data warehouses • Speed • Often need to react to evolving usage patterns in real-time (e.g., merchandising)
  • 5.
    Web Mining topics •Web graph analysis • Power Laws and The Long Tail • Structured data extraction • Web advertising • Systems Issues
  • 6.
    Size of theWeb • Number of pages • Technically, infinite • Much duplication (30-40%) • Best estimate of “unique” static HTML pages comes from search engine claims • Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion • Google recently announced that their index contains 1 trillion pages • How to explain the discrepancy?
  • 7.
    The web asa graph • Pages = nodes, hyperlinks = edges • Ignore content • Directed graph • High linkage • 10-20 links/page on average • Power-law degree distribution
  • 8.
  • 9.
  • 10.
    Measures • Structure • In-degrees •Out-degrees • Number of pages per site • Usage patterns • Number of visitors • Popularity e.g., products, movies, music
  • 11.
  • 12.
    Measures • Shelf spaceis a scarce commodity for traditional retailers • Also: TV networks, movie theaters,… • The web enables near-zero-cost dissemination of information about products • More choice necessitates better filters • Recommendation engines (e.g., Amazon) • How Into Thin Air made Touching the Void a bestseller
  • 13.
    Searching the Web ContentaggregatorsThe Web Content consumers
  • 14.
    Two approaches foranalyzing data • Machine Learning approach • Emphasizes sophisticated algorithms e.g., Support Vector Machines • Data sets tend to be small, fit in memory • Data Mining approach • Emphasizes big data sets (e.g., in the terabytes) • Data cannot even fit on a single disk! • Necessarily leads to simpler algorithms
  • 15.
    View of miningsystem Mem Disk CPU Mem Disk CPU Mem Disk CPU …
  • 16.
    Issues • Web datasets can be very large • Tens to hundreds of terabytes • Cannot mine on a single server! • Need large farms of servers • How to organize hardware/software to mine multi-terabyte data sets • Without breaking the bank!
  • 17.
    What it shoulddo? • Finding relevant information • Low precision and unindexed information • Creating new knowledge out of available information on the web • A data-triggered process • Personalizing the information • Personal preference in content and presentation of the information • Learning about the consumers • What does the customer want to do?
  • 18.
    Direct vs Indirectweb mining • Web mining techniques can be used to solve the information overload problems: Directly Address the problem with web mining techniques E.g. newsgroup agent classifies whether the news as relevant Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service
  • 19.
    Web Mining Categories •Web Content Mining Discovering useful information from web page contents/data/documents. • Web Structure Mining Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs • Web Usage Mining Extraction of interesting knowledge from logging information produced by web servers. Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.
  • 20.
    Types • Web Mining •Web Content Mining • Web Structure Mining • Web Usage Mining
  • 21.
  • 22.
    Web Content DataStructure • Web content consists of several types of data • Text, image, audio, video, hyperlinks. • Unstructured – free text • Semi-structured – HTML • More structured – Data in the tables or database generated HTML pages Note: much of the Web content data is unstructured text data.
  • 23.
    Web Content Mining •Unstructured Documents Bag of words to represent unstructured documents  Takes single word as feature  Ignores the sequence in which words occur Features could be  Boolean  Word either occurs or does not occur in a document  Frequency based  Frequency of the word in a document Variations of the feature selection include  Removing the case, punctuation, infrequent words and stop words Features can be reduced using different feature selection techniques:  Information gain, mutual information, cross entropy.  Stemming: which reduces words to their morphological roots.
  • 24.
    Web Content Mining •Semi-Structured Documents Uses richer representations for features Due to the additional structural information in the hypertext document (typically HTML and hyperlinks) Uses common data mining methods (whereas unstructured might use more text mining methods) Application:  Hypertext classification or categorization and clustering,  learning relations between web documents,  learning extraction patterns or rules, and  finding patterns in semi-structured data.
  • 25.
    Web Content Mining:DB View • The database techniques on the Web are related to the problems of managing and querying the information on the Web. • DB view tries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web • Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database
  • 26.
    Web Content Mining:DB View • DB view mainly uses the Object Exchange Model (OEM) Represents semi-structured data by a labeled graph The data in the OEM is viewed as a graph, with objects as the vertices and labels on the edges  Each object is identified by an object identifier [oid] and  Value is either atomic or complex • Process typically starts with manual selection of Web sites for doing Web content mining • Main application: • The task of finding frequent substructures in semi-structured data • The task of creating multi-layered database
  • 28.
    Taxonomies • Ranking • GraphSearch • Communities • Hyperlink Induced Topic Search • SEO • Hub & Authorities
  • 29.
    Web Structure Mining •Interested in the structure of the hyperlinks within the Web • Inspired by the study of social networks and citation analysis • Can discover specific types of pages(such as hubs, authorities, etc.) based on the incoming and outgoing links. • Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site
  • 30.
    Web Usage Mining •Tries to predict user behavior from interaction with the Web • Wide range of data (logs)  Web client data  Proxy server data  Web server data • Two common approaches  Maps the usage data of Web server into relational tables before an adapted data mining techniques  Uses the log data directly by utilizing special pre-processing techniques
  • 31.
    Web Usage Mining Pre-ProcessingPattern Discovery Pattern Analysis User session File Rules and Patterns Interesting Knowledge
  • 32.
    XML View Generalized Descriptions MoreGeneralized Descriptions Layer0 Layer1 Layern ...
  • 33.
    33 Use of Multi-LayerMeta Web • Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis • Approximate and intelligent query answering • Web high-level query answering (WebSQL, WebML) • Web content and structure mining • Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? • Benefits even if it is partially constructed • Benefits may justify the cost of tool development, standardization and partial restructuring
  • 34.
    Web Search Productsand Services  Alta Vista  DB2 text extender  Excite  Fulcrum  Glimpse (Academic)  Google!  Inforseek Internet  Inforseek Intranet  Inktomi (HotBot)  Lycos  PLS  Smart (Academic)  Oracle text extender  Verity  Yahoo!
  • 35.
    Web Usage Mining •Typical problems: • Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers • Often Usage Mining uses some background or domain knowledge E.g. site topology, Web content, etc.
  • 36.
    Web Usage Mining •Applications: • Two main categories:  Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically  Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site
  • 37.
  • 38.