SlideShare a Scribd company logo
1 of 77
Web-Content Mining
-Akanksha Dombe
JNEC, Aurangabad
Specifies
 The WWW is huge, widely distributed, global
information service centre for
 Information services:
news, advertisements, consumer
information, financial
management, education, government, e-
commerce, etc.
 Hyper-link information
 Access and usage information
 WWW provides rich sources of data for data mining
The Web: Opportunities & Challenges
1. The amount of information on the Web is huge
2. The coverage of Web information is very wide and
diverse
3. Information/data of almost all types exist on the
Web
4. Much of the Web information is
semi-structured
5. Much of the Web information is linked
6. Much of the Web information is redundant
The Web: Opportunities & Challenges
7. The Web is noisy
8. The Web is also about services
9. The Web is dynamic
10. Above all, the Web is a virtual society
11. The Web consists of surface Web and deep Web.
 Surface Web: pages that can be browsed using a
browser.
 Deep Web: databases that can only be accessed
through parameterized query interfaces
What is Web Data ?
 Web data is
1. Web content –text,image,records,etc.
2. Web structure –hyperlinks,tags,etc.
3. Web usage –http logs,app server logs,etc.
4. Intra-page structures
5. Inter-page structures
6. Supplemental data
1. Profiles
2. Registration information
3. Cookies
Web Mining
 Web Mining is the use of the data mining techniques
to automatically discover and extract information
from web documents/services
 Web mining is the application of data mining
techniques to find interesting and potentially useful
knowledge from web data
 Web mining is the application of data mining
techniques to extract knowledge from web
data, including web documents, hyperlinks between
documents, usage logs of web sites, etc.
Web Mining
• Web Mining is the use of the data mining techniques to
automatically discover and extract information from web
documents/services
• Discovering useful information from the World-Wide
Web and its usage patterns
• My Definition: Using data mining techniques to make the
web more useful and more profitable (for some) and to
increase the efficiency of our interaction with the web
Why Mine the Web?
 Enormous wealth of information on Web
 Financial information (e.g. stock quotes)
 Book/CD/Video stores (e.g. Amazon)
 Restaurant information
 Car prices
 Lots of data on user access patterns
 Web logs contain sequence of URLs accessed by users
 Possible to mine interesting nuggets of information
 People who ski also travel frequently to Europe
 Tech stocks have corrections in the summer and rally from November
until February
 The Web is a huge collection of documents except for
 Hyper-link information
 Access and usage information
 The Web is very dynamic
 New pages are constantly being generated
 Challenge: Develop new Web mining algorithms and adapt
traditional data mining algorithms to
 Exploit hyper-links and access patterns
 Be incremental
Why is Web Mining Different?
Web Mining: Subtasks
 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from selected
documents
 Generalization
 Discover general patterns within and across web sites
 Analysis
 Validation and/or interpretation of mined patterns
Web Mining Issues
 Size
 Grows at about 1 million pages a day
 Google indexes 9 billion documents
 Number of web sites
 Netcraft survey says 72 million sites
 (http://news.netcraft.com/archives/web_server_survey.html)
 Diverse types of data
 Images
 Text
 Audio/video
 XML
 HTML
 E-commerce (Infrastructure)
 Generate user profiles
 Targetted advertizing
 Fraud
 Similar image retrieval
 Information retrieval (Search) on the Web
 Automated generation of topic hierarchies
 Web knowledge bases
 Extraction of schema for XML documents
 Network Management
 Performance management
 Fault management
Web Mining Applications
Web Mining Taxonomy
Web Data Mining
 Use of data mining techniques to
automatically discover interesting and
potentially useful information from Web
documents and services.
 Web mining may be divided into three
categories:
1. Web content mining
2. Web structure mining
3. Web usage mining
What
is
“Web Content mining?”
Web Content Mining
 Discovery of useful information from web
contents / data / documents
 Web data contents:
1. text,
2. image,
3. audio,
4. video,
5. metadata and
6. hyperlinks
Web Content Mining
 Examine the contents of web pages as well as result of web
searching
 Can be thought of as extending the work performed by basic
search engines
 Search engines have crawlers to search the web and gather
information, indexing techniques to store the
information, and query processing support to provide
information to the users
 Web Content Mining is: the process of extracting knowledge
from web contents
Web Content Mining
 It provides no information about structure of
content that we are searching for and no
information about various categories of
documents that are found.
 Need more sophisticated tools for searching or
discovering Web content.
Web Content mining
 Discovering useful information from contents of Web
pages.
 Web content is very rich consisting of
textual, image, audio, video etc and metadata as well
as hyperlinks.
 The data may be unstructured (free text) or
structured (data from a database) or semi-structured
(html) although much of the Web is unstructured.
Web Content Data Structure
 Unstructured – free text
 Semi-structured – HTML
 More structured – Table or Database generated
HTML pages
 Multimedia data – receive less attention than text or
hypertext
Web Content mining
 Web content mining is related to data mining
and text mining
 It is related to data mining because many data
mining techniques can be applied in Web content
mining.
 It is related to text mining because much of the
web contents are texts.
 Web data are mainly semi-structured and/or
unstructured, while data mining is structured and
text is unstructured.
Web Content Data Structure
 Web content consists of several types of data
 Text, image, audio, video, hyperlinks.
 Unstructured – free text
 Semi-structured – HTML
 More structured – Data in the tables or
database generated HTML pages
 Note: much of the Web content data is unstructured
text data.
Semi-structured Data
 Content is, in general, semi-structured
 Example:
 Title
 Author
 Publication_Date
 Length
 Category
 Abstract
 Content
Web Content Mining: IR View
 Unstructured Documents
 Bag of words, or phrase-based feature
representation
 Features can be boolean or frequency based
 Features can be reduced using different feature
selection techniques
 Word stemming, combining morphological
variations into one feature
Web Content Mining: IR View
 Semi-Structured Documents
 Uses richer representations for features, based on
information from the document structure
(typically HTML and hyperlinks)
 Uses common data mining methods (whereas
unstructured might use more text mining
methods)
Web Content Mining: DB View
 Tries to infer the structure of a Web site or transform
a Web site to become a database
 Better information management
 Better querying on the Web
 Can be achieved by:
 Finding the schema of Web documents
 Building a Web warehouse
 Building a Web knowledge base
 Building a virtual database
Web Content Mining: DB View
 Mainly uses the Object Exchange Model (OEM)
 Represents semi-structured data (some
structure, no rigid schema) by a labeled graph
 Process typically starts with manual selection of Web
sites for content mining
 Main application: building a structural summary of
semi-structured data (schema extraction or
discovery)
Tech for Web Content Mining
Classifications
Clustering
Association
Web Content Mining : Topics
 Structured data extraction
 Unstructured text extraction
 Sentiment classification, analysis and summarization
of consumer reviews
 Information integration and schema matching
 Knowledge synthesis
 Template detection and page segmentation
Structured Data Extraction
 Most widely studied research topic
 A large amount of information on the Web is
contained in regularly structured data objects
(retrieved from databases)Such Web data records are
important they often present the essential
information of their host pages, e.g., lists of products
and services
Structured Data Extraction
 Applications: integrated and value-added
services, e.g., Comparative shopping, meta-search &
query, etc
Structured Data Extraction
:Approaches
1. Wrapper Generation
2. Wrapper Induction or Wrapper Learning
3. Automatic Approach
Structured Data Extraction
:Approaches
 Wrapper Generation
Write an extraction program for each website
based on observed format patterns
 Labor intensive & time consuming
35
36
CS511, Bing Liu, UIC37
 Automatic Approach
 Structured data objects on the web are normally
database records
 Retrieved from databases & displayed in web
pages with fixed templates
 Find patterns / grammars from the web pages &
then use them to extract data
 e. g. IEPAD, MDR, ROADRUNNER, EXALG etc
38
 Wrapper Induction or Wrapper Learning
 Main technique currently
 The user first manually labels a set of trained
pages
 A learning system then generates rules from the
training pages
 The resulting rules are then applied to extract
target items from web pages
 e.g. WIEN, Stalker, BWI, WL etc
39
 Supervised Learning
 Supervised learning is a ‘machine learning’ technique for
creating a function from training data .
 Documents are categorized
 The output can predict a class label of the input object (called
classification).
 Techniques used are
 Nearest Neighbor Classifier
 Feature Selection
 Decision Tree
 Removes terms in the training documents which
are statistically uncorrelated with the class labels
 Simple heuristics
 Stop words like “a”, “an”, “the” etc.
 Empirically chosen thresholds for ignoring “too
frequent” or “too rare” terms
 Discard “too frequent” and “too rare terms”
Examples of Discovered
Patterns
 Association rules
 98% of AOL users also have E-trade accounts
 Classification
 People with age less than 40 and salary > 40k trade on-line
 Clustering
 Users A and B access similar URLs
 Outlier Detection
 User A spends more than twice the average amount of time
surfing on the Web
 Important for improving customization
 Provide users with pages, advertisements of interest
 Example profiles: on-line trader, on-line shopper
 Generate user profiles based on their access patterns
 Cluster users based on frequently accessed URLs
 Use classifier to generate a profile for each cluster
 Engage technologies
 Tracks web traffic to create anonymous user profiles of Web
surfers
 Has profiles for more than 35 million anonymous users
 Ads are a major source of revenue for Web
portals (e.g., Yahoo, Lycos) and E-commerce
sites
 Plenty of startups doing internet advertizing
 Doubleclick, AdForce, Flycast, AdKnowledge
 Internet advertizing is probably the “hottest”
web mining application today
 Scheme 1:
 Manually associate a set of ads with each user
profile
 For each user, display an ad from the set based on
profile
 Scheme 2:
 Automate association between ads and users
 Use ad click information to cluster users (each user
is associated with a set of ads that he/she clicked
on)
 For each cluster, find ads that occur most frequently
in the cluster and these become the ads for the set
of users in the cluster
 Use collaborative filtering (e.g. Likeminds, Firefly)
 Each user Ui has a rating for a subset of ads (based
on click information, time spent, items bought etc.)
 Rij - rating of user Ui for ad Aj
 Problem: Compute user Ui‟s rating for an unrated ad
Aj
A1 A2 A3
?
Internet Advertizing
 Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Uk
is the user whose rating of ads is most similar to Ui‟s
 User Ui‟s rating for an ad Aj that has not been previously
displayed to Ui is computed as follows:
 Consider a user Uk who has rated ad Aj
 Compute Dik, the distance between Ui and Uk‟s ratings on
common ads
 Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik)
 Display to Ui ad Aj with highest computed rating
Internet Advertizing
 With the growing popularity of E-commerce, systems to
detect and prevent fraud on the Web become important
 Maintain a signature for each user based on buying
patterns on the Web (e.g., amount spent, categories of
items bought)
 If buying pattern changes significantly, then signal fraud
 HNC software uses domain knowledge and neural
networks for credit card fraud detection
 Given:
 A set of images
 Find:
 All images similar to a given image
 All pairs of similar images
 Sample applications:
 Medical diagnosis
 Weather predication
 Web search engine for images
 E-commerce
 QBIC, Virage, Photobook
 Compute feature signature for each image
 QBIC uses color histograms
 WBIIS, WALRUS use wavelets
 Use spatial index to retrieve database image whose
signature is closest to the query‟s signature
 WALRUS decomposes an image into regions
 A single signature is stored for each region
 Two images are considered to be similar if they have
enough similar region pairs
Query image
 Today‟s search engines are plagued by
problems:
 the abundance problem (99% of info of no
interest to 99% of people)
 limited coverage of the Web (internet
sources hidden behind search interfaces)
 Largest crawlers cover < 18% of all web
pages
 limited query interface based on keyword-
oriented search
 limited customization to individual users
 Today‟s search engines are plagued by
problems:
 Web is highly dynamic
 Lot of pages added, removed, and updated every
day
 Very high dimensionality
 Use Web directories (or topic hierarchies)
 Provide a hierarchical classification of documents (e.g., Yahoo!)
 Searches performed in the context of a topic restricts the search to only
a subset of web pages related to the topic
Recreation ScienceBusiness News
Yahoo home page
SportsTravel Companies Finance Jobs
 In the Clever project, hyper-links between Web pages
are taken into account when categorizing them
 Use a bayesian classifier
 Exploit knowledge of the classes of immediate neighbors of
document to be classified
 Show that simply taking text from neighbors and using
standard document classifiers to classify page does not work
 Inktomi‟s Directory Engine uses “Concept Induction” to
automatically categorize millions of documents
 Objective: To deliver content to users quickly and
reliably
• Traffic management
• Fault management
Service Provider Network
Router
Server
 While annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by
a factor of three
 Result is frequent congestion at servers and on
network links
 during a major event (e.g., princess diana‟s death), an
overwhelming number of user requests can result in millions
of redundant copies of data flowing back and forth across the
world
 Olympic sites during the games
 NASA sites close to launch and landing of shuttles
 Key Ideas
 Dynamically replicate/cache content at multiple sites within the
network and closer to the user
 Multiple paths between any pair of sites
 Route user requests to server closest to the user or least
loaded server
 Use path with least congested network links
 Akamai, Inktomi
Service Provider Network
Router
Server
Request
Congested
server
Congested
link
 Need to mine network and Web traffic to determine
 What content to replicate?
 Which servers should store replicas?
 Which server to route a user request?
 What path to use to route packets?
 Network Design issues
 Where to place servers?
 Where to place routers?
 Which routers should be connected by links?
 One can use association rules, sequential pattern mining
algorithms to cache/prefetch replicas at server
 Fault management involves
 Quickly identifying failed/congested servers and links in network
 Re-routing user requests and packets to avoid congested/down servers and
links
 Need to analyze alarm and traffic data to carry out root cause analysis of
faults
 Bayesian classifiers can be used to predict the root cause given a set of
alarms
Total Sites Across All Domains August 1995 - October 2007
 Web data sets can be very large
 Tens to hundreds of terabytes
 Cannot mine on a single server!
 Need large farms of servers
 How to organize hardware/software to
mine multi-terabye data sets
Without breaking the bank!
 Structured Data
 Unstructured Data
 OLE DB offers some solutions!
 Pages contain information
 Links are „roads‟
 How do people navigate the Internet
  Web Usage Mining (clickstream analysis)
 Information on navigation paths
available in log files
 Logs can be mined from a client or a
server perspective
 Why analyze Website usage?
 Knowledge about how visitors use Website could
 Provide guidelines to web site reorganization; Help prevent
disorientation
 Help designers place important information where the visitors
look for it
 Pre-fetching and caching web pages
 Provide adaptive Website (Personalization)
 Questions which could be answered
 What are the differences in usage and access patterns
among users?
 What user behaviors change over time?
 How usage patterns change with quality of service
(slow/fast)?
 What is the distribution of network traffic over time?
 Analog – Web Log File Analyser
 Gives basic statistics such as
 number of hits
 average hits per time period
 what are the popular pages in your site
 who is visiting your site
 what keywords are users searching for to get to
you
 what is being downloaded
 http://www.analog.cx/
 Content is, in general, semi-structured
 Example:
 Title
 Author
 Publication_Date
 Length
 Category
 Abstract
 Content
 Many methods designed to analyze structured data
 If we can represent documents by a set of attributes
we will be able to use existing data mining methods
 How to represent a document?
 Vector based representation(referred to as “bag of
words” as it is invariant to permutations)
 Use statistics to add a numerical dimension to
unstructured text
 A document representation aims to capture what the
document is about
 One possible approach:
 Each entry describes a document
 Attribute describe whether or not a term appears in the
document
 Another approach:
 Each entry describes a document
 Attributes represent the frequency in
which a term appears in the document
 Stop Word removal: Many words are not
informative and thus
 Irrelevant for document representation the, and, a,
an, is, of, that, …
 Stemming: reducing words to their root form
(Reduce dimensionality)
 A document may contain several occurrences of
words like fish, fishes, fisher, and fishers. But would
not be retrieved by a query with the keyword
“fishing”
 Different words share the same word stem and
should be represented with its stem, instead of the
actual word “Fish”

More Related Content

What's hot

Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 

What's hot (20)

Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Web mining
Web miningWeb mining
Web mining
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Web mining
Web miningWeb mining
Web mining
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Text MIning
Text MIningText MIning
Text MIning
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
Web mining
Web miningWeb mining
Web mining
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 
Text mining
Text miningText mining
Text mining
 

Similar to Web content mining

A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDatamining Tools
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web MiningIOSR Journals
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataMelinda Watson
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 

Similar to Web content mining (20)

5463 26 web mining
5463 26 web mining5463 26 web mining
5463 26 web mining
 
Web mining
Web miningWeb mining
Web mining
 
Aa03401490154
Aa03401490154Aa03401490154
Aa03401490154
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information Extraction
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Bb31269380
Bb31269380Bb31269380
Bb31269380
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 

Recently uploaded

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Recently uploaded (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Web content mining

  • 2. Specifies  The WWW is huge, widely distributed, global information service centre for  Information services: news, advertisements, consumer information, financial management, education, government, e- commerce, etc.  Hyper-link information  Access and usage information  WWW provides rich sources of data for data mining
  • 3. The Web: Opportunities & Challenges 1. The amount of information on the Web is huge 2. The coverage of Web information is very wide and diverse 3. Information/data of almost all types exist on the Web 4. Much of the Web information is semi-structured 5. Much of the Web information is linked 6. Much of the Web information is redundant
  • 4. The Web: Opportunities & Challenges 7. The Web is noisy 8. The Web is also about services 9. The Web is dynamic 10. Above all, the Web is a virtual society 11. The Web consists of surface Web and deep Web.  Surface Web: pages that can be browsed using a browser.  Deep Web: databases that can only be accessed through parameterized query interfaces
  • 5. What is Web Data ?  Web data is 1. Web content –text,image,records,etc. 2. Web structure –hyperlinks,tags,etc. 3. Web usage –http logs,app server logs,etc. 4. Intra-page structures 5. Inter-page structures 6. Supplemental data 1. Profiles 2. Registration information 3. Cookies
  • 6.
  • 7. Web Mining  Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services  Web mining is the application of data mining techniques to find interesting and potentially useful knowledge from web data  Web mining is the application of data mining techniques to extract knowledge from web data, including web documents, hyperlinks between documents, usage logs of web sites, etc.
  • 8. Web Mining • Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services • Discovering useful information from the World-Wide Web and its usage patterns • My Definition: Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web
  • 9. Why Mine the Web?  Enormous wealth of information on Web  Financial information (e.g. stock quotes)  Book/CD/Video stores (e.g. Amazon)  Restaurant information  Car prices  Lots of data on user access patterns  Web logs contain sequence of URLs accessed by users  Possible to mine interesting nuggets of information  People who ski also travel frequently to Europe  Tech stocks have corrections in the summer and rally from November until February
  • 10.  The Web is a huge collection of documents except for  Hyper-link information  Access and usage information  The Web is very dynamic  New pages are constantly being generated  Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to  Exploit hyper-links and access patterns  Be incremental Why is Web Mining Different?
  • 11. Web Mining: Subtasks  Resource finding  Retrieving intended documents  Information selection/pre-processing  Select and pre-process specific information from selected documents  Generalization  Discover general patterns within and across web sites  Analysis  Validation and/or interpretation of mined patterns
  • 12. Web Mining Issues  Size  Grows at about 1 million pages a day  Google indexes 9 billion documents  Number of web sites  Netcraft survey says 72 million sites  (http://news.netcraft.com/archives/web_server_survey.html)  Diverse types of data  Images  Text  Audio/video  XML  HTML
  • 13.  E-commerce (Infrastructure)  Generate user profiles  Targetted advertizing  Fraud  Similar image retrieval  Information retrieval (Search) on the Web  Automated generation of topic hierarchies  Web knowledge bases  Extraction of schema for XML documents  Network Management  Performance management  Fault management Web Mining Applications
  • 15. Web Data Mining  Use of data mining techniques to automatically discover interesting and potentially useful information from Web documents and services.  Web mining may be divided into three categories: 1. Web content mining 2. Web structure mining 3. Web usage mining
  • 17. Web Content Mining  Discovery of useful information from web contents / data / documents  Web data contents: 1. text, 2. image, 3. audio, 4. video, 5. metadata and 6. hyperlinks
  • 18. Web Content Mining  Examine the contents of web pages as well as result of web searching  Can be thought of as extending the work performed by basic search engines  Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users  Web Content Mining is: the process of extracting knowledge from web contents
  • 19. Web Content Mining  It provides no information about structure of content that we are searching for and no information about various categories of documents that are found.  Need more sophisticated tools for searching or discovering Web content.
  • 20. Web Content mining  Discovering useful information from contents of Web pages.  Web content is very rich consisting of textual, image, audio, video etc and metadata as well as hyperlinks.  The data may be unstructured (free text) or structured (data from a database) or semi-structured (html) although much of the Web is unstructured.
  • 21. Web Content Data Structure  Unstructured – free text  Semi-structured – HTML  More structured – Table or Database generated HTML pages  Multimedia data – receive less attention than text or hypertext
  • 22. Web Content mining  Web content mining is related to data mining and text mining  It is related to data mining because many data mining techniques can be applied in Web content mining.  It is related to text mining because much of the web contents are texts.  Web data are mainly semi-structured and/or unstructured, while data mining is structured and text is unstructured.
  • 23. Web Content Data Structure  Web content consists of several types of data  Text, image, audio, video, hyperlinks.  Unstructured – free text  Semi-structured – HTML  More structured – Data in the tables or database generated HTML pages  Note: much of the Web content data is unstructured text data.
  • 24. Semi-structured Data  Content is, in general, semi-structured  Example:  Title  Author  Publication_Date  Length  Category  Abstract  Content
  • 25. Web Content Mining: IR View  Unstructured Documents  Bag of words, or phrase-based feature representation  Features can be boolean or frequency based  Features can be reduced using different feature selection techniques  Word stemming, combining morphological variations into one feature
  • 26. Web Content Mining: IR View  Semi-Structured Documents  Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks)  Uses common data mining methods (whereas unstructured might use more text mining methods)
  • 27. Web Content Mining: DB View  Tries to infer the structure of a Web site or transform a Web site to become a database  Better information management  Better querying on the Web  Can be achieved by:  Finding the schema of Web documents  Building a Web warehouse  Building a Web knowledge base  Building a virtual database
  • 28. Web Content Mining: DB View  Mainly uses the Object Exchange Model (OEM)  Represents semi-structured data (some structure, no rigid schema) by a labeled graph  Process typically starts with manual selection of Web sites for content mining  Main application: building a structural summary of semi-structured data (schema extraction or discovery)
  • 29. Tech for Web Content Mining Classifications Clustering Association
  • 30. Web Content Mining : Topics  Structured data extraction  Unstructured text extraction  Sentiment classification, analysis and summarization of consumer reviews  Information integration and schema matching  Knowledge synthesis  Template detection and page segmentation
  • 31. Structured Data Extraction  Most widely studied research topic  A large amount of information on the Web is contained in regularly structured data objects (retrieved from databases)Such Web data records are important they often present the essential information of their host pages, e.g., lists of products and services
  • 32. Structured Data Extraction  Applications: integrated and value-added services, e.g., Comparative shopping, meta-search & query, etc
  • 33. Structured Data Extraction :Approaches 1. Wrapper Generation 2. Wrapper Induction or Wrapper Learning 3. Automatic Approach
  • 34. Structured Data Extraction :Approaches  Wrapper Generation Write an extraction program for each website based on observed format patterns  Labor intensive & time consuming
  • 35. 35
  • 36. 36
  • 38.  Automatic Approach  Structured data objects on the web are normally database records  Retrieved from databases & displayed in web pages with fixed templates  Find patterns / grammars from the web pages & then use them to extract data  e. g. IEPAD, MDR, ROADRUNNER, EXALG etc 38
  • 39.  Wrapper Induction or Wrapper Learning  Main technique currently  The user first manually labels a set of trained pages  A learning system then generates rules from the training pages  The resulting rules are then applied to extract target items from web pages  e.g. WIEN, Stalker, BWI, WL etc 39
  • 40.  Supervised Learning  Supervised learning is a ‘machine learning’ technique for creating a function from training data .  Documents are categorized  The output can predict a class label of the input object (called classification).  Techniques used are  Nearest Neighbor Classifier  Feature Selection  Decision Tree
  • 41.  Removes terms in the training documents which are statistically uncorrelated with the class labels  Simple heuristics  Stop words like “a”, “an”, “the” etc.  Empirically chosen thresholds for ignoring “too frequent” or “too rare” terms  Discard “too frequent” and “too rare terms”
  • 42. Examples of Discovered Patterns  Association rules  98% of AOL users also have E-trade accounts  Classification  People with age less than 40 and salary > 40k trade on-line  Clustering  Users A and B access similar URLs  Outlier Detection  User A spends more than twice the average amount of time surfing on the Web
  • 43.  Important for improving customization  Provide users with pages, advertisements of interest  Example profiles: on-line trader, on-line shopper  Generate user profiles based on their access patterns  Cluster users based on frequently accessed URLs  Use classifier to generate a profile for each cluster  Engage technologies  Tracks web traffic to create anonymous user profiles of Web surfers  Has profiles for more than 35 million anonymous users
  • 44.  Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites  Plenty of startups doing internet advertizing  Doubleclick, AdForce, Flycast, AdKnowledge  Internet advertizing is probably the “hottest” web mining application today
  • 45.  Scheme 1:  Manually associate a set of ads with each user profile  For each user, display an ad from the set based on profile  Scheme 2:  Automate association between ads and users  Use ad click information to cluster users (each user is associated with a set of ads that he/she clicked on)  For each cluster, find ads that occur most frequently in the cluster and these become the ads for the set of users in the cluster
  • 46.  Use collaborative filtering (e.g. Likeminds, Firefly)  Each user Ui has a rating for a subset of ads (based on click information, time spent, items bought etc.)  Rij - rating of user Ui for ad Aj  Problem: Compute user Ui‟s rating for an unrated ad Aj A1 A2 A3 ? Internet Advertizing
  • 47.  Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Uk is the user whose rating of ads is most similar to Ui‟s  User Ui‟s rating for an ad Aj that has not been previously displayed to Ui is computed as follows:  Consider a user Uk who has rated ad Aj  Compute Dik, the distance between Ui and Uk‟s ratings on common ads  Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik)  Display to Ui ad Aj with highest computed rating Internet Advertizing
  • 48.  With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important  Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought)  If buying pattern changes significantly, then signal fraud  HNC software uses domain knowledge and neural networks for credit card fraud detection
  • 49.  Given:  A set of images  Find:  All images similar to a given image  All pairs of similar images  Sample applications:  Medical diagnosis  Weather predication  Web search engine for images  E-commerce
  • 50.  QBIC, Virage, Photobook  Compute feature signature for each image  QBIC uses color histograms  WBIIS, WALRUS use wavelets  Use spatial index to retrieve database image whose signature is closest to the query‟s signature  WALRUS decomposes an image into regions  A single signature is stored for each region  Two images are considered to be similar if they have enough similar region pairs
  • 52.  Today‟s search engines are plagued by problems:  the abundance problem (99% of info of no interest to 99% of people)  limited coverage of the Web (internet sources hidden behind search interfaces)  Largest crawlers cover < 18% of all web pages  limited query interface based on keyword- oriented search  limited customization to individual users
  • 53.  Today‟s search engines are plagued by problems:  Web is highly dynamic  Lot of pages added, removed, and updated every day  Very high dimensionality
  • 54.  Use Web directories (or topic hierarchies)  Provide a hierarchical classification of documents (e.g., Yahoo!)  Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic Recreation ScienceBusiness News Yahoo home page SportsTravel Companies Finance Jobs
  • 55.  In the Clever project, hyper-links between Web pages are taken into account when categorizing them  Use a bayesian classifier  Exploit knowledge of the classes of immediate neighbors of document to be classified  Show that simply taking text from neighbors and using standard document classifiers to classify page does not work  Inktomi‟s Directory Engine uses “Concept Induction” to automatically categorize millions of documents
  • 56.  Objective: To deliver content to users quickly and reliably • Traffic management • Fault management Service Provider Network Router Server
  • 57.  While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three  Result is frequent congestion at servers and on network links  during a major event (e.g., princess diana‟s death), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world  Olympic sites during the games  NASA sites close to launch and landing of shuttles
  • 58.  Key Ideas  Dynamically replicate/cache content at multiple sites within the network and closer to the user  Multiple paths between any pair of sites  Route user requests to server closest to the user or least loaded server  Use path with least congested network links  Akamai, Inktomi
  • 60.  Need to mine network and Web traffic to determine  What content to replicate?  Which servers should store replicas?  Which server to route a user request?  What path to use to route packets?  Network Design issues  Where to place servers?  Where to place routers?  Which routers should be connected by links?  One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server
  • 61.  Fault management involves  Quickly identifying failed/congested servers and links in network  Re-routing user requests and packets to avoid congested/down servers and links  Need to analyze alarm and traffic data to carry out root cause analysis of faults  Bayesian classifiers can be used to predict the root cause given a set of alarms
  • 62. Total Sites Across All Domains August 1995 - October 2007
  • 63.  Web data sets can be very large  Tens to hundreds of terabytes  Cannot mine on a single server!  Need large farms of servers  How to organize hardware/software to mine multi-terabye data sets Without breaking the bank!
  • 64.  Structured Data  Unstructured Data  OLE DB offers some solutions!
  • 65.  Pages contain information  Links are „roads‟  How do people navigate the Internet   Web Usage Mining (clickstream analysis)  Information on navigation paths available in log files  Logs can be mined from a client or a server perspective
  • 66.  Why analyze Website usage?  Knowledge about how visitors use Website could  Provide guidelines to web site reorganization; Help prevent disorientation  Help designers place important information where the visitors look for it  Pre-fetching and caching web pages  Provide adaptive Website (Personalization)  Questions which could be answered  What are the differences in usage and access patterns among users?  What user behaviors change over time?  How usage patterns change with quality of service (slow/fast)?  What is the distribution of network traffic over time?
  • 67.
  • 68.
  • 69.  Analog – Web Log File Analyser  Gives basic statistics such as  number of hits  average hits per time period  what are the popular pages in your site  who is visiting your site  what keywords are users searching for to get to you  what is being downloaded  http://www.analog.cx/
  • 70.
  • 71.
  • 72.
  • 73.  Content is, in general, semi-structured  Example:  Title  Author  Publication_Date  Length  Category  Abstract  Content
  • 74.  Many methods designed to analyze structured data  If we can represent documents by a set of attributes we will be able to use existing data mining methods  How to represent a document?  Vector based representation(referred to as “bag of words” as it is invariant to permutations)  Use statistics to add a numerical dimension to unstructured text
  • 75.  A document representation aims to capture what the document is about  One possible approach:  Each entry describes a document  Attribute describe whether or not a term appears in the document
  • 76.  Another approach:  Each entry describes a document  Attributes represent the frequency in which a term appears in the document
  • 77.  Stop Word removal: Many words are not informative and thus  Irrelevant for document representation the, and, a, an, is, of, that, …  Stemming: reducing words to their root form (Reduce dimensionality)  A document may contain several occurrences of words like fish, fishes, fisher, and fishers. But would not be retrieved by a query with the keyword “fishing”  Different words share the same word stem and should be represented with its stem, instead of the actual word “Fish”