Web Mining.pptx

Chapter-6
Web mining
Data mining in web applications

What isWeb mining?
Web mining is the use of data mining techniques to automatically discover and
extract information fromWeb documents and services.

What is web mining?
• Mining of data related toWWW
• Data present inWeb pages or data related to web activity
• Web data is classified
• Content of web pages
• Intra page structure which include code and actual linkage
• Usage data – how used by visitors
• User profiles

Web Content Mining
• Extension of basic search engines
• Search engines are keyword-based
• Traditional search engines use crawlers
• to search theWeb
• gather information
• indexing techniques to store the information
• query processing to provide fast and accurate information to users

Taxonomy ofWeb content mining
AGENT BASED APPROACH
WEB CONTENT MINING
DATABASEAPPROACH
USE SOFTWARE SYSTEMSTO PERFORM
THE CONTENT MINING
EG. SEARCH ENGINES
VIEWSWEB DATA AS BELONGINGTO
DATABASE
WEB IS A MULTILEVEL DATABASEAND
QUERY LANGUAGESARE USED FOR
QUERYINGTHE DATA
CONTENT MINING ISATYPE OF TEXT MINING

Text mining hierarchy
Keyword
Term
Association
Similarity Search
Classification and Clustering
Natural Language processing
Simple
Complex

How do crawlers work?
• Robot, spider, crawler is a program that traverses the hypertext structure in
the web
• Page that the crawler starts is referred to as seed URL
• All links from that page are recorded and saved in a queue
• The new pages are in turn searched and their links are saved
• The crawlers collect information about each page, extract keywords, store
indices for users

Crawling the web
A Web crawler is an Internet bot which systematically browses the World Wide
Web, typically for the purpose of Web indexing (web spidering). Web search
engines and some other sites use Web crawling or spidering software to
update their web content or indices of others sites' web content.
Including a robots.txt file can request bots to index only parts of a website, or
nothing at all.

Crawling
When the Google visit your website for the purpose of tracking, Google does
this with help of machine, known as web crawler, spider, Google bot, internet
bot, automatic indexer
The process of Crawling: Google uses huge set of computer to fetch or crawl”
million of web pages on the web. Googlebot discovers new and updated pages
with the help of site map to be added to the Google to crawl. Crawler obtains
information; add in to the Google index, this is where crawling works.

Indexing
Once the crawling process has been done from the web
crawler, the result will store in the Google index. The Google
index is alike to an index or library, which lists information
about all the books or library. If you want more pages included
in the Google index, you can create and submit a
Sitemap through Webmaster Tools.
The index is basically a big list of words and the web pages
that feature them on the basis of keywords location of the
term in that particular webpage will store.

Ranking
“Ranking – Determining what each page is about, and
how it should rank for relevant queries”
Search engines have two major functions: crawling and
building an index, and providing search users with a
ranked list of the websites they've determined as the most
relevant.

Types of crawlers
• Periodic crawlers: activated periodically; every time it is activated it replaces
the existing index
• Incremental crawler: updates the index incrementally instead of replacing it
• Focused crawler: visits pages related to topics of interest

Focused crawling
SEED URL
SITE-2 SITE-3
SITE-1
SITE-6
SITE-5
SITE-4

Web Harvesting
• Web harvesting also known asWeb Scraping is in an increasingly popular
method used by websites to channel customer’s searches to their website
• Web harvesting software automatically extracts information from the Web
and picks up where search engines leave off, doing the work the search
engine can't.
• FMiner

VirtualWebView
• Large amount of unstructured data can be handled using multiple layered
database(MLDB) on top of the web data
• Every layer of this dbase is more generalized then the preceding layer
• The upper layer are structured and can be accessed using SQL
• View of MLDB-Virtual WebView(VWV)

WebML
• Query language which supports data mining operations on MLDB
• Four primitive operations inWebML are
• COVERS
• COVERED BY
• LIKE
• CLOSETO
SELECT *
FROM document in “www.engr.smu.edu”
WHERE ONE OF keywords COVERS “cat”

Personalization
• Contents of a web page are modified to fit the desires of the user
• Advertisements are sent to a potential customer based on his specific knowledge
• Personalization is performed on target web page
• Targeting is different from personalization
• In targeting businesses display advertisements at other sites visited by their users
• In personalization when a person visits aWeb site, the advertising can be designed
specifically for that person

Personalization Contd….
• Personalization is a combination of clustering, classification and prediction
• Types of personalization are
• Manual techniques – user registration details
• Collaborative filtering
• Content-based filtering
• Eg. MyYahoo

Web Usage Mining
It deals with understanding user behavior in interacting
with the web or with a website.
Aim
To obtain information that may assist web sites for
reorganization or adaptation to better suit the user.

•Clicking pattern
•Browsing time
•Transaction
To understand user’s behaviour

Web Usage Mining Applications
• Personalization
• Improve structure of a site’s Web pages
• Aid in caching and prediction of future page references
• Improve design of individual pages
• Improve effectiveness of e-commerce (sales and advertising)

Web Usage Mining Activities
• Preprocessing Web log
• Cleanse
• Remove extraneous information
• Sessionize
Session: Sequence of pages referenced by one user at a sitting.
• Pattern Discovery
• Count patterns that occur in sessions
• Pattern is sequence of pages references in session.
• Similar to association rules
• Transaction: session
• Itemset: pattern (or subset)
• Order is important
• Pattern Analysis

Web Usage Mining Issues
• Identification of exact user not possible.
• Exact sequence of pages referenced by a user not possible due to caching.
• Session not well defined
• Security, privacy, and legal issues

Web Log Cleansing
• Replace source IP address with unique but non-identifying ID.
• Replace exact URL of pages referenced with unique but non-identifying ID.
• Delete error records and records containing not page data (such as figures
and code)

Web Structure Mining
• Creating a model of the web organization
• Used to classifyWeb pages or to create similarity measures between
documents
• Mine structure (links, graph) of theWeb
• Techniques
• PageRank
• HITS

Page Rank
• Designed to increase the effectiveness of search engines and improve their
efficiency
• Used to
• Measure the importance of a page
• Prioritize the pages returned from a traditional search engine using keyword searching
• Page Rank is calculated based on the number of pages that point to it

Page Rank
 Search engine that uses link structure to calculate a quality ranking (PageRank) for each page
 Intuition: PageRank can be seen as the probability that a “random surfer” visits a page
 A page is important if important pages link to it

PageRank
Page Rank: A page is important if many important pages link to it.
(PageRank) + (Website Content) = Overall Rank in Results
 Link
ij :
i considers j important.
the more important i, the more
important j becomes.
if i has many out-links: links are less
important.

Let OutDegreei = # out-links of page i
Adjust pj:
PageRank ( j )  (1  d ) +d
PageRank (i )
OutDegree(i)
This is the weighted sum of the importance of the pages
referring to Pj
d-damping factor
Parameter d is probability that the surfer gets bored and starts on
a new random page
(1-d) is the probability that the random surfer follows a link on
current page

Repeat until pagerank vector converges…

Hyperlink-induced topic search(HITS)
• Finds hubs and authoritative pages
• HITS has two components
• Based on a given set of keywords relevant pages are found
• Hubs and authority measures are associated with these pages. Pages with highest
values are returned

Authorities and hubs
• The algorithm produces two types of pages:
- Authority: pages that provide an important, trustworthy information on a
given topic (highly-referenced pages on a topic)
- Hub: pages that “point” to authorities
• A better hub points to many good authorities .A better authority is pointed
to by many good hubs

Definitions
• Authority: pages that provide an important, trustworthy information on a
given topic
• Hubs: pages that contain links to authorities
• Indegree: number of incoming links to a given node, used to measure the
authoritativeness
• Outdegree: number of outgoing links from a given node, here it is used to
measure the hubness

36
HITS Algorithm
• Hubs point to lots of authorities.
• Authorities are pointed to by lots of hubs.
• Together they form a bipartite graph:
Hubs Authorities

HITS
Pages that link to a collection of authoritative pages on a broad topic
Hubs

Authorities
Relevant pages of the highest quality on a broad topic

HITS
 Steps for Discovering Hubs and Authorities on a
specific topic
Collect seed set of pages S (returned by search engine)
Expand seed set to contain pages that point to or are pointed
to by pages in seed set (removes links inside a site)
Iteratively update hub weight h(p) and authority weight a(p)
for each page:
a (p )  h(q ) h(p )  a (q )
q p p q
After a fixed number of iterations, pages with highest
hub/authority weights form core of community

Strengths and weaknesses of HITS
 Strength: its ability to rank pages according to the
query topic, which may be able to provide more relevant
authority and hub pages.
 Weaknesses:
It is easily spammed. It is in fact quite easy to influence HITS
since adding out-links in one’s own page is so easy.
Topic drift. Many pages in the expanded set may not be on topic.
Inefficiency at query time: The query time evaluation is slow.

Application areas of web mining
1. E-commerce: personalized marketing;
2. Fight against terrorism: classify threats;
3. Prediction;
4. And others :)

Future research directions
1. Multimedia data mining: a picture is worth a thousand words;
2. Multilingual knowledge extraction: web page translations;
3. Semantic web mining.

Web Mining.pptx

Recommended

Recommended

More Related Content

Similar to Web Mining.pptx

Similar to Web Mining.pptx (20)

Recently uploaded

Recently uploaded (20)

Web Mining.pptx