SlideShare a Scribd company logo
1 of 25
Page 1
SEMINAR
ON
WEB MINING
Abstract: With the explosive growth of information sources available on the World Wide Web, it has
become increasingly necessary for users to utilize automated tools in find the desired information
resources, and to track and analyze their usage patterns. These factors give rise to the necessity of
creating serverside and clientside intelligent systems that can effectively mine for knowledge. Web
mining can be broadly defined as the discovery and analysis of useful information from the World Wide
Web. This describes the automatic search of information resources available online, i.e. Web content
mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining. In this
paper we present detail statistical formulation and experimental result to show how web mining can be
utilized to perform potential customer. Web usability is an important and sometimes controversial
research area. We proposed an integrated system for web mining and usability study where four core
modules are designed to address the fundamental issues in usability analysis.. As an example to cross
modules analysis, we apply association rule mining from the link structure obtained from web mining
module to automatically discover menus and structures in a web site.
Keywordds: web mining, potential customer.
1. INTRODUCTION
With the explosive growth of information sources
available on the World Wide Web, it has become
increasingly necessary for users to utilize auto-
mated tools in order to find, extract, filter, and
evaluate the desired information and resources. In
addition, with the transformation of the web into
the primary tool for electronic commerce, it is
imperative for organizations and companies, who
have invested millions in Internet and Intranet
technologies, to track and analyze user access
patterns. These factors give rise to the necessity of
creating server-side and client-side intelligent
systems that can
effectively mine for knowledge both across the
Internet and in particular web localities.
Page 2
At present most of the users commonly use
searching engines such as www.google.com, to
find their required information. Moreover, the
target of the Web search engine is only to discover
resource on the Web. Each searching engines
having its own characteristics and employing
different algorithms to index, rank, and present
web documents. But because all these searching
engines is build based on exact key words
matching and it's query language belongs to some
artificial kind, with restricted syntax and
vocabulary other than natural language, there are
defects that all kind of searching engines cannot
overcome.
Narrowly Searching Scope: Web pages indexed
by any searching engines are only a tiny part of
the whole pages on the www, and the return pages
when user input and submit query are another tiny
part of indexed numbers of the searching engine.
Low Precision: User cannot browse all the pages
one by one, and most pages are irrelevant to the
user's meaning, they are highlighted and returned
by searching engine just because these pages in
possession of the key words.
Web mining techniques could be used to solve the
information over load problems directly or
indirectly. However, Web mining techniques are
not the only tools. Other techniques and works
from different research areas, such as DataBase
(DB), Information Retrieval (IR), Natural
Language Processing (NLP), and the Web
document community, could also be used.
Information retrieval
Information retrieval is the art and science of
searching for information in documents, searching
for documents themselves, searching for metadata
which describes documents, or searching within
databases, whether relational standalone databases
or hypertext networked databases such as the
Internet or intranets, for text, sound, images or
data.
Natural language processing
Natural language processing (NLP) is concerned
with the interactions between computers and
human (natural) languages. NLP is a form of
human-to-computer interaction where the
elements of human language, be it spoken or
written, are formalized so that a computer can
perform value-adding tasks based on that
interaction.
Natural language understanding is sometimes
referred to as an AI-complete problem, because
natural-language recognition seems to require
extensive knowledge about the outside world and
the ability to manipulate it.
The purpose of Web mining is to develop methods
and systems for discovering models
of objects and processes on the World Wide Web
and for web-based systems that show adaptive
performance. Web Mining integrates three parent
areas: Data Mining (we use this term here also for
the closely related areas of Machine Learning and
Knowledge Discovery), Internet technology and
Page 3
World Wide Web, and for the more recent
SemanticWeb.
The World Wide Web has made an enormous
amount of information electronically accessible.
The use of email, news and markup languages like
HTML allow users to publish and read documents
at a world-wide scale and to communicate via chat
connections, including information in the form of
images and voice records. The HTTP protocol that
enables access to documents over the network via
Web browsers created an immense improvement
in communication and access to information. For
some years these possibilities were used mostly in
the scientific world but recent years have seen an
immense growth in popularity, supported by the
wide availability of computers and broadband
communication. The use of the internet for other
tasks than finding information and direct
communication is increasing, as can be seen from
the interest in“e-activities” such as e-commerce,
e-learning, e-government, e-science.
Independently of the development of the Internet,
Data Mining expanded out of the academic world
into industry. Methods and their potential became
known outside the academic world and
commercial toolkits became available that allowed
applications at an industrial scale. Numerous
industrial applications have shown that models
can be constructed from data for a wide variety of
industrial problems. The World-Wide Web is an
interesting area for Data Mining because huge
amounts of information are available. Data
Mining methods can be used to analyze the
behavior of individual users, access patterns of
pages or sites, properties of collections of
documents.
Almost all standard data mining methods are
designed for data that are organized as multiple
“cases” that are comparable and can be viewed as
instances of a single pattern, for example patients
described by a fixed set of symptoms and
diseases, applicants for loans, customers of a shop.
A “case” is typically described by a fixed set of
features (or variables). Data on the Web have a
different nature. They are not so easily
comparable and have the form of free text, semi-
structured text (lists, tables) often with images and
hyperlinks, or server logs. The aim to learn
models of documents has given rise to the interest
in Text Mining methods for modeling documents
in terms of properties of documents. Learning
from the hyperlink structure has given rise to
graph-based methods, and server logs are used to
learn about user behavior.
Instead of searching for a document that matches
keywords, it should be possible to combine
information to answer questions. Instead of
retrieving a plan for a trip to Hawaii, it should be
possible to automatically construct a travel plan
that satisfies certain goals and uses opportunities
that arise dynamically. This gives rise to a wide
range of challenges. Some of them concern the
infrastructure, including the interoperability of
Page 4
systems and the languages for the exchange of
information rather than data. Many challenges are
in the area of knowledge representation, discovery
and engineering. They include the extraction of
knowledge from data and its representation in a
form understandable by arbitrary parties, the
intelligent questioning and the delivery of answers
to problems as opposed to conventional queries
and the exploitation of formerly extracted
knowledge in this process.
2. WEB MINING
Web mining is the integration of information
gathered by traditional data mining methodologies
and techniques with information gathered over the
World Wide Web.
Data mining is also called knowledge discovery
and data mining (KDD). It is the extraction of
useful patterns from data sources, e.g.databases,
texts, web, images, etc. Patterns must be valid,
novel, potentially useful, understandable.Classic
data mining tasks
 Classification: mining patterns that can
classify future (new) data into known
classes.
 Association rule mining: mining any rule
of the form X ® Y, where X and Y are sets
of data items. E.g.,Cheese, Milk® Bread
[sup =5%, confid=80%]
 Clustering: identifying a set of similarity
groups in the data
 Sequential pattern mining: A sequential
rule: A® B, says that event A will be
immediately followed by event B with a
certain confidence
Fig .1 The Data Mining (KDD) Process
Just as data mining aims at discovering valuable
information that is hidden in conventional
databases, the emerging field of web mining aims
at finding and extracting relevant information that
is hidden in Web-related data, in particular hyper-
text documents published on the Web. Web
Mining is the extraction of interesting and
potentially useful patterns and implicit
information from artifacts or activity related to the
World Wide Web. There are roughly three
knowledge discovery domains that pertain to web
mining: Web Content Mining, Web Structure
Mining, and Web Usage Mining. Web content
mining is the process of extracting knowledge
from the content of documents or their
descriptions. Web document text mining, resource
discovery based on concepts indexing or agent
based technology may also fall in this category.
Page 5
Web structure mining is the process of inferring
knowledge from the World Wide Web
organization and links between references and
referents in the Web. Finally, web usage mining,
also known as Web Log Mining, is the process of
extracting interesting patterns in web access logs.
Web is a collection of inter-related files on one or
more Web servers. Web mining is a multi-
disciplinary effort that draws techniques from
fields like information retrieval, statistics,
machine learning, natural language processing,
and others. Web mining has new character
compared with the traditional data mining. First,
the objects of Web mining are a large number of
Web documents which are heterogeneously
distributed and each data source are
heterogeneous; second, the Web document itself is
semi-structured or unstructured and lack the
semantics the machine can understand.
3. HISTORY
The term “Web Mining” first used in [E1996],
defined in a ‘task oriented’ manner. Alternate
‘data oriented’ definition given in [CMS1997]. Its
First panel discussion at ICTAI 1997 [SM1997]. It
is a continuing forum.
 WebKDD workshops with ACM
SIGKDD, 1999, 2000, 2001, 2002, … ; 60
–90 attendees
 SIAM Web analytics workshop 2001,
2002, …
 Special issues of DMKD journal,
SIGKDD Explorations
 Papers in various data mining conferences
& journals
 Surveys [MBNL 1999, BL 1999,
KB2000]
This area of research is so huge today due to the
tremendous growth of information sources
available on the Web and the recent interest in e-
commerce. Web mining is used to understand
customer behavior, evaluate the effectiveness of a
particular Web site, and help quantify the success
of a marketing campaign.
3.1.web mining subtasks
Web mining can be decomposed into the subtasks,
namely:
1. Resource finding: the task of retrieving
intended Web documents. By resource
finding we mean the process of retrieving
the data that is either online or offline from
the text sources available on the web such
as electronic newsletters, electronic
newswire, the text contents of HTML
documents obtained by removing HTML
tags, and also the manual selection of Web
resources.
2. Information selection and pre-
processing: automatically selecting and
pre-processing specific information from
retrieved Web resources. It is a kind of
transformation processes of the original
data retrieved in the IR process. These
transformations could be either a kind of
pre-processing that are mentioned above
Page 6
such as stop words, stemming, etc. or a
pre-processing aimed at obtaining the
desired representation such as finding
phrases in the training corpus,
transforming the representation to
relational or first order logic form, etc.
3. Generalization: automatically discovers
general patterns at individual Web sites as
well as across multiple sites. Machine
learning or data mining techniques are
typically used in the process of
generalization. Humans play an important
role in the information or knowledge
discovery process on the Web since the
Web is an interactive medium.
4. Analysis: validating and/or interpretation
of the mined patterns.
4. CHALLENGES OF WEB
MINING
1. Today World Wide Web is flooded with
billions of static and dynamic web pages
created with programming languages such
as HTML, PHP and ASP. It is significant
challenge to search useful and relevant
information on the web.
2. Creating knowledge from available
information.
3. As the coverage of information is very
wide and diverse, personalization of the
information is a tedious process.
4. Learning customer and individual user
patterns.
5. Complexity of Web pages far exceeds the
complexity of any conventional text
document. Web pages on the internet lack
uniformity and standardization.
6. Much of the information present on web is
redundant, as the same piece of
information or its variant appears in many
pages.
7. The web is noisy i.e. a page typically
contains a mixture of many kinds of
information like, main content,
advertisements, copyright notice,
navigation panels.
8. The web is dynamic, information keeps on
changing constantly. Keeping up with the
changes and monitoring them are very
important.
9. The Web is not only disseminating
information but it also about services.
Many Web sites and pages enable people
to perform operations with input
parameters, i.e., they provide services.
10. The most important challenge faced is
Invasion of Privacy. Privacy is considered
lost when information concerning an
individual is obtained, used, or
disseminated, when it occurs without their
knowledge or consent.
Techniques to Address the Problem
1.1 Preprocessing technique - Web
Robots
Page 7
When attempting to detect web robots from a
stream it is desirable to monitor both the Web
server log and activity on the client-side. What we
are looking for is to distinguish single Web
sessions from each other. A Web session is a
series of requests to web pages, i.e. visits to web
pages. Since the navigation patterns of web robots
differs from the navigation patterns of human
users the contribution from web robots has to be
eliminated before proceeding with any further data
mining, i.e. when we are looking into web usage
behaviour of real users.
One problem with identifying web robots is
that they might hide their identity behind a facade
looking a lot like conventional web browsers.
Standard approaches to robot detection will fail to
detect camouflaged web robots. As web robots are
used for tasks like website indexing, e.g. by
Google, or detection of broken links they have to
exist. There is a special file on every domain
called “robot.txt” which, according to the Robot
Exclusion Standard [M. Koster, 1994], will be
examined by the robot in order to prevent the
robot from visiting certain pages of no interest.
Evil web robots however aren’t guaranteed to
follow the advice from robot.txt.
The classes chosen for evaluation are Temporal
Features, Page Features, Communication Features
and Path Features. It is desirable to be able to
detect the presence of a web robot after as few
requests as possible, this is ofcourse a tradeoff
between computational effort and result accuracy.
A simple decision model for determining the class
of a visitor is to first check if the visitor requested
robots.txt, then it will be labeled as robot, second
the visitor will be matched against a list of former
known robots. Third the referer “-” is searched
for, since robots seldom assign any value to the
referer fields this is a rewarding place to look. If a
robot is found, the list of known robots is updated
with the new one.
3.1.2 Avoiding Mislabeled Sessions To avoid
mislabeling of sessions an ensemble filtering
approach [C. Brodley et al., 1999] is used, where
the idea is to instead of just one model for
classification, build several models which are used
to find classification errors via finding single
mislabeled sessions.
The set of models acquired are used to classify all
sessions respectively. For each session, the
amount of false negative and false positive
classifications are counted. A large value of false
positive classifications imply that the session is
currently assigned to be a non-robot despite being
predicted to be a robot in most of the models. A
large value of false negative classifications imply
that the session might be a non-robot but has the
robot classifier.
4.2 Mining Issue
Page 8
3.2.1 Indirect Association Common association
methods often employ patterns that connects
objects to each other. Sometimes, on the other
hand, it might be valuable to consider indirect
association between objects. Indirect association is
used to e.g. represent the behaviour of distinct
user groups.
3.2.2 Clustering With the growth of the World
Wide Web it can be very time consuming to
analyze every web page on its own. Therefore it is
a good idea to cluster web pages based on
attributes that can be considered similar to find
successful and less successful attributes and
patterns.
5. TAXONOMY OF WEB
MINING
In general, Web mining tasks can be classified
into three categories:
1. Web content mining,
2. Web structure mining and
3. Web usage mining.
However, there are two other different approaches
to categorize Web mining. In both, the categories
are reduced from three to two: Web content
mining and Web usage mining. In one, Web
structure is treated as part of Web Content while
in the other Web usage is treated as part of Web
Structure. All of the three categories focus on the
process of knowledge discovery of implicit,
previously unknown and potentially useful
information from the Web. Each of them focuses
on different mining objects of the Web.
Fig. 2 Taxonomy of Web mining
5.1. Web content mining
Web content mining is an automatic process that
goes beyond keyword extraction. Since the
content of a text document presents no machine
readable semantic, some approaches have
suggested to restructure the document content in a
representation that could be exploited by
machines. The usual approach to exploit known
structure in documents is to use wrappers to map
documents to some data model. Techniques using
lexicons for content interpretation are yet to come.
There are two groups of web content mining
strategies: Those that directly mine the content of
documents and those that improve on the content
search of other tools like search engines.
Page 9
Web Content Mining deals with discovering
useful information or knowledge from web page
contents. Web content mining analyzes the
content of Web resources. Content data is the
collection of facts that are contained in a web
page. It consists of unstructured data such as free
texts, images, audio, video, semi-structured data
such as HTML documents, and a more structured
data such as data in tables or database generated
HTML pages. The primary Web resources that are
mined in Web content mining are individual
pages. They can be used to group, categorize,
analyze, and retrieve documents. Web content
mining could be differentiated from two points of
view:
5.1.1. Agent-Based Approach
This approach aims to assist or to improve the
information finding and filtering the information
to the users. This could be placed into the
following three categories:
a. Intelligent Search Agents: These agents
search for relevant information using
domain characteristics and user profiles to
organize and interpret the discovered
information.
b. Information Filtering/ Categorization:
These agents use information retrieval
techniques and characteristics of open
hypertext Web documents to automatically
retrieve, filter, and categorize them.
c. Personalized Web Agents: These agents
learn user preferences and discover Web
information based on these preferences,
and preferences of other users with similar
interest.
1. Intelligent Search Agents:
Several intelligent Web agents have been
developed that search for relevant information
using domain characteristics and user profiles
to organize and interpret the discovered
information. Agents such as Harvest , FAQ
Finder , Information Manifold , OCCAM , and
ParaSite rely either on pre-specified domain
information about particular types of
documents, or on hard coded models of the
information sources to retrieve and interpret
documents. Agents such as ShopBot and ILA
(Internet Learning Agent) interact with and
learn the structure of unfamiliar information
sources. ShopBot retrieves product
information from a variety of vendor sites
using only general information about the
product domain. ILA learns models of various
information sources and translates these into
its own concept hierarchy.
2.InformationFialtering/Categorization:
A number of Web agents use various information
retrieval techniques and characteristics of open
hypertext Web documents to automatically
retrieve, alter, and categorize them, BO
(Bookmark Organizer) 34] combines hierarchical
clustering techniques and user interaction to
Page 10
organize a collection of Web documents based on
conceptual information.
3. Personalized Web Agents:
This category of Web agents learn user
preferences and discover Web information sources
based on these preferences, and those of other
individuals with similar interests (using
collaborative altering). A few recent examples of
such agents include the WebWatcher , PAINT ,
Syskill & Webert . For example, Syskill & Webert
utilizes a user profile and learns to rate Web pages
of interest using a Bayesian classier.
5.1.2. Database Approach
Database approach aims on modeling the data on
the Web into more structured form in order to
apply standard database querying mechanism and
data mining applications to analyze it. The two
main categories are
Multilevel databases: The main idea behind this
approach is that the lowest level of the database
contains semi-structured information stored in
various Web sources, such as hypertext
documents. At the higher level(s) meta data or
generalizations are extracted from lower levels
and organized in structured collections, i.e.
relational or object-oriented databases.
Web query systems: Many Web-based query
systems and languages utilize standard database
query languages such as SQL, structural
information about Web documents, and even
natural language processing for the queries that
are used in World Wide Web searches.. W3QL
combines structure queries, based on the
organization of hypertext documents, and content
queries, based on information retrieval techniques.
WebLog, logic-based query language for
restructuring extracts information from Web
information sources. . TSIMMIS .extracts data
from heterogeneous and semi-structured
information sources and correlates them to
generate an integrated database representation of
the extracted information.
5.2. WEB STRUCTURE MINING
World Wide Web can reveal more information
than just the information contained in documents.
For example, links pointing to a document
indicate the popularity of the document, while
links coming out of a document indicate the
richness or perhaps the variety of topics covered
in the document. This can be compared to
bibliographical citations. When a paper is cited
often, it ought to be important. The PageRank and
CLEVER methods take advantage of this
information conveyed by the links to find
pertinent web pages. By means of counters, higher
levels cumulate the number of artifacts subsumed
by the concepts they hold. Counters of hyperlinks,
in and out documents, retrace the structure of the
web artifacts summarized.
Web structure mining is the process of
discovering structure information from the web.
Page 11
The structure of a typical web graph consists of
web pages as nodes, and hyperlinks as edges
connecting related pages. This can be further
divided into two kinds based on the kind of
structure information used.
Fig.3 Web graph structure
Hyperlinks
A hyperlink is a structural unit that connects a
location in a web page to a different location,
either within the same web page or on a different
web page. A hyperlink that connects to a different
part of the same page is called an Intra-document
hyperlink, and a hyperlink that connects two
different pages is called an inter-document
hyperlink.
Document Structure
In addition, the content within a Web page can
also be organized in a tree structured format,
based on the various HTML and XML tags within
the page. Mining efforts here have focused on
automatically extracting document object model
(DOM) structures out of documents.
Web structure mining focuses on the hyperlink
structure within the Web itself. The different
objects are linked in some way. Simply applying
the traditional processes and assuming that the
events are independent can lead to wrong
conclusions. However, the appropriate handling of
the links could lead to potential correlations, and
then improve the predictive accuracy of the
learned models.
Two algorithms that have been proposed to lead
with those potential correlations are:
1. HITS and
2. PageRank.
5.2.1. PageRank
Page Rank is a metric for ranking hypertext
documents that determines the quality of these
documents. The key idea is that a page has high
rank if it is pointed to by many highly ranked
pages. So the rank of a page depends upon the
ranks of the pages pointing to it. This process is
done iteratively till the rank of all the pages is
determined.
The rank of a page p can thus be written as:
Here, n is the number of nodes in the graph,
OutDegree(q) is the number of hyperlinks on page
q and d damping factor is the probability at each
page the random surfer will get bored and request
another random page.
5.2.2. HITS
Page 12
Hyperlink-induced topic search (HITS) is an
iterative algorithm for mining the Web graph to
identify topic hubs and authorities. Authorities are
the pages with good sources of content that are
referred by many other pages or highly ranked
pages for a given topic; hubs are pages with good
sources of links. The algorithm takes as input,
search results returned by traditional text indexing
techniques, and filters these results to identify
hubs and authorities. The number and weight of
hubs pointing to a page determine the page's
authority. The algorithm assigns weight to a hub
based on the authoritativeness of the pages it
points to. If many good hubs point to a page p,
then authority of that page p increases. Similarly if
a page p points to many good authorities, then hub
of page p increases.
After the computation, HITS outputs the pages
with the largest hub weight and the pages with the
largest authority weights, which is the search
result of a given topic.
5.3. WEB USAGE MINING
Web usage mining is a process of extracting
useful information from server logs i.e. users
history. Web usage mining is the process of
finding out what users are looking for on the
Internet.
Web usage mining focuses on techniques that
could predict the behavior of users while they are
interacting with the WWW. It collects the data
from Web log records to discover user access
patterns of Web pages. Usage data captures the
identity or origin of web users along with their
browsing behavior at a web site.
Web servers record and accumulate data about
user interactions whenever requests for resources
are received. Analyzing the web access logs of
different web sites can help understand the user
behavior and the web structure, thereby improving
the design of this colossal collection of resources.
There are two main tendencies in Web Usage
Mining driven by the applications of the
discoveries: General Access Pattern Tracking and
Customized Usage Tracking. The general access
pattern tracking analyzes the web logs to
understand access patterns and trends. These
analyses can shed light on better structure and
grouping of resource providers. Many web
analysis tools existed but they are limited and
usually unsatisfactory. We have designed a web
log data mining tool, Web Log Miner, and
proposed techniques for using data mining and
OnLine Analytical Processing (OLAP) on treated
and transformed web access files. Applying data
mining techniques on access logs unveils
interesting access patterns that can be used to
restructure sites in a more efficient grouping,
pinpoint effective advertising locations, and target
specific users for specific selling ads.
Customized usage tracking analyzes individual
trends. Its purpose is to customize web sites to
users. The information displayed, the depth of the
Page 13
site structure and the format of the resources can
all be dynamically customized for each user over
time based on their access patterns.
While it is encouraging and exciting to see the
various potential applications of web log file
analysis, it is important to know that the success
of such applications depends on what and how
much valid and reliable knowledge one can
discover from the large raw log data. Current web
servers store limited information about the
accesses. Some scripts custom-tailored for some
sites may store additional information. However,
for an effective web usage mining, an important
cleaning and data transformation
step before analysis may be needed.
In the using and mining of Web data, the most
direct source of data are Web log files on the Web
server. Web log files records of the visitor's
browsing behavior very clearly. Web log _les
include the server log, agent log and client log (IP
address, URL, page reference, access time,
cookies etc.).
There are several available research projects and
commercial products that analyze those patterns
for different purposes. The applications generated
from this analysis can be classified as
personalization, system improvement, site
modification, business intelligence and usage
characterization.
The Web Mining Architechture
Fig. 4 Web Usage Mining Process
The Web Usage Mining can be decomposed into
the following three main sub tasks:
Fig 5. Web usage mining process
5.3.1. Pre-processing
It is necessary to perform a data preparation to
convert the raw data for further process. The
actual data collected generally have the features
that incomplete, redundancy and ambiguity. In
order to mine the knowledge more effectively,
pre-processing the data collected is essential.
Preprocessing can provide accurate, concise data
Page 14
for data mining. Data preprocessing, includes data
cleaning, user identification, user sessions
identification, access path supplement and
transaction identification.
 The main task of data cleaning is to
remove the Web log redundant data which
is not associated with the useful data,
narrowing the scope of data objects.
 Determining the single user must be done
after data cleaning. The purpose of user
identification is to identify the users
uniqueness. It can be complete by means
of cookie technology, user registration
techniques and investigative rules.
 User session identification should be done
on the basis of the user identification. The
purpose is to divide each user's access
information into several separate session
processes. The simplest way is to use time-
out estimation approach, that is, when the
time interval between the page requests
exceeds the given value, namely, that the
user has started a new session.
 Because the widespread use of the page
caching technology and the proxy servers,
the access path recorded by the Web server
access logs may not be the complete
access path of users. Incomplete access log
does not accurately reflect the user's access
patterns, so it is necessary to add access
path. Path supplement can be achieved
using the Web site topology to make the
page analysis.
 The transaction identification is based on
the user's session recognition, and its
purpose is to divide or combine
transactions according to the demand of
data mining tasks in order to make it
appropriate for demand of data mining
analysis.
5.3.2. Pattern discovery
Pattern discovery mines effective, novel,
potentially useful and ultimately understandable
information and knowledge using mining
algorithm. Its methods include statistical analysis,
classification analysis, association rule discovery,
sequential pattern discovery, clustering analysis,
and dependency modeling.
 Statistical Analysis: Statistical analysts
may perform different kinds of descriptive
statistical analyses (frequency, mean,
median, etc.) based on different variables
such as page views, viewing time and
length of a navigational path when
analyzing the session _le. By analyzing the
statistical information contained in the
periodic web system report, the extracted
report can be potentially useful for
improving the system performance,
enhancing the security of the system,
facilitation the site modification task, and
providing support for marketing decisions.
Page 15
 Association Rules: In the web domain, the
pages, which are most often referenced
together, can be put in one single server
session by applying the association rule
generation. Association rule mining
techniques can be used to discover
unordered correlation between items found
in a database of transactions.
 Clustering analysis: Clustering analysis is
a technique to group together users or data
items (pages) with the similar
characteristics. Clustering of user
information or pages can facilitate the
development and execution of future
marketing strategies.
 Classification analysis: Classification is
the technique to map a data item into one
of several predefined classes. The
classification can be done by using
supervised inductive learning algorithms
such as decision tree classifiers, nave
Bayesian classifiers, k-nearest neighbor
classifier,Support Vector Machines etc.
 Sequential Pattern: This technique
intends to find the inter-session pattern,
such that a set of the items follows the
presence of another in a time-ordered set
of sessions or episodes. Sequential patterns
also include some other types of temporal
analysis such as trend analysis, change
point detection, or similarity analysis.
 Dependency Modeling: The goal of this
technique is to establish a model that is
able to represent significant dependencies
among the various variables in the web
domain. The modeling technique provides
a theoretical framework for analyzing the
behavior of users, and is potentially useful
for predicting future web resource
consumption.
5.3.3. Pattern Analysis
Pattern Analysis is a final stage of the whole web
usage mining. The goal of this process is to
eliminate the irrelevant rules or patterns and to
understand, visualize and to extract the interesting
rules or patterns from the output of the pattern
discovery process. The output of web mining
algorithms is often not in the form suitable for
direct human consumption, and thus need to be
transform to a format can be assimilate easily.
There are two most common approaches for the
patter analysis. One is to use the knowledge query
mechanism such as SQL, while another is to
construct multi-dimensional data cube before
perform OLAP operation.
6. APPLICATIONS OF WEB
MINING
Web mining techniques can be applied to
understand and analyze such data, and turned into
actionable information, that can support a web
enabled electronic business to improve its
marketing, sales and customer support operations.
Page 16
Based on the patterns found and the original cache
and log data, many applications can be developed.
Some of them are:
In order to achieve personalized service, it first
has to obtain and collect information on clients to
grasp customer's spending habits, hobbies,
consumer psychology, etc., and then can be
targeted to provide personalized service. To obtain
consumer spending behavior patterns, the
traditional marketing approach is very difficult,
but it can be done using Web mining techniques.
Early on in the life of Amazon.com, its visionary
CEO Jeff Bezos observed, In a traditional (brick-
and mortar) store, the main effort is in getting a
customer to the store. Once a customer is in the
store they are likely to make a purchase since the
cost of going to another store is high and thus the
marketing budget (focused on getting the
customer to the store) is in general much higher
than the in-store customer experience budget
(which keeps the customer in the store). In the
case of an on-line store, getting in or out requires
exactly one click, and thus the main focus must be
on customer experience in the store. This
fundamental observation has been the driving
force behind Amazons comprehensive approach to
personalized customer experience, based on the
mantra a personalized store for every customer. A
host of Web mining techniques, e.g. associations
between pages visited, click-path analysis, etc.,
are used to improve the customers experience
during a store visit. Knowledge gained from Web
mining is the key intelligence behind Amazons
features such as instant recommendations,
purchase circles, wish-lists, etc.
6.1.Improve the website design
Attractiveness of the site depends on its
reasonable design of content and organizational
structure. Web mining can provide details of user
behavior, providing web site designers basis of
decision making to improve the design of the site.
6.2.System Improvement
Performance and other service quality attributes
are crucial to user satisfaction from services such
as databases, net-works, etc. Similar qualities are
expected from the users of Web services. Web
usage mining provides the key to under-standing
Web traffic behavior, which can in turn be used
for developing policies for Web caching, network
transmission, load balancing, or data distribution.
Security is an acutely growing concern for Web-
based services, especially as electronic commerce
continues to grow at an exponential rate. Web
usage mining can also provide patterns which are
useful for detecting intrusion, fraud, attempted
break-ins, etc.
6.3.Predicting trends
Web mining can predict trend within the retrieved
information to indicate future values. For
example, an electronic auction company provides
information about items to auction, previous
Page 17
auction details, etc. Predictive modeling can be
utilized to analyze the existing information, and to
estimate the values for auctioneer items or number
of people participating in future auctions.
The predicting capability of the mining
application can also benefit society by identifying
criminal activities.
6.4.To carry out intelligent business
A visit cycle of customer network marketing
activities can be divided into four steps: Being
attracted, presence, purchase and left. Web mining
technology can dig out the customers' motivation
by analyzing the customer click-stream
information in order to help sales make reasonable
strategies, custom personalized pages for
customers, carry out targeted information
feedback and advertising. In short, in e-commerce
network marketing, Using Web mining techniques
to analyze large amounts of data can dig out the
laws of the consumption of goods and the
customer’s access patterns, help businesses
develop effective marketing strategies, enhance
enterprise competitiveness.
The companies can establish better customer
relationship by giving them exactly what they
need. Companies can understand the needs of the
customer better and they can react to customer
needs faster. The companies can find, attract and
retain customers; they can save on production
costs by utilizing the acquired insight of customer
requirements. They can increase profitability by
target pricing based on the profiles created. They
can even find the customer who might default to a
competitor the company will try to retain the
customer by providing promotional offers to the
specific customer, thus reducing the risk of losing
a customer.
7.RESEARCH DIRECTIONS
The techniques being applied to Web content
mining draw heavily from the work on
information retrieval, databases, intelligent agents,
etc. Since most of these techniques are well
known and reported elsewhere, we have focused
on Web usage mining in this survey instead of
Web content mining. In the following we provide
some directions for future research.
7.1 Data Pre-Processing for Mining
Web usage data is collected in various ways, each
mechanism collecting attributes relevant for its
purpose. There is a need to pre-process the data to
make it easier to mine for knowledge.
Specifically, we believe that issues such as
instrumentation and data collection, data
integration and transaction identification need to
be addressed. Clearly improved data quality can
improve the quality of any analysis on it. A
problem in the Web domain is the inherent
conflict between the analysis needs of the analysts
(who want more detailed usage data collected),
and the privacy needs of users (who want as little
Page 18
data collected as possible). This has lead to the
development of cookie les on one side and cache
busting on the other. The emerging OPS standard
on collecting profile data may be a compromise on
what can andwill be collected. However, it is not
clear how much compliance to this can be
expected. Hence, there will be a continual need to
develop better instrumentation and data collection
techniques, based on whatever is possible and
allowable at any point in time. Portions of Web
usage data exist in sources as diverse as Web
server logs, referral logs, registration les, and
index server logs. Intelligent integration and
correlation of information from these diverse
sources can reveal usage information which may
not be evident from any one of them. Techniques
from data integration should be examined for this
purpose. Web usage data collected in various logs
is at a very fine granularity. Therefore, while it
has the advantage of being extremely general and
fairly detailed, it also has the corresponding
drawback that it cannot be analyzed directly, since
the analysis may start focusing on micro trends
rather than on the macro trends. On the other
hand, the issue of whether a trend is micro or
macro depends on the purpose of a specific
analysis.
Hence, we believe there is a need to group
individual data collection events into groups,
called Web transactions , before feeding it to the
mining system. While have proposed techniques
to do so, more attention needs to be given to this
issue.
7.2 The Mining Process
The key component of Web mining is the mining
process itself. As discussed in this paper, Web
mining has adapted techniques from the field of
data mining, databases, and information retrieval,
as well as developing some techniques of its own,
e.g. path analysis. A lot of work still remains to be
done in adapting known mining techniques as well
as developing new ones. Web usage mining
studies reported to date have mined for association
rules, temporal sequences, clusters, and path
expressions. As the manner in which the Web is
used continues to expand, there is a continual need
to figure out new kinds of knowledge about user
behavior that needs to be mined. The quality of a
mining algorithm can be measured both in terms
of how effective it is in mining for knowledge and
how efficient it is in computational terms. There
will always be a need to improve the performance
of mining algorithms along both these dimensions.
Usage data collection on the Web is incremental
in nature. Hence, there is a need to develop
mining algorithms that take as input the existing
data, mined knowledge, and the new data, and
develop a new model in an efficient manner.
Usage data collection on the Web is also
distributed by its very nature. If all the data were
to be integrated before mining, a lot of valuable
information could be extracted. However, an
Page 19
approach of collecting data from all possible
server logs is both non-scalable and impractical.
Hence, there needs to be an approach where
knowledge mined from various logs can be
integrated together into a more comprehensive
model.
7.3 Analysis of Mined Knowledge
The output of knowledge mining algorithms is
often not in a form suitable for direct human
consumption, and hence there is a need to develop
techniques and tools for helping an analyst better
assimilate it. Issues that need to be addressed in
this area include usage analysis tools and
interpretation of mined knowledge.
There is a need to develop tools which incorporate
statistical methods, visualization, and human
factors to help better understand the mined
knowledge. Section 4 provided a survey of the
current literature in this area. One of the open
issues in data mining, in general, and Web mining,
in particular, is the creation of intelligent tools that
can assist in the interpretation of mined
knowledge. Clearly, these tools need to have
specific knowledge about the particular problem
domain to do any more than altering based on
statistical attributes of the discovered rules or
patterns. In Web mining, for example, intelligent
agents could be developed that based on
discovered access patterns, the topology of the
Web locality, and certain heuristics derived from
user behavior models, could give
recommendations about changing the physical
link structure of a particular site.
8. WEB MINING PROS & CONS
8.1. PROS
Web mining essentially has many advantages
which makes this technology attractive to
corporations including the government agencies.
This technology has enabled ecommerce to do
personalized marketing, which eventually results
in higher trade volumes. The government agencies
are using this technology to classify threats and
fight against terrorism. The predicting capability
of the mining application can benefit the society
by identifying criminal activities. The companies
can establish better customer relationship by
giving them exactly what they need. Companies
can understand the needs of the customer better
and they can react to customer needs faster. The
companies can find, attract and retain customers;
they can save on production costs by utilizing the
acquired insight of customer requirements. They
can increase profitability by target pricing based
on the profiles created. They can even find the
customer who might default to a competitor the
company will try to retain the customer by
providing promotional offers to the specific
customer, thus reducing the risk of losing a
customer.
Prospects
The future of Web Mining will to a large extent
depend on developments of the Semantic Web.
Page 20
The role of Web technology still increases in
industry, government, education, entertainment.
This means that the range of data to which Web
Mining can be applied also increases. Even
without technical advances, the role of Web
Mining technology will become larger and more
central. The main technical advances will be in
increasing the types of data to which Web Mining
can be applied. In particular Web Mining for text,
images and video/audio streams will increase the
scope of current methods. These are all active
research topics in Data Mining and Machine
Learning and the results of this can be exploited
for Web Mining.
The second type of technical advance comes from
the integration of Web Mining with other
technologies in application contexts. Examples are
information retrieval, ecommerce, business
process modeling, instruction, and health care.
The widespread use of web-based systems in these
areas makes them amenable to Web Mining.
In this section we outline current generic practical
problems that will be addressed, technology
required for these solutions, and research issues
that need to be addressed for technical progress.
Knowledge Management
Knowledge Management is generally viewed as a
field of great industrial importance. Systematic
management of the knowledge that is available in
an organization can increase the ability of the
organization to make optimal use of the
knowledge that is available in the organization
and to react effectively to new developments,
threats and opportunities. Web Mining technology
creates the
opportunity to integrate knowledge management
more tightly with business processes.
Standardization efforts that use SemanticWeb
technology and the availability of ever more data
about business processes on the internet creates
opportunities for Web Mining technology. More
widespread use of Web Mining for Knowledge
Management requires the availability of low-
threshold Web Mining tools that can be used by
non-experts and that can flexibly be integrated in a
wide variety of tools and systems.
E-commerce
The increased use of XML/RDF to describe
products, services and business processes
increases the scope and power of Data Mining
methods in e-commerce. Another direction is the
use of text mining methods for modeling
technical, social and commercial developments.
This requires advances in text mining and
information extraction.
E-learning
The Semantic Web provides a way of organizing
teaching material, and usage mining can be
applied to suggest teaching materials to a learner.
This opens opportunities for Web Mining. For
Page 21
example, a recommending approach can be
followed to find courses or teaching material for a
learner. The material can then be organized with
clustering techniques, and ultimately be shared on
the web again, e. g., within a peer to peer network.
Web mining methods can be used to construct a
profile of user skills, competence or knowledge
and of the effect of instruction. Another possibility
is to use web mining to analyze student
interactions for teaching purposes. The internet
supports students who collaborate during learning.
Web mining methods can be used to monitor this
process, without requiring the teacher to follow
the interactions in detail. Current web mining
technology already provides a good basis for this.
Research and development must be directed
toward important characteristics of interactions
and to integration in the instructional process.
E-government
Many activities in governments involve large
collections of documents. Think of regulations,
letters, announcements, reports. Managing access
and availability of this amount of textual
information can be greatly facilitated by a
combination of Semantic Web standardization and
text mining tools. Many internal processes in
government involve documents, both textual and
structured. Web mining creates the opportunity to
analyze these governmental processes and to
create models of the processes and the information
involved. It seems likely that standard ontologies
will be used in governmental organizations and
the standardization that this produces will make
Web Mining more widely applicable and more
powerful than it currently is. The issues involved
are those of Knowledge Management. Also
governmental activities that involve the general
public include many opportunities for Web
Mining. Like shops, governments that offer
services via the internet can analyze their
customers behavior to improve their services.
Information about social processes can be
observed and monitored using Web Mining, in the
style of marketing analyses. Examples of this are
the analysis of research proposals for the
European Commission and the development of
tools for monitoring and structuring internet
discussion for non political issues. Enabling
technologies for this are more advanced
information extraction methods and tools.
Health care
Medicine is one of the Web’s fastest-growing
areas. It profits from Semantic Web technology in
a number of ways: First, as a means of organizing
medical knowledge - for example, the widely-used
taxonomy International Classification of Diseases
and its variants serve to organize telemedicine
portal content and interfaces. The Unified
Medical Language System
Page 22
(http://www.nlm.nih.gov/research/umls) integrates
this classification and many others. Second, health
care institutions can profit from interoperability
between the different clinical information systems
and semantic representations of member
institutions’ organization and services. Usage
analyses of medical sites can be employed for
purposes such as Web site evaluation and the
inference of design guidelines for international
audiences, or the detection of epidemics. In
general, similar issues arise, and the same
methods can be used for analysis and design as in
other content classes of Web sites. Some of the
facets of Semantic Web Mining that we have
mentioned in this article form specific challenges,
in particular: the privacy and security of patient
data, the semantics of visual material, and the
cost-induced pressure towards national and
international integration of Web resources.
E-science
In E-Science two main developments are visible.
One is the use of text mining and Data Mining for
information extraction to extract information from
large collections of textual documents. Much
information is “buried” in the huge scientific
literature and can be extracted by combining
knowledge about the domain and information
extraction. Enabling technology for this is
information extraction in combination with
knowledge representation and ontologies. The
other development is large scale data collection
and data analysis. This also requires common
concept and organisation of the information using
ontologies. However, this form of collaboration
also needs a common methodology and it needs to
be extended with other means of communication,
see for examples and discussion.
Web mining for images and video and audio
streams
So far, efforts in Semantic Web research have
addressed mostly written documents. Recently this
is broadened to include sound/voice and images.
Images and parts of images are annotated with
terms from ontologies.
Privacy and security
A factor that limits the application of Web
Mining is the need to protect privacy of users.
Web Mining uses data that are available on the
web anyway but the use of Data Mining makes it
possible to induce general patterns that can be
applied to personal data to inductively infer data
that should remain private. Recent
research addresses this problem and searches for
selective restrictions on access to data that do
allow the induction of general patterns but at the
same time preserves a preset uncertainty about
individuals, thereby protecting privacy of
individuals.
Information extraction with formalized
knowledge
Page 23
We briefly reviewed the use of concept
hierarchies and thesauri for information
extraction. If knowledge
is represented in more general formal Semantic
Web languages like OWL, in principle there are
stronger possibilities to use this knowledge for
information extraction.
In summary, the main foreseen developments are:
– The extensive use of annotated documents
facilitates the application of Data Mining
techniques to documents.
– The use of a standardized format and a
standardized vocabulary for information on the
web will increase the effect and use of Web
Mining.
– The Semantic Web goal of large-scale
construction of ontologies will require the use of
Data Mining methods, in particular to extract
knowledge from text.
8.2. CONS
Web mining, itself, doesn’t create issues, but this
technology when used on data of personal nature
might cause concerns. The most criticized ethical
issue involving web mining is the invasion of
privacy. Privacy is considered lost when
information concerning an individual is obtained,
used, or disseminated, especially if this occurs
without their knowledge or consent. The obtained
data will be analyzed, and clustered to form
profiles; the data will be made anonymous before
clustering so that there are no personal profiles.
Thus these applications de-individualize the users
by judging them by their mouse clicks. De-
individualization, can be defined as a tendency of
judging and treating people on the basis of group
characteristics instead of on their own individual
characteristics and merits.
Another important concern is that the companies
collecting the data for a specific purpose might
use the data for a totally different purpose, and
this essentially violates the user’s interests. The
growing trend of selling personal data as a
commodity encourages website owners to trade
personal data obtained from their site. This trend
has increased the amount of data being captured
and traded increasing the likeliness of one’s
privacy being invaded. The companies which buy
the data are obliged make it anonymous and these
companies are considered authors of any specific
release of mining patterns. They are legally
responsible for the contents of the release; any
inaccuracies in the release will result in serious
lawsuits, but there is no law preventing them from
trading the data.
Some mining algorithms might use controversial
attributes like sex, race, religion, or sexual
orientation to categorize individuals. These
practices might be against the anti-discrimination
legislation. The applications make it hard to
identify the use of such controversial attributes,
Page 24
and there is no strong rule against the usage of
such algorithms with such attributes. This process
could result in denial of service or a privilege to
an individual based on his race, religion or sexual
orientation, right now this situation can be avoided
by the high ethical standards maintained by the
data mining company. The collected data is being
made anonymous so that, the obtained data and
the obtained patterns cannot be traced back to an
individual. It might look as if this poses no threat
to one’s privacy, actually many extra information
can be inferred by the application by combining
two separate unscrupulous data from the user.
9. CONCLUSION
The term Web mining has been used to refer to
techniques that encompass a broad range of issues.
However, while meaningful and attractive, this
very broadness has caused Web mining to mean
different things to different people, and there is a
need to develop a common vocabulary. Towards
this goal we proposed a definition of Web mining,
and developed taxonomy of the various ongoing
efforts related to it. Next, presented a survey of
the research in this area and concentrated on Web
usage mining.The provided a detailed survey of
the e orts in this area, even though the survey is
short because of the area's newness. To provided a
general architecture of a system to do Web usage
mining, and identified the issues and problems in
this area that require further research and
development.
As the Web and its usage continue to grow, so
does the opportunity to analyze Web data and
extract all manner of useful knowledge from it.
The past few years have seen the emergence of
Web mining as a rapidly growing area, due to the
efforts of the research community as well as
various organizations that are practicing. The key
component of web mining is the mining process
itself. Here described the key computer science
contributions made in this field, including the
overview of web mining, taxonomy of web
mining, the prominent successful applications, and
outlined some promising areas of future research.
10.REFERENCE
[1] http://en.wikipedia.org/wiki/Web mining
[2] http://www.galeas.de/webimining.html
[3] Jaideep srivastava, Robert Cooley, Mukund
Deshpande, Pan-Ning Tan, Web Usage Mining:
Discovery and Applications of Usage Patterns
from Web Data, SIGKDD Explorations, ACM
SIGKDD,Jan 2000.
[4] Miguel Gomes da Costa Jnior,Zhiguo Gong,
Web Structure Mining: An Introduction,
Proceedings of the 2005 IEEE International
Conference on Information Acquisition
[5] R. Cooley, B. Mobasher, and J.
Srivastava,Web Mining: Information and Pattern
Discovery on the World Wide Web, ICTAI97
[6] Brijendra Singh, Hemant Kumar Singh, WEB
DATA MINING RE- SEARCH: A SURVEY,
2010 IEEE
Page 25
[7] Mining the Web: discovering knowledge from
hypertext data, Part 2 By Soumen Chakrabarti,
2003 edition
[8] Web mining: applications and techniques By
Anthony Scime
[9] . R. Agrawal and R. Srikant. Fast algorithms
for mining association rules.
[10] S. Agrawal, R. Agrawal, P.M. Deshpande, A.
Gupta, J. Naughton, R. Ramakrishna, and S.
Sarawagi. On the computation of
multidimensional aggregates.
[11] R. Armstrong, D. Freitag, T. Joachims, and
T. Mitchell. Webwatcher: A learning apprentice
for the world wide web.
[12] M. Balabanovic, Yoav Shoham, and Y. Yun.
An adaptive agent for automated web browsing.
Journal of Visual Communication and Image
Representation,
[13] A. Z. Broder, S. C. Glassman, M. S.
Manasse, and G Zweig. Syntactic clustering of the
web.

More Related Content

What's hot

SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialPeter Mika
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.Silvia Puglisi
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic WaveKaniska Mandal
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.Lanujessy
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyBhojaraju Gunjal
 
Competitive Intelligence Made easy
Competitive Intelligence Made easyCompetitive Intelligence Made easy
Competitive Intelligence Made easyRaghav Shaligram
 
Web of Data as a Solution for Interoperability. Case Studies
Web of Data as a Solution for Interoperability. Case StudiesWeb of Data as a Solution for Interoperability. Case Studies
Web of Data as a Solution for Interoperability. Case StudiesSabin Buraga
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalhplap
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the RisePeter Mika
 
Is590 eport. knowledge map 1
Is590 eport. knowledge map 1Is590 eport. knowledge map 1
Is590 eport. knowledge map 1thomas-monaco
 

What's hot (20)

SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Ibrahim ramadan paper
Ibrahim ramadan paperIbrahim ramadan paper
Ibrahim ramadan paper
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Semantic web
Semantic webSemantic web
Semantic web
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic Wave
 
Searching the web general
Searching the web generalSearching the web general
Searching the web general
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
01635156
0163515601635156
01635156
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case Study
 
Competitive Intelligence Made easy
Competitive Intelligence Made easyCompetitive Intelligence Made easy
Competitive Intelligence Made easy
 
Web of Data as a Solution for Interoperability. Case Studies
Web of Data as a Solution for Interoperability. Case StudiesWeb of Data as a Solution for Interoperability. Case Studies
Web of Data as a Solution for Interoperability. Case Studies
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Is590 eport. knowledge map 1
Is590 eport. knowledge map 1Is590 eport. knowledge map 1
Is590 eport. knowledge map 1
 

Similar to Web Mining

Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Mumbai Academisc
 
A web content mining application for detecting relevant pages using Jaccard ...
A web content mining application for detecting relevant pages  using Jaccard ...A web content mining application for detecting relevant pages  using Jaccard ...
A web content mining application for detecting relevant pages using Jaccard ...IJECEIAES
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...IOSR Journals
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web MiningIOSR Journals
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs inventionjournals
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...IAEME Publication
 
Comparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering TechniqueComparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering TechniqueIOSR Journals
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...rahulmonikasharma
 

Similar to Web Mining (20)

Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)
 
Research Statement
Research StatementResearch Statement
Research Statement
 
Minning www
Minning wwwMinning www
Minning www
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
5463 26 web mining
5463 26 web mining5463 26 web mining
5463 26 web mining
 
A web content mining application for detecting relevant pages using Jaccard ...
A web content mining application for detecting relevant pages  using Jaccard ...A web content mining application for detecting relevant pages  using Jaccard ...
A web content mining application for detecting relevant pages using Jaccard ...
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information Extraction
 
Paper24
Paper24Paper24
Paper24
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
A Clustering Based Approach for knowledge discovery on web.
A Clustering Based Approach for knowledge discovery on web.A Clustering Based Approach for knowledge discovery on web.
A Clustering Based Approach for knowledge discovery on web.
 
WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
 
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web Mining
 
Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs Information Organisation for the Future Web: with Emphasis to Local CIRs
Information Organisation for the Future Web: with Emphasis to Local CIRs
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Web mining
Web miningWeb mining
Web mining
 
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
 
Comparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering TechniqueComparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering Technique
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
 

More from Shobha Rani

Night Vision Technology
Night Vision TechnologyNight Vision Technology
Night Vision TechnologyShobha Rani
 
Graphical Password Authentication
Graphical Password AuthenticationGraphical Password Authentication
Graphical Password AuthenticationShobha Rani
 
3D Optical Data Storage
3D Optical Data Storage 3D Optical Data Storage
3D Optical Data Storage Shobha Rani
 
Night Vision Technology
Night Vision TechnologyNight Vision Technology
Night Vision TechnologyShobha Rani
 
Human Computer Interface (HCI)
Human Computer Interface (HCI)Human Computer Interface (HCI)
Human Computer Interface (HCI)Shobha Rani
 
Cluster Computing
Cluster Computing Cluster Computing
Cluster Computing Shobha Rani
 

More from Shobha Rani (8)

Night Vision Technology
Night Vision TechnologyNight Vision Technology
Night Vision Technology
 
Graphical Password Authentication
Graphical Password AuthenticationGraphical Password Authentication
Graphical Password Authentication
 
3D Optical Data Storage
3D Optical Data Storage 3D Optical Data Storage
3D Optical Data Storage
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Night Vision Technology
Night Vision TechnologyNight Vision Technology
Night Vision Technology
 
Brain gate ppt
Brain gate pptBrain gate ppt
Brain gate ppt
 
Human Computer Interface (HCI)
Human Computer Interface (HCI)Human Computer Interface (HCI)
Human Computer Interface (HCI)
 
Cluster Computing
Cluster Computing Cluster Computing
Cluster Computing
 

Recently uploaded

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 

Recently uploaded (20)

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 

Web Mining

  • 1. Page 1 SEMINAR ON WEB MINING Abstract: With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in find the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating serverside and clientside intelligent systems that can effectively mine for knowledge. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. This describes the automatic search of information resources available online, i.e. Web content mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining. In this paper we present detail statistical formulation and experimental result to show how web mining can be utilized to perform potential customer. Web usability is an important and sometimes controversial research area. We proposed an integrated system for web mining and usability study where four core modules are designed to address the fundamental issues in usability analysis.. As an example to cross modules analysis, we apply association rule mining from the link structure obtained from web mining module to automatically discover menus and structures in a web site. Keywordds: web mining, potential customer. 1. INTRODUCTION With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize auto- mated tools in order to find, extract, filter, and evaluate the desired information and resources. In addition, with the transformation of the web into the primary tool for electronic commerce, it is imperative for organizations and companies, who have invested millions in Internet and Intranet technologies, to track and analyze user access patterns. These factors give rise to the necessity of creating server-side and client-side intelligent systems that can effectively mine for knowledge both across the Internet and in particular web localities.
  • 2. Page 2 At present most of the users commonly use searching engines such as www.google.com, to find their required information. Moreover, the target of the Web search engine is only to discover resource on the Web. Each searching engines having its own characteristics and employing different algorithms to index, rank, and present web documents. But because all these searching engines is build based on exact key words matching and it's query language belongs to some artificial kind, with restricted syntax and vocabulary other than natural language, there are defects that all kind of searching engines cannot overcome. Narrowly Searching Scope: Web pages indexed by any searching engines are only a tiny part of the whole pages on the www, and the return pages when user input and submit query are another tiny part of indexed numbers of the searching engine. Low Precision: User cannot browse all the pages one by one, and most pages are irrelevant to the user's meaning, they are highlighted and returned by searching engine just because these pages in possession of the key words. Web mining techniques could be used to solve the information over load problems directly or indirectly. However, Web mining techniques are not the only tools. Other techniques and works from different research areas, such as DataBase (DB), Information Retrieval (IR), Natural Language Processing (NLP), and the Web document community, could also be used. Information retrieval Information retrieval is the art and science of searching for information in documents, searching for documents themselves, searching for metadata which describes documents, or searching within databases, whether relational standalone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. Natural language processing Natural language processing (NLP) is concerned with the interactions between computers and human (natural) languages. NLP is a form of human-to-computer interaction where the elements of human language, be it spoken or written, are formalized so that a computer can perform value-adding tasks based on that interaction. Natural language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it. The purpose of Web mining is to develop methods and systems for discovering models of objects and processes on the World Wide Web and for web-based systems that show adaptive performance. Web Mining integrates three parent areas: Data Mining (we use this term here also for the closely related areas of Machine Learning and Knowledge Discovery), Internet technology and
  • 3. Page 3 World Wide Web, and for the more recent SemanticWeb. The World Wide Web has made an enormous amount of information electronically accessible. The use of email, news and markup languages like HTML allow users to publish and read documents at a world-wide scale and to communicate via chat connections, including information in the form of images and voice records. The HTTP protocol that enables access to documents over the network via Web browsers created an immense improvement in communication and access to information. For some years these possibilities were used mostly in the scientific world but recent years have seen an immense growth in popularity, supported by the wide availability of computers and broadband communication. The use of the internet for other tasks than finding information and direct communication is increasing, as can be seen from the interest in“e-activities” such as e-commerce, e-learning, e-government, e-science. Independently of the development of the Internet, Data Mining expanded out of the academic world into industry. Methods and their potential became known outside the academic world and commercial toolkits became available that allowed applications at an industrial scale. Numerous industrial applications have shown that models can be constructed from data for a wide variety of industrial problems. The World-Wide Web is an interesting area for Data Mining because huge amounts of information are available. Data Mining methods can be used to analyze the behavior of individual users, access patterns of pages or sites, properties of collections of documents. Almost all standard data mining methods are designed for data that are organized as multiple “cases” that are comparable and can be viewed as instances of a single pattern, for example patients described by a fixed set of symptoms and diseases, applicants for loans, customers of a shop. A “case” is typically described by a fixed set of features (or variables). Data on the Web have a different nature. They are not so easily comparable and have the form of free text, semi- structured text (lists, tables) often with images and hyperlinks, or server logs. The aim to learn models of documents has given rise to the interest in Text Mining methods for modeling documents in terms of properties of documents. Learning from the hyperlink structure has given rise to graph-based methods, and server logs are used to learn about user behavior. Instead of searching for a document that matches keywords, it should be possible to combine information to answer questions. Instead of retrieving a plan for a trip to Hawaii, it should be possible to automatically construct a travel plan that satisfies certain goals and uses opportunities that arise dynamically. This gives rise to a wide range of challenges. Some of them concern the infrastructure, including the interoperability of
  • 4. Page 4 systems and the languages for the exchange of information rather than data. Many challenges are in the area of knowledge representation, discovery and engineering. They include the extraction of knowledge from data and its representation in a form understandable by arbitrary parties, the intelligent questioning and the delivery of answers to problems as opposed to conventional queries and the exploitation of formerly extracted knowledge in this process. 2. WEB MINING Web mining is the integration of information gathered by traditional data mining methodologies and techniques with information gathered over the World Wide Web. Data mining is also called knowledge discovery and data mining (KDD). It is the extraction of useful patterns from data sources, e.g.databases, texts, web, images, etc. Patterns must be valid, novel, potentially useful, understandable.Classic data mining tasks  Classification: mining patterns that can classify future (new) data into known classes.  Association rule mining: mining any rule of the form X ® Y, where X and Y are sets of data items. E.g.,Cheese, Milk® Bread [sup =5%, confid=80%]  Clustering: identifying a set of similarity groups in the data  Sequential pattern mining: A sequential rule: A® B, says that event A will be immediately followed by event B with a certain confidence Fig .1 The Data Mining (KDD) Process Just as data mining aims at discovering valuable information that is hidden in conventional databases, the emerging field of web mining aims at finding and extracting relevant information that is hidden in Web-related data, in particular hyper- text documents published on the Web. Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web. There are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent based technology may also fall in this category.
  • 5. Page 5 Web structure mining is the process of inferring knowledge from the World Wide Web organization and links between references and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs. Web is a collection of inter-related files on one or more Web servers. Web mining is a multi- disciplinary effort that draws techniques from fields like information retrieval, statistics, machine learning, natural language processing, and others. Web mining has new character compared with the traditional data mining. First, the objects of Web mining are a large number of Web documents which are heterogeneously distributed and each data source are heterogeneous; second, the Web document itself is semi-structured or unstructured and lack the semantics the machine can understand. 3. HISTORY The term “Web Mining” first used in [E1996], defined in a ‘task oriented’ manner. Alternate ‘data oriented’ definition given in [CMS1997]. Its First panel discussion at ICTAI 1997 [SM1997]. It is a continuing forum.  WebKDD workshops with ACM SIGKDD, 1999, 2000, 2001, 2002, … ; 60 –90 attendees  SIAM Web analytics workshop 2001, 2002, …  Special issues of DMKD journal, SIGKDD Explorations  Papers in various data mining conferences & journals  Surveys [MBNL 1999, BL 1999, KB2000] This area of research is so huge today due to the tremendous growth of information sources available on the Web and the recent interest in e- commerce. Web mining is used to understand customer behavior, evaluate the effectiveness of a particular Web site, and help quantify the success of a marketing campaign. 3.1.web mining subtasks Web mining can be decomposed into the subtasks, namely: 1. Resource finding: the task of retrieving intended Web documents. By resource finding we mean the process of retrieving the data that is either online or offline from the text sources available on the web such as electronic newsletters, electronic newswire, the text contents of HTML documents obtained by removing HTML tags, and also the manual selection of Web resources. 2. Information selection and pre- processing: automatically selecting and pre-processing specific information from retrieved Web resources. It is a kind of transformation processes of the original data retrieved in the IR process. These transformations could be either a kind of pre-processing that are mentioned above
  • 6. Page 6 such as stop words, stemming, etc. or a pre-processing aimed at obtaining the desired representation such as finding phrases in the training corpus, transforming the representation to relational or first order logic form, etc. 3. Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites. Machine learning or data mining techniques are typically used in the process of generalization. Humans play an important role in the information or knowledge discovery process on the Web since the Web is an interactive medium. 4. Analysis: validating and/or interpretation of the mined patterns. 4. CHALLENGES OF WEB MINING 1. Today World Wide Web is flooded with billions of static and dynamic web pages created with programming languages such as HTML, PHP and ASP. It is significant challenge to search useful and relevant information on the web. 2. Creating knowledge from available information. 3. As the coverage of information is very wide and diverse, personalization of the information is a tedious process. 4. Learning customer and individual user patterns. 5. Complexity of Web pages far exceeds the complexity of any conventional text document. Web pages on the internet lack uniformity and standardization. 6. Much of the information present on web is redundant, as the same piece of information or its variant appears in many pages. 7. The web is noisy i.e. a page typically contains a mixture of many kinds of information like, main content, advertisements, copyright notice, navigation panels. 8. The web is dynamic, information keeps on changing constantly. Keeping up with the changes and monitoring them are very important. 9. The Web is not only disseminating information but it also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services. 10. The most important challenge faced is Invasion of Privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, when it occurs without their knowledge or consent. Techniques to Address the Problem 1.1 Preprocessing technique - Web Robots
  • 7. Page 7 When attempting to detect web robots from a stream it is desirable to monitor both the Web server log and activity on the client-side. What we are looking for is to distinguish single Web sessions from each other. A Web session is a series of requests to web pages, i.e. visits to web pages. Since the navigation patterns of web robots differs from the navigation patterns of human users the contribution from web robots has to be eliminated before proceeding with any further data mining, i.e. when we are looking into web usage behaviour of real users. One problem with identifying web robots is that they might hide their identity behind a facade looking a lot like conventional web browsers. Standard approaches to robot detection will fail to detect camouflaged web robots. As web robots are used for tasks like website indexing, e.g. by Google, or detection of broken links they have to exist. There is a special file on every domain called “robot.txt” which, according to the Robot Exclusion Standard [M. Koster, 1994], will be examined by the robot in order to prevent the robot from visiting certain pages of no interest. Evil web robots however aren’t guaranteed to follow the advice from robot.txt. The classes chosen for evaluation are Temporal Features, Page Features, Communication Features and Path Features. It is desirable to be able to detect the presence of a web robot after as few requests as possible, this is ofcourse a tradeoff between computational effort and result accuracy. A simple decision model for determining the class of a visitor is to first check if the visitor requested robots.txt, then it will be labeled as robot, second the visitor will be matched against a list of former known robots. Third the referer “-” is searched for, since robots seldom assign any value to the referer fields this is a rewarding place to look. If a robot is found, the list of known robots is updated with the new one. 3.1.2 Avoiding Mislabeled Sessions To avoid mislabeling of sessions an ensemble filtering approach [C. Brodley et al., 1999] is used, where the idea is to instead of just one model for classification, build several models which are used to find classification errors via finding single mislabeled sessions. The set of models acquired are used to classify all sessions respectively. For each session, the amount of false negative and false positive classifications are counted. A large value of false positive classifications imply that the session is currently assigned to be a non-robot despite being predicted to be a robot in most of the models. A large value of false negative classifications imply that the session might be a non-robot but has the robot classifier. 4.2 Mining Issue
  • 8. Page 8 3.2.1 Indirect Association Common association methods often employ patterns that connects objects to each other. Sometimes, on the other hand, it might be valuable to consider indirect association between objects. Indirect association is used to e.g. represent the behaviour of distinct user groups. 3.2.2 Clustering With the growth of the World Wide Web it can be very time consuming to analyze every web page on its own. Therefore it is a good idea to cluster web pages based on attributes that can be considered similar to find successful and less successful attributes and patterns. 5. TAXONOMY OF WEB MINING In general, Web mining tasks can be classified into three categories: 1. Web content mining, 2. Web structure mining and 3. Web usage mining. However, there are two other different approaches to categorize Web mining. In both, the categories are reduced from three to two: Web content mining and Web usage mining. In one, Web structure is treated as part of Web Content while in the other Web usage is treated as part of Web Structure. All of the three categories focus on the process of knowledge discovery of implicit, previously unknown and potentially useful information from the Web. Each of them focuses on different mining objects of the Web. Fig. 2 Taxonomy of Web mining 5.1. Web content mining Web content mining is an automatic process that goes beyond keyword extraction. Since the content of a text document presents no machine readable semantic, some approaches have suggested to restructure the document content in a representation that could be exploited by machines. The usual approach to exploit known structure in documents is to use wrappers to map documents to some data model. Techniques using lexicons for content interpretation are yet to come. There are two groups of web content mining strategies: Those that directly mine the content of documents and those that improve on the content search of other tools like search engines.
  • 9. Page 9 Web Content Mining deals with discovering useful information or knowledge from web page contents. Web content mining analyzes the content of Web resources. Content data is the collection of facts that are contained in a web page. It consists of unstructured data such as free texts, images, audio, video, semi-structured data such as HTML documents, and a more structured data such as data in tables or database generated HTML pages. The primary Web resources that are mined in Web content mining are individual pages. They can be used to group, categorize, analyze, and retrieve documents. Web content mining could be differentiated from two points of view: 5.1.1. Agent-Based Approach This approach aims to assist or to improve the information finding and filtering the information to the users. This could be placed into the following three categories: a. Intelligent Search Agents: These agents search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information. b. Information Filtering/ Categorization: These agents use information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. c. Personalized Web Agents: These agents learn user preferences and discover Web information based on these preferences, and preferences of other users with similar interest. 1. Intelligent Search Agents: Several intelligent Web agents have been developed that search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information. Agents such as Harvest , FAQ Finder , Information Manifold , OCCAM , and ParaSite rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents. Agents such as ShopBot and ILA (Internet Learning Agent) interact with and learn the structure of unfamiliar information sources. ShopBot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA learns models of various information sources and translates these into its own concept hierarchy. 2.InformationFialtering/Categorization: A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, alter, and categorize them, BO (Bookmark Organizer) 34] combines hierarchical clustering techniques and user interaction to
  • 10. Page 10 organize a collection of Web documents based on conceptual information. 3. Personalized Web Agents: This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests (using collaborative altering). A few recent examples of such agents include the WebWatcher , PAINT , Syskill & Webert . For example, Syskill & Webert utilizes a user profile and learns to rate Web pages of interest using a Bayesian classier. 5.1.2. Database Approach Database approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it. The two main categories are Multilevel databases: The main idea behind this approach is that the lowest level of the database contains semi-structured information stored in various Web sources, such as hypertext documents. At the higher level(s) meta data or generalizations are extracted from lower levels and organized in structured collections, i.e. relational or object-oriented databases. Web query systems: Many Web-based query systems and languages utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for the queries that are used in World Wide Web searches.. W3QL combines structure queries, based on the organization of hypertext documents, and content queries, based on information retrieval techniques. WebLog, logic-based query language for restructuring extracts information from Web information sources. . TSIMMIS .extracts data from heterogeneous and semi-structured information sources and correlates them to generate an integrated database representation of the extracted information. 5.2. WEB STRUCTURE MINING World Wide Web can reveal more information than just the information contained in documents. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. This can be compared to bibliographical citations. When a paper is cited often, it ought to be important. The PageRank and CLEVER methods take advantage of this information conveyed by the links to find pertinent web pages. By means of counters, higher levels cumulate the number of artifacts subsumed by the concepts they hold. Counters of hyperlinks, in and out documents, retrace the structure of the web artifacts summarized. Web structure mining is the process of discovering structure information from the web.
  • 11. Page 11 The structure of a typical web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. This can be further divided into two kinds based on the kind of structure information used. Fig.3 Web graph structure Hyperlinks A hyperlink is a structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an Intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. Document Structure In addition, the content within a Web page can also be organized in a tree structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents. Web structure mining focuses on the hyperlink structure within the Web itself. The different objects are linked in some way. Simply applying the traditional processes and assuming that the events are independent can lead to wrong conclusions. However, the appropriate handling of the links could lead to potential correlations, and then improve the predictive accuracy of the learned models. Two algorithms that have been proposed to lead with those potential correlations are: 1. HITS and 2. PageRank. 5.2.1. PageRank Page Rank is a metric for ranking hypertext documents that determines the quality of these documents. The key idea is that a page has high rank if it is pointed to by many highly ranked pages. So the rank of a page depends upon the ranks of the pages pointing to it. This process is done iteratively till the rank of all the pages is determined. The rank of a page p can thus be written as: Here, n is the number of nodes in the graph, OutDegree(q) is the number of hyperlinks on page q and d damping factor is the probability at each page the random surfer will get bored and request another random page. 5.2.2. HITS
  • 12. Page 12 Hyperlink-induced topic search (HITS) is an iterative algorithm for mining the Web graph to identify topic hubs and authorities. Authorities are the pages with good sources of content that are referred by many other pages or highly ranked pages for a given topic; hubs are pages with good sources of links. The algorithm takes as input, search results returned by traditional text indexing techniques, and filters these results to identify hubs and authorities. The number and weight of hubs pointing to a page determine the page's authority. The algorithm assigns weight to a hub based on the authoritativeness of the pages it points to. If many good hubs point to a page p, then authority of that page p increases. Similarly if a page p points to many good authorities, then hub of page p increases. After the computation, HITS outputs the pages with the largest hub weight and the pages with the largest authority weights, which is the search result of a given topic. 5.3. WEB USAGE MINING Web usage mining is a process of extracting useful information from server logs i.e. users history. Web usage mining is the process of finding out what users are looking for on the Internet. Web usage mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. It collects the data from Web log records to discover user access patterns of Web pages. Usage data captures the identity or origin of web users along with their browsing behavior at a web site. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behavior and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in Web Usage Mining driven by the applications of the discoveries: General Access Pattern Tracking and Customized Usage Tracking. The general access pattern tracking analyzes the web logs to understand access patterns and trends. These analyses can shed light on better structure and grouping of resource providers. Many web analysis tools existed but they are limited and usually unsatisfactory. We have designed a web log data mining tool, Web Log Miner, and proposed techniques for using data mining and OnLine Analytical Processing (OLAP) on treated and transformed web access files. Applying data mining techniques on access logs unveils interesting access patterns that can be used to restructure sites in a more efficient grouping, pinpoint effective advertising locations, and target specific users for specific selling ads. Customized usage tracking analyzes individual trends. Its purpose is to customize web sites to users. The information displayed, the depth of the
  • 13. Page 13 site structure and the format of the resources can all be dynamically customized for each user over time based on their access patterns. While it is encouraging and exciting to see the various potential applications of web log file analysis, it is important to know that the success of such applications depends on what and how much valid and reliable knowledge one can discover from the large raw log data. Current web servers store limited information about the accesses. Some scripts custom-tailored for some sites may store additional information. However, for an effective web usage mining, an important cleaning and data transformation step before analysis may be needed. In the using and mining of Web data, the most direct source of data are Web log files on the Web server. Web log files records of the visitor's browsing behavior very clearly. Web log _les include the server log, agent log and client log (IP address, URL, page reference, access time, cookies etc.). There are several available research projects and commercial products that analyze those patterns for different purposes. The applications generated from this analysis can be classified as personalization, system improvement, site modification, business intelligence and usage characterization. The Web Mining Architechture Fig. 4 Web Usage Mining Process The Web Usage Mining can be decomposed into the following three main sub tasks: Fig 5. Web usage mining process 5.3.1. Pre-processing It is necessary to perform a data preparation to convert the raw data for further process. The actual data collected generally have the features that incomplete, redundancy and ambiguity. In order to mine the knowledge more effectively, pre-processing the data collected is essential. Preprocessing can provide accurate, concise data
  • 14. Page 14 for data mining. Data preprocessing, includes data cleaning, user identification, user sessions identification, access path supplement and transaction identification.  The main task of data cleaning is to remove the Web log redundant data which is not associated with the useful data, narrowing the scope of data objects.  Determining the single user must be done after data cleaning. The purpose of user identification is to identify the users uniqueness. It can be complete by means of cookie technology, user registration techniques and investigative rules.  User session identification should be done on the basis of the user identification. The purpose is to divide each user's access information into several separate session processes. The simplest way is to use time- out estimation approach, that is, when the time interval between the page requests exceeds the given value, namely, that the user has started a new session.  Because the widespread use of the page caching technology and the proxy servers, the access path recorded by the Web server access logs may not be the complete access path of users. Incomplete access log does not accurately reflect the user's access patterns, so it is necessary to add access path. Path supplement can be achieved using the Web site topology to make the page analysis.  The transaction identification is based on the user's session recognition, and its purpose is to divide or combine transactions according to the demand of data mining tasks in order to make it appropriate for demand of data mining analysis. 5.3.2. Pattern discovery Pattern discovery mines effective, novel, potentially useful and ultimately understandable information and knowledge using mining algorithm. Its methods include statistical analysis, classification analysis, association rule discovery, sequential pattern discovery, clustering analysis, and dependency modeling.  Statistical Analysis: Statistical analysts may perform different kinds of descriptive statistical analyses (frequency, mean, median, etc.) based on different variables such as page views, viewing time and length of a navigational path when analyzing the session _le. By analyzing the statistical information contained in the periodic web system report, the extracted report can be potentially useful for improving the system performance, enhancing the security of the system, facilitation the site modification task, and providing support for marketing decisions.
  • 15. Page 15  Association Rules: In the web domain, the pages, which are most often referenced together, can be put in one single server session by applying the association rule generation. Association rule mining techniques can be used to discover unordered correlation between items found in a database of transactions.  Clustering analysis: Clustering analysis is a technique to group together users or data items (pages) with the similar characteristics. Clustering of user information or pages can facilitate the development and execution of future marketing strategies.  Classification analysis: Classification is the technique to map a data item into one of several predefined classes. The classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, nave Bayesian classifiers, k-nearest neighbor classifier,Support Vector Machines etc.  Sequential Pattern: This technique intends to find the inter-session pattern, such that a set of the items follows the presence of another in a time-ordered set of sessions or episodes. Sequential patterns also include some other types of temporal analysis such as trend analysis, change point detection, or similarity analysis.  Dependency Modeling: The goal of this technique is to establish a model that is able to represent significant dependencies among the various variables in the web domain. The modeling technique provides a theoretical framework for analyzing the behavior of users, and is potentially useful for predicting future web resource consumption. 5.3.3. Pattern Analysis Pattern Analysis is a final stage of the whole web usage mining. The goal of this process is to eliminate the irrelevant rules or patterns and to understand, visualize and to extract the interesting rules or patterns from the output of the pattern discovery process. The output of web mining algorithms is often not in the form suitable for direct human consumption, and thus need to be transform to a format can be assimilate easily. There are two most common approaches for the patter analysis. One is to use the knowledge query mechanism such as SQL, while another is to construct multi-dimensional data cube before perform OLAP operation. 6. APPLICATIONS OF WEB MINING Web mining techniques can be applied to understand and analyze such data, and turned into actionable information, that can support a web enabled electronic business to improve its marketing, sales and customer support operations.
  • 16. Page 16 Based on the patterns found and the original cache and log data, many applications can be developed. Some of them are: In order to achieve personalized service, it first has to obtain and collect information on clients to grasp customer's spending habits, hobbies, consumer psychology, etc., and then can be targeted to provide personalized service. To obtain consumer spending behavior patterns, the traditional marketing approach is very difficult, but it can be done using Web mining techniques. Early on in the life of Amazon.com, its visionary CEO Jeff Bezos observed, In a traditional (brick- and mortar) store, the main effort is in getting a customer to the store. Once a customer is in the store they are likely to make a purchase since the cost of going to another store is high and thus the marketing budget (focused on getting the customer to the store) is in general much higher than the in-store customer experience budget (which keeps the customer in the store). In the case of an on-line store, getting in or out requires exactly one click, and thus the main focus must be on customer experience in the store. This fundamental observation has been the driving force behind Amazons comprehensive approach to personalized customer experience, based on the mantra a personalized store for every customer. A host of Web mining techniques, e.g. associations between pages visited, click-path analysis, etc., are used to improve the customers experience during a store visit. Knowledge gained from Web mining is the key intelligence behind Amazons features such as instant recommendations, purchase circles, wish-lists, etc. 6.1.Improve the website design Attractiveness of the site depends on its reasonable design of content and organizational structure. Web mining can provide details of user behavior, providing web site designers basis of decision making to improve the design of the site. 6.2.System Improvement Performance and other service quality attributes are crucial to user satisfaction from services such as databases, net-works, etc. Similar qualities are expected from the users of Web services. Web usage mining provides the key to under-standing Web traffic behavior, which can in turn be used for developing policies for Web caching, network transmission, load balancing, or data distribution. Security is an acutely growing concern for Web- based services, especially as electronic commerce continues to grow at an exponential rate. Web usage mining can also provide patterns which are useful for detecting intrusion, fraud, attempted break-ins, etc. 6.3.Predicting trends Web mining can predict trend within the retrieved information to indicate future values. For example, an electronic auction company provides information about items to auction, previous
  • 17. Page 17 auction details, etc. Predictive modeling can be utilized to analyze the existing information, and to estimate the values for auctioneer items or number of people participating in future auctions. The predicting capability of the mining application can also benefit society by identifying criminal activities. 6.4.To carry out intelligent business A visit cycle of customer network marketing activities can be divided into four steps: Being attracted, presence, purchase and left. Web mining technology can dig out the customers' motivation by analyzing the customer click-stream information in order to help sales make reasonable strategies, custom personalized pages for customers, carry out targeted information feedback and advertising. In short, in e-commerce network marketing, Using Web mining techniques to analyze large amounts of data can dig out the laws of the consumption of goods and the customer’s access patterns, help businesses develop effective marketing strategies, enhance enterprise competitiveness. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer. 7.RESEARCH DIRECTIONS The techniques being applied to Web content mining draw heavily from the work on information retrieval, databases, intelligent agents, etc. Since most of these techniques are well known and reported elsewhere, we have focused on Web usage mining in this survey instead of Web content mining. In the following we provide some directions for future research. 7.1 Data Pre-Processing for Mining Web usage data is collected in various ways, each mechanism collecting attributes relevant for its purpose. There is a need to pre-process the data to make it easier to mine for knowledge. Specifically, we believe that issues such as instrumentation and data collection, data integration and transaction identification need to be addressed. Clearly improved data quality can improve the quality of any analysis on it. A problem in the Web domain is the inherent conflict between the analysis needs of the analysts (who want more detailed usage data collected), and the privacy needs of users (who want as little
  • 18. Page 18 data collected as possible). This has lead to the development of cookie les on one side and cache busting on the other. The emerging OPS standard on collecting profile data may be a compromise on what can andwill be collected. However, it is not clear how much compliance to this can be expected. Hence, there will be a continual need to develop better instrumentation and data collection techniques, based on whatever is possible and allowable at any point in time. Portions of Web usage data exist in sources as diverse as Web server logs, referral logs, registration les, and index server logs. Intelligent integration and correlation of information from these diverse sources can reveal usage information which may not be evident from any one of them. Techniques from data integration should be examined for this purpose. Web usage data collected in various logs is at a very fine granularity. Therefore, while it has the advantage of being extremely general and fairly detailed, it also has the corresponding drawback that it cannot be analyzed directly, since the analysis may start focusing on micro trends rather than on the macro trends. On the other hand, the issue of whether a trend is micro or macro depends on the purpose of a specific analysis. Hence, we believe there is a need to group individual data collection events into groups, called Web transactions , before feeding it to the mining system. While have proposed techniques to do so, more attention needs to be given to this issue. 7.2 The Mining Process The key component of Web mining is the mining process itself. As discussed in this paper, Web mining has adapted techniques from the field of data mining, databases, and information retrieval, as well as developing some techniques of its own, e.g. path analysis. A lot of work still remains to be done in adapting known mining techniques as well as developing new ones. Web usage mining studies reported to date have mined for association rules, temporal sequences, clusters, and path expressions. As the manner in which the Web is used continues to expand, there is a continual need to figure out new kinds of knowledge about user behavior that needs to be mined. The quality of a mining algorithm can be measured both in terms of how effective it is in mining for knowledge and how efficient it is in computational terms. There will always be a need to improve the performance of mining algorithms along both these dimensions. Usage data collection on the Web is incremental in nature. Hence, there is a need to develop mining algorithms that take as input the existing data, mined knowledge, and the new data, and develop a new model in an efficient manner. Usage data collection on the Web is also distributed by its very nature. If all the data were to be integrated before mining, a lot of valuable information could be extracted. However, an
  • 19. Page 19 approach of collecting data from all possible server logs is both non-scalable and impractical. Hence, there needs to be an approach where knowledge mined from various logs can be integrated together into a more comprehensive model. 7.3 Analysis of Mined Knowledge The output of knowledge mining algorithms is often not in a form suitable for direct human consumption, and hence there is a need to develop techniques and tools for helping an analyst better assimilate it. Issues that need to be addressed in this area include usage analysis tools and interpretation of mined knowledge. There is a need to develop tools which incorporate statistical methods, visualization, and human factors to help better understand the mined knowledge. Section 4 provided a survey of the current literature in this area. One of the open issues in data mining, in general, and Web mining, in particular, is the creation of intelligent tools that can assist in the interpretation of mined knowledge. Clearly, these tools need to have specific knowledge about the particular problem domain to do any more than altering based on statistical attributes of the discovered rules or patterns. In Web mining, for example, intelligent agents could be developed that based on discovered access patterns, the topology of the Web locality, and certain heuristics derived from user behavior models, could give recommendations about changing the physical link structure of a particular site. 8. WEB MINING PROS & CONS 8.1. PROS Web mining essentially has many advantages which makes this technology attractive to corporations including the government agencies. This technology has enabled ecommerce to do personalized marketing, which eventually results in higher trade volumes. The government agencies are using this technology to classify threats and fight against terrorism. The predicting capability of the mining application can benefit the society by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer. Prospects The future of Web Mining will to a large extent depend on developments of the Semantic Web.
  • 20. Page 20 The role of Web technology still increases in industry, government, education, entertainment. This means that the range of data to which Web Mining can be applied also increases. Even without technical advances, the role of Web Mining technology will become larger and more central. The main technical advances will be in increasing the types of data to which Web Mining can be applied. In particular Web Mining for text, images and video/audio streams will increase the scope of current methods. These are all active research topics in Data Mining and Machine Learning and the results of this can be exploited for Web Mining. The second type of technical advance comes from the integration of Web Mining with other technologies in application contexts. Examples are information retrieval, ecommerce, business process modeling, instruction, and health care. The widespread use of web-based systems in these areas makes them amenable to Web Mining. In this section we outline current generic practical problems that will be addressed, technology required for these solutions, and research issues that need to be addressed for technical progress. Knowledge Management Knowledge Management is generally viewed as a field of great industrial importance. Systematic management of the knowledge that is available in an organization can increase the ability of the organization to make optimal use of the knowledge that is available in the organization and to react effectively to new developments, threats and opportunities. Web Mining technology creates the opportunity to integrate knowledge management more tightly with business processes. Standardization efforts that use SemanticWeb technology and the availability of ever more data about business processes on the internet creates opportunities for Web Mining technology. More widespread use of Web Mining for Knowledge Management requires the availability of low- threshold Web Mining tools that can be used by non-experts and that can flexibly be integrated in a wide variety of tools and systems. E-commerce The increased use of XML/RDF to describe products, services and business processes increases the scope and power of Data Mining methods in e-commerce. Another direction is the use of text mining methods for modeling technical, social and commercial developments. This requires advances in text mining and information extraction. E-learning The Semantic Web provides a way of organizing teaching material, and usage mining can be applied to suggest teaching materials to a learner. This opens opportunities for Web Mining. For
  • 21. Page 21 example, a recommending approach can be followed to find courses or teaching material for a learner. The material can then be organized with clustering techniques, and ultimately be shared on the web again, e. g., within a peer to peer network. Web mining methods can be used to construct a profile of user skills, competence or knowledge and of the effect of instruction. Another possibility is to use web mining to analyze student interactions for teaching purposes. The internet supports students who collaborate during learning. Web mining methods can be used to monitor this process, without requiring the teacher to follow the interactions in detail. Current web mining technology already provides a good basis for this. Research and development must be directed toward important characteristics of interactions and to integration in the instructional process. E-government Many activities in governments involve large collections of documents. Think of regulations, letters, announcements, reports. Managing access and availability of this amount of textual information can be greatly facilitated by a combination of Semantic Web standardization and text mining tools. Many internal processes in government involve documents, both textual and structured. Web mining creates the opportunity to analyze these governmental processes and to create models of the processes and the information involved. It seems likely that standard ontologies will be used in governmental organizations and the standardization that this produces will make Web Mining more widely applicable and more powerful than it currently is. The issues involved are those of Knowledge Management. Also governmental activities that involve the general public include many opportunities for Web Mining. Like shops, governments that offer services via the internet can analyze their customers behavior to improve their services. Information about social processes can be observed and monitored using Web Mining, in the style of marketing analyses. Examples of this are the analysis of research proposals for the European Commission and the development of tools for monitoring and structuring internet discussion for non political issues. Enabling technologies for this are more advanced information extraction methods and tools. Health care Medicine is one of the Web’s fastest-growing areas. It profits from Semantic Web technology in a number of ways: First, as a means of organizing medical knowledge - for example, the widely-used taxonomy International Classification of Diseases and its variants serve to organize telemedicine portal content and interfaces. The Unified Medical Language System
  • 22. Page 22 (http://www.nlm.nih.gov/research/umls) integrates this classification and many others. Second, health care institutions can profit from interoperability between the different clinical information systems and semantic representations of member institutions’ organization and services. Usage analyses of medical sites can be employed for purposes such as Web site evaluation and the inference of design guidelines for international audiences, or the detection of epidemics. In general, similar issues arise, and the same methods can be used for analysis and design as in other content classes of Web sites. Some of the facets of Semantic Web Mining that we have mentioned in this article form specific challenges, in particular: the privacy and security of patient data, the semantics of visual material, and the cost-induced pressure towards national and international integration of Web resources. E-science In E-Science two main developments are visible. One is the use of text mining and Data Mining for information extraction to extract information from large collections of textual documents. Much information is “buried” in the huge scientific literature and can be extracted by combining knowledge about the domain and information extraction. Enabling technology for this is information extraction in combination with knowledge representation and ontologies. The other development is large scale data collection and data analysis. This also requires common concept and organisation of the information using ontologies. However, this form of collaboration also needs a common methodology and it needs to be extended with other means of communication, see for examples and discussion. Web mining for images and video and audio streams So far, efforts in Semantic Web research have addressed mostly written documents. Recently this is broadened to include sound/voice and images. Images and parts of images are annotated with terms from ontologies. Privacy and security A factor that limits the application of Web Mining is the need to protect privacy of users. Web Mining uses data that are available on the web anyway but the use of Data Mining makes it possible to induce general patterns that can be applied to personal data to inductively infer data that should remain private. Recent research addresses this problem and searches for selective restrictions on access to data that do allow the induction of general patterns but at the same time preserves a preset uncertainty about individuals, thereby protecting privacy of individuals. Information extraction with formalized knowledge
  • 23. Page 23 We briefly reviewed the use of concept hierarchies and thesauri for information extraction. If knowledge is represented in more general formal Semantic Web languages like OWL, in principle there are stronger possibilities to use this knowledge for information extraction. In summary, the main foreseen developments are: – The extensive use of annotated documents facilitates the application of Data Mining techniques to documents. – The use of a standardized format and a standardized vocabulary for information on the web will increase the effect and use of Web Mining. – The Semantic Web goal of large-scale construction of ontologies will require the use of Data Mining methods, in particular to extract knowledge from text. 8.2. CONS Web mining, itself, doesn’t create issues, but this technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent. The obtained data will be analyzed, and clustered to form profiles; the data will be made anonymous before clustering so that there are no personal profiles. Thus these applications de-individualize the users by judging them by their mouse clicks. De- individualization, can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits. Another important concern is that the companies collecting the data for a specific purpose might use the data for a totally different purpose, and this essentially violates the user’s interests. The growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site. This trend has increased the amount of data being captured and traded increasing the likeliness of one’s privacy being invaded. The companies which buy the data are obliged make it anonymous and these companies are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data. Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the anti-discrimination legislation. The applications make it hard to identify the use of such controversial attributes,
  • 24. Page 24 and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation, right now this situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that, the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to one’s privacy, actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user. 9. CONCLUSION The term Web mining has been used to refer to techniques that encompass a broad range of issues. However, while meaningful and attractive, this very broadness has caused Web mining to mean different things to different people, and there is a need to develop a common vocabulary. Towards this goal we proposed a definition of Web mining, and developed taxonomy of the various ongoing efforts related to it. Next, presented a survey of the research in this area and concentrated on Web usage mining.The provided a detailed survey of the e orts in this area, even though the survey is short because of the area's newness. To provided a general architecture of a system to do Web usage mining, and identified the issues and problems in this area that require further research and development. As the Web and its usage continue to grow, so does the opportunity to analyze Web data and extract all manner of useful knowledge from it. The past few years have seen the emergence of Web mining as a rapidly growing area, due to the efforts of the research community as well as various organizations that are practicing. The key component of web mining is the mining process itself. Here described the key computer science contributions made in this field, including the overview of web mining, taxonomy of web mining, the prominent successful applications, and outlined some promising areas of future research. 10.REFERENCE [1] http://en.wikipedia.org/wiki/Web mining [2] http://www.galeas.de/webimining.html [3] Jaideep srivastava, Robert Cooley, Mukund Deshpande, Pan-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations, ACM SIGKDD,Jan 2000. [4] Miguel Gomes da Costa Jnior,Zhiguo Gong, Web Structure Mining: An Introduction, Proceedings of the 2005 IEEE International Conference on Information Acquisition [5] R. Cooley, B. Mobasher, and J. Srivastava,Web Mining: Information and Pattern Discovery on the World Wide Web, ICTAI97 [6] Brijendra Singh, Hemant Kumar Singh, WEB DATA MINING RE- SEARCH: A SURVEY, 2010 IEEE
  • 25. Page 25 [7] Mining the Web: discovering knowledge from hypertext data, Part 2 By Soumen Chakrabarti, 2003 edition [8] Web mining: applications and techniques By Anthony Scime [9] . R. Agrawal and R. Srikant. Fast algorithms for mining association rules. [10] S. Agrawal, R. Agrawal, P.M. Deshpande, A. Gupta, J. Naughton, R. Ramakrishna, and S. Sarawagi. On the computation of multidimensional aggregates. [11] R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell. Webwatcher: A learning apprentice for the world wide web. [12] M. Balabanovic, Yoav Shoham, and Y. Yun. An adaptive agent for automated web browsing. Journal of Visual Communication and Image Representation, [13] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G Zweig. Syntactic clustering of the web.