Name: Tom Scullion
Date due: December 18, 2002
Project: Evaluation of New Information Systems Technologies
Technology: Data Mining and Clickstream Analysis
Project Deliverable No. 3 – Overview of technology and its place in business
Every time a visitor to the World Wide Web makes a request to open a Web page, the Web server on which
the page resides captures the visitor’s request in a log file. That log entry is referred to as clickstream data.
A tremendous amount of clickstream data is generated by Web traffic. The data accumulates rapidly and
must be managed to have a successful Web site. Clickstream data can be used to understand the interaction
between visitors and a Web site. It may be used to determine the efficiency and effectiveness of the Web
site from a marketing perspective especially if the site is used for e-commerce. E-Retailers must learn as
much as possible about the behavior, individual tastes and preferences of their site visitors to be
To help accomplish that goal and remain competitive, clickstream data should be compiled from the log
files into a database that can be integrated with other corporate data in a warehouse so that the data can be
mined and analyzed. Analysis of this data will help management understand how visitors are using their
Web site and hopefully how the Web site can be improved to ”significantly boost customer relationship
management in ways that directly affect online visitor acquisition and retention” (Mena 2000).
Web server logs capture information when a visitor requests a Web page. Log file entries contain
information on the visitor, the pages requested, time of the request, status of the request, and size of the
page. A server may be configured to have four different log files or the files may be combined into an
extended format. The transaction or activity log records all file transfers from the server to a visitor’s
computer. The error log file records any visitor request for information that cannot be fulfilled. The
referrer log (or referer log as traditionally misspelled by Web administrators) keeps track of where
incoming visitors to a Web page are coming from and the agent log tracks the software on the visitor’s
system that is being used to make the request. (Tittel 2001, McDunn 2002) Usually the files, other than the
error log, are combined into an extended log file.
The fields generally captured in extended log files are:
Field Description Example
Host Fully qualified domain name of the client or its IP 188.8.131.52
Ident Identify information reported by the client if identity -
check option is enabled (seldom used)
Authuser UserID used in successful SSL request -
Date The date and time of the request including time zone [17/Jun/2000:10:39:12 –0600]
Request Request line from the client browser “GET /metadta.html HTTP/1.0”
Status Three digit HTTP status code returned to the client 200
Bytes Number of bytes returned to the client browser for the 19365
Referrer URL of the referring server and requested file from site http://www.ewsolutions.com -> /
Agent Browser and operating system name and version Mozilla/4.0 (Windows; I; 32bit)
From: Jennings, Michael F. 2000 “Using Clickstream as an e-Source for the e-Business Intelligence Environment”
Since identity and authentication fields are seldom used, those fields usually just contain a dash. Because
this information generally is unknown but is desirable, it is necessary to use other methods to establish a
visitor’s identity since the reason for accumulating this data is to be able to profile a visitor for marketing
One method to establish a visitor’s identity is to try and determine it from the visitor’s IP address. This is
unreliable considering the infrastructure of the Internet. Many people still use dial up access to get to the
Web. The IP addresses for these visitors are dynamically assigned and change every time they sign on so
they are inconsistent. If the visitor is going through an Internet Service Provider’s (ISP) proxy server, the
IP address of the proxy server will be entered in the log file instead of the actual visitor, which is of no use.
If the site has installed a load balancer to enhance performance and scalability, the log file will pick up the
IP address of the load balancer instead of the actual visitor. So basically it’s not possible to obtain reliable
information on a visitor’s identity from an IP address.
Using cookie files is another way to identify visitors to a Web site although cookies may not provide as
much detailed information as registration. Cookies are small files that are placed on a visitor’s hard drive
by a Web server so that it can remember something about the visitor the next time they visit. Typically
cookies have six fields:
Field Name Description
Name The name of the cookie variable
Value The string value assigned to the cookie variable
Domain This is the domain name or Web site that created the cookie and it is the only domain that
is permitted to receive or modify the cookie on subsequent accesses.
Path The top level of the subtree within the domain for which the cookie is valid and returned
upon access to a page within the subtree
Expires The expiration date of the cookie. The cookie persists on the client system until this date.
If this value is not set, the cookie only lasts for the duration of the browser session.
Cookies without an expiration date are referred to as Transient cookies.
Secure If this field is set to TRUE, then a secure connection to the domain is needed to pass the
From: Sweiger, Mark “Cookies: The Perfect User Identification Snack”
The server must be configured to use unique variables in the Name field. That way once a cookie is set on
a visitor’s system, every subsequent time that the same visitor accesses the Web page, they will be
identified by the unique value of the cookie variable. It’s not exactly name, address, and phone number but
it does allow a certain amount of personalization.
Cookies are used to record a visitor’s preferences when using a particular site. They may be used to
remember what pages have been requested and what ads have been sent as well as what type of browser is
being used. This information enables the server to customize pages for browser type or for other
information gleaned from prior visits to the site.
There are security and privacy issues with the usage of cookies. Cookies are placed on the visitor’s system
by the server and are retrieved the next time the visitor accesses that particular site. Because of the ability
to retrieve other information from the visitor’s hard drive using the same technology, they are considered
potential threats to security. Because they are used to identify specific visitors, they are considered threats
extensively to identify unique visitors, many sites don’t allow access to a browser that is set to not allow
placement of cookies. (Dodson 2000)
Another method used by some sites to get visitor identity is to have visitors log in when accessing a Web
site. This method may populate both the ident field and the authuser field of a log entry if passwords are
required. Registration to obtain the login and establish a password may establish a visitor’s identity
because the visitor is generally required to fill out an online form that includes name and address and other
information that might be useful for marketing. Although this method probably provides the most accurate
demographic data, there are reasons why it may not always work. First, a visitor may not provide truthful
information (go figure) when registering to use the site so the reliability of the information is suspect.
Second, this method may alienate potential visitors that wish to remain anonymous and don’t want to have
to log in every time they access the site. In may be possible to overcome these challenges by using
incentives to obtain accurate information such as a redeemable coupon that will be mailed to the
registrant’s address (Mena 2000). Registration is the best method of obtaining enough information to
match visitors with demographic and behavioral data from third party vendors. This will provide more
complete data to use in profiling visitors to a web site.
Cookies and registration allow an enterprise to obtain enough quality data to profile customers. These
methods may be considered invasive by customers and potential customers though. However, attitudes
seem to be changing in some circumstances. A survey conducted by Cyber Dialogue and documented in
“Privacy VS. Personalization” (Mabley 2000) produced some significant findings:
• 82% surveyed were willing to provide such personal information as gender, age and ethnicity if the site
will remember their preferences and personal information
• 56% said they are more likely to purchase from a site that allows personalization
• 63% said they are more likely to register at a site that allows personalization or content customization
• 84% refused to provide information at a site that won’t disclose how the information is being used
Based on this information, it appears that online visitors to a Web site are willing to provide the
information that marketers need as long as the information is only used to personalize the visitor’s online
experience. Other findings of the survey included the fact that only 58% of people using the Web
understand cookies and how they work. The final result is that an online enterprise needs to educate their
visitors regarding what information is being obtained, how it is being used and how the enterprise’s privacy
policy governs the use of the information. Generally organizations should explain the process and its
benefits and allow visitors to opt-out if desired. “Let your customers have access to their own profiles. In
the end they will realize that the objective is to service them better than your competitors.” (Mena 2000)
Registration information, cookies and Web logs provide a great deal of raw data about Web site visitors
that may be useful to an organization. They tell how a visitor responds to the site’s content, what links are
used, how much time is spend on a particular page or on the whole site, what time of day the visitor
browses and more. With all this information at hand, one of the challenges becomes how to organize this
clickstream data so that it is usable in a decision support system.
The first step is to add the clickstream data from the Web to a data warehouse. A data warehouse is a
subject-oriented, nonvolatile collection of data for decision support (Mena 2002). The data warehouse may
also include data from various other sources such as a customer relationship management database or a
sales force automation database. Cookies and Web server log entries are generally in text so the data must
be manipulated by an extraction, transformation, and load tool into a format that can be moved into the data
warehouse. Data from registrations and purchases should be added to the clickstream data for more
complete information on visitors and customers. Additional demographic and behavioral information from
third party vendors should be appended to the clickstream data also. This third party data can include
information such as location and even projected income. Third party data is most useful if there is adequate
information from registrations or purchases to match visitors to the demographic data. Once the data is in
the warehouse, data mining methodologies can be applied to provide valuable information on visitors and
customers and on marketing efficiency.
Data mining software programs use a variety of statistical approaches to sort through large amounts of data
to identify patterns, establish relationships within the data and use that information to predict behavior.
Data mining uses various techniques. Association is the technique of looking for patterns where one event
is connected to another event, such as identifying items that are likely to be purchased or viewed in the
same Web site visit. Sequence or path analysis is looking for patterns where one event leads to another
later event such as the path a visitor takes through a Web site and what items are viewed or purchased.
Classification is looking for new patterns. Clustering, or segmentation, is identifying visitors and
customers that share common characteristics. (Greening 2000) It can break down groups until reaching the
famous “segment of one”(Doherty 2000). Forecasting is discovering patterns in data that can lead to
reasonable predictions about visitor behavior.
Data mining generally uses traditional data sources such as customer relationship management systems and
sales force automation systems to supply data that would be mined for marketing and customer
information. Since the Internet always has to have its own terminology, the term Web Mining was recently
coined to refer to mining data from traditional sources that have been appended with data from the Web.
As defined on searchCRM.techtarget.com, “Web mining is the integration of information gathered by
traditional data mining methodologies and techniques with information gathered over the World Wide
Web. Web mining is used to understand customer behavior, evaluate the effectiveness of a particular Web
site, and help quantify the success of a marketing campaign.”
Once the clickstream data is incorporated into an organization’s data warehouse, Web mining software can
be used to extract information to be used to profile visitor behavior and preferences. The object is to find
associations between customer profiles and buying habits such as what age group is most likely to make
multiple purchases. This information can be used to personalize promotions based on demographic
information. Personalization optimizes the advertisements and products that a visitor sees, because
personalized marketing is based on preferences developed from the visitor’s profile so the products
displayed are of more interest to the visitor and increase the likelihood that the visitor will make a purchase
After examining associations, the data may go through segmentation analysis. Segmentation is the process
of dividing a customer base into smaller markets based on different needs, preferences, behavior, and
attributes. The reason for segmentation is that as the group is segmented into smaller and smaller sectors, it
will enable interaction with each segment in different ways (Mena 2000). Segmentation analysis may
reinforce the information discovered through association analysis and provide more detailed information on
smaller market segments for personalized marketing activity. This may assist an enterprise in maximizing
its marketing budget by targeting the market segments that are most likely to provide the biggest return for
their marketing efforts. It may become apparent that females under age 35 are most likely to make multiple
purchases. Promotions can be targeted at that market segment to reward their loyalty and turn those
customers into long-term clients.
Based on the historical customer purchasing data in the data warehouse, Web mining software programs
may be “trained” to predict outcomes, such as the number of purchases new visitors are likely to make.
Once trained, the software can predict what attributes are most important for predicting, such as age, gender
or projected net worth. This information is useful in modifying a registration form to get the information
that would be helpful in raising the conversion rate. Conversion rate is the percentage of visitors that are
converted to customers. If the model predicts that women between the ages of 30 and 45 are likely to
purchase a particular product, then after completion of the registration form, a visitor that fit that profile
would be directed to a welcome page that prominently displayed that product. They would already have
earned a promotional coupon towards the purchase price for accurately filling out the registration form that
may make the purchase even more desirable.
Segmentation of markets to enable personalization of marketing efforts is one of the huge benefits of Web
mining. There are many additional benefits to Web mining and dmreview.com provides an extensive list:
Understand Customer Behavior:
• Companies can optimize e-business sites for maximum commercial impact by understanding the
dynamic behavior of visitors to their Web sites
• E-Tailers can now gain knowledge on the individual tastes and preferences of the visitors to their
• Determine the conversion rate of visitors to buyers on your site.
• Determine the repeat frequency of existing buyers (i.e. the likelihood of customers repurchasing
• Calculate the rate of new customer acquisition.
• Discover actionable browsing and buying patterns of customers.
• Learn who is buying what from your site.
• Discover cross relationships between clients in your e-commerce sites.
Determine Web Site Effectiveness
• Discover high and low impact areas of your e-commerce site.
• Web administrators no longer have to rely on intuition when designing the layout of a Web site.
• E-tailers can now develop the look and feel of the Web site and personalize online content.
Measure the Success of Marketing Efforts:
• In the physical world it is difficult to get reliable feedback on marketing campaigns. But, on the
Internet you can get real measurements of the success of a marketing campaign.
• Companies can cluster customers with similar patterns, and the Web site can adapt to recognized
customers. Segments can then be targeted with campaigns and special offers.
• Effectively gauge the return on investment of banner advertising.
Web mining is not inexpensive. Maintaining a data warehouse and dealing with the volume of clickstream
data generated by the average Web site can take a significant amount of an enterprise’s resources. But the
benefits of mining clickstream data should offset the cost by allowing an enterprise to distinguish itself
from the competition by providing a higher level of customer service and getting a higher return from its
Data Mining Process
Data mining is not a panacea. It will not automatically solve all the problems in the data universe. This
means that a structured methodology must be used to “find problems, define solutions, set expectations,
and deliver results.” This is sometimes referred to as Knowledge Discovery. Knowledge Discovery is
defined as a multi-stage business process leading to the automated detection of regularities in data, which
are useful in new situations. Knowledge discovery attempts to ensure that the goals of mining data align
with the goals of the data users. (Pyle 1998)
Several appropriate methodologies can be found. However, most include the following steps:
1) Define the business problem
2) Build data mining database
3) Explore data
4) Prepare data for modeling
5) Build model
6) Evaluate model
7) Act on the results (Edelstein 2001)
There is a standard being developed by a consortium of mostly European vendors that’s known as CRISP-
DM, which stands for Cross Industry Standard Process for Data Mining. CRISP-DM is still a work in
process so most data mining methodologies use it as a foundation to begin defining the process.
Define the business problem
Determine what specific business problem the organization is attempting to solve with data mining.
The problem definition should be clear and concise so that it can form the basis for a project plan.
Data mining can be used for a variety of business problems including improving customer relationship
management, looking for patterns of fraud in credit card usage or identifying reasons for churn among
telecommunications customers. For an e-commerce site, a business problem that data mining may be
used to solve it identifying online customers that provide the greatest value to the organization and
developing profiles of those customers based on specific attributes.
Once an objective is determined, criteria should be established to measure the success of the project.
Success criteria should be defined in relation to the objective defined by the business problem. For an
e-commerce site trying to identify high value customers, success may be measured by an increase in
effectiveness of targeted marketing campaigns that use the customer profiles developed by the data
Once the business problem is defined it’s necessary to determine if the data that is available will in fact
solve the business problem defined. So to identify particular types of customers there should be
enough data on customer attributes from Web logs, transaction, and registrations to provide a
meaningful profile of individual customers. Data that is currently available in the Enterprise Data
Warehouse (EDW) should be analyzed to determine if it’s adequate for the project. It may be
necessary to obtain additional data from other sources to increase the likelihood of success for the
Data mining objectives may be somewhat different from the business goals. Data mining goals should
be established that will generate the information necessary to achieve the business goals. So if the
project goal is to profile high value customers, the data mining goal may be to determine the attributes
of customers that provide the highest return to the organization.
Based on the business objectives and the data mining objectives, a project plan should be developed
with input from the primary stakeholders.
Build Data Mining Database
Building the data mining database and the next two steps of exploring the data and preparing the data
for modeling will take most of the time and effort of any data mining project. Most of the data
necessary to build the data mining database should be available in the EDW. The EDW is usually not
used for data mining because data may need to be altered for the project. Also data mining can be
resource intensive. The EDW will probably have other resource demands and data mining may cause
system degradation that would be unacceptable. Generally a separate data mart should be built for data
mining. The design of the database should correspond to the business problem that is being solved and
the data available.
Explore the Data
Gaining an understanding of the data that will be mined is necessary prior to actually mining the data.
Exploring the data is the best way to gain that understanding. Graphing and visualization tools may be
used to explore the data. They can be important for revealing general patterns in the data. This step
may often lead to some previously unknown realization about the data that may help in the modeling
Clustering is a common way to explore data. Clustering is a method of dividing data into different
groups based on similar attributes. Clustering may appear to be similar to segmentation. But
clustering is different from segmentation in that segmentation is assigning data to defined groups.
Clustering puts data into groups that were not previously defined based on attributes that may not have
Link analysis is another method of exploring data. Common approaches to link analysis are
association discovery and sequence or path discovery. Association and path analysis were discussed
previously and are useful in gaining a general understanding of the data before deciding on a model.
A good understanding of the data may provide insight into how to proceed with the remaining steps of
the process in a manner that will most likely to ensure the success of the project.
Prepare the Data for Modeling
Data preparation may be the most important part of mining. Even though the data has been
transformed to some extent when it was added to the EDW, it will probably still require some
additional data preparation after it’s added to the data mining database. Once the data is in the data
mining database it should be tested for consistency, for missing variables, and for outliers. Missing
data should be reviewed to determine the best method to decrease the impact of the missing data on the
modeling process. It may be decided that it’s alright to ignore the missing entries. Or an attempt may
be made to estimate the data from other information or the data may just be discarded. Outliers
represent data outside expected parameters and may be errors or just new, unexpected data. Some data
mining programs handle missing data and outliers better than others so whether they are ignored or
included depends on the software being used (Pyle 1998).
In addition to preparing the data for mining, some additional transformation may be necessary. Using
data mining to predict behavior may require new variables that have to be derived from the data. For
transaction data on existing customers, RFM variables may be good predictors. RFM stands for
recency, frequency and monetary. Recency generally would be some measure of time since the last
transaction. Frequency would be the number of transactions in a designated period. And Monetary
would be the total transactions within a designated period as well as an average per transaction. These
additional variables are necessary to make the data more meaningful to the mining process and provide
additional parameters from which the mining software may discover useful relationships.
Build the Data Model
Modeling is the next step in data mining. This is probably what most people think of as data mining
although there is a great deal more to it than just building the models. First it must be decided what
type of prediction is the most appropriate solution to the business problem. Then a model type is
chosen based on the type of prediction. Then algorithms should be explored that fit the predictive type
and the models being developed. Various algorithms are used for modeling from simple conditional
logic to complex neural networks. Types of predictive models generally fall into one of two
categories, classification or regression. Moving through these decisions from type of prediction to
algorithm generally requires highly skilled analysts that understand the business and have experience
with data mining and statistical analysis. Selection of these tools and the people that can execute the
process depends on the business problem that is being solved, the data being used to solve it, and the
There are many classification and regression algorithms available for modeling and as many variations
of each type as there are vendors. But to explore them all would be a project in itself so only a select
few are described below.
One type of prediction that may be used is classification. Classification is generally used to predict the
category or class into which a particular case might fall. For instance, customer will fall into a
particular class depending on the attributes of the customer identified within the data. Classification
models are generally simpler than regression models in that they are more understandable to business
users that may not have a statistical background. Classification models can use categorical data such
as a state abbreviation instead of just numerical data. One of the reasons classification models are
easier to understand is because the variables are more meaningful. And the results are easier to
interpret by analysts that may not have statistical experience either because the outputs are generally in
ranges such as high, medium, and low.
Decision trees are one of the better known classification types of models. They use conditional logic
to partition data into groups based on values for a particular attribute. For instance, customers may be
split into subsets based on RFM variables. One tree node may make a decision based on whether 3 or
more purchases were made in the last three months. A following tree node may make a decision based
on value of recent purchases. If the decision was positive on both nodes, the algorithm may classify
that customer as high value. If the other variables were similar and the purchases were between thirty
and fifty the customer may be classified as medium and so on. Decision trees have limitations because
it is necessary to select one specific attribute such as age or last purchase for classification at each
stage of the process. Also each decision in the tree is based on the current decision node without
taking into account any previous or future decisions. The attribute that is selected for the first decision
or root node is also subjective depending on the modeler or the business problem being addressed.
Decisions being made at each node also represent hard splits that may lead to somewhat arbitrary
results. If a customer had four purchases that averaged $49 they would be classified as medium value
instead of high based on a $1 difference even though the value to the organization was higher due to
the higher number of purchases. There are limitations like these in any modeling algorithm. That’s
what makes it so important to have quality people that are familiar with the business and are able to
use their business experience to build and maintain the models being used to solve the business
Rule induction is another method of classification. Rule induction looks at data and generates a set of
rules that may be used to classify cases. Rules may be generated based on relationships and
confidence levels. For instance based on the data a rule may be generated that states with 75%
confidence that women in a certain age range will purchase product A with product B. Rules may be
“fuzzy” or inexact. Inexact rules have a fixed confidence factor such as the 75% mentioned above.
Fuzzy rules have a confidence factor that varies with one of the attributes so the confidence that
women in a certain age range will purchase product A with product B may increase as the age
Genetic Algorithms generate rules also but not from exploring the data. Genetic algorithms base rules
on changes in patterns of data until a pattern emerges. Genetic algorithms use rules that have already
been developed to combine patterns and develop relationships within the data. So they are more of a
tool to improve the algorithm being used to build the model than a modeling tool.
Boosting is a method of classification that seems to let majority rule decide how data is classed.
Boosting takes several random samples from the data and builds a classification model for each set.
The training sets are modified based on previous results until the outputs fit the expected
classifications. Then additional samples are processed by each model and the classifications that are
assigned most often are used.
Regression type predictions generally use some kind of scoring method to predict customer behavior.
Regression type models do not allow categorical data. All data must be numerical so it’s necessary to
convert data such as state abbreviations to a numerical in some logical manner so the program can use
it. Since they only use numerical data, regression models are more difficult to understand because the
data is less recognizable. Regression type models may appear to operate like a “black box” because
it’s more difficult to visualize how the data is being manipulated than with classification algorithms.
The output is also not as easy for non-statistical analysts to comprehend so they generally require more
sophisticated personnel to operate and interpret. Regression models do provide continuously varying
outputs though as opposed to simply putting outputs in buckets. Scoring allows for customers to fall
into a range to fit a category or to segment the results into smaller, more well defined groups.
Neural networks are one of the better known regression modeling algorithms. Neural networks consist
of algorithms that take predictor variables at the input layer and then assign weights to the paths that
the data uses to travel to the next layer of nodes. There may be one or more hidden layers depending
on the number of attributes being analyzed that the data must go through before emerging at the output
layer. The values of the variables and the paths traveled determine the direction the data will take and
the value it will have when it reaches the output layer. The scores that are generated when the data
reaches output are what are used for prediction. The neural network must first be “trained” using data
that has already been evaluated. The data is sent through the net several times and the weights
associated with the paths are adjusted each time until the results of the training set correspond to the
expected outputs. Then the network is considered to be trained. Neural nets “(a) evaluate input values,
(b) calculate a total for the combined input values, (c) compare the total with a threshold value and (d)
determine what its own output will be.” (Information Discovery Inc. 1997)
Evaluate the Model
Models should be evaluated from a business perspective based on cost benefit analysis and return on
investment. The results of the model may show some interesting patterns but acting on them may not
provide the incremental revenue or cost savings that would justify their use.
One of the simpler ways to evaluate a model is to test the results in the real world. Select a sample
from the population to test a prediction of the model and see how well the actual results follow the
predicted results. The model may predict the likelihood that a certain segment of the market will
respond to a particular promotion. By implementing the promotion on a limited sample and testing the
results against the prediction, the model’s effectiveness can be measured.
Act on the Results
Once working models are available, they can be used to understand customer behavior and customer
expectations. They may be incorporated in production systems such as campaign management
software for marketing purposes. Campaign management software automates marketing campaigns
that are used to target customer segments with specific promotions that are most likely to achieve the
desired results. Targeting specific segments in this manner should increase the response rate to the
promotional campaign based on the model’s predictions. This maximizes marketing efficiency and
effectiveness. Profiles developed from the data mining project should identify customers that are most
likely to respond to cross-selling or up-selling promotions that will lead to an increase customer
lifetime value to the organization.
The customer profiles developed from the project can also be used in a Web environment to classify
visitors to the site based on their registration information. Then the site can personalize the content
presented to them based on the classification. This will increase the likelihood of converting the
visitor to a customer. Personalizing the content of the Web site also helps to differentiate the site from
the competition and provide a higher level of customer service.
The object is to use the predictive models to drive marketing efforts that will turn Web site visitors into
customers and customers into long term clients.
Evaluation of Data Mining Products
Data mining products should be evaluated for the same attributes that any software package would be
• User interface – How easy is it to use?
• How much customization may be necessary?
• Documentation and online help
• Platforms on which it will run
• Databases to which it will connect
• Extensibility – open architecture or proprietary
Another area of evaluation more specific to data mining software is accuracy. There are
organizations that review the software by testing it on known data sets for purposes of providing
some assurance that it will perform as specified. They include Audit Bureau of Verification
Services, Inc., BPA Interactive and the Internet Audit Bureau. An endorsement by one of these
organizations is a good measure of the accuracy of a particular software package.
Data mining software should also be evaluated for its ability to prepare the data for mining. Since
data preparation may be one of the most time consuming tasks in the process, anything a software
package can do to expedite the process will greatly enhance it’s value.
As discussed above, there are also issues in selecting data mining software relating to the
software’s ability to deal with anomalies in the data. It must be evaluated on its ability to handle
missing data and outliers.
There are other criteria that are important in selecting software but one more that should be
mentioned is integration. How well will a data mining package integrate with other enterprise
systems. Can it be integrated with the Enterprise Data Warehouse or with Campaign Management
software? Integration is often the bane of IT departments due to lack of communication in the
selection process. It should be fully explored prior to committing to a particular product.
Retailing ventures continue to grow on the Web. As more enterprises establish a Web presence the
competition for online retail sales increases. Increased competition makes it more and more difficult for e-
commerce sites to attract and retain online customers. Web mining can be a dynamic resource that
provides an enterprise with ways to distinguish itself from its competition by tailoring its Web site to
visitors based on profiles developed from clickstream data. It will enable an enterprise to increase
customer service, increase product offerings and because it should make marketing efforts more cost
effective, it may help reduce product prices. Web mining can provide important insights into customer
behavior and expectations that can be used to implement efficient marketing campaigns. It can also
provide the tools to measure the effectiveness of those campaigns. Web mining is a resource that may be
necessary for the survival of an online enterprise as the marketplace grows and customers become more
sophisticated (Mena 2000).
Data mining uses an evolving process that requires highly trained analysts to develop models that may be
used to predict customer behavior based on a set of attributes. The process involves complex statistical
algorithms that are categorized as classification or regression models. The models manipulate data to
assign the output to a particular case (classification) or assign a score to the output (regression) that can be
used to place customers into specific market segments that can be targeted for promotional campaigns.
Models may also be used to predict the behavior of Web site visitors based on attributes learned through
registration. Once a visitor is assigned to a class or given a score, Web site content may be personalized
based on the class to increase the likelihood of converting the visitor to a customer.
Using data mining as a tool in this manner may provide an organization with a competitive advantage
through more personalized interaction with Web site visitors and customers. It will also allow a higher
level of customer service that will differentiate an organization using data mining from those that aren’t.
Sweiger, Mark “Cookies: The Perfect User Identification Snack”,
Dodson, Jody (January 24, 2000) “It’s Time to Slay the Cookie Monster”,
Tittel, Ed (June 13, 2001) “Understanding Web Server Log Files”,
Mena, Jesus (2002) “Integrating and Mining Web Data in Your Warehouse”,
Jennings, Michael F. (June 30, 2000) “Using Clickstream as an e-Source for the e-Business Intelligence
Doherty, Patricia (January 2000) “Web Mining – The E-Tailer’s Holy Grail”,
Greening, Dan R. (2000) “Data Mining on the Web”, www.webtechniques.com/archives/2000/01/greening/
McDunn, (August 15, 2002) “Web Server Log File Analysis – Basics”,
Mena, Jesus (July 17, 2000) “Bringing Them Back”, www.intelligententerprise.com/000717/feat2.shtml
Mena, Jesus “Web Mining”, www.tdan.com/i011fe04.htm
Mabley, Kevin, Director of Research, Cyber Dialogue (2000) “Privacy vs. Personalization”,
Edelstein, Herbert A. (March 12, 2001) “Pan for Gold in the Clickstream”, InformationWeek.com,
Rudjer Boskovic Institute (2001), Data mining tutorial, http://dms.irb.hr/tutorial/
Pyle, Dorian (1998) “Knowledge Discovery and Data Mining: The Expectation of Magic”,
Cooley, et al. “Web Mining: Information and Pattern Discovery on the World Wide Web”, Department of
Computer Science and Engineering, University of Minnesota
Information Discovery, Inc. (1997) “A Characterization of Data Mining Technologies and Processes”,
Journal of Data Warehousing, www.datamining.com/dm-tech.htm
Two Crows Corporation (1999), “Introduction to Data Mining and Knowledge Discovery” Third Edition,