Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Name: Tom Scullion Date due: December 18, 2002 Project: Evaluation of New Information Systems Technologies Technology: Data Mining and Clickstream Analysis Project Deliverable No. 3 – Overview of technology and its place in business Introduction Every time a visitor to the World Wide Web makes a request to open a Web page, the Web server on which the page resides captures the visitor’s request in a log file. That log entry is referred to as clickstream data. A tremendous amount of clickstream data is generated by Web traffic. The data accumulates rapidly and must be managed to have a successful Web site. Clickstream data can be used to understand the interaction between visitors and a Web site. It may be used to determine the efficiency and effectiveness of the Web site from a marketing perspective especially if the site is used for e-commerce. E-Retailers must learn as much as possible about the behavior, individual tastes and preferences of their site visitors to be competitive. To help accomplish that goal and remain competitive, clickstream data should be compiled from the log files into a database that can be integrated with other corporate data in a warehouse so that the data can be mined and analyzed. Analysis of this data will help management understand how visitors are using their Web site and hopefully how the Web site can be improved to ”significantly boost customer relationship management in ways that directly affect online visitor acquisition and retention” (Mena 2000). Overview Web server logs capture information when a visitor requests a Web page. Log file entries contain information on the visitor, the pages requested, time of the request, status of the request, and size of the page. A server may be configured to have four different log files or the files may be combined into an extended format. The transaction or activity log records all file transfers from the server to a visitor’s computer. The error log file records any visitor request for information that cannot be fulfilled. The referrer log (or referer log as traditionally misspelled by Web administrators) keeps track of where incoming visitors to a Web page are coming from and the agent log tracks the software on the visitor’s system that is being used to make the request. (Tittel 2001, McDunn 2002) Usually the files, other than the error log, are combined into an extended log file. The fields generally captured in extended log files are: Field Description Example Host Fully qualified domain name of the client or its IP address Ident Identify information reported by the client if identity - check option is enabled (seldom used) Authuser UserID used in successful SSL request - Date The date and time of the request including time zone [17/Jun/2000:10:39:12 –0600] Request Request line from the client browser “GET /metadta.html HTTP/1.0” Status Three digit HTTP status code returned to the client 200 Bytes Number of bytes returned to the client browser for the 19365 requested object Referrer URL of the referring server and requested file from site -> / metadata.html Agent Browser and operating system name and version Mozilla/4.0 (Windows; I; 32bit) From: Jennings, Michael F. 2000 “Using Clickstream as an e-Source for the e-Business Intelligence Environment” Page 1
  2. 2. Since identity and authentication fields are seldom used, those fields usually just contain a dash. Because this information generally is unknown but is desirable, it is necessary to use other methods to establish a visitor’s identity since the reason for accumulating this data is to be able to profile a visitor for marketing purposes. One method to establish a visitor’s identity is to try and determine it from the visitor’s IP address. This is unreliable considering the infrastructure of the Internet. Many people still use dial up access to get to the Web. The IP addresses for these visitors are dynamically assigned and change every time they sign on so they are inconsistent. If the visitor is going through an Internet Service Provider’s (ISP) proxy server, the IP address of the proxy server will be entered in the log file instead of the actual visitor, which is of no use. If the site has installed a load balancer to enhance performance and scalability, the log file will pick up the IP address of the load balancer instead of the actual visitor. So basically it’s not possible to obtain reliable information on a visitor’s identity from an IP address. Using cookie files is another way to identify visitors to a Web site although cookies may not provide as much detailed information as registration. Cookies are small files that are placed on a visitor’s hard drive by a Web server so that it can remember something about the visitor the next time they visit. Typically cookies have six fields: Field Name Description Name The name of the cookie variable Value The string value assigned to the cookie variable Domain This is the domain name or Web site that created the cookie and it is the only domain that is permitted to receive or modify the cookie on subsequent accesses. Path The top level of the subtree within the domain for which the cookie is valid and returned upon access to a page within the subtree Expires The expiration date of the cookie. The cookie persists on the client system until this date. If this value is not set, the cookie only lasts for the duration of the browser session. Cookies without an expiration date are referred to as Transient cookies. Secure If this field is set to TRUE, then a secure connection to the domain is needed to pass the cookie. From: Sweiger, Mark “Cookies: The Perfect User Identification Snack” The server must be configured to use unique variables in the Name field. That way once a cookie is set on a visitor’s system, every subsequent time that the same visitor accesses the Web page, they will be identified by the unique value of the cookie variable. It’s not exactly name, address, and phone number but it does allow a certain amount of personalization. Cookies are used to record a visitor’s preferences when using a particular site. They may be used to remember what pages have been requested and what ads have been sent as well as what type of browser is being used. This information enables the server to customize pages for browser type or for other information gleaned from prior visits to the site. There are security and privacy issues with the usage of cookies. Cookies are placed on the visitor’s system by the server and are retrieved the next time the visitor accesses that particular site. Because of the ability to retrieve other information from the visitor’s hard drive using the same technology, they are considered potential threats to security. Because they are used to identify specific visitors, they are considered threats to privacy. Visitors may configure their Web browsers to refuse cookies. However, because they are used extensively to identify unique visitors, many sites don’t allow access to a browser that is set to not allow placement of cookies. (Dodson 2000) Another method used by some sites to get visitor identity is to have visitors log in when accessing a Web site. This method may populate both the ident field and the authuser field of a log entry if passwords are required. Registration to obtain the login and establish a password may establish a visitor’s identity Page 2
  3. 3. because the visitor is generally required to fill out an online form that includes name and address and other information that might be useful for marketing. Although this method probably provides the most accurate demographic data, there are reasons why it may not always work. First, a visitor may not provide truthful information (go figure) when registering to use the site so the reliability of the information is suspect. Second, this method may alienate potential visitors that wish to remain anonymous and don’t want to have to log in every time they access the site. In may be possible to overcome these challenges by using incentives to obtain accurate information such as a redeemable coupon that will be mailed to the registrant’s address (Mena 2000). Registration is the best method of obtaining enough information to match visitors with demographic and behavioral data from third party vendors. This will provide more complete data to use in profiling visitors to a web site. Cookies and registration allow an enterprise to obtain enough quality data to profile customers. These methods may be considered invasive by customers and potential customers though. However, attitudes seem to be changing in some circumstances. A survey conducted by Cyber Dialogue and documented in “Privacy VS. Personalization” (Mabley 2000) produced some significant findings: • 82% surveyed were willing to provide such personal information as gender, age and ethnicity if the site will remember their preferences and personal information • 56% said they are more likely to purchase from a site that allows personalization • 63% said they are more likely to register at a site that allows personalization or content customization • 82% said a Web site’s privacy policy is a critical factor in their decision to purchase online • 84% refused to provide information at a site that won’t disclose how the information is being used Based on this information, it appears that online visitors to a Web site are willing to provide the information that marketers need as long as the information is only used to personalize the visitor’s online experience. Other findings of the survey included the fact that only 58% of people using the Web understand cookies and how they work. The final result is that an online enterprise needs to educate their visitors regarding what information is being obtained, how it is being used and how the enterprise’s privacy policy governs the use of the information. Generally organizations should explain the process and its benefits and allow visitors to opt-out if desired. “Let your customers have access to their own profiles. In the end they will realize that the objective is to service them better than your competitors.” (Mena 2000) Registration information, cookies and Web logs provide a great deal of raw data about Web site visitors that may be useful to an organization. They tell how a visitor responds to the site’s content, what links are used, how much time is spend on a particular page or on the whole site, what time of day the visitor browses and more. With all this information at hand, one of the challenges becomes how to organize this clickstream data so that it is usable in a decision support system. The first step is to add the clickstream data from the Web to a data warehouse. A data warehouse is a subject-oriented, nonvolatile collection of data for decision support (Mena 2002). The data warehouse may also include data from various other sources such as a customer relationship management database or a sales force automation database. Cookies and Web server log entries are generally in text so the data must be manipulated by an extraction, transformation, and load tool into a format that can be moved into the data warehouse. Data from registrations and purchases should be added to the clickstream data for more complete information on visitors and customers. Additional demographic and behavioral information from third party vendors should be appended to the clickstream data also. This third party data can include information such as location and even projected income. Third party data is most useful if there is adequate information from registrations or purchases to match visitors to the demographic data. Once the data is in the warehouse, data mining methodologies can be applied to provide valuable information on visitors and customers and on marketing efficiency. Data mining software programs use a variety of statistical approaches to sort through large amounts of data to identify patterns, establish relationships within the data and use that information to predict behavior. Data mining uses various techniques. Association is the technique of looking for patterns where one event is connected to another event, such as identifying items that are likely to be purchased or viewed in the Page 3
  4. 4. same Web site visit. Sequence or path analysis is looking for patterns where one event leads to another later event such as the path a visitor takes through a Web site and what items are viewed or purchased. Classification is looking for new patterns. Clustering, or segmentation, is identifying visitors and customers that share common characteristics. (Greening 2000) It can break down groups until reaching the famous “segment of one”(Doherty 2000). Forecasting is discovering patterns in data that can lead to reasonable predictions about visitor behavior. Data mining generally uses traditional data sources such as customer relationship management systems and sales force automation systems to supply data that would be mined for marketing and customer information. Since the Internet always has to have its own terminology, the term Web Mining was recently coined to refer to mining data from traditional sources that have been appended with data from the Web. As defined on, “Web mining is the integration of information gathered by traditional data mining methodologies and techniques with information gathered over the World Wide Web. Web mining is used to understand customer behavior, evaluate the effectiveness of a particular Web site, and help quantify the success of a marketing campaign.” Once the clickstream data is incorporated into an organization’s data warehouse, Web mining software can be used to extract information to be used to profile visitor behavior and preferences. The object is to find associations between customer profiles and buying habits such as what age group is most likely to make multiple purchases. This information can be used to personalize promotions based on demographic information. Personalization optimizes the advertisements and products that a visitor sees, because personalized marketing is based on preferences developed from the visitor’s profile so the products displayed are of more interest to the visitor and increase the likelihood that the visitor will make a purchase (Greening 2000). After examining associations, the data may go through segmentation analysis. Segmentation is the process of dividing a customer base into smaller markets based on different needs, preferences, behavior, and attributes. The reason for segmentation is that as the group is segmented into smaller and smaller sectors, it will enable interaction with each segment in different ways (Mena 2000). Segmentation analysis may reinforce the information discovered through association analysis and provide more detailed information on smaller market segments for personalized marketing activity. This may assist an enterprise in maximizing its marketing budget by targeting the market segments that are most likely to provide the biggest return for their marketing efforts. It may become apparent that females under age 35 are most likely to make multiple purchases. Promotions can be targeted at that market segment to reward their loyalty and turn those customers into long-term clients. Based on the historical customer purchasing data in the data warehouse, Web mining software programs may be “trained” to predict outcomes, such as the number of purchases new visitors are likely to make. Once trained, the software can predict what attributes are most important for predicting, such as age, gender or projected net worth. This information is useful in modifying a registration form to get the information that would be helpful in raising the conversion rate. Conversion rate is the percentage of visitors that are converted to customers. If the model predicts that women between the ages of 30 and 45 are likely to purchase a particular product, then after completion of the registration form, a visitor that fit that profile would be directed to a welcome page that prominently displayed that product. They would already have earned a promotional coupon towards the purchase price for accurately filling out the registration form that may make the purchase even more desirable. Page 4
  5. 5. Segmentation of markets to enable personalization of marketing efforts is one of the huge benefits of Web mining. There are many additional benefits to Web mining and provides an extensive list: Understand Customer Behavior: • Companies can optimize e-business sites for maximum commercial impact by understanding the dynamic behavior of visitors to their Web sites • E-Tailers can now gain knowledge on the individual tastes and preferences of the visitors to their sites • Determine the conversion rate of visitors to buyers on your site. • Determine the repeat frequency of existing buyers (i.e. the likelihood of customers repurchasing your brand) • Calculate the rate of new customer acquisition. • Discover actionable browsing and buying patterns of customers. • Learn who is buying what from your site. • Discover cross relationships between clients in your e-commerce sites. Determine Web Site Effectiveness • Discover high and low impact areas of your e-commerce site. • Web administrators no longer have to rely on intuition when designing the layout of a Web site. • E-tailers can now develop the look and feel of the Web site and personalize online content. Measure the Success of Marketing Efforts: • In the physical world it is difficult to get reliable feedback on marketing campaigns. But, on the Internet you can get real measurements of the success of a marketing campaign. • Companies can cluster customers with similar patterns, and the Web site can adapt to recognized customers. Segments can then be targeted with campaigns and special offers. • Effectively gauge the return on investment of banner advertising. (Doherty 2000) Web mining is not inexpensive. Maintaining a data warehouse and dealing with the volume of clickstream data generated by the average Web site can take a significant amount of an enterprise’s resources. But the benefits of mining clickstream data should offset the cost by allowing an enterprise to distinguish itself from the competition by providing a higher level of customer service and getting a higher return from its marketing investment. Data Mining Process Data mining is not a panacea. It will not automatically solve all the problems in the data universe. This means that a structured methodology must be used to “find problems, define solutions, set expectations, and deliver results.” This is sometimes referred to as Knowledge Discovery. Knowledge Discovery is defined as a multi-stage business process leading to the automated detection of regularities in data, which are useful in new situations. Knowledge discovery attempts to ensure that the goals of mining data align with the goals of the data users. (Pyle 1998) Several appropriate methodologies can be found. However, most include the following steps: 1) Define the business problem 2) Build data mining database 3) Explore data 4) Prepare data for modeling 5) Build model 6) Evaluate model Page 5
  6. 6. 7) Act on the results (Edelstein 2001) There is a standard being developed by a consortium of mostly European vendors that’s known as CRISP- DM, which stands for Cross Industry Standard Process for Data Mining. CRISP-DM is still a work in process so most data mining methodologies use it as a foundation to begin defining the process. Define the business problem Determine what specific business problem the organization is attempting to solve with data mining. The problem definition should be clear and concise so that it can form the basis for a project plan. Data mining can be used for a variety of business problems including improving customer relationship management, looking for patterns of fraud in credit card usage or identifying reasons for churn among telecommunications customers. For an e-commerce site, a business problem that data mining may be used to solve it identifying online customers that provide the greatest value to the organization and developing profiles of those customers based on specific attributes. Once an objective is determined, criteria should be established to measure the success of the project. Success criteria should be defined in relation to the objective defined by the business problem. For an e-commerce site trying to identify high value customers, success may be measured by an increase in effectiveness of targeted marketing campaigns that use the customer profiles developed by the data mining project. Once the business problem is defined it’s necessary to determine if the data that is available will in fact solve the business problem defined. So to identify particular types of customers there should be enough data on customer attributes from Web logs, transaction, and registrations to provide a meaningful profile of individual customers. Data that is currently available in the Enterprise Data Warehouse (EDW) should be analyzed to determine if it’s adequate for the project. It may be necessary to obtain additional data from other sources to increase the likelihood of success for the project. Data mining objectives may be somewhat different from the business goals. Data mining goals should be established that will generate the information necessary to achieve the business goals. So if the project goal is to profile high value customers, the data mining goal may be to determine the attributes of customers that provide the highest return to the organization. Based on the business objectives and the data mining objectives, a project plan should be developed with input from the primary stakeholders. Build Data Mining Database Building the data mining database and the next two steps of exploring the data and preparing the data for modeling will take most of the time and effort of any data mining project. Most of the data necessary to build the data mining database should be available in the EDW. The EDW is usually not used for data mining because data may need to be altered for the project. Also data mining can be resource intensive. The EDW will probably have other resource demands and data mining may cause system degradation that would be unacceptable. Generally a separate data mart should be built for data mining. The design of the database should correspond to the business problem that is being solved and the data available. Explore the Data Gaining an understanding of the data that will be mined is necessary prior to actually mining the data. Exploring the data is the best way to gain that understanding. Graphing and visualization tools may be used to explore the data. They can be important for revealing general patterns in the data. This step Page 6
  7. 7. may often lead to some previously unknown realization about the data that may help in the modeling process. Clustering is a common way to explore data. Clustering is a method of dividing data into different groups based on similar attributes. Clustering may appear to be similar to segmentation. But clustering is different from segmentation in that segmentation is assigning data to defined groups. Clustering puts data into groups that were not previously defined based on attributes that may not have been classified. Link analysis is another method of exploring data. Common approaches to link analysis are association discovery and sequence or path discovery. Association and path analysis were discussed previously and are useful in gaining a general understanding of the data before deciding on a model. A good understanding of the data may provide insight into how to proceed with the remaining steps of the process in a manner that will most likely to ensure the success of the project. Prepare the Data for Modeling Data preparation may be the most important part of mining. Even though the data has been transformed to some extent when it was added to the EDW, it will probably still require some additional data preparation after it’s added to the data mining database. Once the data is in the data mining database it should be tested for consistency, for missing variables, and for outliers. Missing data should be reviewed to determine the best method to decrease the impact of the missing data on the modeling process. It may be decided that it’s alright to ignore the missing entries. Or an attempt may be made to estimate the data from other information or the data may just be discarded. Outliers represent data outside expected parameters and may be errors or just new, unexpected data. Some data mining programs handle missing data and outliers better than others so whether they are ignored or included depends on the software being used (Pyle 1998). In addition to preparing the data for mining, some additional transformation may be necessary. Using data mining to predict behavior may require new variables that have to be derived from the data. For transaction data on existing customers, RFM variables may be good predictors. RFM stands for recency, frequency and monetary. Recency generally would be some measure of time since the last transaction. Frequency would be the number of transactions in a designated period. And Monetary would be the total transactions within a designated period as well as an average per transaction. These additional variables are necessary to make the data more meaningful to the mining process and provide additional parameters from which the mining software may discover useful relationships. Build the Data Model Modeling is the next step in data mining. This is probably what most people think of as data mining although there is a great deal more to it than just building the models. First it must be decided what type of prediction is the most appropriate solution to the business problem. Then a model type is chosen based on the type of prediction. Then algorithms should be explored that fit the predictive type and the models being developed. Various algorithms are used for modeling from simple conditional logic to complex neural networks. Types of predictive models generally fall into one of two categories, classification or regression. Moving through these decisions from type of prediction to algorithm generally requires highly skilled analysts that understand the business and have experience with data mining and statistical analysis. Selection of these tools and the people that can execute the process depends on the business problem that is being solved, the data being used to solve it, and the systems available. There are many classification and regression algorithms available for modeling and as many variations of each type as there are vendors. But to explore them all would be a project in itself so only a select few are described below. Page 7
  8. 8. One type of prediction that may be used is classification. Classification is generally used to predict the category or class into which a particular case might fall. For instance, customer will fall into a particular class depending on the attributes of the customer identified within the data. Classification models are generally simpler than regression models in that they are more understandable to business users that may not have a statistical background. Classification models can use categorical data such as a state abbreviation instead of just numerical data. One of the reasons classification models are easier to understand is because the variables are more meaningful. And the results are easier to interpret by analysts that may not have statistical experience either because the outputs are generally in ranges such as high, medium, and low. Decision trees are one of the better known classification types of models. They use conditional logic to partition data into groups based on values for a particular attribute. For instance, customers may be split into subsets based on RFM variables. One tree node may make a decision based on whether 3 or more purchases were made in the last three months. A following tree node may make a decision based on value of recent purchases. If the decision was positive on both nodes, the algorithm may classify that customer as high value. If the other variables were similar and the purchases were between thirty and fifty the customer may be classified as medium and so on. Decision trees have limitations because it is necessary to select one specific attribute such as age or last purchase for classification at each stage of the process. Also each decision in the tree is based on the current decision node without taking into account any previous or future decisions. The attribute that is selected for the first decision or root node is also subjective depending on the modeler or the business problem being addressed. Decisions being made at each node also represent hard splits that may lead to somewhat arbitrary results. If a customer had four purchases that averaged $49 they would be classified as medium value instead of high based on a $1 difference even though the value to the organization was higher due to the higher number of purchases. There are limitations like these in any modeling algorithm. That’s what makes it so important to have quality people that are familiar with the business and are able to use their business experience to build and maintain the models being used to solve the business problem. Rule induction is another method of classification. Rule induction looks at data and generates a set of rules that may be used to classify cases. Rules may be generated based on relationships and confidence levels. For instance based on the data a rule may be generated that states with 75% confidence that women in a certain age range will purchase product A with product B. Rules may be “fuzzy” or inexact. Inexact rules have a fixed confidence factor such as the 75% mentioned above. Fuzzy rules have a confidence factor that varies with one of the attributes so the confidence that women in a certain age range will purchase product A with product B may increase as the age increases. Genetic Algorithms generate rules also but not from exploring the data. Genetic algorithms base rules on changes in patterns of data until a pattern emerges. Genetic algorithms use rules that have already been developed to combine patterns and develop relationships within the data. So they are more of a tool to improve the algorithm being used to build the model than a modeling tool. Boosting is a method of classification that seems to let majority rule decide how data is classed. Boosting takes several random samples from the data and builds a classification model for each set. The training sets are modified based on previous results until the outputs fit the expected classifications. Then additional samples are processed by each model and the classifications that are assigned most often are used. Regression type predictions generally use some kind of scoring method to predict customer behavior. Regression type models do not allow categorical data. All data must be numerical so it’s necessary to convert data such as state abbreviations to a numerical in some logical manner so the program can use it. Since they only use numerical data, regression models are more difficult to understand because the Page 8
  9. 9. data is less recognizable. Regression type models may appear to operate like a “black box” because it’s more difficult to visualize how the data is being manipulated than with classification algorithms. The output is also not as easy for non-statistical analysts to comprehend so they generally require more sophisticated personnel to operate and interpret. Regression models do provide continuously varying outputs though as opposed to simply putting outputs in buckets. Scoring allows for customers to fall into a range to fit a category or to segment the results into smaller, more well defined groups. Neural networks are one of the better known regression modeling algorithms. Neural networks consist of algorithms that take predictor variables at the input layer and then assign weights to the paths that the data uses to travel to the next layer of nodes. There may be one or more hidden layers depending on the number of attributes being analyzed that the data must go through before emerging at the output layer. The values of the variables and the paths traveled determine the direction the data will take and the value it will have when it reaches the output layer. The scores that are generated when the data reaches output are what are used for prediction. The neural network must first be “trained” using data that has already been evaluated. The data is sent through the net several times and the weights associated with the paths are adjusted each time until the results of the training set correspond to the expected outputs. Then the network is considered to be trained. Neural nets “(a) evaluate input values, (b) calculate a total for the combined input values, (c) compare the total with a threshold value and (d) determine what its own output will be.” (Information Discovery Inc. 1997) Evaluate the Model Models should be evaluated from a business perspective based on cost benefit analysis and return on investment. The results of the model may show some interesting patterns but acting on them may not provide the incremental revenue or cost savings that would justify their use. One of the simpler ways to evaluate a model is to test the results in the real world. Select a sample from the population to test a prediction of the model and see how well the actual results follow the predicted results. The model may predict the likelihood that a certain segment of the market will respond to a particular promotion. By implementing the promotion on a limited sample and testing the results against the prediction, the model’s effectiveness can be measured. Act on the Results Once working models are available, they can be used to understand customer behavior and customer expectations. They may be incorporated in production systems such as campaign management software for marketing purposes. Campaign management software automates marketing campaigns that are used to target customer segments with specific promotions that are most likely to achieve the desired results. Targeting specific segments in this manner should increase the response rate to the promotional campaign based on the model’s predictions. This maximizes marketing efficiency and effectiveness. Profiles developed from the data mining project should identify customers that are most likely to respond to cross-selling or up-selling promotions that will lead to an increase customer lifetime value to the organization. The customer profiles developed from the project can also be used in a Web environment to classify visitors to the site based on their registration information. Then the site can personalize the content presented to them based on the classification. This will increase the likelihood of converting the visitor to a customer. Personalizing the content of the Web site also helps to differentiate the site from the competition and provide a higher level of customer service. The object is to use the predictive models to drive marketing efforts that will turn Web site visitors into customers and customers into long term clients. Evaluation of Data Mining Products Page 9
  10. 10. Data mining products should be evaluated for the same attributes that any software package would be evaluated on: • User interface – How easy is it to use? • How much customization may be necessary? • Performance • Documentation and online help • Platforms on which it will run • Databases to which it will connect • Extensibility – open architecture or proprietary • Upgradability • Scalability Another area of evaluation more specific to data mining software is accuracy. There are organizations that review the software by testing it on known data sets for purposes of providing some assurance that it will perform as specified. They include Audit Bureau of Verification Services, Inc., BPA Interactive and the Internet Audit Bureau. An endorsement by one of these organizations is a good measure of the accuracy of a particular software package. Data mining software should also be evaluated for its ability to prepare the data for mining. Since data preparation may be one of the most time consuming tasks in the process, anything a software package can do to expedite the process will greatly enhance it’s value. As discussed above, there are also issues in selecting data mining software relating to the software’s ability to deal with anomalies in the data. It must be evaluated on its ability to handle missing data and outliers. There are other criteria that are important in selecting software but one more that should be mentioned is integration. How well will a data mining package integrate with other enterprise systems. Can it be integrated with the Enterprise Data Warehouse or with Campaign Management software? Integration is often the bane of IT departments due to lack of communication in the selection process. It should be fully explored prior to committing to a particular product. Conclusion Retailing ventures continue to grow on the Web. As more enterprises establish a Web presence the competition for online retail sales increases. Increased competition makes it more and more difficult for e- commerce sites to attract and retain online customers. Web mining can be a dynamic resource that provides an enterprise with ways to distinguish itself from its competition by tailoring its Web site to visitors based on profiles developed from clickstream data. It will enable an enterprise to increase customer service, increase product offerings and because it should make marketing efforts more cost effective, it may help reduce product prices. Web mining can provide important insights into customer behavior and expectations that can be used to implement efficient marketing campaigns. It can also provide the tools to measure the effectiveness of those campaigns. Web mining is a resource that may be necessary for the survival of an online enterprise as the marketplace grows and customers become more sophisticated (Mena 2000). Data mining uses an evolving process that requires highly trained analysts to develop models that may be used to predict customer behavior based on a set of attributes. The process involves complex statistical algorithms that are categorized as classification or regression models. The models manipulate data to assign the output to a particular case (classification) or assign a score to the output (regression) that can be used to place customers into specific market segments that can be targeted for promotional campaigns. Models may also be used to predict the behavior of Web site visitors based on attributes learned through Page 10
  11. 11. registration. Once a visitor is assigned to a class or given a score, Web site content may be personalized based on the class to increase the likelihood of converting the visitor to a customer. Using data mining as a tool in this manner may provide an organization with a competitive advantage through more personalized interaction with Web site visitors and customers. It will also allow a higher level of customer service that will differentiate an organization using data mining from those that aren’t. Reference: Sweiger, Mark “Cookies: The Perfect User Identification Snack”, Dodson, Jody (January 24, 2000) “It’s Time to Slay the Cookie Monster”, Tittel, Ed (June 13, 2001) “Understanding Web Server Log Files”,,289483,sid20_gci849167,00.html Mena, Jesus (2002) “Integrating and Mining Web Data in Your Warehouse”, Jennings, Michael F. (June 30, 2000) “Using Clickstream as an e-Source for the e-Business Intelligence Environment”, Doherty, Patricia (January 2000) “Web Mining – The E-Tailer’s Holy Grail”, Greening, Dan R. (2000) “Data Mining on the Web”, McDunn, (August 15, 2002) “Web Server Log File Analysis – Basics”, www-group.slac/ Mena, Jesus (July 17, 2000) “Bringing Them Back”, Mena, Jesus “Web Mining”, Mabley, Kevin, Director of Research, Cyber Dialogue (2000) “Privacy vs. Personalization”, Edelstein, Herbert A. (March 12, 2001) “Pan for Gold in the Clickstream”,, Rudjer Boskovic Institute (2001), Data mining tutorial, Pyle, Dorian (1998) “Knowledge Discovery and Data Mining: The Expectation of Magic”, Cooley, et al. “Web Mining: Information and Pattern Discovery on the World Wide Web”, Department of Computer Science and Engineering, University of Minnesota Information Discovery, Inc. (1997) “A Characterization of Data Mining Technologies and Processes”, Journal of Data Warehousing, Two Crows Corporation (1999), “Introduction to Data Mining and Knowledge Discovery” Third Edition, Page 11