Data Mining.doc


Published on

  • Be the first to comment

  • Be the first to like this

Data Mining.doc

  1. 1. DATA WAREHOUSING AND MINING SAS, Inc. offers this definition: “A data warehouse is an implementation of an informational database used to store shareable data that originates in an operational database-of-record and in external market data sources.” ( ) This definition illustrates the integrative nature of data within today’s organizations that emphasize CRM. The data are shared throughout the organization, and the data are from internal as well as external sources. The Hyperion Corporation describes a data warehouse as, “… not a product but a best-in-class approach for leveraging corporate information. Data warehousing accommodates the need to consolidate and store data in information systems that provide end users with timely access to critical information so they can make decisions that improve their organization ’s performance.” ( ) Again, the integrative nature of the use of the data is emphasized, as well as managerial decision support. Data from throughout the company, from every operational area, from both internal and external sources, are stored in a Data Warehouse for subsequent processing and analysis by decision makers. Dataspace states, “A data warehouse is a computer system designed to give business decision makers instant access to information. The warehouse copies its data from existing systems like order entry, general ledger, human resources and stores it for use by executives rather than programmers.” ( ) Geiger ( ) provides a perspective on a Data Warehouse within an organization. Data Warehouses provide five primary functions: • It is a reflection of the business rules of the enterprise – not just of a specific function or business unit – as they apply to strategic decision support information. This characteristic requires resiliency to easily accommodate changes to the business rules that include new data elements, shifts in hierarchical relationships or changes to relationships between existing entities. 1
  2. 2. • It is the collection point for the integrated, subject- oriented strategic information that is handled by the data acquisition process. This characteristic calls for a modeling technique that supports subject-orientation and provides the flexibility to integrate data from additional sources over time. • It is the historical store of strategic information, with the history relating to either the data or its relationships. This characteristic calls for a modeling technique that easily supports incorporation of history. • It is the source of information that is subsequently delivered to the data marts. The data marts in question may be used for exploration, data mining, managed queries or online analytical processing. This requires the model to provide unbiased data that can subsequently be filtered to meet specific objectives. It further requires the model to support summarized and aggregated data. • It is the source of stable data regardless of how the processes may change. This requires a data model that is not influenced by the operational processes creating the data. The analysis of the data stored in a Data Warehouse is accomplished though Data Mining. SAS, Inc. provides a description of the relationship between Data Warehousing and Data Mining. Because the quantity of data in a Data Warehouse is vast and continuously updated, it must be analyzed in order to provide useful information. SAS states, “Delivering valuable insights from volumes of customer information collected in the warehouse requires advanced analytical processing and fast access to lots of data.” ( ) The Data Warehouse provides the storage of vast quantities of data and rapid access to the data. Data Mining is the tool that provides the analytical processing required to convert the data into information. Carol Morris states, “Once you have the database, the next think you need to consider is how to maximize its value. First and foremost is to put the data into the hands of the people who can best use it. ” ( ) Kurt Thearling states, “Data mining … extracts information from a database that the user did not know existed. Relationships between variables and customer behaviors that are non-intuitive are the jewels that data mining hopes to find.” ( ) Data mining utilizes software tools to examine the vast quantities of data stored in a Data Warehouse, searching for hidden relationships that, once understood, permit the development of models to predict future customer behavior. The predictive models can be used to profile and segment 2
  3. 3. customers. Once developed, the models can then be integrated into the Data Warehouse itself in order to make CRM more efficient and effective by facilitating future data analysis. Benefits that can be realized from the implementation of a Data Warehouse include: • Improved Organizational Flexibility – a Data Warehouse can reduce or eliminate complexities that exist between functional areas of organizations, particularly large, geographically diversified companies. They provide continuous availability of data from a single source, improving user access, security, and integrative control. Data are continuously updated to provide a current base for analysis. • Managerial Decision Support – secure, accurate, rapid access to data and analysis techniques provide managers with the information they need, at the time they need it, and in the location that they need it. Managers become less dependent upon IS technicians for their information, modeling, and reports. • Organizational Learning – the Data Warehouse analysis capabilities permit companies to focus their attention on the business activities required to deliver high levels of CRM. End users, the decision makers themselves, learn how to analyze data in order to develop models to enhance CRM. Less reliance on IS staff for data access and model development results. • Information Availability, Timeliness, and Alternatives Evaluation – because data are available on a real-time basis, decision makers are not restricted to utilization of historical data that may be days, weeks, or even months old. Managers have an increased ability to take timely actions and to be in a better position to respond to competitive pressures. They are better able to make proactive decisions. Thearling provides an introduction to the relationship between CRM and data mining. He states that the methods used to interact with customers have changed in recent years. (7) These 3
  4. 4. changes have resulted in the recognition that companies have to better understand their customers in order to be able to respond quickly to their requirements. Additionally, the time frame in which CRM is provided is shrinking. It is no longer possible to wait for signs of customer dissatisfaction before taking actions. “To succeed, companies must be proactive and anticipate what a customer wants.” Thearling cites four forces that interact to make CRM complex as well as important: • Compressed market cycle times - decreases customer attention spans and loyalty. Successful CRM reinforces values provided to customers on a continuous basis. The time between new customer desires and when companies must meet the desires is shrinking. If companies don’t react quickly, some other company will. • Increased marketing costs impact price and profitability. CRM permits companies to compete on a basis other than price alone. As stated above, competition on price alone soon becomes unprofitable. Competition based on customer service is as effective and typically serves the most profitable customers. • Streams of new product offerings cause customers to constantly shop for new sources of products and services that will satisfy their exact needs. Customers want exactly what they want, not some “close fit.” Indications are that this trend will continue. • Niche competitors are increasingly able to reach your current customers. The best customers also look good to competitors. They will focus on small profitable market segments that may include a significant portion of a company’s customer base. Successful companies must react to all of these forces in a timely manner. Markets will not wait for companies to respond, and customers will seek new sources for their products and 4
  5. 5. services. Customers and prospective customers want to interact with companies on their own terms. Successful companies must consider multiple criteria when developing CRM strategies. Data mining must be relevant to business processes if it is to have a positive impact on CRM. The impact is dependent upon the business processes, not the data mining process. Data within a company are generated as a result of the process that defines how the company conducts its business. For example, the marketing department interacts with customers based on numerous factors such as direct marketing, print advertising, telemarketing, radio/television advertising, and other methods of marketing. Data are generated during each interaction. Data mining analyzes the data and looks for information that the company doesn’t know exists. Relationships between customers and the process of marketing are what is being sought, and these unknown relationships are the “gems” being sought. Most valuable is information about customer relationships that are counter-intuitive and provide information to enhance understanding of customers so that the appropriate CRM can be developed. Thearling identifies four CRM factors that must be automated. These factors provide the overall focus for data mining. He suggests that automation must achieve: • The Right Offer - managing multiple interactions with customers, prioritizing what the offers will be while making sure that irrelevant offers are minimized • to The Right Person – every customer is unique so interactions with them need to move toward highly segmented marketing campaigns that target individual wants and needs • at The Right Time - interactions with customers now occur on a continuous basis • through The Right Channel. – interactions with customers can occur in a variety of ways (direct mail, email, telemarketing), so companies need to be sure that they are choosing the most effective medium for a particular interaction 5
  6. 6. Data mining first identifies market segments and customers within the segments that exhibit high profit potential based on previous buying behavior. Poulos states, “Segmentation is an art form and the company’s most powerful tool with which to create a loyalty focused enterprise.” ( ) It develops characteristics that can be used to develop marketing campaigns and CRM strategies to maximize the future customer-company interface. Data mining can enhance CRM by identifying target markets more accurately, permitting a better alignment of campaigns to the needs, wants, and desires of customers and prospective customers. The following figure illustrates Thearling’s concept of Data Warehouses, Data Mining, and CRM. Figure 2: The Data Warehousing “Prediction Loop” Data mining can be defined in terms of three facilities: applications, approaches, and algorithms. These three facilities “sit on top of” the raw data generated by the business. They analyze the data on a continuous basis, looking for current relationships and, more importantly, changes in relationships that indicate future shifts in CRM. Applications refer to sets of problems that have similar characteristics across different application domains. For example, marketing people can use customer demographics as a method of segmenting high potential customers from lesser potential customers. The segmentation process is similar regardless of the types of businesses being analyzed. Analyzing 6
  7. 7. data from an auto manufacturer and a drug manufacturer will be similar in nature, and the results, while specific to the companies, will also be similar in nature. Approaches refer to the sets of algorithms used to extract information from databases. These approaches differ by the types of problems they are able to solve. • Association – addresses the typical “market basket” problem. The goal is to find trends across large numbers of transactions that can provide understanding and facilitate exploitation of buying patterns. • Sequence based analysis – is an extension of the Association approach that considers the sequential order in which purchases are made. In this analysis an understanding of buying patterns is examined as well as the sequences in which the patterns occur. • Clustering – focus on segmentation problems. Customers are assigned to specifically defined clusters based on buying patterns and demographics. Buying habits of clusters are often compared to see if new marketing campaigns may be developed for other clusters. • Classification – uses a set of predefined examples to develop a model that can be used to classify all data in a population. Fraud detection and credit risk applications are well suited to this type of analysis. • Estimation – is a variation of the classification approach that generates “scores” along various dimensions in the data. For credit risk applications potential clients receive a “credit worthiness score” rather than be classified in a binary nature. Algorithms refer to the specific analysis methods utilized to analyze data. Algorithms generally fall into one of two categories: 7
  8. 8. • Theory driven modeling tools such as correlation and regression, hypothesis tests, ANOVA, discriminant analysis, and forecasting methods. These tools utilize analysis to develop models proposed by managers. For example, regression analysis may develop a model to investigate a relationship between sales and levels of education that is proposed by a sales manager. • Data driven modeling tools such as cluster analysis, factor analysis, decision trees, and neural networks. These tools utilize analysis to develop patterns in data sets that are then interpreted by managers in order to gain an understanding or perspective on customer behavior. Data mining is not without its drawbacks. The problems common to the algorithms currently utilized fall into three general categories: • Susceptibility to “dirty data” – data mining tools simply take data presented to them and analyze it. They have no semantic structure, nor do they have any ability to identify anomalies or erroneous data. Users must ensure that the data are “clean” before submitting it to analysis. The concept of Data Quality, discussed below, addresses the accuracy and applicability of data utilized in the analysis. • Inability to “explain” results in human behavior terms – the algorithms develop mathematical representations of relationships, but offer no suggestion about the nature of the relationships or why they exist. Why humans do things don’t necessarily map accurately into “if – then” rules, so results must always be interpreted by people. Regression analysis provides an excellent example of this “problem.” A regression model may express a strong relationship between sales and specific customer characteristics, but the relationship cannot explain why customers with the specific characteristics exhibit higher levels of sales. 8
  9. 9. • Data representation gap – codification of human actions may not be able to accurately be represented by data. Categorizing continuous variable data may lose so much information that the analysis is no longer valid. The most frequent “gap” occurs when a Lickert scale is used to ascertain levels of customer satisfaction. There is no guarantee that a five point scale will consistently measure satisfaction levels across all customers. What is “very satisfied” to one customer may be “satisfied” to another customer. Data Mining algorithms have no way to differentiate between similar levels of satisfaction when they are coded differently. What results is an unclear picture of customer satisfaction due to the variability introduced by the representation gaps. Because Data Mining utilizes vast quantities of data to support managerial decisions, the accuracy and applicability of the data is a primary issue. The Dataflux Company states, “One of the critical foundations for the effective "warehousing" and "mining" of volumes of data is data quality. If data is of insufficient quality, then the knowledge workers who query the data warehouse and the decision makers who receive the information can not and should not trust the results.” ( ) Atkins states, “Data quality includes but goes beyond accuracy – such as correct name and address data at the character level. Rather, it is the level of accuracy, consistency of format and data representation and completeness that permits matching and integration of all records that pertain to an entity, such as a customer or patient, a product or a location.” ( ) Wu illustrates the relationship between a Data Warehouse and data accuracy. He states, “The credibility of the data warehouse solely rests with the integrity of its data.” Data accuracy is of paramount importance when using a Data Warehouse. Managers must understand what data quality is and how it is achieved in order to ensure the viability of their Data Warehouse and, ultimately, their CRM efforts. 9
  10. 10. Wu provides a perspective on achieving data quality. He states that data integrity (quality) methods fall into two categories: • Prevention Controls – These controls identify data errors before they enter the Data Warehouse. The controls are part of the process of migrating data from the various applications systems into the warehouse. This is the primary method of preventing meaningless, corrupt, and redundant data from the warehouse. • Detection Controls - These controls assess data accuracy and completeness after they are migrated into the warehouse. They focus on reconciliation of data to detect insufficient or incongruent data. FirstLogic, Inc. provides a more specific perspective on the data migration process from application system to the Data Warehouse. ( ) Data quality is achieved through three stages: • Cleansing – This stage focuses on parsing, correcting, and standardizing data. o Parsing – This process locates, identifies, and isolates individual elements within the data from the application systems. Such elements as “Customer Number” and “Customer Name” are identified. Parsing makes it easier to correct, standardize, and match data because it facilitates individual data item comparison. o Correcting – This process attempts to remove errors and inconsistencies from the data. Such errors as variations in spelling and abbreviation, outdated data, and transpositions are identified and corrected. For example, the customer name “Smith” and “Smtih” will be matched and the second name changed to be consistent with the first. Similarly, the city name Hollywood may be defined in 10
  11. 11. the warehouse as an alias for Los Angeles, providing consistency with respect to the company address. o Standardizing – This process permits the representation of data in a consistent and preferred format. Some problems addressed include inconsistent abbreviations, unusual titles, and variant spellings. For example, the International Harvester Company may be abbreviated as “Intl. Harvester,” “Int. Harvester,” or “Internatl. Harvester.” Standardizing will change all of these names to one name selected by management. • Matching – This stage focuses on comparing data across and between various sources, searching for similar information that may be consolidated in the next stage. Matching permits the discovery of similar data within and across applications software data. This is the “heart” of Data Warehouse quality. Duplicate representations, now consistently represented, can be identified and eliminated. All relevant data can then be consolidated, providing a complete, accurate, consistent source of data about customers. • Consolidation – This stage focuses on developing relationships between customers and between customers and demographics. The goal is to build a consolidated view of each customer so successful one-to-one marketing and CRM can be achieved. One-to-one marketing permits companies to better serve customers at every point of contact by identifying their specific behavioral characteristics. For example, customers who prefer telephone calls from salespeople can be identified and scheduled for calls, while those who prefer e-mails will not be scheduled for calls. The ultimate goal of data quality is to ensure accurate, consistent, applicable data that can be analyzed with Data Mining tools. Without quality data, the CRM efforts are doomed to failure 11
  12. 12. because the data are not accurately representing the customers, their characteristics and demographics, and their buying preferences and behaviors. As the Dataflux Company states, “An information system is only as good as the data which resides within it. A data warehouse, data mart, marketing database, or data mining initiative cannot be expected to deliver a satisfactory return on investment unless the data within the system is accurate, reliable, and trusted.” 12