DATA WAREHOUSING AND MINING
SAS, Inc. offers this definition: “A data warehouse is an implementation of an informational
database used to store shareable data that originates in an operational database-of-record and in
external market data sources.” ( ) This definition illustrates the integrative nature of data within
today’s organizations that emphasize CRM. The data are shared throughout the organization,
and the data are from internal as well as external sources. The Hyperion Corporation describes a
data warehouse as, “… not a product but a best-in-class approach for leveraging corporate
information. Data warehousing accommodates the need to consolidate and store data in
information systems that provide end users with timely access to critical information so they can
make decisions that improve their organization ’s performance.” ( ) Again, the integrative
nature of the use of the data is emphasized, as well as managerial decision support. Data from
throughout the company, from every operational area, from both internal and external sources,
are stored in a Data Warehouse for subsequent processing and analysis by decision makers.
Dataspace states, “A data warehouse is a computer system designed to give business decision
makers instant access to information. The warehouse copies its data from existing systems like
order entry, general ledger, human resources and stores it for use by executives rather than
programmers.” ( )
Geiger ( ) provides a perspective on a Data Warehouse within an organization. Data
Warehouses provide five primary functions:
• It is a reflection of the business rules of the enterprise – not just of a specific function or
business unit – as they apply to strategic decision support information. This characteristic
requires resiliency to easily accommodate changes to the business rules that include new
data elements, shifts in hierarchical relationships or changes to relationships between
• It is the collection point for the integrated, subject- oriented strategic information that is
handled by the data acquisition process. This characteristic calls for a modeling technique
that supports subject-orientation and provides the flexibility to integrate data from
additional sources over time.
• It is the historical store of strategic information, with the history relating to either the data
or its relationships. This characteristic calls for a modeling technique that easily supports
incorporation of history.
• It is the source of information that is subsequently delivered to the data marts. The data
marts in question may be used for exploration, data mining, managed queries or online
analytical processing. This requires the model to provide unbiased data that can
subsequently be filtered to meet specific objectives. It further requires the model to
support summarized and aggregated data.
• It is the source of stable data regardless of how the processes may change. This requires a
data model that is not influenced by the operational processes creating the data.
The analysis of the data stored in a Data Warehouse is accomplished though Data Mining.
SAS, Inc. provides a description of the relationship between Data Warehousing and Data
Mining. Because the quantity of data in a Data Warehouse is vast and continuously updated, it
must be analyzed in order to provide useful information. SAS states, “Delivering valuable
insights from volumes of customer information collected in the warehouse requires advanced
analytical processing and fast access to lots of data.” ( ) The Data Warehouse provides the
storage of vast quantities of data and rapid access to the data. Data Mining is the tool that
provides the analytical processing required to convert the data into information.
Carol Morris states, “Once you have the database, the next think you need to consider is
how to maximize its value. First and foremost is to put the data into the hands of the people who
can best use it. ” ( ) Kurt Thearling states, “Data mining … extracts information from a
database that the user did not know existed. Relationships between variables and customer
behaviors that are non-intuitive are the jewels that data mining hopes to find.” ( ) Data mining
utilizes software tools to examine the vast quantities of data stored in a Data Warehouse,
searching for hidden relationships that, once understood, permit the development of models to
predict future customer behavior. The predictive models can be used to profile and segment
customers. Once developed, the models can then be integrated into the Data Warehouse itself in
order to make CRM more efficient and effective by facilitating future data analysis.
Benefits that can be realized from the implementation of a Data Warehouse include:
• Improved Organizational Flexibility – a Data Warehouse can reduce or eliminate
complexities that exist between functional areas of organizations, particularly large,
geographically diversified companies. They provide continuous availability of data from a
single source, improving user access, security, and integrative control. Data are continuously
updated to provide a current base for analysis.
• Managerial Decision Support – secure, accurate, rapid access to data and analysis techniques
provide managers with the information they need, at the time they need it, and in the location
that they need it. Managers become less dependent upon IS technicians for their information,
modeling, and reports.
• Organizational Learning – the Data Warehouse analysis capabilities permit companies to
focus their attention on the business activities required to deliver high levels of CRM. End
users, the decision makers themselves, learn how to analyze data in order to develop models
to enhance CRM. Less reliance on IS staff for data access and model development results.
• Information Availability, Timeliness, and Alternatives Evaluation – because data are
available on a real-time basis, decision makers are not restricted to utilization of historical
data that may be days, weeks, or even months old. Managers have an increased ability to
take timely actions and to be in a better position to respond to competitive pressures. They
are better able to make proactive decisions.
Thearling provides an introduction to the relationship between CRM and data mining. He
states that the methods used to interact with customers have changed in recent years. (7) These
changes have resulted in the recognition that companies have to better understand their
customers in order to be able to respond quickly to their requirements. Additionally, the time
frame in which CRM is provided is shrinking. It is no longer possible to wait for signs of
customer dissatisfaction before taking actions. “To succeed, companies must be proactive and
anticipate what a customer wants.” Thearling cites four forces that interact to make CRM
complex as well as important:
• Compressed market cycle times - decreases customer attention spans and loyalty. Successful
CRM reinforces values provided to customers on a continuous basis. The time between new
customer desires and when companies must meet the desires is shrinking. If companies don’t
react quickly, some other company will.
• Increased marketing costs impact price and profitability. CRM permits companies to
compete on a basis other than price alone. As stated above, competition on price alone soon
becomes unprofitable. Competition based on customer service is as effective and typically
serves the most profitable customers.
• Streams of new product offerings cause customers to constantly shop for new sources of
products and services that will satisfy their exact needs. Customers want exactly what they
want, not some “close fit.” Indications are that this trend will continue.
• Niche competitors are increasingly able to reach your current customers. The best customers
also look good to competitors. They will focus on small profitable market segments that may
include a significant portion of a company’s customer base.
Successful companies must react to all of these forces in a timely manner. Markets will not
wait for companies to respond, and customers will seek new sources for their products and
services. Customers and prospective customers want to interact with companies on their own
terms. Successful companies must consider multiple criteria when developing CRM strategies.
Data mining must be relevant to business processes if it is to have a positive impact on CRM.
The impact is dependent upon the business processes, not the data mining process. Data within a
company are generated as a result of the process that defines how the company conducts its
business. For example, the marketing department interacts with customers based on numerous
factors such as direct marketing, print advertising, telemarketing, radio/television advertising,
and other methods of marketing. Data are generated during each interaction. Data mining
analyzes the data and looks for information that the company doesn’t know exists. Relationships
between customers and the process of marketing are what is being sought, and these unknown
relationships are the “gems” being sought. Most valuable is information about customer
relationships that are counter-intuitive and provide information to enhance understanding of
customers so that the appropriate CRM can be developed.
Thearling identifies four CRM factors that must be automated. These factors provide the
overall focus for data mining. He suggests that automation must achieve:
• The Right Offer - managing multiple interactions with customers, prioritizing what the offers
will be while making sure that irrelevant offers are minimized
• to The Right Person – every customer is unique so interactions with them need to move
toward highly segmented marketing campaigns that target individual wants and needs
• at The Right Time - interactions with customers now occur on a continuous basis
• through The Right Channel. – interactions with customers can occur in a variety of ways
(direct mail, email, telemarketing), so companies need to be sure that they are choosing the
most effective medium for a particular interaction
Data mining first identifies market segments and customers within the segments that exhibit
high profit potential based on previous buying behavior. Poulos states, “Segmentation is an art
form and the company’s most powerful tool with which to create a loyalty focused enterprise.” (
) It develops characteristics that can be used to develop marketing campaigns and CRM
strategies to maximize the future customer-company interface. Data mining can enhance CRM
by identifying target markets more accurately, permitting a better alignment of campaigns to the
needs, wants, and desires of customers and prospective customers. The following figure
illustrates Thearling’s concept of Data Warehouses, Data Mining, and CRM.
Figure 2: The Data Warehousing “Prediction Loop”
Data mining can be defined in terms of three facilities: applications, approaches, and
algorithms. These three facilities “sit on top of” the raw data generated by the business. They
analyze the data on a continuous basis, looking for current relationships and, more importantly,
changes in relationships that indicate future shifts in CRM.
Applications refer to sets of problems that have similar characteristics across different
application domains. For example, marketing people can use customer demographics as a
method of segmenting high potential customers from lesser potential customers. The
segmentation process is similar regardless of the types of businesses being analyzed. Analyzing
data from an auto manufacturer and a drug manufacturer will be similar in nature, and the results,
while specific to the companies, will also be similar in nature.
Approaches refer to the sets of algorithms used to extract information from databases. These
approaches differ by the types of problems they are able to solve.
• Association – addresses the typical “market basket” problem. The goal is to find trends
across large numbers of transactions that can provide understanding and facilitate
exploitation of buying patterns.
• Sequence based analysis – is an extension of the Association approach that considers the
sequential order in which purchases are made. In this analysis an understanding of
buying patterns is examined as well as the sequences in which the patterns occur.
• Clustering – focus on segmentation problems. Customers are assigned to specifically
defined clusters based on buying patterns and demographics. Buying habits of clusters
are often compared to see if new marketing campaigns may be developed for other
• Classification – uses a set of predefined examples to develop a model that can be used to
classify all data in a population. Fraud detection and credit risk applications are well
suited to this type of analysis.
• Estimation – is a variation of the classification approach that generates “scores” along
various dimensions in the data. For credit risk applications potential clients receive a
“credit worthiness score” rather than be classified in a binary nature.
Algorithms refer to the specific analysis methods utilized to analyze data. Algorithms
generally fall into one of two categories:
• Theory driven modeling tools such as correlation and regression, hypothesis tests, ANOVA,
discriminant analysis, and forecasting methods. These tools utilize analysis to develop
models proposed by managers. For example, regression analysis may develop a model to
investigate a relationship between sales and levels of education that is proposed by a sales
• Data driven modeling tools such as cluster analysis, factor analysis, decision trees, and neural
networks. These tools utilize analysis to develop patterns in data sets that are then
interpreted by managers in order to gain an understanding or perspective on customer
Data mining is not without its drawbacks. The problems common to the algorithms currently
utilized fall into three general categories:
• Susceptibility to “dirty data” – data mining tools simply take data presented to them and
analyze it. They have no semantic structure, nor do they have any ability to identify
anomalies or erroneous data. Users must ensure that the data are “clean” before submitting it
to analysis. The concept of Data Quality, discussed below, addresses the accuracy and
applicability of data utilized in the analysis.
• Inability to “explain” results in human behavior terms – the algorithms develop mathematical
representations of relationships, but offer no suggestion about the nature of the relationships
or why they exist. Why humans do things don’t necessarily map accurately into “if – then”
rules, so results must always be interpreted by people. Regression analysis provides an
excellent example of this “problem.” A regression model may express a strong relationship
between sales and specific customer characteristics, but the relationship cannot explain why
customers with the specific characteristics exhibit higher levels of sales.
• Data representation gap – codification of human actions may not be able to accurately be
represented by data. Categorizing continuous variable data may lose so much information
that the analysis is no longer valid. The most frequent “gap” occurs when a Lickert scale is
used to ascertain levels of customer satisfaction. There is no guarantee that a five point scale
will consistently measure satisfaction levels across all customers. What is “very satisfied” to
one customer may be “satisfied” to another customer. Data Mining algorithms have no way
to differentiate between similar levels of satisfaction when they are coded differently. What
results is an unclear picture of customer satisfaction due to the variability introduced by the
Because Data Mining utilizes vast quantities of data to support managerial decisions, the
accuracy and applicability of the data is a primary issue. The Dataflux Company states, “One of
the critical foundations for the effective "warehousing" and "mining" of volumes of data is data
quality. If data is of insufficient quality, then the knowledge workers who query the data
warehouse and the decision makers who receive the information can not and should not trust the
results.” ( ) Atkins states, “Data quality includes but goes beyond accuracy – such as correct
name and address data at the character level. Rather, it is the level of accuracy, consistency of
format and data representation and completeness that permits matching and integration of all
records that pertain to an entity, such as a customer or patient, a product or a location.” ( ) Wu
illustrates the relationship between a Data Warehouse and data accuracy. He states, “The
credibility of the data warehouse solely rests with the integrity of its data.” Data accuracy is of
paramount importance when using a Data Warehouse. Managers must understand what data
quality is and how it is achieved in order to ensure the viability of their Data Warehouse and,
ultimately, their CRM efforts.
Wu provides a perspective on achieving data quality. He states that data integrity
(quality) methods fall into two categories:
• Prevention Controls – These controls identify data errors before they enter the Data
Warehouse. The controls are part of the process of migrating data from the various
applications systems into the warehouse. This is the primary method of preventing
meaningless, corrupt, and redundant data from the warehouse.
• Detection Controls - These controls assess data accuracy and completeness after they are
migrated into the warehouse. They focus on reconciliation of data to detect insufficient
or incongruent data.
FirstLogic, Inc. provides a more specific perspective on the data migration process from
application system to the Data Warehouse. ( ) Data quality is achieved through three stages:
• Cleansing – This stage focuses on parsing, correcting, and standardizing data.
o Parsing – This process locates, identifies, and isolates individual elements within
the data from the application systems. Such elements as “Customer Number”
and “Customer Name” are identified. Parsing makes it easier to correct,
standardize, and match data because it facilitates individual data item
o Correcting – This process attempts to remove errors and inconsistencies from the
data. Such errors as variations in spelling and abbreviation, outdated data, and
transpositions are identified and corrected. For example, the customer name
“Smith” and “Smtih” will be matched and the second name changed to be
consistent with the first. Similarly, the city name Hollywood may be defined in
the warehouse as an alias for Los Angeles, providing consistency with respect to
the company address.
o Standardizing – This process permits the representation of data in a consistent
and preferred format. Some problems addressed include inconsistent
abbreviations, unusual titles, and variant spellings. For example, the
International Harvester Company may be abbreviated as “Intl. Harvester,” “Int.
Harvester,” or “Internatl. Harvester.” Standardizing will change all of these
names to one name selected by management.
• Matching – This stage focuses on comparing data across and between various sources,
searching for similar information that may be consolidated in the next stage. Matching
permits the discovery of similar data within and across applications software data. This
is the “heart” of Data Warehouse quality. Duplicate representations, now consistently
represented, can be identified and eliminated. All relevant data can then be consolidated,
providing a complete, accurate, consistent source of data about customers.
• Consolidation – This stage focuses on developing relationships between customers and
between customers and demographics. The goal is to build a consolidated view of each
customer so successful one-to-one marketing and CRM can be achieved. One-to-one
marketing permits companies to better serve customers at every point of contact by
identifying their specific behavioral characteristics. For example, customers who prefer
telephone calls from salespeople can be identified and scheduled for calls, while those
who prefer e-mails will not be scheduled for calls.
The ultimate goal of data quality is to ensure accurate, consistent, applicable data that can be
analyzed with Data Mining tools. Without quality data, the CRM efforts are doomed to failure
because the data are not accurately representing the customers, their characteristics and
demographics, and their buying preferences and behaviors. As the Dataflux Company states,
“An information system is only as good as the data which resides within it. A data warehouse,
data mart, marketing database, or data mining initiative cannot be expected to deliver a
satisfactory return on investment unless the data within the system is accurate, reliable, and