Understanding Data Mining
The effectiveness of our buying decisions is governed by how well we understand our problems, our assets
and our objectives. Data mining, or “DM,” can help us make sense of our available information. Data mining
refers to a variety of applications, tools and processes that almost always involve knowledge creation
and/or extraction. But today’s technology landscape is immense and overcrowded with a mix of products,
vendors and challenges. To truly understand your data and data mining itself, the best place to start is
with business intelligence.
Business intelligence (BI)
As shown in Figure 1, Business Intelligence is an
umbrella term that encapsulates many existing
• Data Warehousing (DW)
• Query and Reporting (Q&R)
• On-Line Analytics Processing
• Statistical Analysis
• Data Mining (DM)
Figure 1. Business Intelligence is the end result of the data
Data is a critical asset of any organization, and BI is
the process of drawing the full value from this critical
asset. The kinds of value that can be drawn from data
include information and reporting, explanation and
insight, and forecasting and prediction.
Business Intelligence provides organizations with a
complete temporal spectrum, from historical to future
perspective, as illustrated in Figure 2. Each of the disci-
plines plays a key role in providing these perspectives.
Figure 2. Various types of value available from business
intelligence over time
Understanding Data Mining
Historical and current analytics
Query and reporting include both static and ad hoc options that provide details on historical and current values
and quantities. These reports do not, however, provide projections or predictive capabilities. OLAP is more advanced
than Q&R because it provides for rapid drill down and analysis of dimensional data, as well as trending and
forecasting. One drawback to OLAP is that, typically, analysis is done on aggregated sets of data in the form
of a ”cube” (which is a summarized view of a dataset) and not the data itself.
Predictive analytics fall into two main categories: statistical and data mining. Statistical approaches emerged
from statistical theory, whereas data mining emerged from computational theory. Statistical analytics and data
mining are both rooted in mathematics and share many of the same techniques. However, statistical approaches
work to uncover the relationships and correlations among the variables in the predictive models, while data
mining, also known as “machine learning,” focuses on the predictive capability of the model without regard to
explanation. Data mining techniques do not require the user to know the relationships or patterns in the data,
nor do they necessarily lead to simple explanations of the relationships. Today’s data mining tools and applications
incorporate many of the statistical algorithms traditionally associated with a statistical approach. The tools leave
the appropriate decision to the analyst.
Data Mining Process
The data mining process involves more than just the execution of a data mining algorithm to produce a predic-
tive model. In fact, most of the time and effort in data mining is typically spent defining exactly what needs to
be predicted, collecting the data, preparing the data, interpreting the results and deploying the model into an
end user environment. Because of the investment and skill sets necessary to produce working and trustworthy
models, a detailed examination of business needs is required.
The high level process
The high level process consists of the following six steps:
1. Define what is to be predicted
2. Decide on the appropriate model
3. Prepare data sources
4. Build the model
Each of these steps is described in more detail below.
Step 1. Define what is to be predicted
With this first step, you create a clear definition of the prediction and associated requirements, the data that is
needed, the data that is available, the reason(s) the prediction is needed, and the way the prediction will be used.
Understanding Data Mining
Step 2. Decide on the appropriate modeling type
For this step, you choose a modeling type, which can be one of four: Classification, Clustering/Segmentation,
Regression or Forecasting/Trending.
Step 3. Prepare data sources
Perhaps the most time-consuming of all the steps is the data preparation step, also known as extract, transform
and load (ETL). At various stages in this process, it’s advisable to archive data to ensure attribution later. Throughout
the data mining exercise, changes are made to the data to keep the whole process moving. Later, when analyzing
output/results, it will be necessary to know exactly what data went into the DM process. Preparing the data sources
also involves pulling data from resident systems (it may be necessary to integrate data housed in multiple systems),
transforming the data to a format appropriate for the data mining platform, and identifying data variables
necessary for the data mining effort. Preparing the data also involves data cleanup, in order to avoid “garbage
in, garbage out.” Poor data quality typically ends in invalid predictive results without the end user being aware.
Data quality is a requirement, not a nice-to-have. This process typically involves standardization and/or normal-
ization of the data. Data hygiene routines are often provided with the data mining application.
Step 4. Build the model
Two of the industry-accepted data mining processes are SEMMA (the SAS Institute approach) and CRISP-DM (the
Cross Industry Standard Process for Data Mining). In either case, there is a methodical approach to ensure robust
model development and deployment. Most analytics techniques are built on the sampling premise. Representative
and statistically valid sampling is required to produce valid results. Some improvements in DM applications and
platforms have permitted the complete manipulation of entire datasets with little to no performance impacts,
making this step obsolete.
Other modeling techniques include Explore (perform an exploratory data analysis), Central Tendencies, Population
Characteristics, Dispersion and Distribution, Frequencies, Outliers and Anomalies, and Modify. Modify requires
more ETL. Now that outliers, data gaps, skewed variables and data reduction opportunities have been identified,
they need to be modified and/or recoded in preparation for the DM platform. There are a number of modeling
techniques, some of the more popular being neural networks, decision trees, linear and logistic regression,
discriminate, rule based, and assess, which is the process of validating the accuracy and recall of the model.
Typically, this is performed against a test dataset that was not included in the model creation.
Step 5. Interpret
This step involves a subject matter expert (SME) to interpret the prediction of the model as well as to translate
the results to a form appropriate for deployment to an end user. Often this becomes an iterative process, with
the SME providing valuable insight at this stage that can be folded back into the data. Step 5 is often revisited
multiple times, but the outcome is typically a superior and trusted prediction.
Step 6. Deploy
This is the process of making the model available to the end user. Without the deployment of a model, the results
are only available to the model developer. The model, in combination with a scoring engine, produces the
prediction for a given dataset. This process depends on the chosen DW, application infrastructure and DM platform.
Most advanced DM platforms can produce their models in the form of C++, JAVA or PMML for deployment into
Understanding Data Mining
a production DW, middle-tier infrastructure, or a combination of these. (PMML stands for Predictive Markup Model
Language. It is an XML based specification that contains Data Dictionary, Mining Schema, Transformation Dictionary
and Model Information. The biggest advantage of PMML deployment is that the data does not need to be removed
from the DW to score.)
Successful Data Mining
Businesses and organizations must to come to grips with certain required and undeniable needs for success
in data mining. Possessing sophisticated data mining tools and elaborate data warehousing systems does
not alone ensure successful data mining. There must be a focus on other key factors including data itself,
data experts and skillful analysts.
As referenced numerous times already, data is at the heart of data mining. Businesses and organizations that
attempt to apply DM techniques to data without first understanding their data run the risk of being seriously
misguided. When analytics illustrate trends and correlations or provide predictions based on models, businesses
and organizations are better equipped to take action. At some point, leaders have to trust the direction and/or
advice provided by data modeling. Invariably someone asks about the data used to arrive at the decision, and
at this point, it is too late to discover that there are issues with the data (for example: it wasn’t current, a quarter’s
worth of stats was left out, outliers weren’t eliminated, the database query used to pull the data didn’t join all the
tables as expected, and so on). Data quality and understanding have to remain a major focus during data mining.
Subject matter expert
There are silos and pockets of staff who know critical details about their data, recall changes in systems and
understand how data has evolved in their business or organization. And none of this is written down. Typically,
there is enough existing data for several subject matter experts to possess a vast amount of knowledge and
insight about the nature of data, frequency of update or change, gaping holes and more. The SMEs need to
be involved early in data mining discussions to ensure assumptions do not derail efforts down the road. Subject
matter experts also play a critical role interpreting model results. The SME provides the context and business
perspective that is required to ensure user acceptance and a desirable application of the modeled results.
Data mining exercises are highly dependent upon the skill and experience of the analyst. Encapsulation of data
mining complexities has proliferated the deployment of predictive analytics to those who may not be experts.
Experts have sufficient knowledge and experience to know when to apply the ladder of powers, avoid the
pitfalls of collinearity and overtraining issues. A good data mining platform and tool alone are not enough
to ensure a valid predictive model is constructed.
Understanding Data Mining
Data Mining Vendors and Tools
There are two main categories of data mining platforms: stand-alone and integrated, or in-database, for example,
data warehouse vendors offering DM platforms.
Stand-alone data mining platforms are those that are not bundled with a data warehouse, in other words, data
warehouse agnostic. Model creation utilizes either extracted data or data that is queried at scoring time from
the DW. Recent technical advances have the data warehouse and stand-alone vendors working together to provide
in-database scoring solutions, utilizing models developed from stand-alone applications. The leading stand-alone
data mining platform vendors are SAS Institute and SPSS. All other vendors are quite small in comparison.
Data warehouse vendors Vendor 2005 Market Share (%)
Integrated DM platforms are bundled with a data ware-
SAS Institute 28.3
house. The data is mined in place in the data warehouse,
utilizing the data warehouse hardware for the data mining SPSS 12.6
computations. All leading data warehouse providers offer a Visual Numerics 3.2
data mining solution: Oracle, IBM, Microsoft and Teradata.
Market research firm IDC defines the market for data
mining and statistical tools as Advanced Analytics Software. Oracle 1.7
Table 1, on the right, illustrates just how much stand-alone
advanced analytics software vendors dominate the market.
Relative to DM, a recent trend of the large DW vendors
has been to acquire and integrate existing stand-alone
DM platform vendors into their product offerings. In some Table 1. DM solution vendors and their market share
cases, the DM platform, or some form of it, is then presented
as part of the base DW offering, such as Microsoft SQL Server
2005 Analysis Services (SSAS). The advantages of a DM platform
from a DM provider include:
• a single integrated solution with the DW
• data can be mined in place
• the DW vendor often offers a complete suite of BI solutions
(Microsoft is number three behind Business Objects and Cognos)1
The disadvantages of an integrated solution include:
• single vendor reliance
• DW hardware resources typically utilized for computational, expensive mining activities
• current product offerings are not as extensive as market leading stand-alone vendors
1 IDC, Worldwide Business Intelligence Tools 2005 Vendor
Shares, Doc #202603, July 2006 (used with permission)