2. What is Data Mining?
An information extraction activity which
has as its goal the discovery of hidden facts
contained in databases. It finds patterns and
subtle relationships in data, inferring rules
and generalizations that allow the prediction
of future results. To be a true knowledge
discovery method, a data mining tool should
unearth information automatically.
pres071911a 2
3. Overview
The purpose of this presentation is to
introduce a new and powerful
methodology and associated software
that overcomes many of the limitations
of the other data mining methods in
use today.
pres071911a 3
4. Background
The manual extraction of patterns from data has occurred for centuries.
Early methods of identifying patterns in data included statistical
methods such as Bayes' theorem (1700s) and regression analysis
(1800s). The proliferation, ubiquity and increasing power of computer
technology has increased data collection, storage and manipulations.
As data sets have grown in size and complexity, direct hands-on data
analysis has increasingly been augmented with indirect, automatic data
processing. This has been aided by other discoveries in computer
science, such as neural networks, clustering, genetic algorithms
(1950s), decision trees (1960s) and support vector machines (1990s).
Data mining is the process of applying these methods to data with the
intention of uncovering hidden patterns. (Wikipedia)
pres071911a 4
5. How Does Data Mining Work?
Data Mining Involves Building Predictive Models
that enable better understanding of how to proceed
in some enterprise in a better way.
In order to build a predictive model, several steps
are necessary. Before we outline these steps, here
is a real world problem.
pres071911a 5
8. Steps in Data Mining
Define the problem goals
Identify data sources
pres071911a 8
9. Then build the model
Mine Data
Model
Data
If-Then Rules
pres071911a 9
10. Results
Model containing rules showing what is
best of breed treatment for each case and
why
If diagnosis = Congestive HF and Age =60-
70 and previous. bypass = yes and . . .
Then BOB Treatment = aortic stent
pres071911a 10
11. Steps in a Data Mining Project
1. Define the business or scientific problem
Example: Which of my current customers are likely to
become inactive in the next 6 months.
2. Gather historical data file
Prepare file of customers (present and past) which
include predictive descriptors such as start date, date of
first sale, date of last sales , sales by month, how
acquired, etc.
Include current status (active or inactive) for each
customer
pres071911a 11
12. Steps in a Data Mining Project(continued)
3. Cleanse the Data
Data cleansing reduces noisy and missing data and removes
erroneous data.
4. Add Derived Attributes
Create additional variables from the original data if necessary
(example: compute customer account duration from start date and
current date)
5. Create Test and Holdout Files
Randomly separate the original file into two parts called the test
and holdout files
Build predictive model with modeling software
pres071911a 12
13. Steps in a Data Mining Project(continued)
6. Validate the Data
Validation uses a test set of data which was not used when
building the model. This is the holdout set defined previously.
The learned patterns are applied to this test set and the resulting
output is evaluated for accuracy.
For example, a data mining algorithm trying to distinguish spam
from legitimate emails would be trained on a training set of
sample emails. Once trained, the learned patterns would be
applied to the test set of emails on which it had not been trained.
The accuracy of these patterns can then be measured from how
many emails they correctly classify.
pres071911a 13
14. Steps in a Data Mining Project
(continued)
At this point a model has been created and it can
now be used.
7. Use the Model to Make Predictions
The final step of knowledge discovery from data is
to use the model produced by the data mining
algorithms. As new data come in, the model is
then applied to this data to make predictions.
pres071911a 14
15. Comparison of Methods
Nuggets offers benefits that the other methods don’t
offer Here are a few:
Handles missing data
Handles very large amounts of predictor attributes
Fast Model Development
Able to model small data patterns missed by other
methods
Handles wide variety of data types
Doesn’t require highly trained specialists
pres071911a 15
16. Principal Data Mining Techniques
Industry Standard Methods
Statistics
Neural Nets
Decision Trees
Following is a comparison of Nuggets with
these principal competitors
pres071911a 16
17. Nuggets
Nuggets is a proprietary technology that uses proprietary
search algorithms to intelligently prospect data for valid
hypotheses.
In the act of searching, the algorithms “learn” about the
training data as they proceed.
The result is a very fast and efficient discovery strategy
that does not preclude any potential rule or generalization
from being found. This document outlines its advantages
over its competitors in providing useful and profitable
information from the vast store of data that are being
accumulated at an ever increasing rate.
pres071911a 17
18. Statistics Methods Pros/Cons
Statistics Pros
Statistical analysis is sometimes a good ‘first step’ in
understanding data. These methods deal well with numerical
data where important mathematical facts such as the
underlying probability distributions of the data are known.
However, in today’s world these mathematical facts are rarely
known. These methods are not as good with nominal data
values such as “good”, “better”, “best” in the case of a
preference attribute or “Europe”, “North America”, “Asia” or
“South America” in the case of a location attribute.
pres071911a 18
19. Method Pros/Cons
Statistics (continued)
Some of the statistical methods commonly used are
regression analysis, correlation, Chaid analysis,
hypothesis testing, and discriminant analysis.
Statistical analysis is sometimes a good “first step”
in understanding data. These methods deal well with
numerical data where the underlying probability
distributions of the data are known. This is not often
the case in real world problems.
pres071911a 19
20. Statistics Methods Pros/Cons (cont.)
Nuggets Advantages Over Statistics
Statistical methods require statistical expertise, or a project
person well versed in statistics who is heavily involved.
Such methods require difficult to verify statistical
assumptions. They suffer from the “black box aversion
syndrome”. This means that that non-technical decision
makers, those who will either accept or reject the results of
the study, are often unwilling to make important decisions
based on a technology that gives them answers but does
not explain how it got the answers.
pres071911a 20
21. Statistics Method Pros/Cons
Nuggets Advantages Over Statistics
To tell a non-statistician CEO that she or he must make a crucial
business decision because of a favorable R statistic or some other
arcane statistical reason is not usually well received. With Nuggets®
you can be told exactly how the conclusion was arrived at.
Another problem is that statistical methods are valid only if certain
assumptions about the data are met. Some of these assumptions are:
linear relationships between pairs of variables, non-multicollinearity,
normal probability distributions and independence of samples. If you
do not validate these assumptions because of time limitations or are
not familiar with them, your analysis may be faulty and therefore your
results may not be valid. Even if you know about them you may not
have the time or information to verify the assumptions.
pres071911a 21
22. Method Pros/Cons - Neural Nets
Neural Networks
This is a popular technology, particularly in the financial
community. This method was originally developed in the
1940’s to model biological nervous systems in an attempt
to mimic human thought processes.
pres071911a 22
23. Method Pros/Cons - Neural Nets
Pros
The end result of a Neural Net project is a
mathematical model of the process. It deals
primarily with numerical attributes such as age,
income, height, etc., but not as well with nominal
data such as state, brand preference, vehicle make,
etc.
pres071911a 23
24. Method Pros/Cons - Neural Nets
Nuggets Advantages
There is still much controversy regarding the
efficacy of Neural Nets. One major objection to
the method is that the development of a Neural
Net model is partly an art and partly a science in
that the results often depend on the individual
who built the model. That is, the model form
(called the network topology) and hence the
results, may differ from one researcher to another
for the same data.
pres071911a 24
25. Method Pros/Cons - Neural Nets
There is also the problem with Neural Nets of “overfitting”
that results in good prediction of the data used to build the
model but bad results with new data. Neural Nets often use a
sigmoid function in its computations. This is a mathematical
function resembling the shape of the letter “S”. Questions
exist whether there is any theoretical justification for this
somewhat arbitrary choice and makes this approach
somewhat ad hoc.
Another issue is that the modeling results produced by a
Neural Net method are not intuitive. The method is called a
“black box” to indicate the lack of intuitive understanding of
its results. Neural Nets are still in use but becoming less
popular due to these issues.
pres071911a 25
26. Method Pros/Cons Decision Trees
Decision Trees (Cart, Chaid, etc.)
Decision tree methods are techniques for
partitioning a training file into a tree
representation. The starting node is called the root
node. Depending upon the results of a test this
node is then partitioned into two or more sub-sets.
Each node is then further partitioned until a tree is
built. This tree can be mapped into a set of rules.
These rules in the form of a data tree are used to
generate forecasts.
pres071911a 26
27. Method Pros/Cons Decision Trees
Nuggets Advantages
By far the most important negative for decision trees is that
they are forced to make decisions along the way based on
limited information that implicitly leaves out of consideration
the vast majority of potential patterns in the training file. This
approach may leave valuable patterns undiscovered since
decisions made early in the process will preclude some good
rules from being discovered later. This is called “greedy
optimization” and lessens the accuracy of the resulting model.
Furthermore large numbers of predictor attributes as exist in
most of today’s data sets are not handled with decision trees.
pres071911a 27
28. Method Pros/Cons Decision Trees
Nuggets Advantages
Nuggets does not make these greedy
decisions. Instead it “implicitly”
searches all possible patterns and
thus is able to find patterns that are
useful but that wouldn’t be found
with decision trees.
pres071911a 28
29. Summary of Comparison With Other Methods
Nuggets Advantages
Nuggets offers many advantages over other methods in
common use. A few were presented here.
Nuggets advantages vary from method to method and
most are due to the limiting assumptions required by these
older methods which limit their effectiveness.
Nuggets is designed to circumvent these disadvantages
and offer a superior methodology that can work with the
challenges of the large number of complex data bases that
exist in today’s world.
pres071911a 29