Machine Learning and Data Mining
CNRS, LRI Bât. 490, Université Paris-Sud
91405 Orsay, firstname.lastname@example.org
WORKING PAPER : ENGLISH VERSION OF
"Apprentissage et Fouille de Données" accepted for publication
Deep differences explain why Data Mining has been enthusiastically accepted by Industry, while
Machine Learning and Exploratory Statistics still have problems being accepted by it. This paper
points at all the epistemological, scientific, and industrial differences between the two, and
explains why Data Mining is better accepted in Industry.
Many techniques for developing models out of data were developed since the 60s. This work
amounts to building automatic method for performing an inductive reasoning, but it is not always
acknowledged to be of this nature. Since Data Mining (DM) is the last manifestation of this
attitude, we shall briefly recall the various domains which participated in this effort, in order to
obtain a definition of what DM might be.
Machine Learning was developed at the end of the 70s while Data Mining started at the
beginning of the 90s. In parallel, and since the 60s, several techniques that all belong to Statistical
Learning, have developed as well as their applications to Pattern Recognition. Statistical
Learning includes the last improvements to regression, in particular regression trees (Breiman et
al., 1984), the domain called Data Analysis, the perceptron (Rosenblatt, 1958) and its extension,
neural networks (around 1985, see Le Cun et al., 1989), and Support Vector Machines (Vapnik,
1995). Independently, Bayesian statistics developed their inductive tools. The so-called “naive
Bayes” technique has been used since the beginning of the 60s (Maron, 1961). It postulates
conditional independence between the features knowing the class variable. Presently, techniques
for the automatic generation of Bayesian networks, including the structure of the network, have
I propose to call Automated Learning (AL) - a domain still in creation - that unifies ML and
Statistical Learning (including Data Analysis, Pattern Recognition , Neural Network and
Bayesian approaches). This leads us to propose the following definition:
From the point of view its origins as well as from the one of daily work, Data Mining is the
merging of Data Base and Automated Learning research.
This shows clearly that, in spite of the still strong influence of Carnap's ideas1 (1959) in Science,
researchers in inductive reasoning have not tried to extend the capacities of models of uncertainty
(this work has been done by researchers specialized in the creation of deductive reasoning
models) but to improve our ways to deal with the phenomenon of model emergence from data.
Bayesian networks illustrate the ambiguousness of this phenomenon particularly well. Some
researchers (specialized in deduction) try to improve the capacities of Bayesian networks
inference, given a structure and the conditional probabilities. They improve capacities of
reasoning of the given model, and it happens that this model is capable of deductive and
abductive probabilistic reasoning. Inversely, some other researchers (specialized in induction)
develop methods for the construction from data of the tables of conditional probabilities and of
the structure of the Bayesian network. The last ones try to improve the adequacy between data
and the network, whereas the first try to improve the capacities of a given network.
Because of the relations that we have just underlined between DM and AL, it comes somewhat as
a surprise that Industry enthusiastically adopted the first, while it has always looked down on the
second. Even more surprising, our the analysis forthcoming in this paper will reveal that the deep
reason of DM success seems to be that it did not hesitate to innovate relative to traditional data
processing whereas AL preserves the majority of its epistemological choices.
In reality, and because of this industrial success, the relationships between DM and AL already
underlined, and much of the software sold as DM is nothing but friendly interfaced AL systems.
In this paper, without insisting more on the camouflage that we have just underlined, we will
point at the most important differences between DM and AL, those that can explain the “fashion
effect” of DM.
2. The components of DM
Before being able to describe the differences between the two domains, we have to recall what
these domains are made of. The review below presents the most widespread methods developed
by each component of DM. Each method will be described by its essential aspect, inputs and
outputs of each system, but without going into detail. In spite of our lack of exhaustiveness, the
wealth of methods - often unexpected to the non specialist – worked out in order to help humans
to build models from data is quite striking.
In order to explain the difference between DM and AL, it is also necessary to separate Supervised
Learning, in which the inductive steps are controlled by the expert before the inductive phase,
from Unsupervised Learning where the expert's opinion is taken in account after the model has
been built automatically.
2.1 Supervised and Unsupervised Learning
2.1.1. Supervised Learning
He states: « all inductive reasoning, ..., is reasoning in terms of probability; hence inductive logic, the theory of
inductive reasoning, is the same as probability logic ... ».
Supervised Learning essentially consists in the transformation of a description in extension of a
class, to a description made in intention of the same class.
Inputs are in a data table, one field of which is called the class, or the variable of interest. The
other fields are called features. All the fields can be continuous or discrete.
Outputs are uncertain theorems whose premise is a combination of the feature values, and the
conclusion is one of the class values.
This is an improvement on the obvious condition that the description in intention uses less many
bits than the description in extension, and this is a means to rate the interest of a description in
The validity measure of such a procedure is most often a so-called "cross-validation," i.e., data
are repetitively divided in a learning set of and a test set, upon which the precision is measured.
In general, 9/10 of the initial set are used for learning, and 1/10 for the test. This procedure is
repeated 10 times and the precision is the average of the precision obtained for each test. The
value of 10 does not have a theoretical justification, but seems to give quite satisfactory results.
2.1.2. Unsupervised Learning
Unsupervised Learning tries to extract structures (or patterns) existing within the data, without
knowing beforehand what is a ‘good’ structure.
Let us first speak of Data Analysis that developed fundamentally deductive methods but that are
used often in an inductive way, as correspondence analysis and main components analysis.
Generally, when a matrix is made diagonal and that the ‘strongest’ values are kept while the
‘weakest’ ones are ignored, then an inductive use of a deductive method is performed. In fact, the
induction is made by the human who decides what is strong and what is weak. In principle, it
would be enough to add an optimization operation for the system to become perfectly inductive.
It happens that this addition is far from being easy, this is why so many directly inductive
methods have been developed.
Depending on the kind of structure looked-for, Unsupervised Learning takes a particular name.
When the system must build classes clustering the individuals judged as the most similar relative
to a certain criteria, then it is most often called clustering, and classification in Data Analysis,
categorization in Cognitive Sciences, segmentation in the industry.
When the system looks for logical relations confirmed by data, that is to say theorems, in general
uncertain ones, then it is referred to as the detection of associations, or relations or patterns within
When a valid functional relation among the variables is looked for, Statistics talks about logistical
regression and in ML it is ‘scientific discovery'. The main scientific laws, such as the law of gas
compression, PV = nRT, express a functional relation holding among data.
When the searched relations are relative to the spatial or temporal organization, one speaks of the
discovery of spatial sequences (typically: in Genomics) or of temporal sequences (typically: in
the analysis of the stock market).
It is necessary to realize that Unsupervised Learning does not start with enough information to
steer the induction steps towards a solution more or less expected by the user. Therefore, its
results are extremely difficult to validate. In fact, these results are of three kinds: they can be
trivial (for example, the systematic discovery of tautologies) when they are relative to a very
large population; false (due to noise, for example) when they are relative to a tiny population; or
very interesting since they are unexpected.
The results of Unsupervised Learning can be validated always a posteriori in two ways.
The first consists in a direct confirmation by a domain expert. Like any other scientific
discovery, validation takes place when the discovery raises interest among the experts and brings
progress to a domain. For example, a system of automatic association detection can be coupled to
an Expert System, and Unsupervised Learning is a success when the induced rules, once
introduced into it, improve the Expert System.
The second is a kind of cross validation obtained by several independent optimizations. Its
principle is as follows: use an optimization method, as in 2.2.c below, to perform the
Unsupervised Learning, then test the obtained results with a supervised method using another
optimization criterion. For instance, our team is developing a system that detects associations,
while using a measure called Non-contradiction, combining the confidence in the validity of the
implication A ⇒ B, P(B A), and the confidence in the non validity of this implication, P(¬B
A). Each rule thus found defines two classes in the examples, those that follow the rule and those
that do not. This creates a clustering relative to the set of found rules, provided we accept that
these classes cover the examples without partitioning them. These classes are considered, in turn,
as data by a supervised induction method, here a decision tree, optimized according to a measure
of entropy. The two types of rules thus created should be equivalent, and they are compared. Our
experience is that rules found by the two methods are very different, but their comparison is
easier to rate for an expert than the rules obtained in an unsupervised way.
2.2 The induction criteria
Almost all programs achieving an inductive step perform an optimized search in a space of
hypotheses. Inductive Learning consists in the following 4 steps.
2.2.1. Definition of the hypothesis space
A few starting examples are chosen, and a space of hypotheses is generated as a subset of the set
of all possible generalizations of these examples. To learn grammatical tagging, i.e., labeling
each word of a text by its grammatical category, for example, one particular tagging is observed,
usually in a very large set of tagged sentences, and all the generalizations describing the context
of this tagging are generated. The system learns then the generalizations that are the most
confirmed in the tagged text. In general, this very important step is described very briefly or even
poorly by the authors: they add in a mixture of domain knowledge, of arbitrary choices of
knowledge representation, and hidden heuristics. A precise definition of the hypothesis space is
indeed difficult to describe correctly. For example, Brill's tagger (1994) learns tags in context,
and it describes the context of a word by the labels or the words preceding or (exclusive or)
following it within a distance of three words. This defines a limited space of possible
generalizations, and the reason why generalizations including words or labels preceding and/or
following the one to be tagged does not correspond to a theory about the environment of a word.
Its role is simply to limit the size of the research space. One contribution of DM is to have
insisted on the importance of this step, which must be explicit so that one can understand that the
results can only be a combination of allowed generalizations. For example, association detection
in DM is done under conditions of coverage and precision that have a questionable interest, but
that are perfectly explicit.
2.2.2. Choice of a search strategy within the hypothesis space
The most current strategies are the so-called greedy and exhaustive strategies.
Brill's tagger, cited above, uses an exhaustive search in a very limited hypothesis space.
DM, in association detection, uses also an exhaustive research.
In the greedy strategy, the possible choices are ranked according to a measure of
optimization (to see point c. below) and the path passing by the first point optimizing this
measure is chosen.
The ML approach put a lot of efforts in studying these search strategies, whereas other
approaches tend to adopt the exhaustive search.
Random choices, a variant of the exhaustive strategy, seems very efficient.
Genetic Algorithms are also acknowledged as one of the most efficient search strategies.
2.2.3. Choice of an optimization criterion
The number of optimization criteria is impressive, consequently we will discuss them in detail.
The two most often used criteria are precision (number of successes / total number of tries) and
recall (number of successes / number of objects to be recognized). In Supervised Learning the
number of successes is given by the number of cases where the class is correctly recognized by
the hypothesis under test. In Unsupervised Learning, precision is an evaluation given by the
expert examining a subset of the obtained results. Usually, the expert examines only a subset
because, if the automatic method is efficient, then it deals with too much data and too many
results to be entirely checked by a human. Note then that in Unsupervised Learning the number
of objects to recognize is unknown. Therefore, recall is not computable, unless the expert makes
an exhaustive analysis, and we have just underlined that this is unrealistic on real problems.
It is also “well known” (at least in the ML community) that the precision measure expresses a
particular hypothesis on the nature of data and that Laplace estimator, (number of success +1) /
(total number of tests + number of classes to be recognized) is to be preferred (see
http://www.lri.fr/~yk/ for explanations relative to this phenomenon). It is also interesting to
consider the number of times where the class is falsely recognized (the so-called “false positive”
recognition), that is to say that the hypothesis under test is mistaken while recognizing a class.
The ROC curves (Receiver (or Relative) Operating Characteristic), used especially in DM,
represent the variation of the correctly recognized against the falsely recognized classes. The
precision is also used in DM by drawing lift charts giving the variation of the precision
according to the number of examples examined. A good lift-chart rises very fast which is very
important in the unsupervised approach since a high precision is reached with a small number of
examples validated by the expert, and this is worth consideration since expert work is always
Another classical criterion is entropy variation, systematically used by decision trees, or the
quadratic entropy (also called Gini index) used in ML and in Data Analysis.
When a numerical distance between objects is computable, and this is an usual hypothesis in Data
Analysis and in regression techniques, multiple transformations of the data representation are
then possible, based, in general, on a minimization of the squares of a distance.
The statistical approach very often uses the hypothesis that a distribution of small variance is
better than another of large variance, and it uses the minimization of the variance as a criterion
for optimization, for instance in the case of regression trees.
The Bayesian approach uses the fact that data tables give the probability of the data knowing that
the studied phenomenon, Ph, took place, P(D Ph). In Supervised Learning Ph is, for example, to
belong to a class, and in the unsupervised case, it can be the validity of a pattern. One can
therefore deduce the probability of the phenomenon given the data by computing P(Ph D) =
P(D Ph) * P(Ph) / P(D). This process uses the least possible induction, it simply induces that the
values of P(Ph D) computed from observed data stay valid for new data.
Finally, it is also classical to use the principle of Minimum Description Length (MDL). For this,
both the length of the description of the model, and the length of the description of the examples
it fails to class correctly are encoded (in the supervised case). According to this principle, the
minimum of the sum of these two values is an optimum. The software C4.5 transforms the trees
into rules according to this principle. The present methods of Bayesian networks construction
from data also use the MDL principle systematically. In this unsupervised case, the encoding
includes the network and the examples, given this network.
DM also introduced other measures of optimization, more linked to applications, such as
the optimization of operation cost, or the return on investment, etc.
AL was in general satisfied with validations associated to the chosen optimization criteria. As the
criterion of precision is the most often used, validation reduces to show that the most precise
hypotheses have been used, as in the above described cross-validation.
DM insisted on the importance of a further phase of validation. Either the induced results are
directly examined by an expert who confirms their validity (the comprehensibility of results is
then a primary condition), or, this is the best validation, the results of the induction are used for a
task whose efficiency is measurable. Validation takes then place when efficiency is increased
with the introduction of the induced knowledge.
2.3 Data Mining
One can consider that the date of birth of DM is 1989 when Gregory Piatetsky-Shapiro organized
the first workshop on " Knowledge Discovery in Data Bases". However, the first spectacular
demonstration dates from 1995 when he organized the first KDD conference in Montreal.
Among multiple applications and original points of views, DM gave birth to three main types of
methods that are all included in the DM commercial systems.
The first is association detection, in particular the discovery of uncertain theorem confirmed by
the data, and the multiple measures of interest to choose among all valid theorems. The DM
approach focuses on the problems raised by applications to very large data bases.
It happens that these methods can be very easily extended to the discovery of temporal series that
became the second noticeable successes of DM, as scientific discipline.
The default of classical methods of association detection, that is to say their exhaustiveness
limited by the cover only (if, for example, A & B ⇒ C, then the cover of this implication is the
probability that A, B and C be together true), becomes an advantage when discovering temporal
series since they can be considered as valid only when the series is repeated often enough.
Finally, and under industrial influence, DM developed multiple methods for data cleaning and
2.4 Machine Learning (ML)
The first programs developed that learn rules from data are due to Michalski (Michalski and
Chilausky, 1980) and this dates the beginning of ML, although the first ML workshop (that
became the International Machine Learning Conference) took place in 1982.
The work of Dietterich and Michalski (1982) witnesses that learning structures was a very early
concern for ML.
This research field accomplished most of its work in Supervised Learning and generated a
quantity of systems of which some are included in industrial software.
In particular, decision trees ask for an input of discrete or continuous features (that will be made
optimally discrete in regard to the classes) and of imperatively discrete classes. They produce
classification trees that are a description in intention of the classes. The most famous of these
systems, C4.5 (sold now as C5 or See5) generates rules built from the decision tree.
Other systems, such AQ and CN2, generate classification rules directly from the data, generally
discrete or previously discretized ones.
One of basic procedure of Learning is generalization. The space of possible generalization has
been called the version space, and many methods propose their own way to move in the version
Inductive Logical Programming (ILP) is precisely one way to move in a relational version space.
All other methods suppose implicitly that a feature is in relation with only one record, i.e.,
features are postulated to be unary. Otherwise stated, the i-th feature takes a value for the j-th
field. In ILP, a feature can be n-ary, i.e., it describes a relation between n objects. For example,
one can describe the properties of objects A and B with features taking unary values (such as : A
is red, B is blue), or a binary feature, such as the distance between A and B (for instance,
distance(A, B) = 27). From inputs of this type, ILP will learn some general laws about the
distance, for example that there are no objects distant from B of more than 50 units: [For all x,
(distance(x, B)) <50]. However, the space of possible hypotheses becomes huge, and the
algorithms checking the validity of the hypothesized relations are n-complete. It follows that the
descriptive power of ILP is balanced by the complexity of the computation necessary to verify
the hypotheses allowing the program to build a model explaining the data.
This is why the domain seems to move now towards the so-called propositionnalisation methods
in which n-ary descriptions are trivially replaced by unary relations: one creates, in principle, as
many descriptions as there are possible variable matching. The combinatory explosion in time is
replaced by a combinatory explosion in space. The gain comes from the fact that only a few (that
is, thousands of them) “carefully chosen” descriptions are preserved. The heuristics defining the
way to choose the descriptions to keep (including the trivial heuristic of random choices)
constitute the main topic of research for this new approach.
In clustering, the main contribution of ML is the Unsupervised Learning COBWEB system.
COBWEB uses yet another criterion of optimization, called utility. The utility of a class C
containing the feature A taking v possible values is computed by the product of the probabilities
P(A = v) P(A = v | C) P(C | A = v). P(A = v) is the probability that feature A takes value v; P(A =
v | C) is the probability that the feature A takes the value v in class C; P(C | A = v) is the
probability of meeting the concept C when A = v. Of course, Bayes law rewrites this expression,
to be able to compute the sum of utility gains brought by each class, so that the formula giving
the utility of a clustering is:
U =∑P(C) [∑∑P(A = v | C)2 - P(A = v )2 ] /n
where n is the number of classes, where the first sum is on all classes, and the two following ones
are on all features and on all their values. U is computed for every possible configurations, which
would be impossible if one did not compute the utility gain incrementally. COBWEB is therefore
very slow, but incremental so that it is very well adapted to problems asking for a regular
updating. Besides, the sums are replaced by integrals when dealing with continuous values, so
that COBWEB adapts well to mixed, continuous and discrete, data.
In spite of all these qualities, COBWEB, still written in LISP, is not part of a commercial
2.5 Pattern Recognition
Before ML started, researchers in Pattern Recognition developed learning programs of which the
most used is a linear separator, called the perceptron (Rosenblatt, 1958). One can prove that a
perceptron is able to separate two sets of examples indexed by 0 and 1 in a finite number of k
steps of calculation, where k is bounded by Novikoff’s important theorem, k < (R/ γ ) 2. R is the
radius of the data (that is to say the radius of the volume they fill in the space of n features), and γ
is the maximum, on all examples, of the minimum, on all hyperplans, of the distribution of
distances between the examples and the separating hyperplan, what is now called the functional
margin of the separating hyperplan.
Neural Networks (NN) were born from the need to get out of linear separators, but their success
is rather due to the fact that they deal with inputs and outputs, possibly multiple outputs, that can
be continuous, discrete, or mixed. These two properties (mixed variables and multiple outputs),
inherent to the way a NN is built, correspond to a real industrial need. The NN are now used in
the settings of Vapnik's theory, that is to say as support vector machines (SVM) in order to be
able to compute their generalization capacity.
NN led to an unsupervised version, self-organizing maps (of Kohonen, 1990). Kohonen's maps
implement a particular kind of NN, the so-called competition NN. The success of an output
neuron (belonging to what is in this context called the competition layer) to recognize an input,
reinforce the neuron winning the competition and inhibits the other neurons, so that the winner
for an example, has a tendency to specialize in the recognition of this example.
A self-organizing map is a NN whose outputs are equal in number to the number of classes. Two
examples belong to the same class if they activate the same output. Outputs are represented as
disposed in a plan as the nodes of a grid, which is where the name “maps” comes from.
2.6 Exploratory Statistics
Statistics are extensively taught in University curricula, but their exploratory aspect is
much less taught, we will therefore give a few details on this aspect. The hypothesis underlying
all inductive statistics, in spite of the diversity of the proposed methods, is that the smaller the
variance, the better the model.
For example, the k-means method minimizes intra-class variance, so that if N is the
number of objects to classify, xi the coordinates of the i-th object, and μm the coordinates of the
center of gravity of the m-th class, then the quantity to minimize is
1/N ∑m ∑i (xi - μm)2
The difference between the k-means and other approaches, comes from the various methods of
choice of the seed (i.e., choose astutely the first μm), and the subsequent technique of allocation-
recentering (i. e., compute astutely the next μm).
Regression, be it logistical or not, looks for a solution that minimizes the variance of a
distance, most usually given by the sum of the squares of distances of the objects to the solution.
When the model to be discovered is not given in advance, then the way logistical regression
discovers the model is a truly inductive work.
Regression Trees (Breiman et al., 1984) use exactly the same technique, except that
beforehand it divides the space of solutions in pavements, each among them being a leaf of the
regression tree. Thus, the building of the regression tree itself brings no new concept to the
foreground. Inversely, the notion of an optimal path for pruning the tree built in this way,
introduced by Breiman, is indeed a new concept added to Exploratory Statistics.
Finally, support vector machines (SVM, Vapnik, 1995), in their simplest linear form, are
nothing but perceptrons that minimize the variance of the distances between the objects and the
separating hyperplan, which is called the minimization of the functional margin. The notion of
kernel permitting to simulate the non linear separations, and the notion of Vapnik-Chervonenkis
dimension are, on the contrary, completely original. By these aspects, SVM introduce exploratory
statistics of an completely new type.
2.7 Data Analysis
Data Analysis (DA) is taught very extensively in university courses, for this reason we will not
give any details on this approach.
The basic method of DA - clustering excepted- consists in studying the points in Rn formed by
the studied object. After centering (i.e., expressing coordinates as distances to the mean) and
reducing (i.e., dividing by the standard deviation), a family of ellipsoids centered on the mean is
studied, and the one closest (i.e., most often, the one minimizing variance) to the largest number
of objects is deemed the best representation of the data. The axes of this best ellipsoid reflect the
main tendency of the data. As we pointed out above, induction takes place by choosing the
‘relevant’ number of axes of the ellipsoid.
DA also developed methods of Unsupervised Learning of classes while regrouping individuals
nearest in the sense of a numerical distance.
2.8 Bayesian statistics
The main effort of Bayesian statistics is relative to the development of deductive reasoning
methods taking into account the conditional independence of discrete variables. From the point of
view of induction, they developed two techniques.
The first one is a Supervised Learning method called "Naive Bayes" where all features
conditionally depend on the class to be recognized, if this class is known. Learning, in this case,
is reduced to taking into account the probabilities of the observed event occurrence, but it is one
of the most efficient methods in precision, and it can deal with many features. Note however that
the generated model is absolutely incomprehensible.
The second one is unsupervised. Two different things can be learned. For a given network
structure, it is possible to learn the conditional probability tables from data. The
comprehensibility is then entirely due to the network. It is also possible to learn the structures
themselves. In this case, the automatic generation of large Bayesian networks (Heckerman et al.,
1995) constitutes an essential progress in the domain of inductive reasoning. The criterion of
optimization used is MDL, the principle of minimal description length. This approach induces
comprehensible structures from examples. However, recent results (Bendou and Munteanu, 2003)
devised experiments showing that a very small amount of noise, of the order of 1%, will change
many structures of the network. They also proved the generality of their experimental results by
using the properties of d–separation. The V-substructures only, expressing a conditional
dependence of two nodes relative to a variation of knowledge about a third node or its
descendants (in other words: two variables are the common ‘cause’ of third one), resist well to
noise and might possibly be considered as explanatory by the domain expert. The other structures
are not steady, and get settled to the only purpose of optimizing the network behavior in
This approach has also produced a classification method, AUTOCLASS, that builds classes using
an exhaustive search of classes conditionally optimal relative to data, or at least in principle an
exhaustive one. To our knowledge, this approach has not given any industrial software, even
though an American company tried to sell it.
3. DM/AL differences from the point of view of epistemology
These differences are summarized below in table 1, showing that DM and AL, though both
automate the generation of a model from data, differ otherwise in many epistemological choices.
Differences in the scientific
Classic data processing Automatic Learning DM
(ML and Statistics)
Simulates a deductive Simulates an inductive Simulates an inductive
reasoning (= applies an reasoning (= invents a model) reasoning ("even more
existing model) inductive")
validation according to validation according to validation according to
precision precision utility and comprehensibility
Results as universal as possible Results as universal as Results relative to particular
elegance = conciseness elegance = conciseness elegance = adequacy to the
Position relative to Artificial Intelligence
Tends to reject AI Either tends to reject AI Naturally integrates AI, DB,
(Statistics) or claims Stat., and MMI.
belonging to AI (ML)
Table 1: Differences of epistemological nature among Computer Science, AL and DM.
Classic Computer Science applies existing models, and as we already pointed out, these models
can be of a probabilistic nature. In the same way, methods of fuzzy inference propose a model,
the fuzzy model, and study how this model can be applied to real data. Some approaches produce
inductively fuzzy models, as fuzzy trees of decision (see http://www.lri.fr/~yk/ for a particularly
simple presentation of fuzzy decision trees) or fuzzy rules. This addition of fuzziness makes the
induction more complex (and requires fuzzy data) but does not modify its nature. In the same
way, Rough Sets are a knowledge representation and propose a model. They are therefore by
nature deductive. This does not prevent them fro introducing induction within their
representation, but the induction methods then introduced are the same as those of the other
AL obviously works on the automatic generation of models, but while the majority of the systems
stemming from AL perform Supervised Learning, the majority of systems stemming from DM
perform Unsupervised Learning. This is why table 1, above, states that DM is “even more
inductive” than AL.
In classical Computer Science, because of the weight of the deductive approach, a result is
definitely validated after having been integrated into a formal model, so that it seems deducible
from this model. Here, we will not deal with this final phase, but only with the initial phase
during which the first experimental results are obtained. AL, as well as classical data processing,
uses a criterion of precision to choose the most meaningful experimental results. In fact, symbolic
Learning has sometimes also introduced criteria of comprehensibility. For example, the induction
of decision trees software, C4.5, introduces a final procedure during which decision trees are
transformed into rules, often more comprehensible than trees. Besides, the creation of short trees
or short rules is preferred, even to the loss of a little amount of precision, in order to favor
comprehensibility. The concern for comprehensibility therefore did not appear ex nihilo in DM. It
is necessary to admit, however, that most research efforts, even in the field of symbolic ML, have
been judged on criteria of precision rather than on comprehensibility.
DM considers that precision is only one of the possible criteria, and substitutes the concept of
utility. Utility is obviously not universal and therefore DM introduces at this point a definition of
validation depending on each problem, what is both very new, and very interesting for each
application. Some criteria of utility, as the patient’s pain in Medicine, depend closely on the
application and are completely incompatible with precision. DM therefore does not hesitate to
introduce some social considerations in the criteria for algorithm validation, which is classically
considered as a "scientific heresy." Comprehensibility is also a criterion of social nature, and
what means more precisely : express the induced model in the language of the concerned field,
while using this expert's concepts.
In fact, DM supposes that a society of experts exists, and that it shares the same concepts and
speaks the same language (which is quite sensible), and DM addresses explicitly to one of these
societies, in each of its applications. Validation happens within this society of experts and not in
an absolute sense.
In fact, choosing utility, as opposed to precision, as a criterion of optimization, is already an
example of choice of the particular versus the universal. AL is essentially about the general
methods of induction and their properties. Inversely, DM is essentially about the application of
induction methods to particular problems. For example, AL considers that data are not spoiled by
unverifiable mistakes that prevent the induction to take place correctly, whereas DM considers
that each data set requires a particular cleaning treatment.
At the other end of the chain of knowledge acquisition, AL considers its work accomplished once
knowledge that satisfies a given criterion is acquired. DM considers that this knowledge must be
useful in the relevant specialty domain, and it must be validated by an improvement of the
existing methods of this specialty domain. In addition, it is quite characteristic that the DM
conferences, even the academic ones, systematically organize competitions among systems, and
that domain experts are called to judge the excellence of the results obtained. In a similar way,
Text Mining, methods are adapted to a particular corpus whereas the more classical Linguistics
approach analyzes the general laws of the language.
This difference might appear trivial but it is fundamental. It is quite obvious that it is impossible
to rewrite all programs for each application. This is why DM develops tools allowing experts
themselves to develop their application, for every particular case. This requirement forces
conviviality in setting the program parameters, and that leads to methods adapting to different
Thus, by a kind of epistemological slight of the hand, DM, which is less interested in the general,
builds systems that have more potential applications (and in a sense they are therefore more
general) that AL.
3.4 Elegant conciseness
There have been many debates about the criterion called "Occam's razor", that prefer the simplest
solution. It remains the rule for most approaches (in the DM community, see a discussion in
Domingos, 1998). Of course, nothing really scientific justifies it, except the scientist's aesthetic
pleasure when they use it. This systematic conciseness, when it results in a lack of clarity in the
exposition of the model induced, opposes to the principle of comprehensibility of DM.
3.5 Relations with Artificial Intelligence
It is relatively surprising to note that DM integrates perfectly, apparently without problem, AI
with approaches that traditionally rejected AI. Of the two AL components , the symbolic
component declared its belonging to AI, whereas Statistics, and even Pattern Recognition, put
distance between AI and them. It is possible that the academic quarrels pro or con AI do not
really concern the industrial world, and that this integration of AI in DM is not a reason for
industrial acceptance, but a consequence of an industrial concern.
4. An industry view of the differences DM/AL
4.1 The twelve tips for successful Data Mining, according to the Oracle Data Mining
These tips can still be found on the web, in .pdf form, at:
We use these tips as interesting witnesses of what an industrial might ask from a DM method. We
shall see that under their humorous formulation, very interesting truth are hidden.
4.1.1 - Mine significantly more data.
AL has a tendency to look deeply into small databases, whereas the DM concentrates its efforts
on the very large ones.
4.1.2 - Create new variable to tease more information out of your data
AL, and specially ML developed methods called “constructive induction” and “feature selection”
(Liu and Motoda, 1998), that is to say, ways to create or eliminate features. However, this effort
was essentially carried out on the justification of modifications done to the features, while DM is
ready to be content with a posteriori justifications, observed by the improvement of the obtained
model, rather than by carrying out transforms justified beforehand.
4.1.3 - Take a shallow dive into the data first
A superficial approach is never advisable in an academic context. However many crude mistakes
are avoided by a superficial examination.
4.1.4 - Rapidly build many exploratory predictive models
AL tries to build the ‘best’ optimal explanatory model, whereas DM does not hesitate to produce
several explanatory models. Even in the case of new techniques (actually born after DM started
to exist) such as boosting and bagging, the main effort consists in devising a kind of voting
procedure providing one best result, usually the most precise one. The DM approach would be to
keep the different models generated and help the domain expert to choose among them, or to
combine them in an optimal way.
4.1.5 - Cluster your customers first, and then build multiple targeted predictive models.
As we saw already, in AL, supervised approaches are distinctly dominant, while the unsupervised
ones lead DM. We also saw that one of the goals of Unsupervised Learning is clustering,
therefore a segmentation of records of the DB. Once this segmentation is done, methods of rule
generation, for example, can be applied to each segment.
This advice may appear innocent and somewhat superficial. Yet, it is very important. While
applying pattern detection methods to the entire basis, general laws, valid for all individuals, are
sought, and this often leads to detecting only trivial laws, valid for all the records. Inversely, a
prior segmentation allows us to detect patterns valid on some sub populations. If these sub
populations are meaningful, that is to say if the segmentation has a sense, then the laws thus
found have a good chance of being interesting, either unknown or merely suspected by the expert.
We see that this advice is an illustration of the difference about universality, commented above in
4.1.6 - automated model building
This advises the use of induction, it does not make any difference between AL of DM. It
nevertheless illustrates that automatic building of models, i.e. the automation of inductive
reasoning, is not a fancy of academics but an industrial need.
4.1.7 - Demystify neural networks and clusters by reverse engineering them using C&RT
Neural Networks techniques of classification, together with many other approaches, could be
"demystified" since they are not the only ones to provide non comprehensible results. DM does
not recommend the exclusive use techniques giving comprehensible results, and all techniques of
data mining are acceptable. It is however DM-unacceptable to provide crude outputs, without
interpreting them in a language comprehensible to the user. It follows that the concept of reverse
engineering should become central in DM.
4.1.8 - Use predictive modeling to impute missing values
The missing value problem is obviously well-known in AL. The methods used in AL are of three
The data are absent in a natural way (for example, the illnesses specific to one sex will be
missing from records of the other sex). Then, the missing values are replaced by "non
meaningful" and a specific non-meaningful-value treatment is introduced in the algorithm. This
solution is definitely the best in this case.
When the missing data are due to a lack of documentation, then two solutions are used. The first
one consists in introducing a coefficient weakening the variable whose data are missing, as in
C4.5, in order to decrease their contribution to the decision. The second one consists in
completing by the mean of the observed values. The mean cab be taken over the whole set of
examples, or over the examples of the same class.
The DM approach to this second case follows from the fact that DM does not suppose that the
learning takes place in one step. The domain specialist and the programmer work together to
optimize the results. Models created during the previous iteration, or existing models known to
the experts, are used to compute the missing values.
It is necessary to note however that the case of large amounts of missing data is not dealt with.
When, for example, more than 80% of the values of a variable are not documented, there is no
really efficient method to deal with such shortcomings.
4.1.9 - Build multiple models and form a ‘panel of experts’ predictive models
AL developed numerous approaches for simultaneously generating several models, in particular
those including a vote of models. Eventually one of the models will win. The notion of
cooperation between experts is never used. Although this has not been studied much, models of
agents could play an important role in DM.
4.1.10 - Forget about traditional dated hygiene practices
I prefer not to comment this assertion.
4.1.11 - Enrich your data with external data
AL, in principle, takes data as it is given. It are not obtained by a process with which interaction
is possible. DM supposes that observing that some necessary new data is possible. It can solve a
problem or obtain a solution otherwise impossible to find.
4.1.12 - Feed the models a better ‘balanced fuel mixture’ of data
This advice is similar enough to the one before, except that the model obtained at the previous
iteration is also used to search for data that is better adapted to a future induction.
4.2 What Data Mining techniques do you use regularly?
When consulting Gregory Piatetsky-Shapiro's site, http://www.kdnuggets.com, it is noticeable
that the tools really used in DM are not exactly those that had the greatest success among the AL
community. In particular, categorization tools are used as much as the entire set of the statistical
tools (not including regression and nearest neighbors). Moreover, when one considers
categorization no longer as a tool, but as a problem, 22% declare a need for these tools.
Aug. 2001 Oct. 2002
Clustering - 12% (if ‘type of analysis’, then 22%)
Neural Networks 13% 9%
Decision Trees/Rules 19% 16%
Logistic Regression 14% 9%
Statistics 17% 12%
Bayesian nets 6% 3%
Visualization 8% 6%
Nearest Neighbor - 5%
Association Rules 7% 8%
Hybrid methods 4% 3%
Text Mining 2% 4%
Sequence Analysis - 3%
Genetic Algorithms - 3%
Naive Bayes - 2%
Web mining 5% 2%
Agents 1% -
Other 4% 3%
Table 2. The DM tools in 2001 and 2002.
New methods have been adopted during the last year, such as sequence analysis, genetic
algorithms, text and web mining, that constitutes 12% of new methods. A sudden appearance of
5% of such a nearly ancient method as Nearest Neighbors is all the more striking. In fact, it is
extremely simple to implement, and its efficiency in precision has been noticed for years in the
academic world. Nevertheless, no real clever changes can be made in its use, thus it is not
interesting to academics.
While taking into account these figures, a decrease of 17% for the techniques classed in 2001 is
to be expected. The slight increase of association detection it is even more noticeable. This
method of automatic detection of uncertain patterns in data certainly answers an industrial need.
To the best of my knowledge, it was never studied by AL before DM started identifying its
Bayesian networks lose some points, but apparently only because of the difference now made
between naive or not Bayesian. Similarly, the category "statistics" lose 5% which are probably
the 5% of the Nearest Neighbors. Decision Trees decrease slightly, but not significantly.
Finally, one can say that the 2002 losers are
- Logistical Regression, very much taught at University, and therefore probably over-
valued by students gone to work in industry
- Neural Networks, probably because of the complete lack of understandability of their
results, and their tendency to learn procedures that are not general enough.
- Support Vector Machines, not even cited by industry, in spite of their huge academic
success. It will be interesting to check if this tendency is confirmed or not in the coming years.
The cause of the industrial acceptance of DM is easy to understand since the creators of
this research topic took the problems of industry into account while AL researchers are centered
on scientific issues. Even though they are certainly happy when they find an application, but they
are not motivated by the application. As a testimony of the isolation of AL research from
industrial applications, consider the thousands of academic AL papers that report the progress of
a few tenths % in precision, improving an already known method, and applied to non grounded
An unexpected consequence of taking applications into account is that DM dares to attack
problems known for being impossible to solve with certainty, that is to say, all unsupervised
problems: categorization and segmentation, discovery of associations, temporal series and
construction of a Bayesian network structure from data. Even in the supervised case, DM also
deals with badly defined problems: large quantity of missing values, very noisy data, data with
few examples (i. e., few records) and a large number of features (i.e., many fields). A striking
example of this last problem, which currently attracts much attention from the DM community, is
DNA chips. It is obvious that many models will fit this special kind of data. It is therefore
hopeless to try to find the one true solution. The real goal is decreasing the failure rate in order to
ease further the work of the human specialists.
Thus, DM is characterized by its audacity in challenging the problems as they are, not as they can
be neatly solved.
Bendou M., Munteanu P. "Analyse de l'effet du bruit dans les algorithmes d'apprentissage
des réseaux Bayésiens," Revue des sciences et technologies de l'information 17, (EGC-2003), pp.
Benzecri, J. P. L'analyse des données, Dunod, Paris 1973.
Breiman L., Friedman J., Olshen R., Stone C. : Classification and Regression Trees.
Wadsworth International Group, 1984.
Brill E. "Some Advances in Transformation-Based Part of Speech Tagging," AAAI,
Cornuéjols A., Miclet L., Apprentissage Artificiel, Eyrolles, Paris, 2002.
Dietterich, G. T., Michalski, R. S. "Inductive Learning of Structural Descriptions:
Evaluation Criteria and Comparative Review of Selected Methods" Artificial Intelligence Journal
16, 1981, pp. 257-294.
Domingos P. "Occam's Two Razors: The Sharp and the Blunt," Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY:
AAAI Press, pp. 37-43, 1998.
Fisher D. “Knowledge acquisition via incremental conceptual clustering”, Machine
Learning Journal 2, 139-172, 1987.
Heckerman D., Geiger D., Chickering D. "Learning Bayesian networks: The combination
of knowledge and statistical data," Machine Learning Journal 20, 197-243, 1995.
Kohonen T. "The self-organizing map," Proc. IEEE 78, 1464-1480, 1990.
Liu, H., Motoda, H., Feature Selection, Kluwer Academic Publishers, Norwell, MA,
LeCun Y., Boser B., Denker J. S., Henderson D., Howard R. E., Hubbard W., Jackel L.
D., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, vol.
1, no. 4, pp. 541-551, 1989.
Maron M. E. 1961. "Automatic indexing: An experimental inquiry," Journal of the
Association for Computing Machinery, 8:404-417, 1961.
Michalski, R. S., Chilausky R. L. "Learning by being told and learning from examples:
An experimental comparison of the two methods of knowledge acquisition in the context of
developing an expert system for soybean disease diagnosis," International Journal of Policy
Analysis and Information Systems 4:125-160, 1980.
Piatetsky-Shapiro G, Frawley W. J. , (Eds.), Knowledge Discovery in Data Bases, ALAI/
MIT Press, Melo Park CA, 1991.
Popper, K. R. The logic de scientific discovery, Harper and Row, NY, 1959.
Quinlan J. R. "Learning Efficient Classification Procedures and their Application to Chess
End Games," in Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G.
Carbonell, T. M. Mitchell (Eds.), Morgan Kaufmann, Los Altos, pp. 463-482, 1983.
Quinlan R. S. C4.5: Programs for ML, Morgan-Kaufmann, San Mateo, 1993.
Rosenblatt F. "The perceptron: a probabilistic model for information storage and
organization in the brain," Psychological Review 65:386-408 (1958).
Vapnik V. The nature of statistical learning theory, Springer-Verlag, 1995.