1. Machine Learning and Data Mining
Yves Kodratoff
CNRS, LRI Bât. 490, Université Paris-Sud
91405 Orsay, yk@lri.fr
http://www.lri.fr/~yk/
“Automatic Learning”: stemming from 4
communities developing 4 approaches
AI
Stats (and DA)
Bayesian Stats.
Pattern Recognition
DM: the ‘daughter’ of DB and AL
1. A good many definitions
A few definitions 1, 2, 3:
Supervised and Unsupervised Learning
What is automated induction?
The components of DM
2. Differences between AL and DM
Differences in the scientific approach
Differences from the point of view of industry 1, 2
Twelve tips for successful Data Mining
3. A few definitions 1:
Supervised and Unsupervised Learning
Supervised Learning (“with teacher”)
Input: description in extension of the problem.
Most often:
Field 1 Field 2 … Field k Class
Record 1 Value 11 Value 12 … Value 1k Class
value
…
Record p Value p1 Value p2 … Value pk Class
value
Output : extract the ‘properties’ of this description
(also called : description in intention)
IF (Field m = Value ml) & Field n ∈ [Value ij, Value mn] & …
THEN Class value = a
Unsupervised Learning (“without teacher”)
Discover patterns in the data
4. Clustering =
classification, categorization, segmentation
Data Analysis
e.g. main axis of ellipsoid containing the data
Search for logical structures =
Probabilistic theorems (associations)
functional relations among variables (such as
PV = nRT)
Spatial or Temporal sequences
Discover terms in texts
A few definitions 2:
What is automated induction?
Techniques for inventing a new model better fitting the data
Essentially made of 4 steps:
Definition of the hypothesis space
Choice of a search strategy within the hypothesis space
Choice of an optimization criterion
Validation
5. Definition of the hypothesis space
Defines the task and the space of possible solutions
e.g.: tagging.
‘special purposes’ ‘special-adj purposes-n-plur’
Texample task: Learn the tags of new words from a set of
tagged texts
Hypothesis space: Let W1 the new word to tag. Hypothesis
space is ‘context’:
all words and tags within 3 words before or after W1.
Rules will be of the form:
IF context(W1) = … THEN tag W1 as …
Choice of a search strategy within the hypothesis space
Exhaustive
Exhaustive + random choice
Greedy (choose 1st step that leads to best value of
optimization criterion)
Steepest descent (e.g. Neural Networks)
Genetic Algorithms
6. Choice of an optimization criterion
Apply the current hypothesis to the data and then use the
following :
Adjust numerical distances (DA)
e.g. hypothesize a cluster, compute its center of gravity,
compute the sum of the distances of the points in the cluster
to the center of gravity, optimum is obtained when distance
is minimum
Decrease variance (Stats)
Increase precision or similar measurements (ML)
Adjust discrete (or Boolean) distances (ML & DA)
Decrease entropy (decision trees)
Increase utility (define utility) (DM)
Increase posterior probability of phenomenon given data:
P(Ph D) (Bayesian learning)
Minimum length description ( learning & Bayesian)
When everything else fails: Occam’s razor ('everyone')
7. Validation
Expert
Use the results
A few definitions 3:
The base components of DM
Data Mining
Machine Learning
Pattern Recognition
Exploratory Statistics
Data Analysis
Bayesian statistics
Data Mining (DM) (1989)
Unsupervised:
Association Detection
Temporal Series
Segmentation techniques
Supervised :
Data with many fields and few records : DNA chips
9. Data Analysis (60s)
Supervised :
Main components analysis
Unsupervised:
Numerical clustering
Bayesian statistics
Supervised (1961)
Naive Bayes
Unsupervised (1995)
Large Bayesian networks structure
10. Differences between AL and DM
Differences in the scientific approach
Classic data Automatic DM
processing Learning
(ML and Statistics)
Simulates Simulates Simulates
deductive inductive inductive
reasoning (= reasoning (= reasoning ("even
applies an existing invents a model) more inductive")
model)
validation validation validation
according to according to according to
precision precision utility and
comprehensibility
Results as universal Results as Results relative to
as possible universal as particular cases
possible
elegance = elegance = elegance =
conciseness conciseness adequacy to the
user's model
Position relative to Artificial Intelligence
Tends to reject Either tends to reject Naturally
AI AI (Statistics) or integrates AI, DB,
claims belonging to Stat., and MMI.
AI (ML)
11.
12. Differences from the point of view of industry 1
Twelve tips for successful Data Mining
Oracle Data Mining Suite
a - Mine significantly more data
b - Create new variable to tease more information out of your
data
c - Take has shallow dive into the data first
d - Rapidly build many exploratory predictive models
e - Cluster your customers first, and then build multiple
targeted predictive models
apply pattern detection methods to the entire basis
laws valid for all individuals (usually trivial)
apply pattern detection methods to the segmented basis
laws valid for all each segment (usually as interesting as
segmentation is)
f - automated model building
g - Demystify neural networks and clusters by reverse
engineering them using C&RT models
h - Use predictive modeling to impute missing values
i - Build multiple models and form a ‘panel of experts’
predictive models
j - Forget about traditional dated hygiene practices
k - Enrich your data with external data
13. l - Feed the models a better ‘balanced fuel mixture’ of data
Differences from the point of view of industry 2
What Data Mining techniques do you use regularly?
http://www.kdnuggets.com
Aug. 2001 Oct. 2002
Clustering na 12% (if ‘type of analysis’, then 22%)
Neural Networks 13% 9%
Decision Trees/Rules 19% 16%
Logistic Regression 14% 9%
Statistics 17% 12%
Bayesian nets 6% 3%
Visualization 8% 6%
Nearest Neighbor na 5%
Association Rules 7% 8%
Hybrid methods 4% 3%
Text Mining 2% 4%
Sequence Analysis na 3%
Genetic Algorithms na 3%
Naive Bayes na 2%
Web mining 5% 2%
Agents 1% na
Other 2% 2%
Conclusion
Obvious that DM takes care of industrial problems
BUT ALSO