0
Introduction
to
Machine Learning
Isabelle Guyon
isabelle@clopinet.com
What is Machine Learning?
Learning
algorithm
TRAINING
DATA Answer
Trained
machine
Query
What for?
• Classification
• Time series prediction
• Regression
• Clustering
Some Learning Machines
• Linear models
• Kernel methods
• Neural networks
• Decision trees
Applications
inputs
training
examples
10
102
103
104
105
Bioinformatics
Ecology
OCR
HWR
Market
Analysis
Text
Categorizatio...
Banking / Telecom / Retail
• Identify:
– Prospective customers
– Dissatisfied customers
– Good customers
– Bad payers
• Ob...
Biomedical / Biometrics
• Medicine:
– Screening
– Diagnosis and prognosis
– Drug discovery
• Security:
– Face recognition
...
Computer / Internet
• Computer interfaces:
– Troubleshooting wizards
– Handwriting and speech
– Brain waves
• Internet
– H...
Challenges
inputs
training
examples
10
102
103
104
105
Arcene,
Dorothea, Hiva
Sylva
Gisette
Gina
Ada
Dexter, Nova
Madelon
...
Ten Classification Tasks
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4...
Challenge Winning Methods
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Linear
/Kernel
Neural
Nets
Trees
/RF
Naïve
Bayes
Gisette (HW...
Conventions
X={xij}
n
m
xi
y ={yj}
α
w
Learning problem
Colon cancer, Alon et al 1999
Unsupervised learning
Is there structure in data?
Supervised learning
Predi...
Linear Models
• f(x) = w • x +b = Σj=1:n wj xj +b
Linearity in the parameters, NOT in the input
components.
• f(x) = w • Φ...
Artificial Neurons
x1
x2
xn
1
Σ f(x)
w1
w2
wn
b
f(x) = w • x + b
Axon
Synapses
Activation
of other
neurons Dendrites
Cell ...
Linear Decision Boundary
-0.5
0
0.5
-0.
0
0.5
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
X1X2
X3
x1
x2
x3
hyperplane
x...
Perceptron
Rosenblatt, 1957
f(x)
f(x) = w • Φ(x) + b
φ1(x)
1
x1
x2
xn
φ2(x)
φN(x)
Σ
w1
w2
wN
b
NL Decision Boundary
x1
x2
-0.5
0
0.5
-0.5
0
0.5
-0.5
0
0.5
Hs.128749Hs.234680
Hs.7780
x1
x2
x3
Kernel Method
Potential functions, Aizerman et al 1964
f(x) = Σi αi k(xi,x) + b
k(x1,x)
1
x1
x2
xn
Σ
α1
α2
αm
b
k(x2,x)
k(...
Hebb’s Rule
wj ← wj + yi xij
Axon
Σ
y
xj wj
Synapse
Activation
of another
neuron
Dendrite
Link to “Naïve Bayes”
Kernel “Trick” (for Hebb’s rule)
• Hebb’s rule for the Perceptron:
w = Σi yi Φ(xi)
f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x)
• ...
Kernel “Trick” (general)
• f(x) = Σi αi k(xi, x)
• k(xi, x) = Φ(xi) • Φ(x)
• f(x) = w • Φ(x)
• w = Σi αi Φ(xi)
Dual forms
A kernel is:
• a similarity measure
• a dot product in some feature space: k(s, t) = Φ(s) • Φ(t)
But we do not need to kno...
Multi-Layer Perceptron
Back-propagation, Rumelhart et al, 1986
Σ
xj
Σ
Σ
“hidden units”
internal “latent” variables
Chessboard Problem
Tree Classifiers
CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
At each step,
choose the
feature that
“reduces
entropy” most...
Iris Data (Fisher, 1936)
Linear discriminant Tree classifier
Gaussian mixture Kernel method (SVM)
setosa
virginica
versico...
x1
x2
Fit / Robustness Tradeoff
x1
x2
15
x1
x2
Performance evaluation
x1
x2
f(x)=0
f(x) > 0
f(x) < 0
f(x) = 0
f(x) > 0
f(x) < 0
x1
x2
x1
x2
f(x)=-1
f(x) > -1
f(x) < -1
f(x) = -1
f(x) > -1
f(x) < -1
Performance evaluation
x1
x2
x1
x2
f(x)=1
f(x) > 1
f(x) < 1
f(x) = 1
f(x) > 1
f(x) < 1
Performance evaluation
ROC Curve
100%
100%
For a given
threshold
on f(x),
you get a
point on the
ROC curve. Actual ROC
0
Positive class
success r...
ROC Curve
Ideal ROC curve (AUC=1)
100%
100%
0 ≤ AUC ≤ 1
Actual ROC
Random
ROC
(AUC=0.5)
0
For a given
threshold
on f(x),
y...
Lift Curve
O
MGini =
O M
Fraction of customers selected
Hitrategoodcustomersselect.
Random
lift
Ideal Lift
100%
100%Custom...
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
Cost matrix
Predictions: F(x)
Class -1 Class +1...
What is a Risk Functional?
A function of the parameters of the
learning machine, assessing how much
it is expected to fail...
How to train?
• Define a risk functional R[f(x,w)]
• Optimize it w.r.t. w (gradient descent,
mathematical programming, sim...
How to Train?
• Define a risk functional R[f(x,w)]
• Find a method to optimize it, typically
“gradient descent”
wj ← wj - ...
Summary
• With linear threshold units (“neurons”) we can build:
– Linear discriminant (including Naïve Bayes)
– Kernel met...
Want to Learn More?
• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard
pattern recognition textbook. Limit...
Upcoming SlideShare
Loading in...5
×

Introduction

203

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
203
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Pb : the background image shows in webex. I’ve reduced the image : hope it helps
    Not mention KXEN
  • Include here recommandation systems
  • Best performance for each type of method, normalized by the average of these perf
  • Which one is best : linear or non-linear
    The decision comes when we see new data
    Very often the simplest model is better
    This principle is implemented in Learning Theory
  • Which one is best : linear or non-linear
    The decision comes when we see new data
    Very often the simplest model is better
    This principle is implemented in Learning Theory
  • Which one is best : linear or non-linear
    The decision comes when we see new data
    Very often the simplest model is better
    This principle is implemented in Learning Theory
  • Which one is best : linear or non-linear
    The decision comes when we see new data
    Very often the simplest model is better
    This principle is implemented in Learning Theory
  • Explain that this is a global estimator
  • Explain that this is a global estimator
  • Explain that this is a global estimator
    Proof of Gini = 2 AUC –1
    L = lift
    Hitrate = tp/pos
    Farate = fp/neg
    Selected = sel/tot = (tp+fp)/tot = pos/tot.tp/pos + neg/tot.fp/neg = pos/tot Hitrate + neg/tot Farate
    AUC = sum Hitrate d(Farate)
    L = sum Hitrate d(Selected)
    = sum Hitrate d(pos/tot Hitrate + neg/tot Farate)
    = pos/tot sum Hitrate d Hitrate + neg/tot sum Hitrate d Farate
    = ½ pos/tot + neg/tot AUC
    2L-1 = -(1-pos/tot) + 2(1-pos/tot) AUC = (1-pos/tot) (2AUC-1)
    Gini = (L-1/2)/(1-pos/tot)/2
    = (2L-1)/(1-pos/tot) = 2AUC-1
  • Transcript of "Introduction"

    1. 1. Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com
    2. 2. What is Machine Learning? Learning algorithm TRAINING DATA Answer Trained machine Query
    3. 3. What for? • Classification • Time series prediction • Regression • Clustering
    4. 4. Some Learning Machines • Linear models • Kernel methods • Neural networks • Decision trees
    5. 5. Applications inputs training examples 10 102 103 104 105 Bioinformatics Ecology OCR HWR Market Analysis Text Categorization Machine VisionSystemdiagnosis 10 102 103 104 105
    6. 6. Banking / Telecom / Retail • Identify: – Prospective customers – Dissatisfied customers – Good customers – Bad payers • Obtain: – More effective advertising – Less credit risk – Fewer fraud – Decreased churn rate
    7. 7. Biomedical / Biometrics • Medicine: – Screening – Diagnosis and prognosis – Drug discovery • Security: – Face recognition – Signature / fingerprint / iris verification – DNA fingerprinting 6
    8. 8. Computer / Internet • Computer interfaces: – Troubleshooting wizards – Handwriting and speech – Brain waves • Internet – Hit ranking – Spam filtering – Text categorization – Text translation – Recommendation 7
    9. 9. Challenges inputs training examples 10 102 103 104 105 Arcene, Dorothea, Hiva Sylva Gisette Gina Ada Dexter, Nova Madelon 10 102 103 104 105 NIPS 2003 & WCCI 2006
    10. 10. Ten Classification Tasks 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 ADA GINA HIVA NOVA SYLVA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 ARCENE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DEXTER 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DOROTHEA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 GISETTE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 MADELON Test BER (%)
    11. 11. Challenge Winning Methods 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Linear /Kernel Neural Nets Trees /RF Naïve Bayes Gisette (HWR) Gina (HWR) Dexter (Text) Nova (Text) Madelon (Artificial) Arcene (Spectral) Dorothea (Pharma) Hiva (Pharma) Ada (Marketing) Sylva (Ecology) BER/<BER>
    12. 12. Conventions X={xij} n m xi y ={yj} α w
    13. 13. Learning problem Colon cancer, Alon et al 1999 Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y. Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, …
    14. 14. Linear Models • f(x) = w • x +b = Σj=1:n wj xj +b Linearity in the parameters, NOT in the input components. • f(x) = w • Φ(x) +b = Σj wj φj(x) +b (Perceptron) • f(x) = Σi=1:m αi k(xi,x) +b (Kernel method)
    15. 15. Artificial Neurons x1 x2 xn 1 Σ f(x) w1 w2 wn b f(x) = w • x + b Axon Synapses Activation of other neurons Dendrites Cell potential Activation function McCulloch and Pitts, 1943
    16. 16. Linear Decision Boundary -0.5 0 0.5 -0. 0 0.5 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 X1X2 X3 x1 x2 x3 hyperplane x1 x2
    17. 17. Perceptron Rosenblatt, 1957 f(x) f(x) = w • Φ(x) + b φ1(x) 1 x1 x2 xn φ2(x) φN(x) Σ w1 w2 wN b
    18. 18. NL Decision Boundary x1 x2 -0.5 0 0.5 -0.5 0 0.5 -0.5 0 0.5 Hs.128749Hs.234680 Hs.7780 x1 x2 x3
    19. 19. Kernel Method Potential functions, Aizerman et al 1964 f(x) = Σi αi k(xi,x) + b k(x1,x) 1 x1 x2 xn Σ α1 α2 αm b k(x2,x) k(xm,x) k(. ,. ) is a similarity measure or “kernel”.
    20. 20. Hebb’s Rule wj ← wj + yi xij Axon Σ y xj wj Synapse Activation of another neuron Dendrite Link to “Naïve Bayes”
    21. 21. Kernel “Trick” (for Hebb’s rule) • Hebb’s rule for the Perceptron: w = Σi yi Φ(xi) f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x) • Define a dot product: k(xi,x) = Φ(xi) • Φ(x) f(x) = Σi yi k(xi,x)
    22. 22. Kernel “Trick” (general) • f(x) = Σi αi k(xi, x) • k(xi, x) = Φ(xi) • Φ(x) • f(x) = w • Φ(x) • w = Σi αi Φ(xi) Dual forms
    23. 23. A kernel is: • a similarity measure • a dot product in some feature space: k(s, t) = Φ(s) • Φ(t) But we do not need to know the Φ representation. Examples: • k(s, t) = exp(-||s-t||2 /σ2 ) Gaussian kernel • k(s, t) = (s • t)q Polynomial kernel What is a Kernel?
    24. 24. Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986 Σ xj Σ Σ “hidden units” internal “latent” variables
    25. 25. Chessboard Problem
    26. 26. Tree Classifiers CART (Breiman, 1984) or C4.5 (Quinlan, 1993) At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f1 f2 Choose f2 Choose f1
    27. 27. Iris Data (Fisher, 1936) Linear discriminant Tree classifier Gaussian mixture Kernel method (SVM) setosa virginica versicolor Figure from Norbert Jankowski and Krzysztof Grabczewski
    28. 28. x1 x2 Fit / Robustness Tradeoff x1 x2 15
    29. 29. x1 x2 Performance evaluation x1 x2 f(x)=0 f(x) > 0 f(x) < 0 f(x) = 0 f(x) > 0 f(x) < 0
    30. 30. x1 x2 x1 x2 f(x)=-1 f(x) > -1 f(x) < -1 f(x) = -1 f(x) > -1 f(x) < -1 Performance evaluation
    31. 31. x1 x2 x1 x2 f(x)=1 f(x) > 1 f(x) < 1 f(x) = 1 f(x) > 1 f(x) < 1 Performance evaluation
    32. 32. ROC Curve 100% 100% For a given threshold on f(x), you get a point on the ROC curve. Actual ROC 0 Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity) Random ROC Ideal ROC curve
    33. 33. ROC Curve Ideal ROC curve (AUC=1) 100% 100% 0 ≤ AUC ≤ 1 Actual ROC Random ROC (AUC=0.5) 0 For a given threshold on f(x), you get a point on the ROC curve. Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity)
    34. 34. Lift Curve O MGini = O M Fraction of customers selected Hitrategoodcustomersselect. Random lift Ideal Lift 100% 100%Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0 ≤ Gini ≤ 1 Actual Lift 0
    35. 35. Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power Performance Assessment Compare F(x) = sign(f(x)) to the target y, and report: • Error rate = (fn + fp)/m • {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 • F measure = 2 precision.recall/(precision+recall) Vary the decision threshold θ in F(x) = sign(f(x)+θ), and plot: • ROC curve: Hit rate vs. False alarm rate • Lift curve: Hit rate vs. Fraction selected • Precision/recall curve: Hit rate vs. Precision Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp Cost matrix
    36. 36. What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: – Error rate: (1/m) Σi=1:m 1(F(xi)≠yi) – 1- AUC (Gini Index = 2 AUC-1) • Regression: – Mean square error: (1/m) Σi=1:m(f(xi)-yi)2
    37. 37. How to train? • Define a risk functional R[f(x,w)] • Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) Parameter space (w) R[f(x,w)] w* (… to be continued in the next lecture)
    38. 38. How to Train? • Define a risk functional R[f(x,w)] • Find a method to optimize it, typically “gradient descent” wj ← wj - η ∂R/∂wj or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.) (… to be continued in the next lecture)
    39. 39. Summary • With linear threshold units (“neurons”) we can build: – Linear discriminant (including Naïve Bayes) – Kernel methods – Neural networks – Decision trees • The architectural hyper-parameters may include: – The choice of basis functions φ (features) – The kernel – The number of units • Learning means fitting: – Parameters (weights) – Hyper-parameters – Be aware of the fit vs. robustness tradeoff
    40. 40. Want to Learn More? • Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html • The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ • Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle /Papers/guyon_stork_nips98.ps.gz • Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×