Upcoming SlideShare
×

# Introduction

316 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
316
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
4
0
Likes
0
Embeds 0
No embeds

No notes for slide
• Pb : the background image shows in webex. I’ve reduced the image : hope it helps
Not mention KXEN
• Include here recommandation systems
• Best performance for each type of method, normalized by the average of these perf
• Which one is best : linear or non-linear
The decision comes when we see new data
Very often the simplest model is better
This principle is implemented in Learning Theory
• Which one is best : linear or non-linear
The decision comes when we see new data
Very often the simplest model is better
This principle is implemented in Learning Theory
• Which one is best : linear or non-linear
The decision comes when we see new data
Very often the simplest model is better
This principle is implemented in Learning Theory
• Which one is best : linear or non-linear
The decision comes when we see new data
Very often the simplest model is better
This principle is implemented in Learning Theory
• Explain that this is a global estimator
• Explain that this is a global estimator
• Explain that this is a global estimator
Proof of Gini = 2 AUC –1
L = lift
Hitrate = tp/pos
Farate = fp/neg
Selected = sel/tot = (tp+fp)/tot = pos/tot.tp/pos + neg/tot.fp/neg = pos/tot Hitrate + neg/tot Farate
AUC = sum Hitrate d(Farate)
L = sum Hitrate d(Selected)
= sum Hitrate d(pos/tot Hitrate + neg/tot Farate)
= pos/tot sum Hitrate d Hitrate + neg/tot sum Hitrate d Farate
= ½ pos/tot + neg/tot AUC
2L-1 = -(1-pos/tot) + 2(1-pos/tot) AUC = (1-pos/tot) (2AUC-1)
Gini = (L-1/2)/(1-pos/tot)/2
= (2L-1)/(1-pos/tot) = 2AUC-1
• ### Introduction

1. 1. Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com
2. 2. What is Machine Learning? Learning algorithm TRAINING DATA Answer Trained machine Query
3. 3. What for? • Classification • Time series prediction • Regression • Clustering
4. 4. Some Learning Machines • Linear models • Kernel methods • Neural networks • Decision trees
5. 5. Applications inputs training examples 10 102 103 104 105 Bioinformatics Ecology OCR HWR Market Analysis Text Categorization Machine VisionSystemdiagnosis 10 102 103 104 105
6. 6. Banking / Telecom / Retail • Identify: – Prospective customers – Dissatisfied customers – Good customers – Bad payers • Obtain: – More effective advertising – Less credit risk – Fewer fraud – Decreased churn rate
7. 7. Biomedical / Biometrics • Medicine: – Screening – Diagnosis and prognosis – Drug discovery • Security: – Face recognition – Signature / fingerprint / iris verification – DNA fingerprinting 6
8. 8. Computer / Internet • Computer interfaces: – Troubleshooting wizards – Handwriting and speech – Brain waves • Internet – Hit ranking – Spam filtering – Text categorization – Text translation – Recommendation 7
9. 9. Challenges inputs training examples 10 102 103 104 105 Arcene, Dorothea, Hiva Sylva Gisette Gina Ada Dexter, Nova Madelon 10 102 103 104 105 NIPS 2003 & WCCI 2006
10. 10. Ten Classification Tasks 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 ADA GINA HIVA NOVA SYLVA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 ARCENE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DEXTER 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DOROTHEA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 GISETTE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 MADELON Test BER (%)
11. 11. Challenge Winning Methods 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Linear /Kernel Neural Nets Trees /RF Naïve Bayes Gisette (HWR) Gina (HWR) Dexter (Text) Nova (Text) Madelon (Artificial) Arcene (Spectral) Dorothea (Pharma) Hiva (Pharma) Ada (Marketing) Sylva (Ecology) BER/<BER>
12. 12. Conventions X={xij} n m xi y ={yj} α w
13. 13. Learning problem Colon cancer, Alon et al 1999 Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y. Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, …
14. 14. Linear Models • f(x) = w • x +b = Σj=1:n wj xj +b Linearity in the parameters, NOT in the input components. • f(x) = w • Φ(x) +b = Σj wj φj(x) +b (Perceptron) • f(x) = Σi=1:m αi k(xi,x) +b (Kernel method)
15. 15. Artificial Neurons x1 x2 xn 1 Σ f(x) w1 w2 wn b f(x) = w • x + b Axon Synapses Activation of other neurons Dendrites Cell potential Activation function McCulloch and Pitts, 1943
16. 16. Linear Decision Boundary -0.5 0 0.5 -0. 0 0.5 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 X1X2 X3 x1 x2 x3 hyperplane x1 x2
17. 17. Perceptron Rosenblatt, 1957 f(x) f(x) = w • Φ(x) + b φ1(x) 1 x1 x2 xn φ2(x) φN(x) Σ w1 w2 wN b
18. 18. NL Decision Boundary x1 x2 -0.5 0 0.5 -0.5 0 0.5 -0.5 0 0.5 Hs.128749Hs.234680 Hs.7780 x1 x2 x3
19. 19. Kernel Method Potential functions, Aizerman et al 1964 f(x) = Σi αi k(xi,x) + b k(x1,x) 1 x1 x2 xn Σ α1 α2 αm b k(x2,x) k(xm,x) k(. ,. ) is a similarity measure or “kernel”.
20. 20. Hebb’s Rule wj ← wj + yi xij Axon Σ y xj wj Synapse Activation of another neuron Dendrite Link to “Naïve Bayes”
21. 21. Kernel “Trick” (for Hebb’s rule) • Hebb’s rule for the Perceptron: w = Σi yi Φ(xi) f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x) • Define a dot product: k(xi,x) = Φ(xi) • Φ(x) f(x) = Σi yi k(xi,x)
22. 22. Kernel “Trick” (general) • f(x) = Σi αi k(xi, x) • k(xi, x) = Φ(xi) • Φ(x) • f(x) = w • Φ(x) • w = Σi αi Φ(xi) Dual forms
23. 23. A kernel is: • a similarity measure • a dot product in some feature space: k(s, t) = Φ(s) • Φ(t) But we do not need to know the Φ representation. Examples: • k(s, t) = exp(-||s-t||2 /σ2 ) Gaussian kernel • k(s, t) = (s • t)q Polynomial kernel What is a Kernel?
24. 24. Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986 Σ xj Σ Σ “hidden units” internal “latent” variables
25. 25. Chessboard Problem
26. 26. Tree Classifiers CART (Breiman, 1984) or C4.5 (Quinlan, 1993) At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f1 f2 Choose f2 Choose f1
27. 27. Iris Data (Fisher, 1936) Linear discriminant Tree classifier Gaussian mixture Kernel method (SVM) setosa virginica versicolor Figure from Norbert Jankowski and Krzysztof Grabczewski
28. 28. x1 x2 Fit / Robustness Tradeoff x1 x2 15
29. 29. x1 x2 Performance evaluation x1 x2 f(x)=0 f(x) > 0 f(x) < 0 f(x) = 0 f(x) > 0 f(x) < 0
30. 30. x1 x2 x1 x2 f(x)=-1 f(x) > -1 f(x) < -1 f(x) = -1 f(x) > -1 f(x) < -1 Performance evaluation
31. 31. x1 x2 x1 x2 f(x)=1 f(x) > 1 f(x) < 1 f(x) = 1 f(x) > 1 f(x) < 1 Performance evaluation
32. 32. ROC Curve 100% 100% For a given threshold on f(x), you get a point on the ROC curve. Actual ROC 0 Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity) Random ROC Ideal ROC curve
33. 33. ROC Curve Ideal ROC curve (AUC=1) 100% 100% 0 ≤ AUC ≤ 1 Actual ROC Random ROC (AUC=0.5) 0 For a given threshold on f(x), you get a point on the ROC curve. Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity)
34. 34. Lift Curve O MGini = O M Fraction of customers selected Hitrategoodcustomersselect. Random lift Ideal Lift 100% 100%Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0 ≤ Gini ≤ 1 Actual Lift 0
35. 35. Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power Performance Assessment Compare F(x) = sign(f(x)) to the target y, and report: • Error rate = (fn + fp)/m • {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 • F measure = 2 precision.recall/(precision+recall) Vary the decision threshold θ in F(x) = sign(f(x)+θ), and plot: • ROC curve: Hit rate vs. False alarm rate • Lift curve: Hit rate vs. Fraction selected • Precision/recall curve: Hit rate vs. Precision Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp Cost matrix
36. 36. What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: – Error rate: (1/m) Σi=1:m 1(F(xi)≠yi) – 1- AUC (Gini Index = 2 AUC-1) • Regression: – Mean square error: (1/m) Σi=1:m(f(xi)-yi)2
37. 37. How to train? • Define a risk functional R[f(x,w)] • Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) Parameter space (w) R[f(x,w)] w* (… to be continued in the next lecture)
38. 38. How to Train? • Define a risk functional R[f(x,w)] • Find a method to optimize it, typically “gradient descent” wj ← wj - η ∂R/∂wj or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.) (… to be continued in the next lecture)
39. 39. Summary • With linear threshold units (“neurons”) we can build: – Linear discriminant (including Naïve Bayes) – Kernel methods – Neural networks – Decision trees • The architectural hyper-parameters may include: – The choice of basis functions φ (features) – The kernel – The number of units • Learning means fitting: – Parameters (weights) – Hyper-parameters – Be aware of the fit vs. robustness tradeoff
40. 40. Want to Learn More? • Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html • The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ • Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle /Papers/guyon_stork_nips98.ps.gz • Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book