Upcoming SlideShare
×

Introduction

464 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
464
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
10
0
Likes
0
Embeds 0
No embeds

No notes for slide
• Pb : the background image shows in webex. I’ve reduced the image : hope it helps Not mention KXEN
• Include here recommandation systems
• Best performance for each type of method, normalized by the average of these perf
• Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
• Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
• Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
• Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
• Explain that this is a global estimator
• Explain that this is a global estimator
• Explain that this is a global estimator Proof of Gini = 2 AUC –1 L = lift Hitrate = tp/pos Farate = fp/neg Selected = sel/tot = (tp+fp)/tot = pos/tot.tp/pos + neg/tot.fp/neg = pos/tot Hitrate + neg/tot Farate AUC = sum Hitrate d(Farate) L = sum Hitrate d(Selected) = sum Hitrate d(pos/tot Hitrate + neg/tot Farate) = pos/tot sum Hitrate d Hitrate + neg/tot sum Hitrate d Farate = ½ pos/tot + neg/tot AUC 2L-1 = -(1-pos/tot) + 2(1-pos/tot) AUC = (1-pos/tot) (2AUC-1) Gini = (L-1/2)/(1-pos/tot)/2 = (2L-1)/(1-pos/tot) = 2AUC-1
• Introduction

1. 1. Introduction to Machine Learning Isabelle Guyon isabelle @ clopinet .com
2. 2. What is Machine Learning? <ul><li>Learning </li></ul><ul><li>algorithm </li></ul>TRAINING DATA ? Answer Trained machine Query
3. 3. What for? <ul><li>Classification </li></ul><ul><li>Time series prediction </li></ul><ul><li>Regression </li></ul><ul><li>Clustering </li></ul>
4. 4. Some Learning Machines <ul><li>Linear models </li></ul><ul><li>Kernel methods </li></ul><ul><li>Neural networks </li></ul><ul><li>Decision trees </li></ul>
5. 5. Applications inputs training examples 10 10 2 10 3 10 4 10 5 Bioinformatics Ecology OCR HWR Market Analysis Text Categorization Machine Vision System diagnosis 10 10 2 10 3 10 4 10 5
6. 6. Banking / Telecom / Retail <ul><li>Identify: </li></ul><ul><ul><li>Prospective customers </li></ul></ul><ul><ul><li>Dissatisfied customers </li></ul></ul><ul><ul><li>Good customers </li></ul></ul><ul><ul><li>Bad payers </li></ul></ul><ul><li>Obtain: </li></ul><ul><ul><li>More effective advertising </li></ul></ul><ul><ul><li>Less credit risk </li></ul></ul><ul><ul><li>Fewer fraud </li></ul></ul><ul><ul><li>Decreased churn rate </li></ul></ul>
7. 7. Biomedical / Biometrics <ul><li>Medicine: </li></ul><ul><ul><li>Screening </li></ul></ul><ul><ul><li>Diagnosis and prognosis </li></ul></ul><ul><ul><li>Drug discovery </li></ul></ul><ul><li>Security: </li></ul><ul><ul><li>Face recognition </li></ul></ul><ul><ul><li>Signature / fingerprint / iris verification </li></ul></ul><ul><ul><li>DNA fingerprinting </li></ul></ul>6
8. 8. Computer / Internet <ul><li>Computer interfaces: </li></ul><ul><ul><li>Troubleshooting wizards </li></ul></ul><ul><ul><li>Handwriting and speech </li></ul></ul><ul><ul><li>Brain waves </li></ul></ul><ul><li>Internet </li></ul><ul><ul><li>Hit ranking </li></ul></ul><ul><ul><li>Spam filtering </li></ul></ul><ul><ul><li>Text categorization </li></ul></ul><ul><ul><li>Text translation </li></ul></ul><ul><ul><li>Recommendation </li></ul></ul>7
9. 9. Challenges inputs training examples 10 10 2 10 3 10 4 10 5 Arcene, Dorothea, Hiva Sylva Gisette Gina Ada Dexter, Nova Madelon 10 10 2 10 3 10 4 10 5 NIPS 2003 & WCCI 2006
10. 10. Ten Classification Tasks 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 ADA GINA HIVA NOVA SYLVA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 ARCENE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DEXTER 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DOROTHEA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 GISETTE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 MADELON Test BER (%)
11. 11. Challenge Winning Methods BER/<BER>
12. 12. Conventions X={x ij } n m x i y ={y j }  w
13. 13. Learning problem Colon cancer, Alon et al 1999 Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y . Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, …
14. 14. Linear Models <ul><li>f( x ) = w  x +b =  j=1:n w j x j +b </li></ul><ul><li>Linearity in the parameters , NOT in the input components. </li></ul><ul><li>f( x ) = w   ( x) +b =  j w j  j ( x ) +b (Perceptron) </li></ul><ul><li>f( x ) =  i=1:m  i k ( x i , x ) +b (Kernel method) </li></ul>
15. 15. Artificial Neurons f( x ) = w  x + b Axon Synapses Activation of other neurons Dendrites Cell potential Activation function McCulloch and Pitts, 1943 x 1 x 2 x n 1  f( x ) w 1 w 2 w n b
16. 16. Linear Decision Boundary x 1 x 2 x 3 hyperplane x 1 x 2
17. 17. Perceptron Rosenblatt, 1957 f( x ) f( x ) = w   ( x) + b  1 ( x ) 1 x 1 x 2 x n  2 ( x )  N ( x )  w 1 w 2 w N b
18. 18. NL Decision Boundary x 1 x 2 x 1 x 2 x 3
19. 19. Kernel Method Potential functions, Aizerman et al 1964 f( x ) =  i  i k ( x i , x ) + b k( x 1 ,x ) 1 x 1 x 2 x n   1  2  m b k( x 2 ,x ) k( x m ,x ) k(. ,. ) is a similarity measure or “kernel”.
20. 20. Hebb’s Rule <ul><li>w j  w j + y i x ij </li></ul>Axon Link to “Naïve Bayes”  y x j w j Synapse Activation of another neuron Dendrite
21. 21. Kernel “Trick” (for Hebb’s rule) <ul><li>Hebb’s rule for the Perceptron: </li></ul><ul><li>w =  i y i  ( x i ) </li></ul><ul><li>f( x ) = w   ( x ) =  i y i  ( x i )   ( x ) </li></ul><ul><li>Define a dot product: </li></ul><ul><li>k( x i , x ) =  ( x i )   ( x ) </li></ul><ul><li>f( x ) =  i y i k( x i , x ) </li></ul>
22. 22. Kernel “Trick” (general) <ul><li>f( x ) =  i  i k( x i , x ) </li></ul><ul><li>k( x i , x ) =  ( x i )   ( x ) </li></ul><ul><li>f( x ) = w   ( x ) </li></ul><ul><li>w =  i  i  ( x i ) </li></ul>Dual forms
23. 23. <ul><li>A kernel is: </li></ul><ul><li>a similarity measure </li></ul><ul><li>a dot product in some feature space: k( s , t ) =  ( s )   ( t ) </li></ul><ul><li>But we do not need to know the  representation. </li></ul><ul><li>Examples: </li></ul><ul><li>k( s , t ) = exp(-|| s - t || 2 /  2 ) Gaussian kernel </li></ul><ul><li>k( s , t ) = ( s  t ) q Polynomial kernel </li></ul>What is a Kernel?
24. 24. Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986  x j   “hidden units” internal “latent” variables
25. 25. Chessboard Problem
26. 26. Tree Classifiers <ul><li>CART (Breiman, 1984) or C4.5 (Quinlan, 1993) </li></ul>At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f 1 f 2 Choose f 2 Choose f 1
27. 27. Iris Data (Fisher, 1936) Linear discriminant Tree classifier Gaussian mixture Kernel method (SVM) setosa virginica versicolor Figure from Norbert Jankowski and Krzysztof Grabczewski
28. 28. Fit / Robustness Tradeoff x 1 x 2 15 x 1 x 2
29. 29. Performance evaluation x 1 x 2 f( x ) = 0 f( x ) > 0 f( x ) < 0 f( x ) = 0 f( x ) > 0 f( x ) < 0 x 1 x 2
30. 30. Performance evaluation x 1 x 2 f( x ) = -1 f( x ) > -1 f( x ) < -1 f( x ) = -1 f( x ) > -1 f( x ) < -1 x 1 x 2
31. 31. Performance evaluation x 1 x 2 f( x ) = 1 f( x ) > 1 f( x ) < 1 f( x ) = 1 f( x ) > 1 f( x ) < 1 x 1 x 2
32. 32. ROC Curve 100% 100% For a given threshold on f(x), you get a point on the ROC curve. Actual ROC 0 Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity) Random ROC Ideal ROC curve
33. 33. ROC Curve Ideal ROC curve (AUC=1) 100% 100% 0  AUC  1 Actual ROC Random ROC (AUC=0.5) 0 For a given threshold on f(x), you get a point on the ROC curve. Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity)
34. 34. Lift Curve O M Fraction of customers selected Hit rate = Frac. good customers select . Random lift Ideal Lift 100% 100% Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0  Gini  1 Actual Lift 0
35. 35. Performance Assessment False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power <ul><li>Compare F(x) = sign(f(x)) to the target y, and report: </li></ul><ul><ul><li>Error rate = ( fn + fp )/m </li></ul></ul><ul><ul><li>{ Hit rate , False alarm rate } or { Hit rate , Precision} or { Hit rate , Frac.selected} </li></ul></ul><ul><ul><li>Balanced error rate (BER) = ( fn/pos + fp/neg )/2 = 1 – ( sensitivity + specificity )/2 </li></ul></ul><ul><ul><li>F measure = 2 precision. recall /(precision+ recall ) </li></ul></ul><ul><ul><li>Vary the decision threshold  in F(x) = sign(f(x)+  ), and plot: </li></ul></ul><ul><ul><ul><li>ROC curve : Hit rate vs. False alarm rate </li></ul></ul></ul><ul><ul><ul><li>Lift curve : Hit rate vs. Fraction selected </li></ul></ul></ul><ul><ul><ul><li>Precision/recall curve : Hit rate vs. Precision </li></ul></ul></ul>Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp Cost matrix
36. 36. What is a Risk Functional? <ul><li>A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. </li></ul><ul><li>Examples: </li></ul><ul><li>Classification: </li></ul><ul><ul><li>Error rate: (1/m)  i=1:m 1 (F( x i )  y i ) </li></ul></ul><ul><ul><li>1- AUC (Gini Index = 2 AUC-1) </li></ul></ul><ul><li>Regression: </li></ul><ul><ul><li>Mean square error: (1/m)  i=1:m (f( x i )-y i ) 2 </li></ul></ul>
37. 37. How to train? <ul><li>Define a risk functional R[f( x , w )] </li></ul><ul><li>Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) </li></ul>(… to be continued in the next lecture) Parameter space ( w ) R[f( x , w )] w *
38. 38. How to Train? <ul><li>Define a risk functional R[f( x , w )] </li></ul><ul><li>Find a method to optimize it, typically “gradient descent” </li></ul><ul><li>w j  w j -   R/  w j </li></ul><ul><li>or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.) </li></ul><ul><li>(… to be continued in the next lecture) </li></ul>
39. 39. Summary <ul><li>With linear threshold units (“neurons”) we can build: </li></ul><ul><ul><li>Linear discriminant (including Naïve Bayes) </li></ul></ul><ul><ul><li>Kernel methods </li></ul></ul><ul><ul><li>Neural networks </li></ul></ul><ul><ul><li>Decision trees </li></ul></ul><ul><li>The architectural hyper-parameters may include: </li></ul><ul><ul><li>The choice of basis functions  (features) </li></ul></ul><ul><ul><li>The kernel </li></ul></ul><ul><ul><li>The number of units </li></ul></ul><ul><li>Learning means fitting: </li></ul><ul><ul><li>Parameters (weights) </li></ul></ul><ul><ul><li>Hyper-parameters </li></ul></ul><ul><ul><li>Be aware of the fit vs. robustness tradeoff </li></ul></ul>
40. 40. Want to Learn More? <ul><li>Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http:// rii . ricoh .com/~stork/DHS.html </li></ul><ul><li>The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ </li></ul><ul><li>Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http:// clopinet .com/ isabelle /Papers/ guyon _stork_nips98. ps . gz </li></ul><ul><li>Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http:// clopinet .com/ fextract -book </li></ul>