Your SlideShare is downloading. ×
0
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction

269

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
269
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Pb : the background image shows in webex. I’ve reduced the image : hope it helps Not mention KXEN
  • Include here recommandation systems
  • Best performance for each type of method, normalized by the average of these perf
  • Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  • Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  • Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  • Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  • Explain that this is a global estimator
  • Explain that this is a global estimator
  • Explain that this is a global estimator Proof of Gini = 2 AUC –1 L = lift Hitrate = tp/pos Farate = fp/neg Selected = sel/tot = (tp+fp)/tot = pos/tot.tp/pos + neg/tot.fp/neg = pos/tot Hitrate + neg/tot Farate AUC = sum Hitrate d(Farate) L = sum Hitrate d(Selected) = sum Hitrate d(pos/tot Hitrate + neg/tot Farate) = pos/tot sum Hitrate d Hitrate + neg/tot sum Hitrate d Farate = ½ pos/tot + neg/tot AUC 2L-1 = -(1-pos/tot) + 2(1-pos/tot) AUC = (1-pos/tot) (2AUC-1) Gini = (L-1/2)/(1-pos/tot)/2 = (2L-1)/(1-pos/tot) = 2AUC-1
  • Transcript

    • 1. Introduction to Machine Learning Isabelle Guyon isabelle @ clopinet .com
    • 2. What is Machine Learning? <ul><li>Learning </li></ul><ul><li>algorithm </li></ul>TRAINING DATA ? Answer Trained machine Query
    • 3. What for? <ul><li>Classification </li></ul><ul><li>Time series prediction </li></ul><ul><li>Regression </li></ul><ul><li>Clustering </li></ul>
    • 4. Some Learning Machines <ul><li>Linear models </li></ul><ul><li>Kernel methods </li></ul><ul><li>Neural networks </li></ul><ul><li>Decision trees </li></ul>
    • 5. Applications inputs training examples 10 10 2 10 3 10 4 10 5 Bioinformatics Ecology OCR HWR Market Analysis Text Categorization Machine Vision System diagnosis 10 10 2 10 3 10 4 10 5
    • 6. Banking / Telecom / Retail <ul><li>Identify: </li></ul><ul><ul><li>Prospective customers </li></ul></ul><ul><ul><li>Dissatisfied customers </li></ul></ul><ul><ul><li>Good customers </li></ul></ul><ul><ul><li>Bad payers </li></ul></ul><ul><li>Obtain: </li></ul><ul><ul><li>More effective advertising </li></ul></ul><ul><ul><li>Less credit risk </li></ul></ul><ul><ul><li>Fewer fraud </li></ul></ul><ul><ul><li>Decreased churn rate </li></ul></ul>
    • 7. Biomedical / Biometrics <ul><li>Medicine: </li></ul><ul><ul><li>Screening </li></ul></ul><ul><ul><li>Diagnosis and prognosis </li></ul></ul><ul><ul><li>Drug discovery </li></ul></ul><ul><li>Security: </li></ul><ul><ul><li>Face recognition </li></ul></ul><ul><ul><li>Signature / fingerprint / iris verification </li></ul></ul><ul><ul><li>DNA fingerprinting </li></ul></ul>6
    • 8. Computer / Internet <ul><li>Computer interfaces: </li></ul><ul><ul><li>Troubleshooting wizards </li></ul></ul><ul><ul><li>Handwriting and speech </li></ul></ul><ul><ul><li>Brain waves </li></ul></ul><ul><li>Internet </li></ul><ul><ul><li>Hit ranking </li></ul></ul><ul><ul><li>Spam filtering </li></ul></ul><ul><ul><li>Text categorization </li></ul></ul><ul><ul><li>Text translation </li></ul></ul><ul><ul><li>Recommendation </li></ul></ul>7
    • 9. Challenges inputs training examples 10 10 2 10 3 10 4 10 5 Arcene, Dorothea, Hiva Sylva Gisette Gina Ada Dexter, Nova Madelon 10 10 2 10 3 10 4 10 5 NIPS 2003 & WCCI 2006
    • 10. Ten Classification Tasks 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 ADA GINA HIVA NOVA SYLVA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 ARCENE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DEXTER 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DOROTHEA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 GISETTE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 MADELON Test BER (%)
    • 11. Challenge Winning Methods BER/<BER>
    • 12. Conventions X={x ij } n m x i y ={y j }  w
    • 13. Learning problem Colon cancer, Alon et al 1999 Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y . Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, …
    • 14. Linear Models <ul><li>f( x ) = w  x +b =  j=1:n w j x j +b </li></ul><ul><li>Linearity in the parameters , NOT in the input components. </li></ul><ul><li>f( x ) = w   ( x) +b =  j w j  j ( x ) +b (Perceptron) </li></ul><ul><li>f( x ) =  i=1:m  i k ( x i , x ) +b (Kernel method) </li></ul>
    • 15. Artificial Neurons f( x ) = w  x + b Axon Synapses Activation of other neurons Dendrites Cell potential Activation function McCulloch and Pitts, 1943 x 1 x 2 x n 1  f( x ) w 1 w 2 w n b
    • 16. Linear Decision Boundary x 1 x 2 x 3 hyperplane x 1 x 2
    • 17. Perceptron Rosenblatt, 1957 f( x ) f( x ) = w   ( x) + b  1 ( x ) 1 x 1 x 2 x n  2 ( x )  N ( x )  w 1 w 2 w N b
    • 18. NL Decision Boundary x 1 x 2 x 1 x 2 x 3
    • 19. Kernel Method Potential functions, Aizerman et al 1964 f( x ) =  i  i k ( x i , x ) + b k( x 1 ,x ) 1 x 1 x 2 x n   1  2  m b k( x 2 ,x ) k( x m ,x ) k(. ,. ) is a similarity measure or “kernel”.
    • 20. Hebb’s Rule <ul><li>w j  w j + y i x ij </li></ul>Axon Link to “Naïve Bayes”  y x j w j Synapse Activation of another neuron Dendrite
    • 21. Kernel “Trick” (for Hebb’s rule) <ul><li>Hebb’s rule for the Perceptron: </li></ul><ul><li>w =  i y i  ( x i ) </li></ul><ul><li>f( x ) = w   ( x ) =  i y i  ( x i )   ( x ) </li></ul><ul><li>Define a dot product: </li></ul><ul><li>k( x i , x ) =  ( x i )   ( x ) </li></ul><ul><li>f( x ) =  i y i k( x i , x ) </li></ul>
    • 22. Kernel “Trick” (general) <ul><li>f( x ) =  i  i k( x i , x ) </li></ul><ul><li>k( x i , x ) =  ( x i )   ( x ) </li></ul><ul><li>f( x ) = w   ( x ) </li></ul><ul><li>w =  i  i  ( x i ) </li></ul>Dual forms
    • 23. <ul><li>A kernel is: </li></ul><ul><li>a similarity measure </li></ul><ul><li>a dot product in some feature space: k( s , t ) =  ( s )   ( t ) </li></ul><ul><li>But we do not need to know the  representation. </li></ul><ul><li>Examples: </li></ul><ul><li>k( s , t ) = exp(-|| s - t || 2 /  2 ) Gaussian kernel </li></ul><ul><li>k( s , t ) = ( s  t ) q Polynomial kernel </li></ul>What is a Kernel?
    • 24. Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986  x j   “hidden units” internal “latent” variables
    • 25. Chessboard Problem
    • 26. Tree Classifiers <ul><li>CART (Breiman, 1984) or C4.5 (Quinlan, 1993) </li></ul>At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f 1 f 2 Choose f 2 Choose f 1
    • 27. Iris Data (Fisher, 1936) Linear discriminant Tree classifier Gaussian mixture Kernel method (SVM) setosa virginica versicolor Figure from Norbert Jankowski and Krzysztof Grabczewski
    • 28. Fit / Robustness Tradeoff x 1 x 2 15 x 1 x 2
    • 29. Performance evaluation x 1 x 2 f( x ) = 0 f( x ) > 0 f( x ) < 0 f( x ) = 0 f( x ) > 0 f( x ) < 0 x 1 x 2
    • 30. Performance evaluation x 1 x 2 f( x ) = -1 f( x ) > -1 f( x ) < -1 f( x ) = -1 f( x ) > -1 f( x ) < -1 x 1 x 2
    • 31. Performance evaluation x 1 x 2 f( x ) = 1 f( x ) > 1 f( x ) < 1 f( x ) = 1 f( x ) > 1 f( x ) < 1 x 1 x 2
    • 32. ROC Curve 100% 100% For a given threshold on f(x), you get a point on the ROC curve. Actual ROC 0 Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity) Random ROC Ideal ROC curve
    • 33. ROC Curve Ideal ROC curve (AUC=1) 100% 100% 0  AUC  1 Actual ROC Random ROC (AUC=0.5) 0 For a given threshold on f(x), you get a point on the ROC curve. Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity)
    • 34. Lift Curve O M Fraction of customers selected Hit rate = Frac. good customers select . Random lift Ideal Lift 100% 100% Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0  Gini  1 Actual Lift 0
    • 35. Performance Assessment False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power <ul><li>Compare F(x) = sign(f(x)) to the target y, and report: </li></ul><ul><ul><li>Error rate = ( fn + fp )/m </li></ul></ul><ul><ul><li>{ Hit rate , False alarm rate } or { Hit rate , Precision} or { Hit rate , Frac.selected} </li></ul></ul><ul><ul><li>Balanced error rate (BER) = ( fn/pos + fp/neg )/2 = 1 – ( sensitivity + specificity )/2 </li></ul></ul><ul><ul><li>F measure = 2 precision. recall /(precision+ recall ) </li></ul></ul><ul><ul><li>Vary the decision threshold  in F(x) = sign(f(x)+  ), and plot: </li></ul></ul><ul><ul><ul><li>ROC curve : Hit rate vs. False alarm rate </li></ul></ul></ul><ul><ul><ul><li>Lift curve : Hit rate vs. Fraction selected </li></ul></ul></ul><ul><ul><ul><li>Precision/recall curve : Hit rate vs. Precision </li></ul></ul></ul>Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp Cost matrix
    • 36. What is a Risk Functional? <ul><li>A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. </li></ul><ul><li>Examples: </li></ul><ul><li>Classification: </li></ul><ul><ul><li>Error rate: (1/m)  i=1:m 1 (F( x i )  y i ) </li></ul></ul><ul><ul><li>1- AUC (Gini Index = 2 AUC-1) </li></ul></ul><ul><li>Regression: </li></ul><ul><ul><li>Mean square error: (1/m)  i=1:m (f( x i )-y i ) 2 </li></ul></ul>
    • 37. How to train? <ul><li>Define a risk functional R[f( x , w )] </li></ul><ul><li>Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) </li></ul>(… to be continued in the next lecture) Parameter space ( w ) R[f( x , w )] w *
    • 38. How to Train? <ul><li>Define a risk functional R[f( x , w )] </li></ul><ul><li>Find a method to optimize it, typically “gradient descent” </li></ul><ul><li>w j  w j -   R/  w j </li></ul><ul><li>or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.) </li></ul><ul><li>(… to be continued in the next lecture) </li></ul>
    • 39. Summary <ul><li>With linear threshold units (“neurons”) we can build: </li></ul><ul><ul><li>Linear discriminant (including Naïve Bayes) </li></ul></ul><ul><ul><li>Kernel methods </li></ul></ul><ul><ul><li>Neural networks </li></ul></ul><ul><ul><li>Decision trees </li></ul></ul><ul><li>The architectural hyper-parameters may include: </li></ul><ul><ul><li>The choice of basis functions  (features) </li></ul></ul><ul><ul><li>The kernel </li></ul></ul><ul><ul><li>The number of units </li></ul></ul><ul><li>Learning means fitting: </li></ul><ul><ul><li>Parameters (weights) </li></ul></ul><ul><ul><li>Hyper-parameters </li></ul></ul><ul><ul><li>Be aware of the fit vs. robustness tradeoff </li></ul></ul>
    • 40. Want to Learn More? <ul><li>Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http:// rii . ricoh .com/~stork/DHS.html </li></ul><ul><li>The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ </li></ul><ul><li>Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http:// clopinet .com/ isabelle /Papers/ guyon _stork_nips98. ps . gz </li></ul><ul><li>Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http:// clopinet .com/ fextract -book </li></ul>

    ×