Modern Machine Learning Regression Methods

1,221 views
1,158 views

Published on

AACIMP 2009 Summer School lecture by Igor Tetko. "Environmental Chemoinfornatics" course.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,221
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
32
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Modern Machine Learning Regression Methods

  1. 1. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Modern Machine Learning Regression Methods* Igor & Igor** *based on our lecture at Strasbourg Chemoinformatics School **Igor Baskin, MSU
  2. 2. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c The Goal of Learning The goal of statistical learning (in chemoinformatics) consists in finding function f relating the value of some property y (which can be physicochemical property, biological activity, etc) to the values of descriptors x1,…,xM (which can describe chemical compounds, reactions, etc) y f ( x1 ,..., x M ) Continuous y are handled by regression analysis, function approximation, etc Discrete y are handed by discriminant analysis, classification, pattern recognition, etc
  3. 3. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c The Goal of Learning So, the goal of statistical learning is to find such function class F and such parameters c1,…,cP that would minimize the risk function and therefore provide a model with the highest predictive ability: R(c1 ,...cP ) min In classical statistics it is assumed that: R (c1 ,...cP ) min Rexp (c1 ,...cP ) min Is this correct? Almost correct for big data sets and may not be correct for small data sets
  4. 4. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Incorrectness of Empirical Risk Minimization p Data approximation with polynoms of different order Pol ( x, p) c0 ci x i i 1 Minimization of the empirical risk function does not guaranties the best predictive performance of the model
  5. 5. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Occam’s Razor Entia non sunt multiplicanda praeter necessitatem Entities should not be multiplied beyond necessity All things being equal, the simplest solution tends to be the best one (c. 1285–1349)
  6. 6. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om om Machine Learning Regression w w w w. w. A B B Y Y.c A B B Y Y.c Methods • Multiple Linear Regression (MLR) • Kernel version of this method • K Nearest Neighbours (kNN) • Back-Propagation Neural Network (BPNN) • Associative Neural Network (ASNN
  7. 7. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Multiple Linear Regression Y=CX C ( X T X ) 1 X TY M < N !!! 1 c0 y1 1 x1 x1 M 2 2 c1 y2 1 x1 xM C Y X cM yN N 1 x1 N xM Regression Experimental property Descriptor values coefficients values y c 0 c1 x1 ... c M x M / Topless: M < N/5 for good models Topleiss: M < N/5 for good models
  8. 8. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Mystery of the “Rule of 5” John G. Topliss J.H.Topliss & R.L.Costello, JMC, 1972, Vol. 15, No. 10, P. 1066-1068.
  9. 9. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Mystery of the “Rule of 5” For example, if R2 = 0.40 is regarded as the maximum acceptable level of chance correlation then the minimum number of observations required to test five variables is about 30, for 10 variables 50 observations, for 20 variables 65 observations, and for 30 variables 85 observations. John G. Topliss J.H.Topliss & R.L.Costello, JMC, 1972, Vol. 15, No. 10, P. 1066-1068.
  10. 10. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Mystery of the “Rule of 5” C.Hansch, K.W.Kim, R.H.Sarma, JACS, 1973, Vol. 95, No.19, 6447-6449 Topliss and Costello2 have pointed out the danger of finding meaningless chance correlations with three or four data points per variable. The correlation coefficient is good and there are almost five data points per variable. / Topliss Hansch: M < N/5 for good models
  11. 11. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C Kernel ridge regression w om w om w w w. w. A B B Y Y.c A B B Y Y.c • Regression + regularization, in feature space 2 2 2 min w i yi w, x i w min w i yi w , (x i ) w • Avoids large, canceling coefficients • Bias unnecessary if samples and labels are centered (in feature space) 1 • Solution in convex hull of samples, K In n , where K is the kernel and I the identity matrix • Predict new samples x as i xi, x i
  12. 12. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c K Nearest Neighbors M Euclid i D ij ( xk xkj ) 2 k 1 M Manhat tan i D ij xk xkj k 1 Non-weighted Weighted 1 1 1 yipred yj pred 1 Dij y i yj j k neighbours k j k neighbours j k neighbours Dij
  13. 13. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w Overfitting by Variable Selection in w. w. A B B Y Y.c A B B Y Y.c kNN Golbraikh A., Tropsha A. Beware of q2! JMGM, 2002, 20, 269-276
  14. 14. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Artificial Neuron Transfer function O2 o3 o2 O2 o4 o1 o5 wi2 wi3 wi4 wi1 wi5 1, x f ( x) 0, x neti = wjioj-ti oi = f(neti) oi f ( x ) 1 (1 e x )
  15. 15. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Multilayer Neural Network Input Layer Hidden Layer Output Layer Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables
  16. 16. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Generalized Delta-Rule This is application of the steepest descent method to training backpropagation neural networks Remp df j (e) ' wij yi j yi j wij de – learning rate constant 1974 1986 David James Paul Werbos Geoffrey Hinton Rummelhard McClelland
  17. 17. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Multilayer Neural Network Input Layer Hidden Layer Output Layer The number of weights corresponds to the number of adjustable parameters of the method
  18. 18. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Origin of “Rule of 2” The number of weights (adjustable parameters) for the case of one hidden layer W = (I+1)H + (H+1)O N Parameter : W 1.8 2 .2 T.A. Andrea and H. Kalayeh, J. Med. Chem., 1991, 34, 2824-2836. End of “Rule of 2” I.V. Tetko, D.J. Livingstone, Luik, A.I. J. Chem. Inf. Comput. Sci., 1995, 35, 826-833. Baskin, I.I. et al. Foundations Comput. Decision. Sci. 1997, v.22, No.2, p.107-116.
  19. 19. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Overtraining and Early Stopping test set 2 test set 1 training set point for early stopping
  20. 20. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Associative Neural Network (ASNN) z1i Ensemble approach: A prediction of case x i i 1 i ANNE M zi z k zi zk i: M k 1, M i zM Pearson’s (Spearman) correlation coefficient rij=R(zi,zj)>0 in space of residuals 1 zi zi k yj zj <<= ASNN bias correction j N k ( xi ) The correction of neural network ensemble value is performed using errors (biases) calculated for the neighbor cases of analyzed case xi detected in space of neural network models
  21. 21. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Correction of a model by the nearest neighbors
  22. 22. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Nearest neighbors for Gauss function A B 1 y 1 y 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x x C x2 4 Detection of nearest 2 neighbors in space of models uses invariants in 0 x “structure-property” space. -2 -4 -3 -2 -1 0 1 2 3 x1
  23. 23. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic ASNN local correction strongly C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c improves prediction of extreme data Tetko, Methods Mol Biol, 2008, 458, 826-833.
  24. 24. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. Prediction of similar properties A B B Y Y.c A B B Y Y.c 1 . "LogD" 0.75 t raining " LogP" 0.5 0.25 0 0 1 2 3
  25. 25. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c 10 new points are measured 1 "LogD" 0.75 t raining " LogP" 0.5 0.25 0 0 1 2 3
  26. 26. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Library mode (no retraining) Library " LogD" 1 "LogD" 0.75 t raining " LogP" 0.5 0.25 0 0 1 2 3
  27. 27. F T ra n sf o F T ra n sf o PD rm PD rm Y Y Y Y er er ABB ABB y y bu bu 2.0 2.0 to to re re he he k k lic lic C C w om w om w w w. w. A B B Y Y.c A B B Y Y.c Library mode vs development of a new model using 10 cases Library " LogD" 1 t arget f unct ion "LogD" Ret r aining f rom " scrat h" 0.75 0.5 0.25 0 2 0 1 3

×