4. Machine Learning Algorithms
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Regression/Classification
I. Linear Regression
II. Logistic Regression
III. Decision Tree
IV. SVM
V. Naive Bayes
VI. kNN
VII. K-Means
VIII. Random Forest
IX. Dimensionality Reduction Algorithms
X. Gradient Boosting algorithms
XI. Artificial Neural Network
5. Linear Regression
• Regression: a statistical technique for estimating the relationships among variables
y = X.β +ε
• X is a tensor in ML (in our work mostly a multidimensional matrix) called feature vector
• y is the target (what we want to predict? e.g. adsorption energy, barrier height, bandgap,
dielectric loss etc.)
• β is/are the coefficient(s)
• ε is the error in prediction
• Goal is to find β for which ε is minimum
• X and y are multidimensional: Solution is Least square
6. (Ordinary) Least Squares Solution
• When the number variables are not equal to number of equation: No exact solution
• Approximation in solution: Least squares
• Least squares: Overall solution minimizes the sum of the squares of the residuals made
in the results of every single equation (Source: Wikipedia)
• If number of equation is larger than number of unknown variables the solution for β is
β = (XT.X)-1 .XT .y
• If number of equation is smaller than or equal to number of unknown variables:
β = XT.(X.XT)-1 .y
• Solution is valid only if the inverse matrix exist (collinearlity?)
Minimization of the function of residual sum of square (RSS) = ||y-X.β||2
7. Collinearity in Matrix
1. Inversion of a matrix
2. If matrix is collinear: Determinant is Zero
Solutions?
Remove the collinearity
1. See the correlation coefficient b/w features
and remove the features which are highly correlated
2. Add a penalty term to the inverse matrix (Lasso, Ridge etc...)
Pearson's correlation coefficient
8. Python Code
Scikit-Learn Library
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
What if features available to us are highly collinear?
i.e. not sufficient features to elliminate them!
9. Partial Least Squares (PLS) Solution
• Find new latent variables from the old features by principal component analysis (PCA)
• PCA: Find a orthonormal matrix P where U = P.X so that (1/n-1) UUT is diagonalizable
• Rows of P are the principal component X
• The new variables are chosen to simulteneously satisfy three conditions:
1. They are highly correlated to dependent variables
2. They model as much as the variance among the independent
variable as possible (Signal to noise ratio max)
3. They are uncorrelated with each other (minimizes the no. of variables )
Disadvantages: Latent variables are abstract and difficult to interpret
Scikit-Learn Library
from sklearn.linear_model import PLSRegression
model = PLSRegression(n_components=5)
optimization of n_components is required!
10. Ridge Regression (L2 regularization)
1. Developed to overcome the issue of Collinearity problem
2. Add a loss function to inverse matrix (least square regression)
3. Ridge Function:
Lridge (β,λ) = ||y-X.β||2 + λ||β||2
4. Solution for β is
β=(XT.X+λIpp)-1XT.y
5. In practice we have to optimize λ (Hyperparameter)
Scikit-Learn Library
from sklearn.linear_model import Ridge
model = Ridge(alpha=0.0000001, max_iter=10000, tol=0.001)
11. Lasso Regression (L1 regularization)
• Difference between Lasso and Ridge is the nature of the loss function
• Lasso Function:
LLasso (β,λ) = ||y-X.β||2 + λ||β||
• Solution for β is
• β = sgn (βi
LS) (|βi
LS|-λ)+
Scikit-Learn Library
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.00001,max_iter=100000)
signum function (sgn) for real number
12. Prepocessing of Data
• Prior to construct any ML model the data need to be preprocess (Majorly time goes here)
• All NaN data should be removed
• Normalize the features (Not Target)
• Creating the feature vector is essentially our job (differnce b/w ML in other fields and material
science)
• Expertise is extremely important (Using elemental properties does not works always)
• Stuructural and chemical descriptors are needed for better precision
• We should look into minimum number but effective ones as features
13. Concept of Overfitting
• Overfitting is a modeling error which occurs when a function is too closely fit to a
limited set of data points
• Limited number of data: high probability of over fitting (Our case, we should be very careful)
14. Conclusions
• Basic overview on Machine Learning
• Briefly discussed Least squares regression
• Issues of collinearity
• Discussed about PLS, Lasso, Ridge regression
• A small discussion on preprocessing of data
• Presented a discussion on Overfitting of models