2. Types of ML :-
There are four types of machine learning:
1.Supervised Learning:
Supervised Learning is the one, where you can consider the learning is
guided by a teacher. We have a dataset which acts as a teacher and its
role is to train the model or the machine. Once the model gets trained it
can start making a prediction or decision when new data is given to it.
Supervised learning uses labelled training data to learn the mapping
function that turns input variables (X) into the output variable (Y). In
other words, it solves for f in the following equation:
Y = f (X)
This allows us to accurately generate outputs when given new inputs.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
3. Two types of supervised learning are: classification and regression.
Classification is used to predict the outcome of a given sample when the
output variable is in the form of categories. A classification model might look
at the input data and try to predict labels like “sick” or “healthy.”
Regression is used to predict the outcome of a given sample when the
output variable is in the form of real values. For example, a regression
model might process input data to predict the amount of rainfall, the height
of a person, etc.
Ensembling is another type of supervised learning. It means combining the
predictions of multiple machine learning models that are individually weak to
produce a more accurate prediction on a new sample.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
4. Thus, In supervised Machine Learning
“The outcome or output for the given input is known before itself” and the
machine must be able to map or assign the given input to the output.
Multiple images of a cat, dog, orange, apple etc here the images are
labelled. It is fed into the machine for training and the machine must identify
the same. Just like a human child is shown a cat and told so, when it sees a
completely different cat among others still identifies it as a cat, the same
method is employed here. In short,Supervised Learning means – Train Me!
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
5. 2.Unsupervised Learning:
Unsupervised learning models are used when we only have the input
variables (X) and no corresponding output variables.
They use unlabelled training data to model the underlying structure of the
data. Input data is given and the model is run on it. The image or the input
given are mixed together and insights on the inputs can be found .
The model learns through observation and finds structures in the data. Once
the model is given a dataset, it automatically finds patterns and relationships
in the dataset by creating clusters in it.
What it cannot do is add labels to the cluster, like it cannot say this a group
of apples or mangoes, but it will separate all the apples from mangoes.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
6. Two types of unsupervised learning are:Association and Clustering
Association is used to discover the probability of the co-occurrence of items
in a collection. It is extensively used in market-basket analysis. For example,
an association model might be used to discover that if a customer purchases
bread, s/he is 80% likely to also purchase eggs.
Clustering is used to group samples such that objects within the same
cluster are more similar to each other than to the objects from another
cluster.
Apriori, K-means, PCA — are examples of unsupervised learning.
Suppose we presented images of apples, bananas and mangoes to the
model, so what it does, based on some patterns and relationships it creates
clusters and divides the dataset into those clusters. Now if a new data is fed
to the model, it adds it to one of the created clusters.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
7. Fig: grouping of similar data
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
8. 3.Semi-supervised Learning:
It is in-between that of Supervised and Unsupervised Learning. Where the
combination is used to produce the desired results and it is the most
important in real-world scenarios where all the data available are a
combination of labelled and unlabelled data.
3.Reinforced Learning:
The machine is exposed to an environment where it gets trained by trial and
error method, here it is trained to make a much specific decision. The
machine learns from past experience and tries to capture the best possible
knowledge to make accurate decisions based on the feedback received.
Algorithm allows an agent to decide the best next action based on its current
state by learning behaviours that will maximize a reward.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
9. It is the ability of an agent to interact with the environment and find out what
is the best outcome. It follows the concept of hit and trial method. The agent
is rewarded or penaltized with a point for a correct or a wrong answer, and
on the basis of the positive reward points gained the model trains itself.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
11. 1.Overfitting :Over fitting refers to a model that models the training data too
well.
Over fitting happens when a model learns the detail and noise in the training
data to the extent that it negatively impacts the performance of the model on
new data. This means that the noise or random fluctuations in the training
data is picked up and learned as concepts by the model. The problem is that
these concepts do not apply to new data and negatively impact the models
ability to generalize.
Over fitting is more likely with nonparametric and nonlinear models that have
more flexibility when learning a target function. As such, many
nonparametric machine learning algorithms also include parameters or
techniques to limit and constrain how much detail the model learns.
2.Underfitting : Under fitting refers to a model that can neither model the
training data nor generalize to new data.
An under fit machine learning model is not a suitable model and will be
obvious as it will have poor performance on the training data.
Under fitting is often not discussed as it is easy to detect given a good
performance metric. The remedy is to move on and try alternate machine
learning algorithms. Nevertheless, it does provide a good contrast to the
problem of over fitting.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
12. Bias: It gives us how closeness is our predictive model’s to training data
after averaging predict value. Generally algorithm has high bias which help
them to learn fast and easy to understand but are less flexible. That looses it
ability to predict complex problem, so it fails to explain the algorithm bias.
This results in under fitting of our model.
Getting more training data will not help much.
Variance: It define as deviation of predictions, in simple it is the amount
which tell us when its point data value change or a different data is use how
much the predicted value will be affected for same model or for different
model respectively. Ideally, the predicted value which we predict from model
should remain same even changing from one training data-sets to another,
but if the model has high variance then model predict value are affect by
value of data-sets.
“Signal” as the true underlying pattern that you wish to learn from the data.
“Noise” on the other hand, refers to the irrelevant information or randomness
in a dataset.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
13. Overfitting and Underfitting are the two main problems that occur in machine
learning and degrade the performance of the machine learning models.
The main goal of each machine learning model is to generalize well.
Here generalization defines the ability of an ML model to provide a suitable
output by adapting the given set of unknown input. It means after providing
training on the dataset, it can produce reliable and accurate output. Hence,
the underfitting and overfitting are the two terms that need to be checked for
the performance of the model and whether the model is generalizing well or
not.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
14. Over fitting :
Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency and
accuracy of the model. The overfitted model has low bias and high
variance.
The chances of occurrence of overfitting increase as much we provide
training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.
Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below
graph of the linear regression output:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
16. In above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal
of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model :
Both overfitting and underfitting cause the degraded performance of the
machine learning model. But the main cause is overfitting, so there are
some ways by which we can reduce the occurrence of overfitting in our
model.
Cross-Validation
Training with more data
Removing features
Early stopping the training
Regularization
Ensembling
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
17. Underfitting :
Underfitting occurs when our machine learning model is not able to capture
the underlying trend of the data. To avoid the overfitting in the model, the fed
of training data can be stopped at an early stage, due to which the model
may not learn enough from the training data. As a result, it may fail to find
the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.
An underfitted model has high bias and low variance.
Example: We can understand the underfitting using below output of the
linear regression model:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
19. In above graph, the model is unable to capture the data points present in the
plot.
How to avoid underfitting:
By increasing the training time of the model.
By increasing the number of features.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
20. Goodness of Fit :
The "Goodness of fit" term is taken from the statistics, and the goal of the
machine learning models to achieve the goodness of fit. In statistics
modeling, it defines how closely the result or predicted values match the true
values of the dataset.
The model with a good fit is between the underfitted and overfitted model,
and ideally, it makes predictions with 0 errors, but in practice, it is difficult to
achieve it.
There are two other methods by which we can get a good point for our
model, which are the resampling method to estimate model accuracy
and validation dataset.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
21. Machine learning life cycle is a cyclic process to build an efficient machine
learning project. The main purpose of the life cycle is to find a solution to the
problem or project.
Machine learning life cycle involves seven major steps, which are given
below:
Gathering Data
Data preparation
Data Wrangling
Analyze Data
Train the model
Test the model
Deployment
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
23. In the complete life cycle process, to solve a problem, we create a machine
learning system called "model", and this model is created by providing
"training". But to train a model, we need data, hence, life cycle starts by
collecting data.
The most important thing in the complete process is to understand the
problem and to know the purpose of the problem.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of
this step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be
collected from various sources such as files, database, internet, or mobile
devices. It is one of the most important steps of the life cycle. The quantity
and quality of the collected data will determine the efficiency of the output.
The more will be the data, the more accurate will be the prediction.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
24. This step includes the below tasks:
Identify various data sources
Collect data
Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset.
It will be used in further steps.
2. Data preparation :
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine
learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
Data pre-processing:
Now the next step is preprocessing of data for its analysis.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
25. 3. Data Wrangling :
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in the
next step. It is one of the most important steps of the complete process. Cleaning of
data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the
data may not be useful. In real-world applications, collected data may have various
issues, including:
Missing Values
Duplicate data
Invalid data
Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively
affect the quality of the outcome.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
26. 4. Data Analysis :
Now the cleaned and prepared data is passed on to the analysis step. This
step involves:
Selection of analytical techniques
Building models
Review the result
The aim of this step is to build a machine learning model to analyze the data
using various analytical techniques and review the outcome. It starts with the
determination of the type of the problems, where we select the machine
learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and
evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to
build the model.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
27. 5. Train Model :
Now the next step is to train the model, in this step we train our model to
improve its performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms.
Training a model is required so that it can understand the various patterns,
rules, and, features.
6. Test Model :
Once our machine learning model has been trained on a given dataset, then
we test the model. In this step, we check for the accuracy of our model by
providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per
the requirement of project or problem.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
28. 7. Deployment :
The last step of machine learning life cycle is deployment, where we deploy
the model in the real-world system.
If the above-prepared model is producing an accurate result as per our
requirement with acceptable speed, then we deploy the model in the real
system. But before deploying the project, we will check whether it is
improving its performance using available data or not. The deployment
phase is similar to making the final report for a project.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
29. In machine learning classification problems, there are often too many factors
on the basis of which the final classification is done. These factors are
basically variables called features. The higher the number of features, the
harder it gets to visualize the training set and then work on it. Sometimes,
most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction
is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables. It can be divided into
feature selection and feature extraction.
In Below figure, A 3-D classification problem can be hard to visualize,
whereas a 2-D one can be mapped to a simple 2 dimensional space, and a
1-D problem to a simple line. The below figure illustrates this concept, where
a 3-D feature space is split into two 1-D feature spaces, and later, if found to
be correlated, the number of features can be reduced even further.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
31. Components of Dimensionality Reduction :
Feature selection: In this, we try to find a subset of the original set of
variables, or features, to get a smaller subset which can be used to model
the problem. It usually involves three ways:
• Filter
• Wrapper
• Embedded
Feature extraction: This reduces the data in a high dimensional space to a
lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction :
The various methods used for dimensionality reduction include:
Principal Component Analysis (PCA)
Linear Discriminate Analysis (LDA)
Generalized Discriminate Analysis (GDA)
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
32. Dimensionality reduction may be both linear or non-linear, depending upon
the method used. The prime linear method, called Principal Component
Analysis, or PCA.
Principal Component Analysis(PCA)
This method was introduced by Karl Pearson. It works on a condition that
while the data in a higher dimensional space is mapped to data in a lower
dimension space, the variance of the data in the lower dimensional space
should be maximum.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
33. It involves the following steps:
Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used to
reconstruct a large fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might
have been some data loss in the process. But, the most important variances
should be retained by the remaining eigenvectors.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
34. Advantages of Dimensionality Reduction :
It helps in data compression, and hence reduced storage space.
It reduces computation time.
It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction :
It may lead to some amount of data loss.
PCA tends to find linear correlations between variables, which is sometimes
undesirable.
PCA fails in cases where mean and covariance are not enough to define
datasets.
We may not know how many principal components to keep- in practice,
some thumb rules are applied.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
35. Principal Component Analysis :
Principal Component Analysis is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning. It is a statistical
process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components. It is
one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given
dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA are image processing,
movie recommendation system, optimizing the power allocation in
various communication channels. It is a feature extraction technique, so it
contains the important variables and drops the least important variable.
36. The PCA algorithm is based on some mathematical concepts such as:
Variance and Covariance
Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
Dimensionality: It is the number of features or variables present in the given
dataset. More easily, it is the number of columns present in the dataset.
Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation
value ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional
to each other, and +1 indicates that variables are directly proportional to each
other.
Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.
Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.
Then v will be eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.
37. Principal Components in PCA :
As described above, the transformed new features or the output of PCA are
the Principal Components. The number of these PCs are either equal to or
less than the original features present in the dataset. Some properties of
these principal components are given below:
The principal component must be the linear combination of the original
features.
These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
importance.
38. Steps for PCA algorithm :
1.Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X
and Y, where X is the training set, and Y is the validation set.
2.Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent
the two-dimensional matrix of independent variable X. Here each row
corresponds to the data items, and the column corresponds to the Features.
The number of columns is the dimensions of the dataset.
3.Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column,
the features with high variance are more important compared to the features
with lower variance.
If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation of
the column. Here we will name the matrix as Z.
39. 4.Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it.
After transpose, we will multiply it by Z. The output matrix will be the
Covariance matrix of Z.
5.Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions of
the axes with high information. And the coefficients of these eigenvectors are
defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing
order, which means from largest to smallest. And simultaneously sort the
eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be
named as P*.
7.Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix
to the Z. In the resultant matrix Z*, each observation is the linear combination of
original features. Each column of the Z* matrix is independent of each other.
40. 8.Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and
what to remove. It means, we will only keep the relevant or important
features in the new dataset, and unimportant features will be removed out.
Applications of Principal Component Analysis :
PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.
41. Evaluation metrics are tied to machine learning tasks. There are different
metrics for the tasks of classification, regression, ranking, clustering, topic
modeling, etc. Some metrics, such as precision-recall, are useful for multiple
tasks. Classification, regression, and ranking are examples of supervised
learning, which constitutes a majority of machine learning applications.
Model Accuracy:
Model accuracy in terms of classification models can be defined as the ratio
of correctly classified samples to the total number of samples:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
43. True Positive (TP) — A true positive is an outcome where the
model correctly predicts the positive class.
True Negative (TN)—A true negative is an outcome where the
model correctly predicts the negative class.
False Positive (FP)—A false positive is an outcome where the
model incorrectly predicts the positive class.
False Negative (FN)—A false negative is an outcome where the
model incorrectly predicts the negative class.
Problem Statement- Build a prediction model for hospitals to identify
whether the patient is suffering from cancer or not .
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
44. Binary Classification Model — Predict whether the patient has cancer or
not.
Let’s assume we have a training dataset with labels—100 cases, 10 labeled
as ‘Cancer’, 90 labeled as ‘Normal’
Let’s try calculating the accuracy of this model on the above dataset, given
the following results:
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
45. In the above case let’s define the TP, TN, FP, FN:
TP (Actual Cancer and predicted Cancer) = 1
TN (Actual Normal and predicted Normal) = 90
FN (Actual Cancer and predicted Normal) = 8
FP (Actual Normal and predicted Cancer) = 1
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
46. So the accuracy of this model is 91%. But the question remains as to
whether this model is useful, even being so accurate?
This highly accurate model may not be useful, as it isn’t able to predict the
actual cancer patients—hence, this can have worst consequences.
So for these types of scenarios how do we can trust the machine learning
models?
Accuracy alone doesn’t tell the full story when we’re working with a class-
imbalanced dataset like this one, where there’s a significant disparity
between the number of positive and negative labels.
Precision and Recall :
In a classification task, the precision for a class is the number of true
positives (i.e. the number of items correctly labeled as belonging to the
positive class) divided by the total number of elements labeled as belonging
to the positive class (i.e. the sum of true positives and false positives, which
are items incorrectly labeled as belonging to the class).
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
48. Recall is defined as the number of true positives divided by the total number
of elements that actually belong to the positive class (i.e. the sum of true
positives and false negatives, which are items which were not labeled as
belonging to the positive class but should have been).
High precision means that an algorithm returned substantially more
relevant results than irrelevant ones.
High recall means that an algorithm returned most of the relevant results
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
49. Let’s try to measure precision and recall for our cancer prediction use case:
Our model has a precision value of 0.5 — in other words, when it predicts
cancer, it’s correct 50% of the time.
Our model has a recall value of 0.11 — in other words, it correctly
identifies only 11% of all cancer patients.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune
50. Classification Accuracy :
Classification Accuracy is what we usually mean, when we use the term
accuracy. It is the ratio of number of correct predictions to the total number
of input samples.
It works well only if there are equal number of samples belonging to each
class.
For example, consider that there are 98% samples of class A and 2%
samples of class B in our training set. Then our model can easily get 98%
training accuracy by simply predicting every training sample belonging to
class A.
When the same model is tested on a test set with 60% samples of class A
and 40% samples of class B, then the test accuracy would drop down to
60%. Classification Accuracy is great, but gives us the false sense of
achieving high accuracy.
The real problem arises, when the cost of misclassification of the minor
class samples are very high. If we deal with a rare but fatal disease, the cost
of failing to diagnose the disease of a sick person is much higher than the
cost of sending a healthy person to more tests.
Mrs.Harsha Patil,Dr.D.Y.Patil ACS College,Pimpri,Pune