This is an introductory learn-up on Machine Learning using Python. You get to understand the basic theory of Machine Learning and also the practical implementation.
Lab notebook can be found here : https://github.com/opencubelabs
5. http://ocl.space
Supervised Learning
y = f(X)
X is the features/inputs
y is the target/output
f(X) is the learning function
Types :
● Regression
● Classification
6. http://ocl.space
Unsupervised Learning
● We have input data (X) but no corresponding output variable (y).
● The goal is to model the distribution of the data in order to learn more
about the data.
● Types of unsupervised learning :
--> Clustering
--> Association
8. http://ocl.space
Regression
● A form of predictive modelling technique which investigates the relationship
between a dependent (target) and independent variable (s) (predictor).
● It is used for forecasting, time series modelling and finding the causal effect
relationship between the variables.
● It indicates the significant relationships between dependent variable and
independent variable.
● It indicates the strength of impact of multiple independent variables on a
dependent variable.
● Types of regression : Linear, Logistic, Polynomial, Stepwise, Ridge, Lasso
and ElasticNet
10. http://ocl.space
Clustering and Association
● The aim is to segregate groups with similar traits and assign them into clusters.
● Types of Clustering :
--> Hard Clustering: In hard clustering, each data point either belongs to a cluster
completely or not.
--> Soft Clustering: In soft clustering, instead of putting each data point into a
separate cluster, a probability or likelihood of that data point to be in those
clusters is assigned.
● When we want to discover rules that describe portions of the input data it is known
as association problem.
11. http://ocl.space
Linear Regression
● It is used to estimate real values (cost of houses, number of calls, total sales etc.)
based on continuous variable(s).
● Here, we establish relationship between independent and dependent variables
by fitting a best line.
● This best fit line is known as regression line and represented by a linear equation
Y= a * X + b
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
13. http://ocl.space
Logistic Regression
● It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false )
based on given set of independent variable(s).
● It predicts the probability of occurrence of an event by fitting data to a logit function.
15. http://ocl.space
Overfitting & Underfitting
● Overfitting happens when a model performs too well on training data but does
not perform well on unseen data.
● Underfitting when a model does not perform well on training data as well as
unseen data.
16. http://ocl.space
Cross Validation
● A method to test how well a model performs on unseen data.
● Types of Cross Validation methods :
--> Hold out method
--> K-fold method
--> Leave-one-out cross validation
18. http://ocl.space
Naive Bayes
● Naive Bayes is a supervised learning algorithm which is based on bayes theorem.
● The word naive comes from the assumption of independence among features.
● We can write bayes theorem as follows :
Where,
P(x) is the prior probability of a feature.
P(x | y) is the probability of a feature given target. It's also known as likelihood.
P(y) is the prior probability of a target or class in case of classification.
p(y | x) is the posterior probability of target given feature.
19. http://ocl.space
Support Vector Machines (SVMs)
● SVMs are among the best supervised learning algorithms.
● It is effective in high dimensional space and it is memory efficient as well.
● We plot each data item as a point in n-dimensional space andperform classification
by finding the hyperplane that differentiate the two classes very well.
● We can draw m number of hyperplanes.
● The optimal hyperplane is obtained by maximizing the margin.
21. http://ocl.space
Decision Tree
● Decision Tree is the supervised learning algorithm which can be used for
classification as well as regression problems.
● Here we split population into set of homogeneous sets by asking set of questions.
● Example : To decide what to do on a particular day.
23. http://ocl.space
Random Forest
● Random Forest is the most common type of Ensemble Learning.
● It is a collection of decision trees.
● To classify a new object based on attributes, each tree gives a classification
and we say the tree “votes” for that class. The forest chooses the classification
having the most votes (over all the trees in the forest).
● There are plethora of advantages of random forest such as they are fast to train,
requires no input preparation.
● One of the disadvantage of random forest is that our model may become too large.
24. http://ocl.space
K-nearest Neighbors (KNN)
● KNN can be used for both classification and regression problems.
● It stores all available cases and classifies new cases by a majority vote of its k
neighbors.
● KNN is computationally expensive.
25. http://ocl.space
K-means clustering
● K-means is one of the simplest unsupervised learning algorithm used for
clustering problem.
● Our goal is to group objects based on their features similarity.
● Basic idea behind K-means is, we define k centroids,
that is, one for each cluster.
26. http://ocl.space
Neural Networks
● Neural Network is an information processing system, that is, we pass some
input to the Neural Network, some processing happens and we get some output.
● Neural Networks are inspired from biological connection of neurons and how
information processing happens in the brain.