Unit 1
• Learning, Types of Learning
• Well defined learning problems, Designing a Learning System
• History of ML, Introduction of Machine Learning Approaches
• Artificial Neural Network, Clustering, Reinforcement Learning
• Decision Tree Learning, Bayesian networks
• Support Vector Machine, Genetic Algorithm
• Issues in Machine Learning
• Data Science Vs Machine Learning
Have you ever heard of !
• Virtual Personal Assistants
• Smart Speakers: Amazon Echo and
Google Home
• Mobile Apps: Ok Google
• Predictions while Commuting
• GPS navigation
• Videos Surveillance
• Social Media Services
• People You May Know
• Face Recognition
• Similar Pins
• Email Spam and Malware Filtering
• Online Customer Support
• Search Engine Result Refining
• Product Recommendations
• Online Fraud Detection
Machine Learning definition
• Arthur Samuel (1959). Machine Learning: Field of study that gives computers the
ability to learn without being explicitly programmed.
• Machine learning (ML) is a type of artificial intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being
explicitly programmed to do so. Machine learning algorithms use historical data as
input to predict new output values.
• Machine learning is an application of AI that enables systems to learn and improve
from experience without being explicitly programmed. Machine learning focuses
on developing computer programs that can access data and use it to learn for
Well Posed Learning Problem
Machine learning Types
• Machine learning algorithms:
• Supervised learning
• Unsupervised learning
• Others: Reinforcement learning, recommender systems.
Supervised Learning
Supervised machine learning algorithms are designed to learn a machine by labels. The
name “supervised” learning originates from the idea that training this type of algorithm is
like having a teacher supervise the whole process.
Supervised Learning
When training a supervised learning algorithm, the training data will consist of inputs paired
with the correct outputs. During training, the algorithm will search for patterns in the data that
correlate with the desired outputs. After training, a supervised learning algorithm will take in
new unseen inputs and will determine which label the new inputs will be classified as based on
prior training data. The objective of a supervised learning model is to predict the correct label
for newly presented input data. At its most basic form, a supervised learning algorithm can be
written simply as:
Where Y is the predicted output that is determined by a mapping function that assigns a
class to an input value x. The function used to connect input features to a predicted output is
Supervised learning can be split into two subcategories:
• Regression
• Linear Regression
• Logistic Regression
• Polynomial Regression
• Decision Tree Regression
• Classification
• Linear Classifiers
• Support Vector Machines
• Decision Trees classification
• K-Nearest Neighbor
• Random Forest
Regression : example
A classification algorithm will be given data points with an assigned category. The job of a
classification algorithm is to then take an input value and assign it a class, or category, that
it fits into based on the training data provided.
Unsupervised learning
Unsupervised learning occurs when an algorithm learns from plain examples without any
associated response, leaving to the algorithm to determine the data patterns on its own.
When no labels are present in data set to train the model. This is called un supervised ML.
• Clustering
• Neural Networks
• Fuzzy C-Means
Reinforcement learning
• Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave
in an environment by performing the actions and seeing the results of actions. For each action, the agent gets
feedback. E.g. AWSDeepRacer Car
• IN RL, an agent interacts with an environment with an objective to maximize its total reward.
• A reinforcement learning model will learn from its experience and over the time will be able to identify
which actions lead to the best rewards.
• The main component of RL are:
 Agent
 Environment
 State
 Reward
 Action
Applications of RL
Video gameplay: Reinforcement learning has been used to teach bots to
play a number of video games.
Resource management: Given finite resources and a defined goal,
reinforcement learning can help enterprises plan out how to allocate
Applications of ML
Email-Spam Filtering
Traffic Prediction
Virtual Personal Assistant: Google assistant, alexa, cortona, Siri
Social Media Personalization
Online Fraud Detection
Stock Market Prediction
Weather Prediction
Speech Recognition
Medical Diagnosis
Self driving car
Image Recognition
Issues in ML
Poor quality of data
 Unclean and noisy data
Remove outliers
Filter missing values
Remove unwanted features
Underfitting of training data
This process occurs when data is unable to establish an accurate relationship between input and output
variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple to establish a
precise relationship. To overcome this issue:
Enhance the complexity of the model
Add more features to the data
Issues in ML
ML is a complex process
It includes analyzing the data, removing data bias, training data, applying complex mathematical calculations, and a lot more.
Hence it is a really complicated process which is another big challenge for Machine learning professionals.
Lack of training data
The most important task you need to do in the
machine learning process is to train the data to
achieve an accurate output. Less amount training
data will produce inaccurate or too biased
Issues in ML
Slow implementation
This is one of the common issues faced by machine learning professionals. The machine learning models
are highly efficient in providing accurate results, but it takes a tremendous amount of time. Slow programs,
data overload, and excessive requirements usually take a lot of time to provide accurate results. Further, it
requires constant monitoring and maintenance to deliver the best output.
Imperfections in the Algorithm When
Data Grows
you have found quality data, trained it amazingly, and the predictions are really concise and accurate. Yay,
you have learned how to create a machine learning algorithm!! But wait, there is a twist; the model may
become useless in the future as data grows. The best model of the present may become inaccurate in the
coming Future and require further rearrangement. So you need regular monitoring and maintenance to keep
the algorithm working. This is one of the most exhausting issues faced by machine learning professionals.
Data Science vs. Machine Learning
1. A field of deep study of data that includes extracting
useful insights from the data, and processing that
information using different tools, statistical models,
and Machine learning algorithms.
2. It is used for discovering insights from the data.
3. It is a broad term that includes various steps to
create a model for a given problem and deploy
the model.
4. A data scientist needs to have skills to use big
data tools like Hadoop, Hive and Pig, statistics,
programming in Python, R, or Scala, data
5. It can work with raw, structured, and
unstructured data.
6. Data scientists spent lots of time in handling the
data, cleansing the data, and understanding its
1. Machine Leaning allows the computers to learn from
the past experiences by its own, it uses statistical
methods to improve the performance and predict the
output without being explicitly programmed.
2. It is used for making predictions and classifying
the result for new data points.
3. It is used in the data modeling step of the data
science as a complete process.
4. Machine Learning Engineer needs to have skills
such as computer science fundamentals,
programming skills in Python or R, statistics and
probability concepts, etc.
5. It mostly requires structured data to work on.
6. ML engineers spend a lot of time for managing
the complexities that occur during the
implementation of algorithms and mathematical
concepts behind that.
Choosing the Training Experience
The very important and first task is to choose the training data or training experience which will be
fed to the Machine Learning Algorithm. Three important parameters are:
Feedback regarding choice
Degree to control the sequence of training example
Distribution of example for performance measure
Choosing target function:
The next important step is choosing the target function. It means according to the knowledge fed to the
algorithm the machine learning will choose NextMove function which will describe what type of legal moves
should be taken. For example : While playing chess with the opponent, when opponent will play then
the machine learning algorithm will decide what be the number of possible legal moves taken in
order to get success.
Choosing Representation for Target
When the machine algorithm will know all the possible legal moves the next step is to choose the
optimized move using any representation i.e. using linear Equations, Hierarchical Graph Representation,
Tabular form etc. The NextMove function will move the Target move like out of these move which will
provide more success rate. For Example : while playing chess machine have 4 possible moves, so the
machine will choose that optimized move which will provide success to it.
Choosing Function Approximation
An optimized move cannot be chosen just with the training data. The training data had to go through with
set of example and through these examples the training data will approximates which steps are chosen and
after that machine will provide feedback on it. For Example : When a training data of Playing chess is
fed to algorithm so at that time it is not machine algorithm will fail or get success and again from
that failure or success it will measure while next move what step should be chosen and what is its
success rate.
Final Design:
The final design is created at last when system goes from number of examples , failures and success ,
correct and incorrect decision and what will be the next step etc. Example: DeepBlue is an
intelligent computer which is ML-based won chess game against the chess expert Garry Kasparov,
and it became the first computer which had beaten a human chess expert.
Introduction of Machine Learning Approaches
We can decide which machine learning approaches/algorithm to select based on the problem
statement, its an interaction with the environment and what type of data and inputs are
going to be. We can categorize the machine learning algorithms in two groups:
1) Learning algorithms
2) Similarity algorithms.
The similarity algorithms further used as a learning model based on the types of problem
Machine Learning Algorithms
Similarity Algorithms
• Regression Algorithms
• Clustering
• Decision Tree Algorithms
• Artificial Neural Networks
• Support Vector Machine
• Reinforcement Learning
• Bayesian networks
• Support Vector Machine
• Genetic Algorithm
Artificial Neural Network
Warren McCulloch and Walter Pitts published the first concept of a simplified brain cell, the
so-called McCulloch-Pitts (MCP) neuron, in 1943 (A Logical Calculus of the Ideas
Immanent in nervous Activity, W. S. McCulloch and W. Pitts, Bulletin of Mathematical
Biophysics, 5(4): 115-133, 1943). Biological neurons are interconnected nerve cells in the
brain that are involved in the processing and transmitting of chemical and electrical signals.
Artificial Neural Network
McCulloch and Pitts described such a nerve cell as a simple logic gate with binary outputs;
multiple signals arrive at the dendrites, they are then integrated into the cell body, and, if the
accumulated signal exceeds a certain threshold, an output signal is generated that will be
passed on by the axon. Frank Rosenblatt published the first concept of the perceptron
learning rule based on the MCP neuron model (The Perceptron: A Perceiving and
Recognizing Automaton, F. Rosenblatt, Cornell Aeronautical Laboratory, 1957). With his
perceptron rule, Rosenblatt proposed an algorithm that would automatically learn the
optimal weight coefficients that would then be multiplied with the input features in order to
make the decision of whether a neuron fires (transmits a signal) or not. In the context of
supervised learning and classification, such an algorithm could then be used to predict
whether a new data point belongs to one class or the other.
The Formal Definition of An Artificial Neuron
More formally, we can put the idea behind artificial neurons into the context of a binary
classification task where we refer to our two classes as 1 (positive class) and –1 (negative
class) for simplicity. We can then define a decision function (𝜙(𝑧)) that takes a linear
combination of certain input values, x, and a corresponding weight vector, w, where z is the
so-called net input.
The Formal Definition of An Artificial
• if the net input of a particular example, Xi, s greater than a defined threshold, 𝜃, we
predict class 1, and class –1 otherwise. In the perceptron algorithm, the decision function,
𝜙(·), is a variant of a unit step function:
• For simplicity, we can bring the threshold, 𝜃, to the left side of the equation and define a
weight-zero as 𝑤0 = -𝜃 and 𝑥= 1 so that we write z in a more compact form:
• In machine learning literature, the negative threshold, or weight, 𝑤0 = -𝜃, is usually called
the bias unit.
The following figure illustrates how the net input, 𝑧 = wTx is squashed into a binary output
(–1 or 1) by the decision function of the perceptron (left subfigure) and how it can be used
to discriminate between two linearly separable classes (right subfigure).
The perceptron learning rule
The whole idea behind the MCP neuron and Rosenblatt's thresholded perceptron model is to
use a reductionist approach to mimic how a single neuron in the brain works: it either fires
or it doesn't. Thus, Rosenblatt's initial perceptron rule is fairly simple, and the perceptron
algorithm can be summarized by the following steps:
1. Initialize the weights to 0 or small random numbers.
2. For each training example, 𝒙j
a. Compute the output value, 𝑦^
b. Update the weights.
• Here, the output value is the class label predicted by the unit step function that we defined
earlier, and the simultaneous update of each weight, 𝑤j , in the weight vector , w, can be
more formally written as: 𝑤𝑗 := 𝑤𝑗+ Δ𝑤𝑗
• The update value for 𝑤𝑗, (or change in 𝑤𝑗) , which we refer to as Δ𝑤 , is calculated by the perceptron
learning rule as follows:
Δ𝑤= 𝜂(𝑦(𝑖)- 𝑦^(𝑖))𝑥j
• Where 𝜂 is the learning rate (typically a constant between 0.0 and 1.0), y is the true class label of
the ith training example, and 𝑦^(𝑖) is the predicted class label. It is important to note that all weights
in the weight vector are being updated simultaneously, which means that we don't recompute the
predicted label 𝑦^(𝑖) before all of the weights are updated via the respective update values Δ𝑤j.
Concretely, for a two-dimensional dataset, we would write the update as
• let's go through a simple thought experiment to illustrate how beautifully simple this
learning rule really is. In the two scenarios where the perceptron predicts the class label
correctly, the weights remain unchanged, since the update values are 0:
• However, in the case of a wrong prediction, the weights are being pushed toward the
direction of the positive or negative target class:
• To get a better understanding of the multiplicative factor, xj
(i), let's go through another
simple example, where:
It is important to note that the convergence of the perceptron is only guaranteed if the two classes are
linearly separable and the learning rate is sufficiently small f the two classes can't be separated by a
linear decision boundary, we can set a maximum number of passes over the training dataset (epochs)
and/or a threshold for the number of tolerated misclassifications—the perceptron would never stop
updating the weights otherwise:
General concept of the perceptron
The three general layers of a neural network
The middle layers are considered hidden because, like human vision, they covertly
process objects between the input and output layers. When faced with four lines
connected in the shape of a square, our eyes instantly recognize those four lines as a
square. We don’t notice the mental processing that is involved to register the four
polylines (input) as a square (output).
Multilayer Perceptrons
Multilayer Perceptron: The multilayer perceptron (MLP), as with other ANN techniques,
is an algorithm for predicting a categorical (classification) or continuous (regression) target
variable. Multilayer perceptrons are powerful because they aggregate multiple models into a
unified prediction model, as demonstrated by the classification model.
We used supervised learning techniques to build machine learning models, using data where
the answer was already known—the class labels were already available in our training data.
Now, we will switch gears and explore cluster analysis, a category of unsupervised learning
techniques that allows us to discover hidden structures in data where we do not know the
right answer upfront. The goal of clustering is to find a natural grouping in data so that
items in the same cluster are more similar to each other than to those from different clusters.
Grouping objects by similarity using k-means
• It is one of the most popular clustering algorithms which is widely used in academia as
well as in industry. Clustering (or cluster analysis) is a technique that allows us to find
groups of similar objects that are more related to each other than to objects in other
• Examples of business oriented applications of clustering include the grouping of
documents, music, and movies by different topics, or finding customers that share similar
interests based on common purchase behaviors as a basis for recommendation engines.
K-means clustering Algorithm
• k-means algorithm is extremely easy to implement, but it is also computationally very efficient
compared to other clustering algorithms, which might explain its popularity. The k-means algorithm
belongs to the category of prototype-based clustering. We will discuss two other categories of clustering,
hierarchical and density-based clustering.
• Prototype-based clustering means that each cluster is represented by a prototype, which is usually either
the centroid (average) of similar points with continuous features, or the medoid (the most representative
or the point that minimizes the distance to all other points that belong to a particular cluster) in the case
of categorical features. While k-means is very good at identifying clusters with a spherical shape, one of
the drawbacks of this clustering algorithm is that we have to specify the number of clusters, k, a priori.
An inappropriate choice for k can result in poor clustering performance. Later, we will discuss the elbow
method and silhouette plots, which are useful techniques to evaluate the quality of a clustering to help us
determine the optimal number of clusters, k.
K-means clustering Algorithm for k=3
If we were to set k to 4, an additional cluster would be derived from the dataset to produce four
How does k-means clustering separate the data
• the first step is to examine the un-clustered data and manually select a centroid for each
cluster. That centroid then forms the epicenter of an individual cluster.
• Centroids can be chosen at random, which means you can nominate any data point on the
scatterplot to act as a centroid. However, you can save time by selecting centroids
dispersed across the scatterplot and not directly adjacent to each other. In other words,
start by guessing where you think the centroids for each cluster might be located. The
remaining data points on the scatterplot are then assigned to the nearest centroid by
measuring the Euclidean distance.
Each data point can be assigned to only one cluster, and each cluster is discrete. This means
that there’s no overlap between clusters and no case of nesting a cluster inside another
cluster. Also, all data points, including anomalies, are assigned to a centroid irrespective of
how they impact the final shape of the cluster. However, due to the statistical force that pulls
all nearby data points to a central point, clusters will typically form an elliptical or spherical
How does k-means clustering separate the data
Decision Tree Learning
Decision tree classifiers are attractive models if we care about interpretability. As the name
"decision tree" suggests, we can think of this model as breaking down our data by making a
decision based on asking a series of questions. Let's consider the following example in
which we use a decision tree to decide upon an activity on a particular day:
Decision Tree Learning
Based on the features in our training dataset, the decision tree model learns a series of
questions to infer the class labels of the examples. Although the preceding figure illustrates
the concept of a decision tree based on categorical variables, the same concept applies if our
features are real numbers, like in the Iris dataset. For example, we could simply define a
cut-off value along the sepal width feature axis and ask a binary question: "Is the sepal
width = 2.8 cm?“. Using the decision algorithm, we start at the tree root and split the data on
the feature that results in the largest information gain (IG), which will be explained in more
detail in the following section. In an iterative process, we can then repeat this splitting
procedure at each child node until the leaves are pure. This means that the training examples
at each node all belong to the same class. In practice, this can result in a very deep tree with
many nodes, which can easily lead to overfitting. Thus, we typically want to prune the tree
by setting a limit for the maximal depth of the tree.
Decision Tree
In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values
of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and
the tree itself to a disjunction of these conjunctions.
(Outlook = Sunny  Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain A Wind = Weak)
Decision Tree
Decision trees classify instances by sorting them down the tree from the root to some leaf
node, which provides the classification of the instance. Each node in the tree specifies a
test of some attribute of the instance, and each branch descending from that node
corresponds to one of the possible values for this attribute. An instance is classified by
starting at the root node of the tree, testing the attribute specified by this node, then moving
down the tree branch corresponding to the value of the attribute in the given example. This
process is then repeated for the subtree rooted at the new node. Decision tree classifies
Saturday mornings according to whether they are suitable for work to do.
e.g. (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
• Instances are represented by attribute-value pairs: Instances are described by a fixed set of attributes
(e.g. Temperature) and their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). However,
extensions to the basic algorithm allow handling real-valued attributes as well (e.g., representing
Temperature numerically).
• The target function has discrete output values: The decision tree assigns a Boolean classification
(e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more
than two possible output values. A more substantial extension allows learning target functions with real-
valued outputs, though the application of decision trees in this setting is less common.
• The training data may contain errors.
• The training data may contain missing attribute values.
Decision tree learning has therefore been applied to problems such as learning to classify
medical patients by their disease, equipment malfunctions by their cause, and loan
applicants by their likelihood of defaulting on payments. Such problems, in which the task
is to classify examples into one of a discrete set of possible categories, are often referred to
as classification problems.
What is Inductive Learning?
From the perspective of inductive learning, we are given input samples (x) and output samples (f(x)) and
the problem is to estimate the function (f). Specifically, the problem is to generalize from the samples and
the mapping to be useful to estimate the output for new samples in the future. In practice it is almost
always too hard to estimate the function, so we are looking for very good approximations of the function.
• Credit risk assessment.
• The x is the properties of the customer.
• The f(x) is credit approved or not.
• Disease diagnosis.
• The x are the properties of the patient.
• The f(x) is the disease they suffer from.
• Face recognition.
• The x are bitmaps of peoples faces.
• The f(x) is to assign a name to the face.
• Automatic steering.
• The x are bitmap images from a camera in front of the car.
• The f(x) is the degree the steering wheel should be turned.
When Should You Use Inductive Learning?
There are problems where inductive learning is not a good idea. It is important when to use
and when not to use supervised machine learning.
4 problems where inductive learning might be a good idea:
• Problems where there is no human expert. If people do not know the answer they
cannot write a program to solve it. These are areas of true discovery.
• Humans can perform the task but no one can describe how to do it. There are
problems where humans can do things that computer cannot do or do well. Examples
include riding a bike or driving a car.
• Problems where the desired function changes frequently. Humans could describe it
and they could write a program to do it, but the problem changes too often. It is not cost
effective. Examples include the stock market.
• Problems where each user needs a custom function. It is not cost effective to write a
custom program for each user. Example is recommendations of movies or books on
Netflix or Amazon.
Two perspectives on inductive learning:
• Learning is the removal of uncertainty. Having data removes some uncertainty.
Selecting a class of hypotheses we are removing more uncertainty.
• Learning is guessing a good and small hypothesis class. It requires guessing. We don’t
know the solution we must use a trial and error process. If you knew the domain with
certainty, you don’t need learning. But we are not guessing in the dark.
A Framework For Studying Inductive Learning
• Training example: a sample from x including its output from the target function
• Target function: the mapping function f from x to f(x)
• Hypothesis: approximation of f, a candidate function.
• Concept: A Boolean target function, positive examples and negative examples for the 1/0
class values.
• Classifier: Learning program outputs a classifier that can be used to classify.
• Learner: Process that creates the classifier.
• Hypothesis space: set of possible approximations of f that the algorithm can create.
• Version space: subset of the hypothesis space that is consistent with the observed data
• Linear
• Logistic
Linear Regression
Regression models are used to predict target variables on a continuous scale, which makes
them attractive for addressing many questions in science.
They also have applications in industry, such as understanding relationships between
variables, evaluating trends, or making forecasts. One example is predicting the sales of a
company in future months.
Introducing linear regression
The goal of linear regression is to model the relationship between one or multiple features
and a continuous target variable.
In contrast to classification—a different subcategory of supervised learning—regression
analysis aims to predict outputs on a continuous scale rather than categorical class labels.
Simple linear regression
• The goal of simple (univariate) linear regression is to model the relationship between a
single feature (explanatory variable, x) and a continuous-valued target (response variable,
y). The equation of a linear model with one explanatory variable is defined as follows
• Here w0 represents the y axis intercept and 𝑤1 is the weight coefficient of the explanatory
variable. Our goal is to learn the weights of the linear equation to describe the relationship
between the explanatory variable and the target variable, which can then be used to
predict the responses of new explanatory variables that were not part of the training
Linear Regression
The values w0 and w1 must be chosen so that they minimize the error. If sum of squared
error is taken as a metric to evaluate the model, then goal to obtain a line that best reduces
the error. If we don’t square the error, then positive and negative point will cancel out each
Intercept Calculation 𝑤0 = 𝑦 − 𝑤1𝜘
Co-efficient Formula
• Exploring ‘w1’
• If w1 > 0, then x(predictor) and y(target) have a positive relationship. That is increase
in x will increase y.
• If w1 < 0, then x(predictor) and y(target) have a negative relationship. That is increase
in x will decrease y.
Exploring w0
• If the model does not include x=0, then the prediction will become meaningless with only
w0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0(that
is height as 0), will make equation have only w0 value which is completely meaningless as
in real-time height and weight can never be zero. This resulted due to considering the
model values beyond its scope.
• If the model includes value 0, then ‘w0’ will be the average of all predicted values when
x=0. But, setting zero for all the predictor variables is often impossible.
• The value of w0 guarantee that residual have mean zero. If there is no ‘w0’ term, then
regression will be forced to pass over the origin. Both the regression co-efficient and
prediction will be biased.
0 500 1000 1500 2000 2500 3000
Size (feet2)
(in 1000s of
Housing Prices
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
How to choose ‘s ?
Training Set (m=47)
‘s: Parameters
Size in feet2 (x) Price ($) in 1000's (y)
2104 460
1416 232
1534 315
852 178
… …
0 1 2 3
0 1 2 3
0 1 2 3
Idea: Choose so that
is close to for our
training examples
Linear regression with one variable
Cost Function:
0 1 2 3
(for fixed , this is a function of x) (function of the parameter )
-0.5 0 0.5 1 1.5 2 2.5
Email: Spam / Not Spam?
Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?
0: “Negative Class” (e.g., benign tumor)
1: “Positive Class” (e.g., malignant tumor)
Tumor Size
Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
Tumor Size
Malignant ?
(Yes) 1
(No) 0
Classification: y = 0 or 1
can be > 1 or < 0
Logistic Regression:
Logistic Regression
As demonstrated, linear regression is a useful technique to quantify relationships between
continuous variables. Now, Predicting discrete variables plays a major part in data analysis
and machine learning. For instance, is something “A” or “B?” Is it “positive” or “negative?”
Is this person a “new customer” or a “returning customer?” Unlike linear regression, the
dependent variable (y) is no longer a continuous variable (such as price) but rather a discrete
categorical variable. The independent variables used as input to predict the dependent
variable can be either categorical or continuous.
Sigmoid function
Logistic function
Logistic Regression Model
Figure : A sigmoid function used to classify data points
Example: Linear regression (housing prices)
Overfitting: If we have too many features, the learned hypothesis
may fit the training set very well ( ), but
fail to generalize to new examples (predict prices on new examples).
Example: Logistic regression
( = sigmoid function)

