2018 p 2019-ee-a2

Machine Learning
ASSIGNMENT 2
 NAME: Faizan Arshad
 REG.NO: 2018-P/2019-EE-139
 SECTION: B
SUBMITTED TO
DR KASHIF JAVAID
University of Engineering & Technology Lahore, Pakistan

Logistic regression
Logistic regression is a statistical method that is used for building machine learning models where
the dependent variable is dichotomous: i.e. binary. Logistic regression is used to describe data and
the relationship between one dependent variable and one or more independent variables. The
independent variables can be nominal, ordinal, or of interval type. The name “logistic regression”
is derived from the concept of the logistic function that it uses. The logistic function is also known
as the sigmoid function. The value of this logistic function lies between zero and one. The
following is an example of a logistic function we can use to find the probability of a vehicle
breaking down, depending on how many years it has been since it was serviced last.[1]
Advantages of the Logistic Regression Algorithm
 Logistic regression performs better when the data is linearly separable
 It does not require too many computational resources as it’s highly interpretable
 There is no problem scaling the input features—It does not require tuning
 It is easy to implement and train a model using logistic regression
 It gives a measure of how relevant a predictor (coefficient size) is, and its direction of
association (positive or negative)
How Does the Logistic Regression Algorithm Work?
The Sigmoid function (logistic regression model) is used to map the predicted predictions to
probabilities. The Sigmoid function represents an ‘S’ shaped curve when plotted on a map. The
graph plots the predicted values between 0 and 1. The values are then plotted towards the margins
at the top and the bottom of the Y-axis, with the labels as 0 and 1. Based on these values, the target
variable can be classified in either of the classes.

The equation for the sigmoid function is given as:
y=1/(1+e^x),
Where e^x= the exponential constant with a value of 2.718.
This equation gives the value of y(predicted value) close to zero if x is a considerable negative
value. Similarly, if the value of x is a large positive value, the value of y is predicted close to one.
A decision boundary can be set to predict the class to which the data belongs. Based on the set
value, the estimated values can be classified into classes.
For instance, let us take the example of classifying emails as spam or not. If the predicted value(p)
is less than 0.5, then the email is classified spam and vice versa.[2]
Types of logistic regression
Logistic regression models are generally used for predictive analysis for binary classification of
data. However, they can also be used for multi-class classification. Logistic regression models can
be classified into three main logistic regression analysis categories. They are:
 Binary Logistic Regression Model
This is one of the most widely-used logistic regression models, used to predict and categorize data
into either of the two classes. For example, a patient can have cancerous cells, or they cannot. The
data can’t belong to two categories at the same time.
 Multinomial Logistic Regression Model
The multinomial logistic regression model is used to classify the target variable into multiple
classes, irrespective of any quantitative significance. For instance, the type of food an individual
is likely to order based on their diet preferences – vegetarians, non-vegetarians, and vegan.
 Ordinal Logistic Regression Model
The ordinal logistic regression model is used to classify the target variable into classes and also in
order. For example, a pupil’s performance in an examination can be classified as poor, good, and
excellent in a hierarchical order. Thus, we can see that the data is not only classified into three
distinct categories, but each category has a unique level of importance.
The logistic regression algorithm can be used in a plethora of cases such as tumor classification,
spam detection, and sex categorization, to name a few. Let’s have a look at some logistic regression
examples to get a better idea.[2]
Random Forest
A random forest is a supervised machine learning algorithm that is constructed from decision tree
algorithms. This algorithm is applied in various industries such as banking and e-commerce to
predict behavior and outcomes. A random forest algorithm consists of many decision trees. The
‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap
aggregating. Bagging is an ensemble meta algorithm that improves the accuracy of machine
learning algorithms. A random forest eradicates the limitations of a decision tree algorithm. It

reduces the overfitting of datasets and increases precision. It generates predictions without
requiring many configurations in packages. Classification in random forests employs an ensemble
methodology to attain the outcome. The training data is fed to train various decision trees. This
dataset consists of observations and features that will be selected randomly during the splitting of
nodes.
A rain forest system relies on various decision trees. Every decision tree consists of decision nodes,
leaf nodes, and a root node. The leaf node of each tree is the final output produced by that specific
decision tree. The selection of the final output follows the majority-voting system. In this case, the
output chosen by the majority of the decision trees becomes the final output of the rain forest
system. The diagram below shows a simple random forest classifier. [3]
Some of the applications of the random forest may include:
1. Banking
2. Stock market
3. E-Commerce
4. Health Care System
Features of Random Forests
 It is unexcelled in accuracy among current algorithms.
 It runs efficiently on large data bases.
 It can handle thousands of input variables without variable deletion.
 It gives estimates of what variables are important in the classification.
 It generates an internal unbiased estimate of the generalization error as the forest building
progresses.
 It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.
 It has methods for balancing error in class population unbalanced data sets.
 Generated forests can be saved for future use on other data.
 Prototypes are computed that give information about the relation between the variables and
the classification.
 It computes proximities between pairs of cases that can be used in clustering, locating
outliers, or (by scaling) give interesting views of the data.
 The capabilities of the above can be extended to unlabeled data, leading to unsupervised
clustering, data views and outlier detection.
 It offers an experimental method for detecting variable interactions.[4]
Support Vector Machines
A support vector machine (SVM) is a supervised machine learning model that uses classification
algorithms for two-group classification problems. After giving an SVM model sets of labeled
training data for each category, they’re able to categorize new text. Compared to newer algorithms

like neural networks, they have two main advantages: higher speed and better performance with a
limited number of samples (in the thousands). This makes the algorithm very suitable for text
classification problems, where it’s common to have access to a dataset of at most a couple of
thousands of tagged samples.[5]
How Does SVM Work?
The basics of Support Vector Machines and how it works are best understood with a simple
example. Let’s imagine we have two tags: red and blue, and our data has two features: x and y. We
want a classifier that, given a pair of (x,y) coordinates, outputs if it’s either red or blue. We plot
our already labeled training data on a plane:
A support vector machine takes these data points and outputs the hyperplane (which in two
dimensions it’s simply a line) that best separates the tags. This line is the decision boundary:
anything that falls to one side of it we will classify as blue, and anything that falls to the other
as red.

But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the margins from
both tags. In other words: the hyperplane (remember it's a line in this case) whose distance to the
nearest element of each tag is the largest.
Nonlinear data
Now this example was easy, since clearly the data was linearly separable — we could draw a
straight line to separate red and blue. Sadly, usually things aren’t that simple. Take a look at this
case:
It’s pretty clear that there’s not a linear decision boundary (a single straight line that separates both
tags). However, the vectors are very clearly segregated and it looks as though it should be easy to
separate them.

So here’s what we’ll do: we will add a third dimension. Up until now we had two
dimensions: x and y. We create a new z dimension, and we rule that it be calculated a certain way
that is convenient for us: z = x² + y² (you’ll notice that’s the equation for a circle).
This will give us a three-dimensional space. Taking a slice of that space, it looks like this:
What can SVM do with this? Let’s see:
That’s great! Note that since we are in three dimensions now, the hyperplane is a plane parallel
to the x axis at a certain z (let’s say z = 1).[5]
What’s left is mapping it back to two dimensions:

Advantages of SVM:
 Effective in high dimensional cases
 Its memory efficient as it uses a subset of training points in the decision function called
support vectors
 Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.[6]
ANOVA
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed
aggregate variability found inside a data set into two parts: systematic factors and random factors.
The systematic factors have a statistical influence on the given data set, while the random factors
do not. Analysts use the ANOVA test to determine the influence that independent variables have
on the dependent variable in a regression study.
The t- and z-test methods developed in the 20th century were used for statistical analysis until
1918, when Ronald Fisher created the analysis of variance method. ANOVA is also called the
Fisher analysis of variance, and it is the extension of the t- and z-tests. The term became well-
known in 1925, after appearing in Fisher's book, "Statistical Methods for Research Workers."
The ANOVA test is the initial step in analyzing factors that affect a given data set. Once the test
is finished, an analyst performs additional testing on the methodical factors that measurably

contribute to the data set's inconsistency. The analyst utilizes the ANOVA test results in an f-test
to generate additional data that aligns with the proposed regression models.
The ANOVA test allows a comparison of more than two groups at the same time to determine
whether a relationship exists between them. The result of the ANOVA formula, the F statistic
(also called the F-ratio), allows for the analysis of multiple groups of data to determine the
variability between samples and within samples.
If no real difference exists between the tested groups, which is called the null hypothesis, the
result of the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible values
of the F statistic is the F-distribution. This is actually a group of distribution functions, with two
characteristic numbers, called the numerator degrees of freedom and the denominator degrees of
freedom.[7]
The Formula for ANOVA is:
F=MSE/MST
Where:
F=ANOVA coefficient
MST=Mean sum of squares due to treatment
MSE=Mean sum of squares due to error
Method
A researcher might, for example, test students from multiple colleges to see if students from one
of the colleges consistently outperform students from the other colleges. In a business application,
an R&D researcher might test two different processes of creating a product to see if one process
is better than the other in terms of cost efficiency.
The type of ANOVA test used depends on a number of factors. It is applied when data needs to
be experimental. Analysis of variance is employed if there is no access to statistical software
resulting in computing ANOVA by hand. It is simple to use and best suited for small samples.
With many experimental designs, the sample sizes have to be the same for the various factor level
combinations.
ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t-tests.
However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups
differences by comparing the means of each group and includes spreading out the variance into
diverse sources. It is employed with subjects, test groups, between groups and within groups.
One-Way ANOVA versus Two-Way ANOVA
There are two main types of ANOVA: one-way (or unidirectional) and two-way. There also
variations of ANOVA. For example, MANOVA (multivariate ANOVA) differs from ANOVA as
the former tests for multiple dependent variables simultaneously while the latter assesses only
one dependent variable at a time. One-way or two-way refers to the number of independent
variables in your analysis of variance test. A one-way ANOVA evaluates the impact of a sole
factor on a sole response variable. It determines whether all the samples are the same. The one-
way ANOVA is used to determine whether there are any statistically significant differences
between the means of three or more independent (unrelated) groups.

A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one
independent variable affecting a dependent variable. With a two-way ANOVA, there are two
independents. For example, a two-way ANOVA allows a company to compare worker
productivity based on two independent variables, such as salary and skill set. It is utilized to
observe the interaction between the two factors and tests the effect of two factors at the same
time.
Results and Conclusion:
RF Classifier SVM Classifier Logistic Regression
Classifier
1st Split 0.782051282051282 0.8717948717948718 0.782051282051282
2nd
Split 0.717948717948718 0.8076923076923077 0.7435897435897436
3rd
Split 0.8076923076923077 0.8717948717948718 0.8589743589743589
Average Value 76.92307692307692 % 85.04273504273505 % 79.48717948717949 %
And we have following results for ANOVA test:
The F value in one way ANOVA is a tool to help you answer the question “Is the variance between
the means of two populations significantly different?” The F value in the ANOVA test also
determines the P value; The P value is the probability of getting a result at least as extreme as the
one that was actually observed, given that the null hypothesis is true.

References:
[1].https://www.simplilearn.com/tutorials/machine-learning-tutorial/logistic-regression-in-python
[2]. https://www.jigsawacademy.com/blogs/data-science/logistic-regression/
[3].https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/
[4]. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
[5]. https://monkeylearn.com/blog/introduction-to-support-vector-machines-svm/
[6]. https://www.geeksforgeeks.org/support-vector-machine-algorithm/
[7]. https://www.investopedia.com/terms/a/anova.asp

2018 p 2019-ee-a2

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to 2018 p 2019-ee-a2

Similar to 2018 p 2019-ee-a2 (20)

Recently uploaded

Recently uploaded (20)

2018 p 2019-ee-a2