Fdp ppt

Machine Learning
By
B.JAYARAM
Assistant Professor
Department of Computer Science and Engineering
Malla Reddy Institute of Technology
Hyderabad - 500055
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 1

Contents
• Machine Learning.
• Usage of Machine Learning.
• Supervised vs Unsupervised Learning.
• Classification.
• Regression Models.
• Decision trees.
• Random Forest.
• Logistic Regression.

Machine Learning
• What is data?
• Where the data is available?
• Types of Data.
• What is data analytics?
• What way data is related to machine
learning?.
• Architecture of Machine learning Model.

What is Data?
• Collection of information stored in a particular
file.
– Structured form
• Any form of relational database structure where
relation between attributes is possible. Eg: using
database programming languages (SQL, Oracle, Mysql
etc).
– Unstructured form.
• Any form of data that does not have predefined
structure. Eg: video, images, Comments, posts, few
websites such as blogs and wikipedia

Machine Learning
– There are lot of sources of data available.
– Primary source of data
• Eg: data created by individual or a business concern on
their own.
– Secondary source of data
• Eg: data can be extracted from cloud servers, website
sources (kaggle, UCI, AWS, google cloud, Twitter,
Facebook, youtube, Github etc..)

Machine Learning
• What is data?
• Types of Data.
learning?.

Machine Learning
• Types of data

Qualitative Data
• Nominal Data.
– There is no natural ordering in values in the
attribute of the dataset.
– Eg: color, Gender, nouns ( name, place, animal,
thing)
• Ordinal Data.
– Has natural ordering in values in the attribute of
the dataset.
– Eg: size (S,M,L,XL,XXL ), rating (excellent, good,
better,worst)

Quantitative Data
• Discrete Attribute:
– It takes only finite number of numerical values
(integers).
– Eg: number of buttons, no of days for product
delivery etc..
• Continuous Attribute:
– It can take finite number of fractional values.
– Eg: price, discount, height, weight, length,
temperature, speed etc…

Sample Dataset
• Covid 19 Dataset (statewise in India)

Machine Learning
• What is data?
• Types of Data.
learning?.

Machine Learning
– Data analytics is the science of analyzing raw data in
order to make conclusions about that information. ...
This information can then be used to optimize
processes to increase the overall efficiency of a
business or system.
Types:
– Descriptive analytics. Eg: (observation, case-study,
surveys)
– Predictive analytics. Eg: Healthcare, sports, weather,
insurance, social media analysis.
– Prescriptive analytics. Eg: Healthcare, banking.

Machine Learning
• What is data?
• Types of Data.
• What way data is related to machine learning?

Machine Learning Cont….
• What way data is related to machine learning?
• Data analytics is a subcomponent of machine
learning.
Analytics

Machine Learning
• Machine learning is an application of
artificial intelligence (AI) that provides
systems the ability to automatically learn and
improve from experience without being
explicitly programmed.
• Machine learning focuses on the
development of computer programs that can
access data and use it learn for themselves.

Assumptions in Machine Learning
• If assumptions are not met, the model may
inaccurately reflect the data and will likely
result in inaccurate predictions.
• The assumptions are
– Diagnostics.
– Multicollinearity.
– Dataset Distributions.
– Outliers.

Diagnostics
• Diagnostics are used to evaluate the model
assumptions and figure out whether or not
there are observations with a large, undue
influence (dependent on certain factor) on the
analysis.
• It is mainly used in regression analysis (how
the independent Y variable changes when one
of the X variables changes ).

Multicollinearity
• Multicollinearity occurs when a dataset’s
features, or X variables are not independent
from each other.
• Major problem in regression analysis .

Dataset Distribution
• The distribution of a dataset shows the
different possible values for a characteristic of
a population.
• Mostly normal distribution is being used.

Sample Normal Distribution
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -3 -2 -1 0 1 2 3 4
Series1

Outliers
• outliers can greatly influence our model and
alter its effectiveness.
• Mean is more sensitive to Outliers.
• It can be identified using box plot.
• Eg:
– series 1:3,5.0,5.1, 5.2, 5.3, 5.3,5.4, 5.7, 5.8, 5.9,
– Series 2: 2.1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

BoxPlot
-5
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11
Series2
Series1

Contents
• Classification.
• Decision trees.
• Random Forest.

Usage of Machine Learning
• Virtual Personal Assistance:
– Similar to AI. Eg: Google Home (speaker), Amazon
Allo (mobile app).
• Predictions in commuting: Eg: Traffic Predictions.
• cctv surveillance camera: Eg: To help in theft activities
• Social media services:
– Eg: mutual friends, face recognition etc..

Usage of Machine Learning
• Email spam and malware filtering.
– Uses rule based filtering algorithms of ML. Eg: decision
tree, multi layer perceptron etc..
• Online customer support
– using AI based chatbots using ML. It can also be created
using AWS
– Eg: Banking, Insurance
• Search engine result refining.
– Eg: in Google many algorithms such as Google Panda,
Google Penguin, Google Hummingbird, Google Pigeon,
Google Mobile, Google Rankbrain, Google Possum, Google
Fred

Contents
• Classification.
• Decision trees.
• Random Forest.

Supervised Learning
• Supervised learning algorithms requires, a data
analyst with learning skills to provide both input
and desired output. And also provide details
about accuracy of predicted data by providing
feedback.
• Supervised learning algorithms has labelled data.
• It contains 3 parts
– Extraction
– Training
– Prediction

Supervised Learning Workflow

List of Supervised Algorithms
• The lists of few supervised algorithms are listed
below.
– Decision Trees
– Naive Bayes Classification
– Support vector machines for classification problems
– Random forest for classification and regression
problems
– Linear regression for regression problems
– Ordinary Least Squares Regression
– Logistic Regression
– Ensemble Methods

Unsupervised Learning
• Unsupervised learning algorithms do not need to be
trained with desired outcome data, but it uses deep
learning approach to review data and come to
conclusions.
• Unsupervised learning has un-labelled data.
• Mainly used in various applications such as image
processing and speech to text conversion, through
neural networks.

Unsupervised Learning Workflow

List of Unsupervised Learning
Algorithms
• Some common unsupervised algorithms are
listed below
– K-means for clustering problems
– Apriori algorithm for association rule learning
problems
– Principal Component Analysis.
– Singular Value Decomposition.
– Independent Component Analysis.

Supervised vs Unsupervised Leaning
Algorithms
Supervised Learning
Algorithm
Unsupervised Learning
Algorithm
Input Data Labelled data Un-labelled data
Computation complexity Very high Less complexity
Real Time usage Uses of off-line analysis Uses real time analysis
No of classes Known (fixed). Unknown
Accuracy of results Accurate and reliable Moderate and reliable
Category Classification and
Regression
Clustering and association
rule mining.

Contents
• Classification.
• Decision trees.
• Random Forest.

Classification
• Classification is a technique where we
categorize data into a given number of classes.
• Classification based machine learning algorithms
are
– Decision Trees.
– Bayesian Classifiers.
– Neural Networks.
– K-Nearest Neighbor.
– Support Vector Machines
– Linear Regression.
– Logistic Regression.

Working using R-studio ( Covid-19
dataset)
• covid_india <- read.csv("C:/Users/Admin/Downloads/covid19-
in-india/covid_19_india.csv",header = TRUE)
• state <- table(covid_india$State.UnionTerritory)
• barplot(state)

State-wise count (obtained from R-
studio)

Decision Trees
• Decision tree is a tree
with following properties.
– A inner node represents an
attribute.
– An edge represents the
test of the attribute of the
further node
– A leaf represents one of
the classes.
• Construction of decision
tree is based on training
data.

Types of Decision Tree
• Binary variable decision tree: Decision tree
which has a binary target variable. Eg: will you
play chess? (Yes/No)
• Continuous variable decision tree: Decision
tree which has a continuous target variable.
Eg: prediction of whether all customers in a
insurance company will pay insurance or not.
(Yes/ No)

R code for creating decision tree
• library(rpart)
• library(rpart.plot)
• decisionTree_model <- rpart(Class ~ . , creditcard_data,
method = 'class')
• predicted_val <- predict(decisionTree_model,
creditcard_data, type = 'class')
• probability <- predict(decisionTree_model,
creditcard_data, type = 'prob')
• rpart.plot(decisionTree_model)

Decision Tree for credit card dataset

Advantages of decision trees
• Very easy to understand.
• Easy data exploration.
• Less data cleaning is required.
• All datatype accepted (qualitative or
quantitative)

Disadvantage of Decision Trees
• Overfitting.
• Not fit for continuous variables.
We use random forest algorithm to overcome
these drawbacks.

Contents
• Classification.
• Decision trees.
• Random Forest.

Random Forest Algorithm
• Scheduled to discuss tomorrow in our
schedule.

Contents
• Classification.
• Decision trees.
• Random Forest.

Regression Models
• Linear Regression.

Regression
 Regression is a supervised machine learning
technique where the output variable is continuous.
 Ex: predict sales of product, stock price, temperature,
house price ….
What is Linear Regression:
– It is way of finding a relationship between a single
continuous variable called dependent or target
variable and one or more other variables
(continuous or not) called independent variables

 Where y is dependent variable
 x is independent variable
 b is slope --> how much the line rises for each unit
increase in x
 a is intercept --> the value of y when x=0.
Simple Linear Regression: When you have a single
independent variable, then we call it as Simple Linear
Regression
• Ex: Height(input) --> Weight; Experience(input) --> salary

Multiple Linear Regression:
 When you have multiple independent variables, then
we call it as Multiple Linear Regression
 Ex: sqft,no of bed rooms, location, brand, floor rise
etc. --> Predict house price

Estimate beta coefficients
Ordinary least Square:
 The objective of OLS is to minimize the sum of
squares of residuals (Σerror^2)= (Yact -Ypred)^2
 Beta = Inverse(Xtranspose * X) * Xtranspose*Y -->
(Hat Matrix)
 We make use of linear algebra(matrices)

Variable Selection Methods: (For Regression
only)
 Forward selection: Starts with a single variable, then
add other variables one at a time based on AIC values
(AIC: Akaike Information Criteria Model performance
metrics /measures)
 Backward Elimination: Starts with all variables,
iteratively removing those variables of low
importance based on AIC values
 Stepwise Regression (Bi-direction regression):
Run in both directions

How to find the best Regression line, the
line of best fit:
We discussed that the regression line establishes a
relationship between IND and DEP variables.
A line which explain the relationship better is said
to be the BEST FIT LINE
In other words, the best fit line tends to return the
most accurate value of Y based on X i.e. cause a
minimum difference between the actual and
predicted value of Y (lower prediction error)

Assumptions in regression: ******
 Regression is a parametric approach. Parametric means it
makes assumptions about data for the purpose of analysis
 Linear and additive (Effect of 1 variable 'x1' on Y is independent
of other variables)
 There should be no correlation between the residual terms -->
Auto Correlation (Time series)
 Independent variables should not be correlated --
> Multicollinearity
 Errors terms must have constant variance.
– Constant --> Homoscedasticity;
– non constant --> Heteroscedasticity
 Error terms must be normally distributed

Errors
 Sum of all errors: (Σerror) = Actual -Predicted =Σ(Y-Y^)
 Sum of absolute value of all errors: (Error|)
 Sum of square of all errors:(Σerror^2)

Logistic Regression
 Logistic Regression technique is borrowed by
machine learning from the field of statistics
 It is the go-to method for binary classification (2 class
values -S/F; Y/N..)
 Logistic regression or Logit regression or Logit
model -it is a regression model where the dependent
variable is categorical

Logistic Regression
 Logistic regression measures the relationship between
a categorical DV and one or more independent
variables by estimating the probabilities using a
logistic function
 It is used to predict the binary outcome given a set of
independent variables

Logistic Regression
 LR can be seen as special case of GLM (Generalized
Linear Models) and thus similar to linear regression.
 Below are key differences:
– Predicted values are probabilities and therefore restricted
(0,1) through the logistic distribution function
– Conditional distribution P (Y=0 | for all X) and P (Y=1 | for
all X) is a Bernoulli distribution rather than a Gaussian
distribution

Applications
 Email: spam/No spam
 Online transaction: F/NF
 Customer churn: (R/E)
 HR status: J/NJ
 Credit scoring: D/ND

Advantages
Highly interpretable
Outputs are well calibrated predicted
probabilities
Model training and prediction are fast
Features don’t need scaling
Can perform well with a small number of
observations

Probability to log of odds ratio:
 Let Y be the primary outcome variable indicates:
S/F; 1/0..
 P be the probability of Y to be 1 P(Y=1);
to be 0 P(Y=0)
 X1, X2,…. Xn be the set of predictor variables
 B1,B2… Bn be the model coefficients

Probability to log of odds ratio

Logit Function:
Logistic regression is an estimation of logit
function.
Logit function is simply a log of odds ratio in
favour of event
This function creates a s-shaped curve with the
probability estimate

In general, we can use the below
for classification
 Confusion matrix (sensitivity, specificity, F1…)
 -K fold cross validation
 -AUC-ROC (Area Under Curve -Receiver Operating
characteristic) --> always this score should be close
towards 1

Queries & Suggestions
• Feel free to mail me for at
jayaramb05@gmail.com.

Thank you

Fdp ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fdp ppt

Similar to Fdp ppt (20)

Recently uploaded

Recently uploaded (20)

Fdp ppt

Editor's Notes