3. Contents
• Machine Learning.
• Usage of Machine Learning.
• Supervised vs Unsupervised Learning.
• Classification.
• Regression Models.
• Decision trees.
• Random Forest.
• Logistic Regression.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 3
4. Machine Learning
• What is data?
• Where the data is available?
• Types of Data.
• What is data analytics?
• What way data is related to machine
learning?.
• Architecture of Machine learning Model.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 4
5. What is Data?
• Collection of information stored in a particular
file.
– Structured form
• Any form of relational database structure where
relation between attributes is possible. Eg: using
database programming languages (SQL, Oracle, Mysql
etc).
– Unstructured form.
• Any form of data that does not have predefined
structure. Eg: video, images, Comments, posts, few
websites such as blogs and wikipedia
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 5
6. Machine Learning
• Where the data is available?
– There are lot of sources of data available.
– Primary source of data
• Eg: data created by individual or a business concern on
their own.
– Secondary source of data
• Eg: data can be extracted from cloud servers, website
sources (kaggle, UCI, AWS, google cloud, Twitter,
Facebook, youtube, Github etc..)
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 6
7. Machine Learning
• What is data?
• Where the data is available?
• Types of Data.
• What is data analytics?
• What way data is related to machine
learning?.
• Architecture of Machine learning Model.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 7
9. Qualitative Data
• Nominal Data.
– There is no natural ordering in values in the
attribute of the dataset.
– Eg: color, Gender, nouns ( name, place, animal,
thing)
• Ordinal Data.
– Has natural ordering in values in the attribute of
the dataset.
– Eg: size (S,M,L,XL,XXL ), rating (excellent, good,
better,worst)
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 9
10. Quantitative Data
• Discrete Attribute:
– It takes only finite number of numerical values
(integers).
– Eg: number of buttons, no of days for product
delivery etc..
• Continuous Attribute:
– It can take finite number of fractional values.
– Eg: price, discount, height, weight, length,
temperature, speed etc…
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 10
11. Sample Dataset
• Covid 19 Dataset (statewise in India)
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 11
12. Machine Learning
• What is data?
• Where the data is available?
• Types of Data.
• What is data analytics?
• What way data is related to machine
learning?.
• Architecture of Machine learning Model.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 12
13. Machine Learning
• What is data analytics?
– Data analytics is the science of analyzing raw data in
order to make conclusions about that information. ...
This information can then be used to optimize
processes to increase the overall efficiency of a
business or system.
Types:
– Descriptive analytics. Eg: (observation, case-study,
surveys)
– Predictive analytics. Eg: Healthcare, sports, weather,
insurance, social media analysis.
– Prescriptive analytics. Eg: Healthcare, banking.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 13
14. Machine Learning
• What is data?
• Where the data is available?
• Types of Data.
• What is data analytics?
• What way data is related to machine learning?
• Architecture of Machine learning Model.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 14
15. Machine Learning Cont….
• What way data is related to machine learning?
• Architecture of Machine learning Model.
• Data analytics is a subcomponent of machine
learning.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 15
Analytics
16. Machine Learning
• Machine learning is an application of
artificial intelligence (AI) that provides
systems the ability to automatically learn and
improve from experience without being
explicitly programmed.
• Machine learning focuses on the
development of computer programs that can
access data and use it learn for themselves.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 16
17. Assumptions in Machine Learning
• If assumptions are not met, the model may
inaccurately reflect the data and will likely
result in inaccurate predictions.
• The assumptions are
– Diagnostics.
– Multicollinearity.
– Dataset Distributions.
– Outliers.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 17
18. Diagnostics
• Diagnostics are used to evaluate the model
assumptions and figure out whether or not
there are observations with a large, undue
influence (dependent on certain factor) on the
analysis.
• It is mainly used in regression analysis (how
the independent Y variable changes when one
of the X variables changes ).
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 18
19. Multicollinearity
• Multicollinearity occurs when a dataset’s
features, or X variables are not independent
from each other.
• Major problem in regression analysis .
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 19
20. Dataset Distribution
• The distribution of a dataset shows the
different possible values for a characteristic of
a population.
• Mostly normal distribution is being used.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 20
21. Sample Normal Distribution
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 21
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -3 -2 -1 0 1 2 3 4
Series1
22. Outliers
• outliers can greatly influence our model and
alter its effectiveness.
• Mean is more sensitive to Outliers.
• It can be identified using box plot.
• Eg:
– series 1:3,5.0,5.1, 5.2, 5.3, 5.3,5.4, 5.7, 5.8, 5.9,
– Series 2: 2.1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 22
24. Contents
• Machine Learning.
• Usage of Machine Learning.
• Supervised vs Unsupervised Learning.
• Classification.
• Regression Models.
• Decision trees.
• Random Forest.
• Logistic Regression.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 24
25. Usage of Machine Learning
• Virtual Personal Assistance:
– Similar to AI. Eg: Google Home (speaker), Amazon
Allo (mobile app).
• Predictions in commuting: Eg: Traffic Predictions.
• cctv surveillance camera: Eg: To help in theft activities
• Social media services:
– Eg: mutual friends, face recognition etc..
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 25
26. Usage of Machine Learning
• Email spam and malware filtering.
– Uses rule based filtering algorithms of ML. Eg: decision
tree, multi layer perceptron etc..
• Online customer support
– using AI based chatbots using ML. It can also be created
using AWS
– Eg: Banking, Insurance
• Search engine result refining.
– Eg: in Google many algorithms such as Google Panda,
Google Penguin, Google Hummingbird, Google Pigeon,
Google Mobile, Google Rankbrain, Google Possum, Google
Fred
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 26
27. Contents
• Machine Learning.
• Usage of Machine Learning.
• Supervised vs Unsupervised Learning.
• Classification.
• Regression Models.
• Decision trees.
• Random Forest.
• Logistic Regression.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 27
29. Supervised Learning
• Supervised learning algorithms requires, a data
analyst with learning skills to provide both input
and desired output. And also provide details
about accuracy of predicted data by providing
feedback.
• Supervised learning algorithms has labelled data.
• It contains 3 parts
– Extraction
– Training
– Prediction
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 29
31. List of Supervised Algorithms
• The lists of few supervised algorithms are listed
below.
– Decision Trees
– Naive Bayes Classification
– Support vector machines for classification problems
– Random forest for classification and regression
problems
– Linear regression for regression problems
– Ordinary Least Squares Regression
– Logistic Regression
– Ensemble Methods
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 31
32. Unsupervised Learning
• Unsupervised learning algorithms do not need to be
trained with desired outcome data, but it uses deep
learning approach to review data and come to
conclusions.
• Unsupervised learning has un-labelled data.
• Mainly used in various applications such as image
processing and speech to text conversion, through
neural networks.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 32
34. List of Unsupervised Learning
Algorithms
• Some common unsupervised algorithms are
listed below
– K-means for clustering problems
– Apriori algorithm for association rule learning
problems
– Principal Component Analysis.
– Singular Value Decomposition.
– Independent Component Analysis.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 34
35. Supervised vs Unsupervised Leaning
Algorithms
Supervised Learning
Algorithm
Unsupervised Learning
Algorithm
Input Data Labelled data Un-labelled data
Computation complexity Very high Less complexity
Real Time usage Uses of off-line analysis Uses real time analysis
No of classes Known (fixed). Unknown
Accuracy of results Accurate and reliable Moderate and reliable
Category Classification and
Regression
Clustering and association
rule mining.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 35
36. Contents
• Machine Learning.
• Usage of Machine Learning.
• Supervised vs Unsupervised Learning.
• Classification.
• Regression Models.
• Decision trees.
• Random Forest.
• Logistic Regression.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 36
37. Classification
• Classification is a technique where we
categorize data into a given number of classes.
• Classification based machine learning algorithms
are
– Decision Trees.
– Bayesian Classifiers.
– Neural Networks.
– K-Nearest Neighbor.
– Support Vector Machines
– Linear Regression.
– Logistic Regression.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 37
38. Working using R-studio ( Covid-19
dataset)
• covid_india <- read.csv("C:/Users/Admin/Downloads/covid19-
in-india/covid_19_india.csv",header = TRUE)
• state <- table(covid_india$State.UnionTerritory)
• barplot(state)
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 38
40. Decision Trees
• Decision tree is a tree
with following properties.
– A inner node represents an
attribute.
– An edge represents the
test of the attribute of the
further node
– A leaf represents one of
the classes.
• Construction of decision
tree is based on training
data.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 40
41. Types of Decision Tree
• Binary variable decision tree: Decision tree
which has a binary target variable. Eg: will you
play chess? (Yes/No)
• Continuous variable decision tree: Decision
tree which has a continuous target variable.
Eg: prediction of whether all customers in a
insurance company will pay insurance or not.
(Yes/ No)
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 41
42. R code for creating decision tree
• library(rpart)
• library(rpart.plot)
• decisionTree_model <- rpart(Class ~ . , creditcard_data,
method = 'class')
• predicted_val <- predict(decisionTree_model,
creditcard_data, type = 'class')
• probability <- predict(decisionTree_model,
creditcard_data, type = 'prob')
• rpart.plot(decisionTree_model)
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 42
43. Decision Tree for credit card dataset
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 43
44. Advantages of decision trees
• Very easy to understand.
• Easy data exploration.
• Less data cleaning is required.
• All datatype accepted (qualitative or
quantitative)
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 44
45. Disadvantage of Decision Trees
• Overfitting.
• Not fit for continuous variables.
We use random forest algorithm to overcome
these drawbacks.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 45
46. Contents
• Machine Learning.
• Usage of Machine Learning.
• Supervised vs Unsupervised Learning.
• Classification.
• Regression Models.
• Decision trees.
• Random Forest.
• Logistic Regression.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 46
47. Random Forest Algorithm
• Scheduled to discuss tomorrow in our
schedule.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 47
48. Contents
• Machine Learning.
• Usage of Machine Learning.
• Supervised vs Unsupervised Learning.
• Classification.
• Regression Models.
• Decision trees.
• Random Forest.
• Logistic Regression.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 48
50. Regression
Regression is a supervised machine learning
technique where the output variable is continuous.
Ex: predict sales of product, stock price, temperature,
house price ….
What is Linear Regression:
– It is way of finding a relationship between a single
continuous variable called dependent or target
variable and one or more other variables
(continuous or not) called independent variables
51.
52. Where y is dependent variable
x is independent variable
b is slope --> how much the line rises for each unit
increase in x
a is intercept --> the value of y when x=0.
Simple Linear Regression: When you have a single
independent variable, then we call it as Simple Linear
Regression
• Ex: Height(input) --> Weight; Experience(input) --> salary
53.
54. Multiple Linear Regression:
When you have multiple independent variables, then
we call it as Multiple Linear Regression
Ex: sqft,no of bed rooms, location, brand, floor rise
etc. --> Predict house price
55. Estimate beta coefficients
Ordinary least Square:
The objective of OLS is to minimize the sum of
squares of residuals (Σerror^2)= (Yact -Ypred)^2
Beta = Inverse(Xtranspose * X) * Xtranspose*Y -->
(Hat Matrix)
We make use of linear algebra(matrices)
56. Variable Selection Methods: (For Regression
only)
Forward selection: Starts with a single variable, then
add other variables one at a time based on AIC values
(AIC: Akaike Information Criteria Model performance
metrics /measures)
Backward Elimination: Starts with all variables,
iteratively removing those variables of low
importance based on AIC values
Stepwise Regression (Bi-direction regression):
Run in both directions
57. How to find the best Regression line, the
line of best fit:
We discussed that the regression line establishes a
relationship between IND and DEP variables.
A line which explain the relationship better is said
to be the BEST FIT LINE
In other words, the best fit line tends to return the
most accurate value of Y based on X i.e. cause a
minimum difference between the actual and
predicted value of Y (lower prediction error)
58. Assumptions in regression: ******
Regression is a parametric approach. Parametric means it
makes assumptions about data for the purpose of analysis
Linear and additive (Effect of 1 variable 'x1' on Y is independent
of other variables)
There should be no correlation between the residual terms -->
Auto Correlation (Time series)
Independent variables should not be correlated --
> Multicollinearity
Errors terms must have constant variance.
– Constant --> Homoscedasticity;
– non constant --> Heteroscedasticity
Error terms must be normally distributed
59. Errors
Sum of all errors: (Σerror) = Actual -Predicted =Σ(Y-Y^)
Sum of absolute value of all errors: (Error|)
Sum of square of all errors:(Σerror^2)
60. Logistic Regression
Logistic Regression technique is borrowed by
machine learning from the field of statistics
It is the go-to method for binary classification (2 class
values -S/F; Y/N..)
Logistic regression or Logit regression or Logit
model -it is a regression model where the dependent
variable is categorical
61. Logistic Regression
Logistic regression measures the relationship between
a categorical DV and one or more independent
variables by estimating the probabilities using a
logistic function
It is used to predict the binary outcome given a set of
independent variables
62. Logistic Regression
LR can be seen as special case of GLM (Generalized
Linear Models) and thus similar to linear regression.
Below are key differences:
– Predicted values are probabilities and therefore restricted
(0,1) through the logistic distribution function
– Conditional distribution P (Y=0 | for all X) and P (Y=1 | for
all X) is a Bernoulli distribution rather than a Gaussian
distribution
64. Advantages
Highly interpretable
Outputs are well calibrated predicted
probabilities
Model training and prediction are fast
Features don’t need scaling
Can perform well with a small number of
observations
65. Probability to log of odds ratio:
Let Y be the primary outcome variable indicates:
S/F; 1/0..
P be the probability of Y to be 1 P(Y=1);
to be 0 P(Y=0)
X1, X2,…. Xn be the set of predictor variables
B1,B2… Bn be the model coefficients
67. Logit Function:
Logistic regression is an estimation of logit
function.
Logit function is simply a log of odds ratio in
favour of event
This function creates a s-shaped curve with the
probability estimate
69. In general, we can use the below
for classification
Confusion matrix (sensitivity, specificity, F1…)
-K fold cross validation
-AUC-ROC (Area Under Curve -Receiver Operating
characteristic) --> always this score should be close
towards 1
70.
71. Queries & Suggestions
• Feel free to mail me for at
jayaramb05@gmail.com.
5/16/2020 FDP ON HADOOP AND MACHINE LEARNING 71