SlideShare a Scribd company logo
Introduction to
Machine Learning
GirishGore
Introducing the Speaker
• Girish Gore : 10+Years of Experience in Data Analytics / Data Science
• B.E. Computer Science fromVIT Pune , M.S. from BITS Pilani
• SpentTime on Data Products Mainly In companies like
• Cognizant (InnovationsGroup)
• SAS (Pricing & Revenue Management)
• VuClip (Video Entertainment)
• Shoptimize (E-Commerce)
• Worked in fields like
• Text Mining
• Forecasting and Optimization
• Recommender Systems
Knowing the Audience
Average Experience in Industry ?
Average ML Experience ?
UnderstandingTerminologies
Artificial Intelligence
AI involves machines that can perform tasks that are characteristic of human
intelligence.
Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides
systems the ability to automatically learn and improve from experience without
being explicitly programmed.
Deep Learning
Deep Learning is an attempt to mimic the workings of the brain. Deep
Learning is one of many approaches to machine learning
The Hierarchy
Traditional Programming vs Machine Learning
• If Programming automates processes ,
Machine Learning automates Program
generation i.e. Automation.
• Data and output is run on the computer to
create a program.This program can be used
in traditional programming
What is Machine Learning ?
• Machine Learning is
• study of algorithms that
• improve their performance at a particular task
• with experience ( previous data , output)
• Optimize a performance criterion using example data or past experience
• Role of Computer Science : Efficient Algorithms
• Solve the optimization problem
• Represent and Evaluate the model for inference
Why are we here Now !!! GoogleTrends !!
• Exponential increase in Data generation , accumulation
• Increasing computational power
• Growing progress in available algorithms and Research
• Software becoming too complex to write by hand
Common Applications of Machine Learning
• Web search: ranking page based on what you are most likely to click on.
• Finance: decide who to send what credit card offers to. Evaluation of risk on credit
offers. How to decide where to invest money.
• E-commerce: Predicting customer churn.Whether or not a transaction is fraudulent.
• Robotics: how to handle uncertainty in new environments.Autonomous. Self-driving car.
• Information extraction:Ask questions over databases across the web.
• Social networks: Data on relationships and preferences. Machine learning to extract value
from data.
• Debugging: Use in computer science especially in Labor intensive processes like
debugging. Could suggest where the bug could be
• Gaming, IBMWatson
Types Of Machine Learning
• Learning Associations
• Supervised Learning
• Regression
• Classification
• Un Supervised Learning
• Reinforcement Learning
• Semi supervised Learning
• Training data includes a few desired outputs. Between supervised and un supervised
Learning Associations
• Market Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y
are products/services.
Example: P ( diaper| beer ) = 0.7
TransactionID BasketItems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper,Coke
Learning Associations
• Support : The probability of the customer buying diaper and beer together
among all sales transactions (Higher support the better)
• Confidence : Suppose that if a customer pick up diaper. How he/she is likely
to buy beer? (Closer to 1 better)
• Lift : Lift is a true comparison between naive model and our model,
meaning that how more likely a customer buy both, compared to buy
separately? (Lift > 1)
Supervised Learning
• Supervised Learning is a Machine Learning task of inferring a generalized function
from labelled training data. Training data includes desired outputs.
Example: Spam Detection , Credit Scoring , Face Detection
• In Supervised Learning for spam detection we have
• Email Contents with Labels marking Spam or Non Spam
• Task is to label newer emails
• Main two types of Supervised Learning Problems
• Regression
• Classification
Supervised Learning
• Regression Problems
• Maps input data to a continuous prediction variable
• Example: Predicting Retail house prices (Price as continues variable)
• Classification Problems
• Maps input data to a set of predefined classes
• Example: Benign or MalignantTumours
Regression : House Price Prediction
• We have historic data about size of house and the price for last 1 year
• Task is to predict the Price of House given its size
•Model Derivation:
Price = Slope of Line * Size + Constant
Classification : Credit Scoring
We have labelled data of low and high risk customers.
Task is differentiating between low-risk and high-risk customers from their
income and savings.
Model Derivation:
IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Un Supervised Learning
• Training data does not include desired output.
Task is to find hidden structure in unlabeled data
• CommonApproaches to Un Supervised Learning
• Clustering or Segmentation ( Customer Segmentation)
• Dimensionality Reduction ( PCA (Principal ComponentAnalysis) , SVD
(SingularValue Decomposition))
• Summarization
Un Supervised Learning
• Customer Segmentation: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs.
• The clustering algorithm
forms 3 different groups of
customers to target.
Reinforcement Learning
• Learning from interaction with the environment to achieve a goal.
Rewards from a sequence of actions.
• Every Action has either a
• Reward OR
• Observation
• Examples
• Self Driving Cars
• Recommender Systems
•Stanford Research Link
https://www.cs.utexas.edu/~eladlieb/RLRG.html
ML – Data Science Relationship
Supervised Learning
Linear Regression
Linear Regression
22
• In statistics, linear regression is an approach for modeling the
relationship between a scalar dependent variable y and one or more
explanatory variables (or independent variables) denoted X
• The case of one explanatory variable is called simple linear
regression
• For more than one explanatory
variable, the process is
called multiple linear regression
https://en.wikipedia.org/wiki/Linear_regression
From School Book :
Linear Equations
Y
Y = mX + b
b = Y-intercept
X
Change
in Y
Change in X
m = Slope
Linear Regression : A Common Example
24
Ohm’s Law:
• In physics, it is observed that the relationship between Voltage (V), Current (I)
and Resistance (R) is a linear relationship expressedas
V = I * R
I = V / R
• In a circuit board for a given Resistance R,
as you increase the VoltageV,
the Current I increases proprotionately
http://www.electronics-tutorials.ws/dccircuits/dcp_1.html
Sample Monthly Income-Expense Data of a Household
25
Monthly Income
(in Rs.)
Monthly Expense
(in Rs.)
5,000 8,000
6,000 7,000
10,000 4,500
10,000 2,000
12,500 12,000
14,000 8,000
15,000 16,000
18,000 20,000
19,000 9,000
20,000 9,000
20,000 18,000
22,000 25,000
23,400 5,000
24,000 10,500
24,000 10,000
We have to find the relationship between Income and Expenses
of a household
y = 0.3008x + 6319.1
R² = 0.4215
0
40000
30000
20000
10000
50000
60000
MonthlyExpense
Monthly Income
Income Vs. Expense
Line of Best Fit
26
0
10000
20000
30000
40000
50000
60000
MonthlyExpense
Monthly Income
IncomeVs.Expense
Which of these lines best
describe the relationship
between Household Income
and Expenses ?
27
0
10000
20000
30000
40000
50000
60000
MonthlyExpense
Monthly Income
Income Vs. Expense
The Line of Best Fit will be the
one where Sum of Square of
Error (SSE) term will be
nique)
sample
on
)
)
get
Xi
X
b =
)2
ii
i i i i
nX -(
X Y
21
minimum (OLSTech
Err or (em = ym - ym)
Yi(hat) = bo + b1Xi isthe
regression equati
SSE = ei(hat
2 (1)
)
= (Yi -Y(i(hat))2 (2
= (Yi - bo - b1Xi)2 (3
Using calculus we
Error (en)
Yi -b1
bo =
n
n XY -
Line of Best Fit
Least Squares
• ‘Best Fit’ Means Difference Between ActualYValues & PredictedYValues is
a Minimum. But Positive Differences Off-Set Negative ones. So square
errors!
• LS Minimizes the Sum of the Squared Differences (errors) (SSE)
   

n
i
i
n
i
ii YY
1
2
1
2
ˆˆ 
Simple Linear Regression in R
29
### CODE SNIPPET ###
?cars
# Investigating the basics of the data set
str(cars)
attributes(cars)
Examining the data
30
### CODE SNIPPET ###
# How speed and distance value summaries look. NA’s ?
summary(cars)
# Is there a correlation between speed and time to stop
cor(cars$speed, cars$dist)
Plotting the data
31
### CODE SNIPPET ###
plot(cars, main=“Distance between Speed and Distance to Stop”)
scatter.smooth(cars,lpars = list(col = "red", lwd = 3 , lty = 3))
boxplot(cars$dist, main="Outliers for Distance")
plot(density(cars$speed) , main="Density Distribution of Speed" ,
type="h",col="blue")
Basic Linear Model
32
### CODE SNIPPET ###
linear_model = lm(dist ~ speed , data=cars)
summary(linear_model)
CoefficientAnalysis
33
• Coefficient - Estimate
• Y intercept given is -17.5791
• Every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324 feet.
• Coefficient - Standard Error
• The coefficient Standard Error measures the average amount that the coefficient estimates vary from
the actual average value of our response variable.We’d ideally want a lower number relative to its
coefficients.
• Coefficient - t value
• The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far
away from 0.We want it to be far away from zero as this would indicate we could reject the null
hypothesis - that is, we could declare a relationship between speed and distance exist. In general, t-
values are also used to compute p-values.
• Coefficient - Pr(>t)
• A small p-value for the intercept and the slope indicates that we can reject the null hypothesis which
allows us to conclude that there is a relationship between speed and distance.
ResidualAnalysis
### CODE SNIPPET ###
pred_dist <- predict(linear_model, newdata=cars)
residuals <- cars$dist - pred_dist
summary(residuals)
plot(pred_dist , residuals,
xlab=" PredictedValues" ,
ylab=" Residuals" ,
main=" Residual Plot" , col="blue")
Which residual plot suggest good
fit ? : Poll
35
Residual Standard Error
36
• Residual Standard Error is measure of the quality of a linear
regression fit.
• The Residual Standard Error is the average amount that the response
(dist) will deviate from the true regression line.
• In our example, the actual distance required to stop can deviate from
the true regression line by approximately 15.3795867 feet, on
average. (Which is ~ 3.93 * 4 times)
• The Residual Standard Error was calculated with 48 degrees of
freedom. Simplistically, degrees of freedom are the number of data
points that went into the estimation of the parameters
Coefficient of Determination
• In statistics, the coefficient of determination, denoted R2 or r2 and pronounced
"R squared", is a number that indicates the proportion of the variance in the
dependent variable that is predictable from the independent variable(s)
• The R2 we get is 0.6511. Roughly 65% of the variance found in the response
variable (distance) can be explained by the predictor variable (speed)
• R2 value significance is relative to domain , Adjusted R2 used for multi linear
https://en.wikipedia.org/wiki/Coefficient_of_determination
F Statistics & PValue
• Indicator of whether there is a relationship between our predictor and the
response variables
• Greater than 1 suggests we can reject the null hypothesis : No relation between
speed and distance exists
• We can consider a linear model to be statistically significant only when both
these p-Values are less that the pre-determined statistical significance level,
which is ideally 0.05
Summary
What allWe did ?
• Examined the data
• Plotting the data
• Simple Linear Regression Model Creation
• Co efficient Analysis
• Residual Analysis
• R2 Analysis
• F Statistics
Is the current state of model good to be deployed /
used on live ?
Evaluation of Model : SplitTrain /Test
### CODE SNIPPET ###
## 80% of the sample size
sample_size <- floor(0.80 * nrow(cars))
## set the seed to make your partition reproductible
set.seed(123)
train_index <- sample(seq_len(nrow(cars)), size = sample_size)
train <- cars[ train_index, ]
test <- cars[-train_index, ]
linear_model_subset <- lm(dist ~ speed, data=train)
distPred <- predict(linear_model_subset, test)
summary(linear_model_subset)
plot(distPred, test$dist)
RMSE :To compare between models
### CODE SNIPPET ###
rmse <-function(error)
{
sqrt(mean(error^2))
}
print(rmse(test$dist - distPreds))
• RMSE : Root Mean Squared Error
• Average Distance between the observed values and the model predictions
OR
• How far are the residuals from zero
Food for thought !!!
Is the test / train split model the best
generalization we have ??
.. Covered in Upcoming Sessions

More Related Content

What's hot

Machine learning
Machine learningMachine learning
Machine learning
Vatsal Gajera
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
Suresh Arora
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
butest
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
Akanksha Bali
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using Python
SavitaHanchinal
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
Marina Santini
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
David Raj Kanthi
 
Machine learning
Machine learningMachine learning
Machine learning
Dr Geetha Mohan
 
Introduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learningIntroduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learning
Sardar Alam
 
Machine learning
Machine learning Machine learning
Machine learning
Saurabh Agrawal
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Rahul Kumar
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
Pranav Ainavolu
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
Bhaskara Reddy Sannapureddy
 
Machine learning
Machine learningMachine learning
Machine learning
Sanjay krishne
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
Edge AI and Vision Alliance
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
drcfetr
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
Anish Das
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
ZOLLHOF - Tech Incubator
 

What's hot (20)

Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using Python
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learningIntroduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learning
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
 

Similar to Introduction to machine learning and model building using linear regression

Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
pradeep kumar
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
VickyKumar131533
 
Machine Learning event gdsc haldia
Machine Learning event gdsc haldiaMachine Learning event gdsc haldia
Machine Learning event gdsc haldia
XAnLiFE
 
Machine learning full guide gdsc haldia
Machine learning full guide  gdsc haldiaMachine learning full guide  gdsc haldia
Machine learning full guide gdsc haldia
XAnLiFE
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PAPIs.io
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Tamir Taha
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Johnson Ubah
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
Boston Institute of Analytics
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
Boston Institute of Analytics
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
TigerGraph
 
Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...
Nikolaos Aletras
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
Dev Raj Gautam
 
Business Analytics.pptx
Business Analytics.pptxBusiness Analytics.pptx
Business Analytics.pptx
Parveen Vashisth
 
Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching
Shankar Somayajula
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
CloudBusiness2
 

Similar to Introduction to machine learning and model building using linear regression (20)

Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Machine Learning event gdsc haldia
Machine Learning event gdsc haldiaMachine Learning event gdsc haldia
Machine Learning event gdsc haldia
 
Machine learning full guide gdsc haldia
Machine learning full guide  gdsc haldiaMachine learning full guide  gdsc haldia
Machine learning full guide gdsc haldia
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
Business Analytics.pptx
Business Analytics.pptxBusiness Analytics.pptx
Business Analytics.pptx
 
Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
 

Recently uploaded

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 

Recently uploaded (20)

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 

Introduction to machine learning and model building using linear regression

  • 2. Introducing the Speaker • Girish Gore : 10+Years of Experience in Data Analytics / Data Science • B.E. Computer Science fromVIT Pune , M.S. from BITS Pilani • SpentTime on Data Products Mainly In companies like • Cognizant (InnovationsGroup) • SAS (Pricing & Revenue Management) • VuClip (Video Entertainment) • Shoptimize (E-Commerce) • Worked in fields like • Text Mining • Forecasting and Optimization • Recommender Systems
  • 3. Knowing the Audience Average Experience in Industry ? Average ML Experience ?
  • 4. UnderstandingTerminologies Artificial Intelligence AI involves machines that can perform tasks that are characteristic of human intelligence. Machine Learning Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Deep Learning Deep Learning is an attempt to mimic the workings of the brain. Deep Learning is one of many approaches to machine learning
  • 6. Traditional Programming vs Machine Learning • If Programming automates processes , Machine Learning automates Program generation i.e. Automation. • Data and output is run on the computer to create a program.This program can be used in traditional programming
  • 7. What is Machine Learning ? • Machine Learning is • study of algorithms that • improve their performance at a particular task • with experience ( previous data , output) • Optimize a performance criterion using example data or past experience • Role of Computer Science : Efficient Algorithms • Solve the optimization problem • Represent and Evaluate the model for inference
  • 8. Why are we here Now !!! GoogleTrends !! • Exponential increase in Data generation , accumulation • Increasing computational power • Growing progress in available algorithms and Research • Software becoming too complex to write by hand
  • 9. Common Applications of Machine Learning • Web search: ranking page based on what you are most likely to click on. • Finance: decide who to send what credit card offers to. Evaluation of risk on credit offers. How to decide where to invest money. • E-commerce: Predicting customer churn.Whether or not a transaction is fraudulent. • Robotics: how to handle uncertainty in new environments.Autonomous. Self-driving car. • Information extraction:Ask questions over databases across the web. • Social networks: Data on relationships and preferences. Machine learning to extract value from data. • Debugging: Use in computer science especially in Labor intensive processes like debugging. Could suggest where the bug could be • Gaming, IBMWatson
  • 10. Types Of Machine Learning • Learning Associations • Supervised Learning • Regression • Classification • Un Supervised Learning • Reinforcement Learning • Semi supervised Learning • Training data includes a few desired outputs. Between supervised and un supervised
  • 11. Learning Associations • Market Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services. Example: P ( diaper| beer ) = 0.7 TransactionID BasketItems 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper,Coke
  • 12. Learning Associations • Support : The probability of the customer buying diaper and beer together among all sales transactions (Higher support the better) • Confidence : Suppose that if a customer pick up diaper. How he/she is likely to buy beer? (Closer to 1 better) • Lift : Lift is a true comparison between naive model and our model, meaning that how more likely a customer buy both, compared to buy separately? (Lift > 1)
  • 13. Supervised Learning • Supervised Learning is a Machine Learning task of inferring a generalized function from labelled training data. Training data includes desired outputs. Example: Spam Detection , Credit Scoring , Face Detection • In Supervised Learning for spam detection we have • Email Contents with Labels marking Spam or Non Spam • Task is to label newer emails • Main two types of Supervised Learning Problems • Regression • Classification
  • 14. Supervised Learning • Regression Problems • Maps input data to a continuous prediction variable • Example: Predicting Retail house prices (Price as continues variable) • Classification Problems • Maps input data to a set of predefined classes • Example: Benign or MalignantTumours
  • 15. Regression : House Price Prediction • We have historic data about size of house and the price for last 1 year • Task is to predict the Price of House given its size •Model Derivation: Price = Slope of Line * Size + Constant
  • 16. Classification : Credit Scoring We have labelled data of low and high risk customers. Task is differentiating between low-risk and high-risk customers from their income and savings. Model Derivation: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
  • 17. Un Supervised Learning • Training data does not include desired output. Task is to find hidden structure in unlabeled data • CommonApproaches to Un Supervised Learning • Clustering or Segmentation ( Customer Segmentation) • Dimensionality Reduction ( PCA (Principal ComponentAnalysis) , SVD (SingularValue Decomposition)) • Summarization
  • 18. Un Supervised Learning • Customer Segmentation: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. • The clustering algorithm forms 3 different groups of customers to target.
  • 19. Reinforcement Learning • Learning from interaction with the environment to achieve a goal. Rewards from a sequence of actions. • Every Action has either a • Reward OR • Observation • Examples • Self Driving Cars • Recommender Systems •Stanford Research Link https://www.cs.utexas.edu/~eladlieb/RLRG.html
  • 20. ML – Data Science Relationship
  • 22. Linear Regression 22 • In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X • The case of one explanatory variable is called simple linear regression • For more than one explanatory variable, the process is called multiple linear regression https://en.wikipedia.org/wiki/Linear_regression
  • 23. From School Book : Linear Equations Y Y = mX + b b = Y-intercept X Change in Y Change in X m = Slope
  • 24. Linear Regression : A Common Example 24 Ohm’s Law: • In physics, it is observed that the relationship between Voltage (V), Current (I) and Resistance (R) is a linear relationship expressedas V = I * R I = V / R • In a circuit board for a given Resistance R, as you increase the VoltageV, the Current I increases proprotionately http://www.electronics-tutorials.ws/dccircuits/dcp_1.html
  • 25. Sample Monthly Income-Expense Data of a Household 25 Monthly Income (in Rs.) Monthly Expense (in Rs.) 5,000 8,000 6,000 7,000 10,000 4,500 10,000 2,000 12,500 12,000 14,000 8,000 15,000 16,000 18,000 20,000 19,000 9,000 20,000 9,000 20,000 18,000 22,000 25,000 23,400 5,000 24,000 10,500 24,000 10,000 We have to find the relationship between Income and Expenses of a household y = 0.3008x + 6319.1 R² = 0.4215 0 40000 30000 20000 10000 50000 60000 MonthlyExpense Monthly Income Income Vs. Expense
  • 26. Line of Best Fit 26 0 10000 20000 30000 40000 50000 60000 MonthlyExpense Monthly Income IncomeVs.Expense Which of these lines best describe the relationship between Household Income and Expenses ?
  • 27. 27 0 10000 20000 30000 40000 50000 60000 MonthlyExpense Monthly Income Income Vs. Expense The Line of Best Fit will be the one where Sum of Square of Error (SSE) term will be nique) sample on ) ) get Xi X b = )2 ii i i i i nX -( X Y 21 minimum (OLSTech Err or (em = ym - ym) Yi(hat) = bo + b1Xi isthe regression equati SSE = ei(hat 2 (1) ) = (Yi -Y(i(hat))2 (2 = (Yi - bo - b1Xi)2 (3 Using calculus we Error (en) Yi -b1 bo = n n XY - Line of Best Fit
  • 28. Least Squares • ‘Best Fit’ Means Difference Between ActualYValues & PredictedYValues is a Minimum. But Positive Differences Off-Set Negative ones. So square errors! • LS Minimizes the Sum of the Squared Differences (errors) (SSE)      n i i n i ii YY 1 2 1 2 ˆˆ 
  • 29. Simple Linear Regression in R 29 ### CODE SNIPPET ### ?cars # Investigating the basics of the data set str(cars) attributes(cars)
  • 30. Examining the data 30 ### CODE SNIPPET ### # How speed and distance value summaries look. NA’s ? summary(cars) # Is there a correlation between speed and time to stop cor(cars$speed, cars$dist)
  • 31. Plotting the data 31 ### CODE SNIPPET ### plot(cars, main=“Distance between Speed and Distance to Stop”) scatter.smooth(cars,lpars = list(col = "red", lwd = 3 , lty = 3)) boxplot(cars$dist, main="Outliers for Distance") plot(density(cars$speed) , main="Density Distribution of Speed" , type="h",col="blue")
  • 32. Basic Linear Model 32 ### CODE SNIPPET ### linear_model = lm(dist ~ speed , data=cars) summary(linear_model)
  • 33. CoefficientAnalysis 33 • Coefficient - Estimate • Y intercept given is -17.5791 • Every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324 feet. • Coefficient - Standard Error • The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable.We’d ideally want a lower number relative to its coefficients. • Coefficient - t value • The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0.We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between speed and distance exist. In general, t- values are also used to compute p-values. • Coefficient - Pr(>t) • A small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between speed and distance.
  • 34. ResidualAnalysis ### CODE SNIPPET ### pred_dist <- predict(linear_model, newdata=cars) residuals <- cars$dist - pred_dist summary(residuals) plot(pred_dist , residuals, xlab=" PredictedValues" , ylab=" Residuals" , main=" Residual Plot" , col="blue")
  • 35. Which residual plot suggest good fit ? : Poll 35
  • 36. Residual Standard Error 36 • Residual Standard Error is measure of the quality of a linear regression fit. • The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line. • In our example, the actual distance required to stop can deviate from the true regression line by approximately 15.3795867 feet, on average. (Which is ~ 3.93 * 4 times) • The Residual Standard Error was calculated with 48 degrees of freedom. Simplistically, degrees of freedom are the number of data points that went into the estimation of the parameters
  • 37. Coefficient of Determination • In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s) • The R2 we get is 0.6511. Roughly 65% of the variance found in the response variable (distance) can be explained by the predictor variable (speed) • R2 value significance is relative to domain , Adjusted R2 used for multi linear https://en.wikipedia.org/wiki/Coefficient_of_determination
  • 38. F Statistics & PValue • Indicator of whether there is a relationship between our predictor and the response variables • Greater than 1 suggests we can reject the null hypothesis : No relation between speed and distance exists • We can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05
  • 40. What allWe did ? • Examined the data • Plotting the data • Simple Linear Regression Model Creation • Co efficient Analysis • Residual Analysis • R2 Analysis • F Statistics Is the current state of model good to be deployed / used on live ?
  • 41. Evaluation of Model : SplitTrain /Test ### CODE SNIPPET ### ## 80% of the sample size sample_size <- floor(0.80 * nrow(cars)) ## set the seed to make your partition reproductible set.seed(123) train_index <- sample(seq_len(nrow(cars)), size = sample_size) train <- cars[ train_index, ] test <- cars[-train_index, ] linear_model_subset <- lm(dist ~ speed, data=train) distPred <- predict(linear_model_subset, test) summary(linear_model_subset) plot(distPred, test$dist)
  • 42. RMSE :To compare between models ### CODE SNIPPET ### rmse <-function(error) { sqrt(mean(error^2)) } print(rmse(test$dist - distPreds)) • RMSE : Root Mean Squared Error • Average Distance between the observed values and the model predictions OR • How far are the residuals from zero
  • 43. Food for thought !!! Is the test / train split model the best generalization we have ?? .. Covered in Upcoming Sessions