Data mining with R- regression models
Upcoming SlideShare
Loading in...5

Data mining with R- regression models



Data mining with R - Regression Models

Data mining with R - Regression Models
a curation from:Data Analysis Course
Weeks 4-5-6



Total Views
Views on SlideShare
Embed Views



1 Embed 3 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data mining with R- regression models Data mining with R- regression models Presentation Transcript

  • Data Mining with R Regression models Hamideh Iraj
  • Slides Reference This a curation from: Data Analysis Course Weeks 4-5-6
  • Galton Data – Introduction library(UsingR) data(galton) ---------------------------------Head(galton) Tail(galton) ---------------------------------Dim(galton) Str(galton) summary(galton) summary(galton$child)
  • Galton Data - Plotting par(mfrow=c(1,2)) hist(galton$child,col="blue",breaks=100) hist(galton$parent,col="blue",breaks=100)
  • Galton Data – Plotting pairs(galton) - cont.
  • What is Regression Analysis? regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.
  • Fitting a line  plot(galton$child, galton$parent, pch=19,col="blue")  lm1 <- lm(child ~ parent, data=galton)  lines(galton$parent,lm1$fitted,col="red", lwd=3) The line width
  • Plot Residuals plot(galton$parent,lm1$residuals,col="blue",pch=19) Abline (c(0,0),col="red",lwd=3)
  • Linear Model Coefficients >Summary(lm1) lm1$coeff
  • Why care about model Accuracy?
  • Model Accuracy Measures P-value Confidence Interval R2 Adjusted R2
  • P-value Most Common Measure of Statistical Significance Idea: Suppose nothing is going on - how unusual is it to see the estimate we got? Some typical values (single test)  P < 0.05 (significant)  P < 0.01 (strongly significant)  P < 0.001 (very significant)
  • Confidence intervals A confidence interval is a type of interval estimate of a population parameter and is used to indicate the reliability of an estimate confint(lm1,level=0.95)
  • 2 R R2 : the proportion of response variation "explained" by the regressors in the model. R2= 1 :the fitted model explains all variability in R2 = 0 indicates no 'linear' relationship (for straight line regression, this means that the straight line model is a constant line (slope=0, intercept=bar{y}) between the response variable and regressors).
  • Adjusted R2 The use of an adjusted R2 (often written as bar R^2 and pronounced "R bar squared") is an attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model.
  • Predicting with Linear Regression coef(lm1)[1] + coef(lm1)[2]*80 newdata <- data.frame(parent=80) predict(lm1,newdata)
  • Multivariate Linear Regression WHO childhood hunger data Dataset: ofile=text&filter=COUNTRY:* hunger <- read.csv("./hunger.csv") hunger <- hunger[hunger$Sex!="Both sexes", ]
  • Multivariate Linear Regression – cont. lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex) lmBoth2 <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex*hunger$Year) Same slopes Different slopes
  • Model Selection step(lmBoth2)
  • Regression with Factor Variables  Outcome is still quantitative  Covariate(s) are factor variables  Fitting lines = fitting means  Want to evaluate contribution of all factor levels at once
  • Regression with Factor Variables – cont.  Dataset:  movies <- read.table("./movies.txt",sep="t",header=T,quote="")  head(movies)
  • Regression with Factor Variables – cont.  lm2 <- lm(movies$score ~ as.factor(movies$rating))  summary(lm2)