Upcoming SlideShare
×

Like this presentation? Why not share!

Data mining with R- regression models

on Dec 31, 2013

• 488 views

Data mining with R - Regression Models

Data mining with R - Regression Models
a curation from:Data Analysis Course
Weeks 4-5-6
https://www.coursera.org/course/dataanalysis

Views

Total Views
488
Views on SlideShare
485
Embed Views
3

Likes
1
11
0

Report content

• Comment goes here.
Are you sure you want to

Data mining with R- regression modelsPresentation Transcript

• Data Mining with R Regression models Hamideh Iraj Hamideh.iraj@ut.ac.ir
• Slides Reference This a curation from: Data Analysis Course Weeks 4-5-6 https://www.coursera.org/course/dataanalysis
• Galton Data – Introduction library(UsingR) data(galton) ---------------------------------Head(galton) Tail(galton) ---------------------------------Dim(galton) Str(galton) summary(galton) summary(galton\$child) View slide
• Galton Data - Plotting par(mfrow=c(1,2)) hist(galton\$child,col="blue",breaks=100) hist(galton\$parent,col="blue",breaks=100) View slide
• Galton Data – Plotting pairs(galton) - cont.
• What is Regression Analysis? regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. http://en.wikipedia.org/wiki/Regression_analysis
• Fitting a line  plot(galton\$child, galton\$parent, pch=19,col="blue")  lm1 <- lm(child ~ parent, data=galton)  lines(galton\$parent,lm1\$fitted,col="red", lwd=3) The line width
• Plot Residuals plot(galton\$parent,lm1\$residuals,col="blue",pch=19) Abline (c(0,0),col="red",lwd=3)
• Linear Model Coefficients >Summary(lm1) lm1\$coeff
• Why care about model Accuracy? http://en.wikipedia.org/wiki/Linear_regression
• Model Accuracy Measures P-value Confidence Interval R2 Adjusted R2
• P-value Most Common Measure of Statistical Significance Idea: Suppose nothing is going on - how unusual is it to see the estimate we got? Some typical values (single test)  P < 0.05 (significant)  P < 0.01 (strongly significant)  P < 0.001 (very significant)
• Confidence intervals A confidence interval is a type of interval estimate of a population parameter and is used to indicate the reliability of an estimate confint(lm1,level=0.95) http://en.wikipedia.org/wiki/Confidence_interval
• 2 R R2 : the proportion of response variation "explained" by the regressors in the model. R2= 1 :the fitted model explains all variability in R2 = 0 indicates no 'linear' relationship (for straight line regression, this means that the straight line model is a constant line (slope=0, intercept=bar{y}) between the response variable and regressors). http://en.wikipedia.org/wiki/Coefficient_of_determination
• Adjusted R2 The use of an adjusted R2 (often written as bar R^2 and pronounced "R bar squared") is an attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model. http://en.wikipedia.org/wiki/Coefficient_of_determination
• Predicting with Linear Regression coef(lm1)[1] + coef(lm1)[2]*80 newdata <- data.frame(parent=80) predict(lm1,newdata)
• Multivariate Linear Regression WHO childhood hunger data Dataset: http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?pr ofile=text&filter=COUNTRY:* hunger <- read.csv("./hunger.csv") hunger <- hunger[hunger\$Sex!="Both sexes", ]
• Multivariate Linear Regression – cont. lmBoth <- lm(hunger\$Numeric ~ hunger\$Year + hunger\$Sex) lmBoth2 <- lm(hunger\$Numeric ~ hunger\$Year + hunger\$Sex + hunger\$Sex*hunger\$Year) Same slopes Different slopes
• Model Selection step(lmBoth2)
• Regression with Factor Variables  Outcome is still quantitative  Covariate(s) are factor variables  Fitting lines = fitting means  Want to evaluate contribution of all factor levels at once