Upcoming SlideShare
×

# A2DataDive workshop: Introduction to R

1,286 views
1,176 views

Published on

A2DataDive workshop speakers Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu. A2 Data Dive. Feb. 10- 12, 2012. visit the wiki for more information: http://wiki.datawithoutborders.cc/index.php?title=Project:Current_events:A2_DD

Published in: Education, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,286
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
7
0
Likes
0
Embeds 0
No embeds

No notes for slide
• *Why do we do linear regression?*give two examples of prediction (modern examples)Genetic canderdeseace personalize medicineSubset so manyTend not to test allComputational advertisingInformation about your history as variblesMultiple contributing factors how do they unqiuely affect the response variable*When? When you need the these kinds of information.*Y is a continuous variable*X’s can be continuous, discrete or categorical*If p = 1, the model is called simple linear regression. If p &gt; 1, the model is called multiple linear regression.*Typical Linear Model displayed, epsilon is the additive error term*Unknown parameters beta zero to beta p are what we are going to estimate.*Note1: “linear” means linear with beta’s not x’s. Oral examples given.*Note2: there are always p+1 unknown parameters when there are p predict variables. Beta zero is called intercept.*The basic mechanism for building a linear regression model is to minimize the sum of residual square, which involves huge amounts of calculation.*With the help of R, we can easily obtain a model through a line of code and we will get to that later.
• *Don’t want to include all variables that you have obtained. Also we want simplify the model, maybe not containing 20 variables.*Also, some variables have high correlation among each other. Eliminate the chance of multi-colinearity*Reduce the noise caused by unnecessary variables*Save time and money*Backward elimination, forward selection and stepwise regression*Testing-based methods are sensitive to outliers, so we introduce a better one: Criterion-based procedures*General Idea: choose the model that optimizes a criterion with the minimization of information loss, which balances goodness-of-fit and model size or accuracy and complexity*AIC: Akaike Information Criterion*BIC: Bayes Information Criterion*Adjusted R square and Mallow’s C_p are similar criterion which can be easily calculated by R*What exactly is a critirion?*It is a number that one can calculate based on the model that he or she has fitted.*n is the number of observations, p is the number of predict variables and RSS, which means residual sum squares is a number based on your model*Once AIC is calculated, we do other possible models containing other combination of variables, and check if they will give smaller AIC or BIC*Our goal is to find a model with the smallest AIC or BIC*R can help us check all possible AIC or BIC values
• //WITH ANIMATION*How the predict variables Population, Income, Illiteracy percentage, murder rate, high school graduation percentage, weather hazard of frosting and land areas affect life expectancy from 1969 to 1971*R can easily help us get the result that we want*After we read the data set into R, we only need to use the lm function to set up the linear regression model*can use “~.” to indicate all variables in the data*summary(g) gives the number of interceptions and coefficients of x’s, which are beta’s that we are estimating*R calculates beta’s as mentioned by minimizing residuals sum of squares*People who are familiar with statistics might notice that p-value are so large for some variables, indicating these predict variables are going to be eliminated in future steps*Here we have our complete model including all possible variables*R also provides ANOVA table where we can easily see the RSS which is used to calculate AIC*But we don’t need to calculate AIC by ourselves. R will help us do it.*Just simply use the “step” function
• *This slides goes through very quickly just with the simple explanation of mechanism of AIC
• //WITH ANIMATION*Final Result*Visualization of the model
• http://rss.acs.unt.edu/Rdoc/library/faraway/html/pima.htmlpregnant: Number of times pregnantGlucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance testDiastolic : Diastolic blood pressure (mm Hg)Triceps : Triceps skin fold thickness (mm)Insulin : 2-Hour serum insulin (mu U/ml)Bmi : Body mass index (weight in kg/(height in metres squared))Diabetes : Diabetes pedigree functionAge : Age (years)Test : test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive)Artificial results: makeBMI + 73*Cholesterol + Age
• The difference with Linear Regression is that LR has a specific goal: it is clear which variable is the dependent variable and which ones are independent variables whose effect we want to examine. Principal component’s goal would be less specific: we are examining the correlations between the variables. **Professor Shedden could you be more specific about what I should talk about in this slide?
• Goal: obtain several “principal components”.Often the first several principal components account for most of the variation. (Shown in last slide)We can see that towards the PC1 direction, insulin accounts for most of the variation. We can verify this by observing the PCs.The usefulness of PCA is in that the first several principal components may give us which variables account for more “variation”Also, unlike “variable selection” in linear regression, we would be preserving the effects of all the columns while creating new (fewer) variables that can explain the data.*Many dimensions can be reduced into several principal components whose “directions” are a combinations of the original variables.(Downside is that Interpretation is sometimes unclear: what does it mean to have -0.3 of insulin and 0.594 of glucose? Maybe I shouldn’t talk about this)
• ****Round down to 3 digitslibrary(faraway)data = pima;summary(pima);data.pca &lt;- prcomp(data[,-9])quartz()plot(data.pca)totalrep = data.pca\$sdevtotalrep = totalrep/sum(totalrep)barplot(totalrep, main=&quot;Representation of Principal Components&quot;, xlab=&quot;Principal Component&quot;, ylab=&quot;% of Total Variance&quot;)biplot(data.pca)plotAll(data.pca)plot(data.pca\$x[,1],data.pca[,2])biplot(data.pca, xlabs=rep(&apos;+&apos;,768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);summary(data.pca)
• ### A2DataDive workshop: Introduction to R

3. 3. Descriptive Statisticsquantitatively describe the main features of a collection of data. How do salaries What should I vary across the make of all company? this???!!! employee manager Staff. Jones HR
4. 4. Descriptive Statistics in RMean > mean(x); > mean(x,trim=a)Median > median(x)Mode > sort(table(x))Standard deviation > sd(x)Variance > var(x)the median absolute > mad(c(x))deviationinterquartile range > IQR(x)Range > range(x)
5. 5. Data Dimensions> length(x)[1] 1000-------------------------> nrow(X) Matrix X[1] 2030 ….> ncol(X)[1] 100000> dim(X)[1] 2034 100000 ….
6. 6. Vectorization in R Matrix X> apply( X, MARGIN=1, FUN= mean)> apply( X, MARGIN=2, FUN= mean)
7. 7. 25 boxplot(X) • Good for small20 data sets • Easy to compar e groups side b15 y side • 1.5*IQR defines10 outlier50 epiE epiS epiImp epilie epiNeur
8. 8. The Big Six Minimum, 1st Q, Median, Mean, 3rd Q, Maximu m> summary(X)
9. 9. R tries to understand you > summary(X)
10. 10. Histograms: > hist(X) epiE epiS epiImp epilie 80 50Frequency Frequency Frequency Frequency 20 40 40 40 20 0 0 0 0 0 5 10 20 0 4 8 12 0 2 4 6 8 0 2 4 6 epiE epiS epiImp epilie epiNeur bfagree bfcon bfext 40 40Frequency Frequency Frequency Frequency 30 40 20 20 0 0 0 0 0 5 15 80 120 160 60 100 160 0 50 150 epiNeur bfagree bfcon bfext bfneur bfopen bdi 50Frequency Frequency Frequency 60 20 20 0 0 0 40 80 120 80 120 160 0 10 20 30 bfneur bfopen bdi
11. 11. Correlation> cor(wt,mpg)[1] -0.8676594> plot(x=wt,y=mpg) Scatterplot Example 30 Miles Per Gallon 25 20 15 10 2 3 4 5 Car Weight
12. 12. Scatterplot Matrix• Iris dataset• 150 flowers• 5 variables Goingslo, flickr
13. 13. Scatterplot Matrixplot > pairs(data) 2.0 3.0 4.0 0.5 1.5 2.5 7.5 Sepal.Length 6.0 4.5 4.0 setosa 3.0 Sepal.Width versicolor 2.0 virginica 7 5 Petal.Length 3 1 2.5 1.5 Petal.Width 0.5 3.0 2.0 Species 1.0 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0
14. 14. > coplot(lat ~ long | depth) Given : depth 100 200 300 400 500 600 165 170 175 180 185 165 170 175 180 185 -10 -15 -20lat -25 -30 -35 165 170 175 180 185 165 170 175 180 185 long
15. 15. Linear Regression Why?  Prediction of future or unknown observations  Assessment of relationship between variables  General description of data structure What?
16. 16. Variable Selection Why?  Simplification  Elimination of multicollinearity and noise  Time and money saving How?  Testing-based Variable Selection Methods - Backward, Forward, Stepwise  Criterion-based Procedures What?  AIC = n ln(RSS/n) + 2(p)
17. 17. Example: U.S. State Fact and Figures Life Expectancy  Population, Income, Illiteracy, Murder, HS Grad, Frost, Area Selected R code  Linear Regression > g <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area, data = statedata) > summary(g) Coefficients: Variance Table Analysis of Response: Life.Exp Estimate Std. Error t value Pr(>|t|) > anova(g) (Intercept) Df Sum Sq Mean Sq F value 7.094e+01 1.748e+00 40.586 Pr(>F) < 2e-16 *** Population 5.180e-05 2.919e-05 Population 1 0.4089 0.4089 0.7372 0.395434 . 1.775 0.0832  AIC Income Income 1 11.5946 11.5946 20.9028 4.218e-05 *** -2.180e-05 2.444e-04 -0.089 0.9293 Illiteracy 3.382e-02 19.4207 35.0116 5.228e-07 *** Illiteracy 1 19.4207 3.663e-01 0.092 0.9269 > step(g) Murder Murder -3.011e-01 27.4288 49.4486 1.308e-08 *** 1 27.4288 4.662e-02 -6.459 8.68e-08 *** HS.Grad HS.Grad 1 4.0989 4.0989 7.3895 0.009494 ** 4.893e-02 2.332e-02 2.098 0.0420 * Frost Frost 1 2.0488 2.0488 3.6935 0.061426 . . -5.735e-03 3.143e-03 -1.825 0.0752 Area Area 1 0.0011 1.668e-06 -0.044 -7.383e-08 0.0011 0.0020 0.964908 0.9649AIC = n ln(RSS/n) + 2(p) Residuals 42 23.2971 0.5547
18. 18. Continued: U.S. State Fact and FiguresStart: AIC=-22.18Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area Df Sum of Sq RSS AIC- Area 1 0.0011 23.298 -24.182- Income 1 0.0044 23.302 -24.175- Illiteracy 1 0.0047 23.302 -24.174<none> 23.297 -22.185- Population 1 1.7472 25.044 -20.569- Frost 1 1.8466 25.144 -20.371- HS.Grad 1 2.4413 25.738 -19.202- Murder 1 23.1411 46.438 10.305Step: AIC=-24.18Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost Df Sum of Sq RSS AIC- Illiteracy 1 0.0038 23.302 -26.174- Income 1 0.0059 23.304 -26.170<none> 23.298 -24.182- Population 1 1.7599 25.058 -22.541- Frost 1 2.0488 25.347 -21.968- HS.Grad 1 2.9804 26.279 -20.163- Murder 1 26.2721 49.570 11.569
19. 19. Continued: U.S. State Fact and Figures 73Step: AIC=-28.16Life.Exp ~ Population + Murder + HS.Grad + Frost Df Sum of Sq RSS AIC Effect on Response Variable of<none> 23.308 -28.161 One Unit Change of Predict Variable- Population 1 2.064 25.372 -25.920 Life Expectancy- Frost 1 3.122 26.430 -23.877- HS.Grad 1 5.112 28.420 -20.246- Murder 1 34.816 58.124 15.528Coefficients: 0.00005014(Intercept) Population Murder HS.Grad Frost 0.3001 7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03 0.04658 0.005943 71.03 70 Intercept x1 x4 x5 x6 Predict Variables
20. 20. What is Principal Component Analysis (PCA)? Two general approaches of reducing variables : feature selection and feature extraction  Feature Selection : “Akaike Information Criterion”(AIC), BIC or Back-Substitution  Feature extraction : “Principal Component Analysis”(PCA) is most widely used  Create several artificial variables  Built-in functions in R = Convenient!
21. 21. Actual Pima Data pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 4 1 89 66 23 94 28.1 0.167 21 0 5 0 137 40 35 168 43.1 2.288 33 1 6 5 116 74 0 0 25.6 0.201 30 0 ….( Imagine a data set with many more (~1000) columns )(Imagine a Linear Regression: Which variables affect diabetes in what ways?)
22. 22. PCA Example: Pima Indians The National Institute of Diabetes and Digestive and Kidney Diseases conducte d a study on 768 adult female Pima Indians living near Phoenix. 9 Variables (8 continuous, 1 categorical)  pregnant: Number of times pregnant  Glucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance test  Diastolic : Diastolic blood pressure (mm Hg)  Triceps : Triceps skin fold thickness (mm)  Insulin : 2-Hour serum insulin (mu U/ml)  Bmi : Body mass index (weight in kg/(height in metres squared))  Diabetes : Diabetes pedigree function  Age : Age (years)  Test : diabetes (coded 0 if negative, 1 if positive) Next Slide: PCA Implementation
23. 23. What principal components might look like: PC1 : 1*Insulin + 0.01*Glucose + .. PC2 : 1*Glucose + 0.12*Age + 0.12*DiastolicBP + .. PC3 : 0.92 * DiastolicBP + 0.31*Triceps  Principal components : What are they composed of? (less important)  Difference with Linear Regression
24. 24. + ++ -4000 -3000 -2000 -1000 0 +-Goal: obtain summary 0.10about data in lower 1000dimensions + + + + + + 0.05 ++ +++ +++ + + + + ++ + + + + + ++++ ++ ++ + ++++ ++ 500 + ++ ++++++++ + + + + ++ +++ + + +++ +++ + + + + ++++ + + + ++++ + + + + + ++++ +++ + ++ + + + +++ + + ++-- How many + +++++ + +++ +++ + + + ++ +++ + + + + + +++ ++ + + + ++ + + + + + + + ++++++ +++++ + + ++ +++ +dimensions? + ++ + ++ + + + ++ ++ + + +++ ++++ + + + ++++ +++++ + 0.00 insulin + triceps + + +++ 0 PC2 + + + + ++ pregnant +++ + ++ ++++ + + +age + ++ bmi + + ++++ + + +diastolic+ + ++ + + + + ++ ++ + ++ + ++ + + ++ + ++ + ++ + + + ++ + + + + + + ++++++ + + + -500 + + + ++ + + + +- R code in the next + + ++ + + ++ + + ++ + -0.05 + + ++ + + ++ + ++ + + + ++slide: +++ + glucose + + + + +++++ + + -1000 + ++ + ++ + + + + ++ ++ -0.10 + ++ -1500 + -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 PC1
25. 25. Brief : R-Code> data.pca <- prcomp(data[,-9]); summary(data.pca);Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7Standard deviation 116.002 30.5411 19.7630 14.0777 10.6155 6.76973 2.78575Proportion of Variance 0.889 0.0616 0.0258 0.0131 0.00744 0.00303 0.00051Cumulative Proportion 0.889 0.950 0.976 0.9890 0.996 0.999 1.00000> data.pcaRotation: PC1 PC2 PC3 PC4 PC5 PC6 PC7pregnant 0.002 -0.02 0.02 0.05 2e-01 -0.005 -1e+00glucose -0.098 -0.97 -0.14 -0.12 -9e-02 0.051 -9e-04Diastolic -0.016 -0.14 0.92 0.26 -2e-01 0.076 1e-03triceps -0.061 0.06 0.31 -0.88 3e-01 0.221 4e-04insulin -0.993 0.09 -0.02 0.07 -2e-04 -0.006 -1e-03bmi -0.014 -0.05 0.13 -0.19 2e-02 -0.971 3e-03age 0.004 -0.14 0.13 0.30 9e-01 -0.015 2e-01> barplot(totalrep, main="Representation of Principal Components", xlab="Principal Component", ylab="% of Total Variance")> biplot(data.pca, xlabs=rep(+,768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);
26. 26. Representation of Principal Components 0.5 0.4% of Total Variance 0.3 0.2 0.1 0.0 Principal Component