Upcoming SlideShare
×

# B409 W11 Sas Collaborative Stats Guide V4.2

1,104 views

Published on

Published in: Technology
1 Comment
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• The Strategic Relationship Marketing Co-op Program from GBC 2010 - 2011 made this little SAS Guide for all you newbies out there.. Enjoy!!

Are you sure you want to  Yes  No
Views
Total views
1,104
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
74
1
Likes
1
Embeds 0
No embeds

No notes for slide

### B409 W11 Sas Collaborative Stats Guide V4.2

2. 2. Likelihood of certain candidates to be elected
3. 3. Reactions to certain new products
4. 4. Survey response rate reliability
5. 5. Predict results based on previous research</li></ul>Using a Confidence Interval<br />An example of how one can arrive at a Confidence Interval is the following:<br />Getting statistics from an entire population may be impossible, information may be correct but outdated, and response rates on surveys may be very low. Because of this, researchers simplify the statistical process by picking a sample of the population of interest, finding answers to their research questions, and trying to estimate the reliability and precision of the results. This reliability estimate is where using the Confidence Interval comes in.<br />For example, lets answer the following question: With 95% accuracy, what is the average amount of languages spoken by each student at George Brown?<br />We could ask every student at George Brown but that would be time consuming and some students may not answer truthfully. Therefore, a convenient way to answer our question is by picking and analyzing a sample that we can work with. This will help us to calculate the Confidence Interval which will be the answer to our question. In this case, we will pick a reasonably large proportion of the students in the school , so that the results will be representative of the larger population (We will be using a representative class). Once we have chosen the sample, we need to estimate the reliability that the mean of the entire population will be contained In a certain range (Confidence Interval).<br />Results:<br />Mean=2.6 languages per student<br />Standard deviation=1.836<br />(Intervals are calculated from the mean, standard deviation and the size of the sample)<br />By doing the Confidence Interval calculations we arrive at a conclusion. The mean number of languages, with 95% confidence, is between 1.945 & 3.255.<br />Applying Confidence Intervals to SAS<br />The Distribution Analysis produces statistics describing the distribution of a single variable. Next example explores the distribution of the variable Height in the Volcanoes data set. On the Process Flow field click the Volcanoes data icon to make it active. Then select Task Describe Distribution.<br />In the data tab choose the Height variable for analysis. Then in the distributions tab click Normal.<br />In the Tables tab you can choose all the statistics you would like to explore. We are particularly interested in Basic Confidence Intervals and Basic Measures (Mean, Standard Deviation, and Variance). To measure confidence intervals we have to specify the confidence level in the drop-box on the top right. You can choose among 90%, 95%, and 99%. After selection click Run.<br />The Resulting Report starts with basic statistic measures about the distribution of the variable: mean, median, standard deviation, variance, and range. Another section of the report contains confidence limits assuming normality. This table shows confidence intervals for main parameters (mean, standard deviation, and variance) with 95% confidence level. <br />We can also build a plot to better evaluate the normality of variable distribution. Click Modify Task and in the open window click Plots page. You can choose among different appearances. Choose Histogram Plot. <br />Click Insert Page and choose statistics you would like to include to the plot (for this example we took sample size, sample mean, and standard deviation). Choose the location of this information on the graph and click Run.<br /> <br />From the example we can see that the sample size is 32. The graph shows that the data is normally distributed and the Volcanoes’ Height mean is 3113.563. With 95% of confidence, the height of average Volcano (mean) is from 2481.3 km to 3745.9 km.<br /> Chapter 04Chapter 04<br />Simple Regression<br />Sukhoi<br />Amit Bansal<br />Sheleena Jaria <br />Kalpesh Patel<br />Ishan Sangrai <br />Pranay Sankhe <br />Introduction to Regression Analysis<br />In the statistical terms, regression is the study of the natural relationship between the variables so that one may be able to predict the unknown value of one variable for a known value of another variable.<br />According to Oxford English Dictionary, the word ‘regression’ means “stepping back” or “returning to average value”. The term was first used in the 19th century by Sir Francis Galton. He found out an interesting result by studying the height of about 1000 fathers and sons. His calculation were that (i) sons of all fathers tend to be tall and sons of short fathers tend to be short in height (ii) But the mean height of the tall fathers was greater than the mean height of sons, whereas the mean height of the short sons was greater than the mean height of the short fathers. The tendency of the entire mankind to twin back to average height was termed by Galton ‘Regression towards Mediocricity’ and the line that shows trend named as ‘Regression Line’.<br />In words of M.M Blair, ‘Regression is the measure of the average relationship between two or more variables.<br />Regression analysis is used to:<br /><ul><li>Predict the value of a dependent variable based on the value of at least one independent variable.
6. 6. Explain the impact of changes in an independent variable based on the dependent variable.</li></ul>Dependent variable: the variable we wish to predict or explain.<br />Independent variable: the variable used to explain the dependent variable.<br />Regression Formula<br />To calculate relation between X and Y we need an equation which is<br />Regression Equation Y = a + bX <br />Where X and Y are the variables, b = the slope of the regression line, a = the intercept point of the regression line.<br />Slope (B) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX) 2)<br />Intercept (A) = (ΣY – b (ΣX)) / n<br />Figure 4.1 shows Simple Regression<br />As per Figure 4.1 Regression line shows the average relationship between two variables. This is also known as Line of Best Fit. On the basis of regression line, we can predict the value of a dependent variable on the basis of the given value of the independent variable. So this regression line of Y on X gives the best estimate for the value of Y for any given value of X.<br />Steps In Linear Regression<br /><ul><li>State the hypothesis.
7. 7. State the null hypothesis
8. 8. Gather the data.
9. 9. Compute the regression equation.
10. 10. Examine tests of statistical significant and measures of association.
11. 11. Relate statistical findings to the hypothesis. Accept or reject the null hypothesis.
12. 12. Reject, accept or revise the original hypothesis. Make suggestions for research design and management aspects of the problem</li></ul>Regression Example<br />To find the Simple Regression, Let’s take a simple example, where X is Cattle and Y is Cost. The example shows the relationship between both of them. First we need a database. <br /> To find regression equation, we will first find slope, intercept and use it to form regression equation.<br />Step 1: Count the number of values <br />Step 2: Find XY, X2, Y2<br />Step 3: Find ΣX, ΣY, ΣXY, ΣX2,ΣY2<br /> ΣX = 116.969; ΣY = 670.575; ΣXY = 5570.426; ΣX2 = 1036.087,ΣY2 =32134.66<br />Step 4: After putting Values in slope formula<br /> Slope (b) = (nΣXY - (ΣX) (ΣY)) / (nΣX2 - (ΣX)2)<br /> = 1.4086<br />Step 5: Now, substitute the value in the formula.<br /> Intercept (a) = (ΣY - b (ΣX)) / n <br /> = 26.6211<br />Step 6: Then substitute these values in regression equation <br />Regression Equation(Y) = a + bX<br /> = 26.6211 + 1.4086X<br /> Suppose if we want to know the approximate ‘Y’ value for the variable ‘X’ = 3.437. Then we can substitute the value in the above equation.<br />Regression Equation(Y) = a + bX<br /> = 26.6211 + 1.4086 (3.437)<br /> = 26.6211 + 4.8416 = 31.4627<br />The Above example tells us how to find the relationship between two variables by calculating the Regression from the above mentioned steps.<br />Assumptions Of Simple Regression<br />In theory, there are several important assumptions that must be satisfied if linear regression is to be used. These are:<br /><ul><li>Both the independent (X) and the dependent (Y) variables are measured at the interval or ratio level.
13. 13. The relationship between the independent (X) and the dependent (Y) variables is linear.
14. 14. Errors in prediction of the value of Y are distributed in a way that approaches the normal curve.
15. 15. Errors in prediction of the value of Y are all independent of one another.
16. 16. The distribution of the errors in prediction of the value of Y is constant regardless of the value of X.</li></ul>Implementing within SAS<br />Now when we doing the same task in SAS, we need to have a database on which we will calculate relationship between them (Variables).<br />So initially you have to do is<br />Open SASFileOpenData<br />Figure 4.2 shows how to access Data in SAS<br />Now select Data from the computer which you want to analyze. After selecting data, window pops like as below shown: <br />Figure 4.3 <br />After selecting the data, go to GraphScatter Chart<br />Figure 4.4<br />Click 2D Scatter chart<br />Figure 4.5<br />Figure 4.6 shows Columns to assign different Task Roles<br />Drag cattle into Horizontal and Cost into Vertical, then Run<br />Figure 4.7 shows after selection of variables in their Task roles<br />Figure 4.8 shows Scatter Plot Graph<br />Now we need to find the relationship between X and Y through SAS, Select the process flow and then double click Market database.<br />Figure 4.9<br />Select Analyze RegressionLinear Regression<br />Figure 4.10<br />Then insert Cattle into Dependent Variable and Cost into Explanatory variables <br />Figure 4.11<br />Click RUN. Output will have several graphs but we focused only on one which is shown below.<br />Figure 4.12 shows relationship between Cattle and Cost.<br />Figure 4.13 shows the window after clicking Process Flow<br />In SAS, we can modify the output. Right click on Linear RegressionModify Linear Regression<br />Figure 4.14<br />Linear Regression window will pop up and here we want name on Footer. So click Titlesfootnote<br />Figure 4.15<br />Click Default text and then write you’re “Name” instead of “the SAS system” than click RUN<br />Figure 4.16<br />Conclusion<br />After doing the analysis, initially manually and later with SAS software, we get to know that output remains the same but the difference in efforts is far different from each other. By using SAS software, it’s easy to get the output which otherwise would take lots of tedious hours. The best thing about the SAS software is that you can make changes at any point of time with just fraction of seconds but otherwise you need to do the complete calculation again.<br />So in nutshell, Simple regression gives us a relationship between two values and we can predict the one value if other is known and using the SAS software we get the output early and error free.<br /> Chapter 05Chapter 05<br />Correlation Coefficient<br />Fusion<br />Gaurav Anand <br />Maninder Kaur <br />Anil Khurana <br />Rizwan Maknojia <br />Bikramjit Singh<br />Definition<br />The correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two variables. It gives a mathematical number to weather two numeric variable are related or not, It ranges from -1 to +1.<br />“+1” correlation indicates a perfect positive correlation, meaning that both variables move in the same direction together. <br />“-1” correlation indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down<br /><ul><li>A “0” correlation indicates that there is no relationship between the variables.</li></ul>In mathematic terms, Correlation is referred as “r”. The degree of relationship between variables can be defined by r value as shown in table 5.1.<br />Value of rStrength of relationship-1.0 to -0.5 OR 1.0 to 0.5Strong-0.5 to -0.3 OR 0.3 to 0.5Moderate-0.3 to -0.1 OR 0.1 to 0.3Weak-0.1 to 0.1None or very weak<br />Table 5.1 – r value table<br />Correlation Example<br />Let’s assume that we want to look at the relationship between two variables, the age of the student and their marks. Perhaps we have a hypothesis that the age of a student’s effects their marks. We have a sample data of 10 students and their marks out of 50.<br />AgeMarks2535304826362436284525403146314026362531<br />Table 5.2<br />Based on the above data in Table 5.2 the calculated correlation value is “r=.105”. This indicates that there is not a strong positive relationship between age of the student and their mark. Therefore, it’s not necessarily that the older the student is, higher the marks he or she will get. Neither there is a negative relationship between the two. The “r” value is .105 which is very close to “0”, it indicates that there is hardly any relationship between the two variables. <br />Implementing Within SAS<br />Let’s understand how correlation can be used in SAS Enterprise Guide.<br />Open SAS Enterprise Guide 4.2. Open Class data set from LibrarySASHelp. Class data set has the name of the student, Sex, Age, Height & their Weight. Now, we will check if there is any relationship between the Height of the student & their Weight. On the menu bar at the top, click on TasksMultivariateCorrelations as shown in figure 5.1<br /><ul><li>Figure 5.1 – Path to open Correlation </li></ul>Correlation window will pop up.<br /><ul><li>Figure 5.2 – Correlation Window</li></ul>Select & drag Height under Analysis variable & Weight under Correlate with & click Run<br /><ul><li>Figure 5.3 – Assigning variables for correlation </li></ul>Figure 5.4 – Correlation output window<br />As you can see the output in figure 5.4, “Correlation Analysis” at the top it displays the variables for which you want to check the degree of relationship between them. Below that under “Simple Statistics”, it shows the Mean, Standard Deviation, Minimum Value & Maximum Value for both Height & Weight, where N is the number of students in the class. All these statistics are used to calculate the correlation between the two variables. In the output displayed above, we can see that the Correlation Coefficient value is .877. Thus, we can say that there is strong positive relation between the height of the student & their Weight. If the height of student will increase, the weight will also increase. <br />Modifying Output<br />In SAS enterprise guide, we can modify the output in different ways. For example, we want to check the correlation between height & weight separately for males & females & also we want the scatter plot in the output.<br /><ul><li>Right-click on CorrelationsModify Correlations under the process flow
17. 17. Figure 5.5 – Modify correlation path
18. 18. Correlations window will pop-up & drag Sex under Group analysis by as shown in figure 5.6
19. 19. Figure 5.6 – Assigning variables for group analysis
20. 20. Click on Resultscheck the option Create a scatter plot for each correlation pair
22. 22. The data shows trends of beer sales and the relationship
23. 23. This Chapter will focus on computer confidence and prediction intervals as well as interpreting the associated output.</li></ul>How to Complete in SAS EG<br /><ul><li>251460011430Open the dataset: beer_sales.sas,
24. 24. As you can see from the raw data, an increase in temperature is strongly and positively correlated to beer sales.
25. 25. If we make a simple line plot before we start computing confidence intervals, it will give us a better sense of the information we’re looking at.